Blog Post

Machine Learning Algorithms - Quick Reference

Mo Ghasemzadeh Mar 17 2016

Summary

There are many machine learning algorithms, each with their own unique advantages and disadvantages. I have described algorithms commonly used in machine learning, pros, cons along with sample code. For this post I have compiled information and videos on the web which I thought did a great job of explaining the concepts.The intention of this post isn't to teach you certain algorithms but to make you familiar with some of the concepts of Machine Learning. You can also use this for future reference, when in doubt regarding which algorithm you should consider using. I have used information provided by John Paul Mueller, Luca Massaron and Thales Sehn Korting for this post. I have not included Neural Networks in this post as they are quite difficult to explain in a small section within this post. If you are interested to know more about Neural Networks (deep learning) you can start by checking out this link.

Support Vector Machines

Explanation:
SVMs (linear or otherwise) inherently do binary classification. However, there are various procedures for extending them to multiclass problems. The most common methods involve transforming the problem into a set of binary classification problems, by one of two strategies:

1. One vs. the rest. For K classes, K binary classifiers are trained. Each determines whether an example belongs to its 'own' class versus any other class. The classifier with the largest output is taken to be the class of the example.

2. One vs. one. A binary classifier is trained for each pair of classes. A voting procedure is used to combine the outputs.

Best at:

Character recognition
Image recognition
Text classification

Pros:

Automatic nonlinear feature creation
Can approximate complex nonlinear functions

Cons:

Difficult to interpret when applying nonlinear kernels
Suffers from too many examples, after 10,000 examples it starts taking too long to train

Sample Python Code:

											print(__doc__)


											# Author: Gael Varoquaux "gael dot varoquaux at normalesup dot org"
											# License: BSD 3 clause

											# Standard scientific Python imports
											import matplotlib.pyplot as plt

											# Import datasets, classifiers and performance metrics
											from sklearn import datasets, svm, metrics

											# The digits dataset
											digits = datasets.load_digits()

											%matplotlib inline
											plt.ion()

											# The data that we are interested in is made of 8x8 images of digits, let's
											# have a look at the first 4 images, stored in the `images` attribute of the
											# dataset.  If we were working from image files, we could load them using
											# matplotlib.pyplot.imread.  Note that each image must have the same size. For these
											# images, we know which digit they represent: it is given in the 'target' of
											# the dataset.
											images_and_labels = list(zip(digits.images, digits.target))
											for index, (image, label) in enumerate(images_and_labels[:4]):
											    plt.subplot(2, 4, index + 1)
											    plt.axis('off')
											    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
											    plt.title('Training: %i' % label)

											# To apply a classifier on this data, we need to flatten the image, to
											# turn the data in a (samples, feature) matrix:
											n_samples = len(digits.images)
											data = digits.images.reshape((n_samples, -1))

											# Create a classifier: a support vector classifier
											classifier = svm.SVC(gamma=0.001)

											# We learn the digits on the first half of the digits
											classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])

											# Now predict the value of the digit on the second half:
											expected = digits.target[n_samples / 2:]
											predicted = classifier.predict(data[n_samples / 2:])

											print("Classification report for classifier %s:\n%s\n"
											      % (classifier, metrics.classification_report(expected, predicted)))
											print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

											images_and_predictions = list(zip(digits.images[n_samples / 2:], predicted))
											for index, (image, prediction) in enumerate(images_and_predictions[:4]):
											    plt.subplot(2, 4, index + 5)
											    plt.axis('off')
											    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
											    plt.title('Prediction: %i' % prediction)

											plt.show()

Random Forest

Explanation:

Best at:

Apt at almost any machine learning problem
Bioinformatics

Pros:

Can work in parallel
Seldom overfits
Automatically handles missing values
No need to transform any variable
No need to tweak parameters
Can be used by almost anyone with excellent results

Cons:

Difficult to interpret
Weaker on regression when estimating values at the extremities of the distribution of response values
Biased in multiclass problems toward more frequent classes

Sample Python Code:

											import pandas as pd
											import numpy as np
											from sklearn.cross_validation import train_test_split
											from sklearn.ensemble import RandomForestClassifier
											from sklearn.metrics import r2_score
											import matplotlib.pyplot as plt

											#Assuming your data is in data.csv
											
											dataset = pd.read_csv('data.csv', sep=',')
											
											dataset.fillna(0, inplace=True)
											
											#For turning you categorical variables into dummy variables (sklean can only deal with continuous variables)

											dataset=pd.get_dummies(dataset)

											#assuming your classification is in the last column and your predictors are in the other columns
											X, y = dataset.iloc[:,1:], dataset.iloc[:,:1]

											#for creating a split test/traing set where 20% of the original data is in the test set and the rest is in training
											X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
											
											#Random Forest
											
											clf = RandomForestClassifier(n_estimators=10)
											yt=np.ravel(y_train)
											clf = clf.fit(X_train, yt)
											y_predict=clf.predict(X_test)
											print("Rsquared for Random Forest: ",r2_score(y_test, y_predict))

Gradient Boosting

Explanation:
.

Best at:

Apt at almost any machine learning problem
Search engines (solving the problem of learning to rank)

Pros:

It can approximate most nonlinear function
Best in class predictor
Automatically handles missing values
No need to transform any variable

Cons:

It can overfit if run for too many iterations
Sensitive to noisy data and outliers
Doesn't work well without parameter tuning

Sample Python Code:

											from sklearn.datasets import make_hastie_10_2
											from sklearn.ensemble import GradientBoostingClassifier

											X, y = make_hastie_10_2(random_state=0)
											X_train, X_test = X[:2000], X[2000:]
											y_train, y_test = y[:2000], y[2000:]

											clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
											    max_depth=1, random_state=0).fit(X_train, y_train)
											#The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage 

											clf.score(X_test, y_test)

Adaboost (Adaptive Boost)

Explanation:

Zhaojun Zhang, a Software Engineer from Coursera, describes the differences between Adaboost and Gradient boosting quite well. As he mentions, Both methods use a set of weak learners. They try to boost these weak learners into a strong learner. Assume that the strong learner is additive by the weak learners.

Gradient boosting generates learners during the learning process. It build first learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, fourth … until certain threshold.

Adaboost requires users specify a set of weak learners (alternatively, it will randomly generate a set of weak learner before the real learning process). It will learn the weights of how to add these learners to be a strong learner. The weight of each learner is learned by whether it predicts a sample correctly or not. If a learner mis-predicts a sample, the weight of the learner is reduced a bit. It will repeat such process until converge.

Best at:

Face detection

Pros:

Automatically handles missing values
No need to transform any variable
It doesn’t overfit easily
Few parameters to tweak
It can leverage many different weak-learners

Cons:

Sensitive to noisy data and outliers
Never the best in class predictions

Sample Python Code:

											# AdaBoost Classification
											import pandas
											from sklearn import model_selection
											from sklearn.ensemble import AdaBoostClassifier
											url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
											names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
											dataframe = pandas.read_csv(url, names=names)
											array = dataframe.values
											X = array[:,0:8]
											Y = array[:,8]
											seed = 7
											num_trees = 30
											kfold = model_selection.KFold(n_splits=10, random_state=seed)
											model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
											results = model_selection.cross_val_score(model, X, Y, cv=kfold)
											print(results.mean())

Naive Bayes

Explanation:

Best at:

Face recognition
Sentiment analysis
Spam detection
Text classification

Pros:

Easy and fast to implement, doesn’t require too much memory and can be used for online learning
Easy to understand
Takes into account prior knowledge

Cons:

Strong and unrealistic feature independence assumptions
Fails estimating rare occurrences
Suffers from irrelevant features

Sample Python Code:

											#Email Spam Detection
											#You will need a folder which included spam only email, we'll call it Spam here
											#You will also need a folder which included non-spam only email, we'll call it Ham

											import os
											import io
											import numpy
											from pandas import DataFrame
											from sklearn.feature_extraction.text import CountVectorizer
											from sklearn.naive_bayes import MultinomialNB

											def readFiles(path):
											    for root, dirnames, filenames in os.walk(path):
											        for filename in filenames:
											            path = os.path.join(root, filename)

											            inBody = False
											            lines = []
											            f = io.open(path, 'r', encoding='latin1')
											            for line in f:
											                if inBody:
											                    lines.append(line)
											                elif line == '\n':
											                    inBody = True
											            f.close()
											            message = '\n'.join(lines)
											            yield path, message


											def dataFrameFromDirectory(path, classification):
											    rows = []
											    index = []
											    for filename, message in readFiles(path):
											        rows.append({'message': message, 'class': classification})
											        index.append(filename)

											    return DataFrame(rows, index=index)

											data = DataFrame({'message': [], 'class': []})

											data = data.append(dataFrameFromDirectory('e:/directory/spam', 'spam'))
											data = data.append(dataFrameFromDirectory('e:/directory/ham', 'ham'))
											vectorizer = CountVectorizer()
											counts = vectorizer.fit_transform(data['message'].values)

											classifier = MultinomialNB()
											targets = data['class'].values
											classifier.fit(counts, targets)
											examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
											example_counts = vectorizer.transform(examples)
											predictions = classifier.predict(example_counts)
											predictions

K-nearest Neighbors

Explanation:

Best at:

Computer vision
Multilabel tagging
Recommender systems
Spell checking problems

Pros:

Fast, lazy training
Can naturally handle extreme multiclass problems (like tagging text)

Cons:

Slow and cumbersome in the predicting phase
Can fail to predict correctly due to the curse of dimensionality

Sample Python Code:

											X = [[0], [1], [2], [3]]
											y = [0, 0, 1, 1]
											from sklearn.neighbors import KNeighborsClassifier
											neigh = KNeighborsClassifier(n_neighbors=3)
											neigh.fit(X, y) 

											print(neigh.predict([[1.1]]))
											#[0]
											print(neigh.predict_proba([[0.9]]))
											#[[ 0.66666667  0.33333333]]

(LDA) Linear Discriminant Analysis

Explanation:

Both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels).

Best at:

Face recognition
Reducing dimensions of the dataset

Pros:

Can reduce data dimensionality

Cons:

LDA produces at most C-1 feature projections, where is C is the number of class labels
LDA is a parametric method (it assumes unimodal Gaussian likelihoods)

Sample Python Code:

											import numpy as np
											from sklearn.lda import LDA
											X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
											y = np.array([1, 1, 1, 2, 2, 2])
											clf = LDA()
											clf.fit(X, y)

											print(clf.predict([[-0.8, -1]]))

Logistic Regression

Explanation:

Best at:

Ordering results by probability
Modelling marketing responses

Pros:

Simple to understand and explain
It seldom overfits
Using L1 & L2 regularization is effective in feature selection
Fast to train
Easy to train on big data thanks to its stochastic version

Cons:

You have to work hard to make it fit nonlinear functions
Can suffer from outliers

Sample Python Code:

											print(__doc__)


											# Code source: Gaël Varoquaux
											# Modified for documentation by Jaques Grobler
											# License: BSD 3 clause

											import numpy as np
											import matplotlib.pyplot as plt
											from sklearn import linear_model, datasets

											# import some data to play with
											iris = datasets.load_iris()
											X = iris.data[:, :2]  # we only take the first two features.
											Y = iris.target

											h = .02  # step size in the mesh

											logreg = linear_model.LogisticRegression(C=1e5)

											# we create an instance of Neighbours Classifier and fit the data.
											logreg.fit(X, Y)

											# Plot the decision boundary. For that, we will assign a color to each
											# point in the mesh [x_min, x_max]x[y_min, y_max].
											x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
											y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
											xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
											Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

											# Put the result into a color plot
											Z = Z.reshape(xx.shape)
											plt.figure(1, figsize=(4, 3))
											plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

											# Plot also the training points
											plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
											plt.xlabel('Sepal length')
											plt.ylabel('Sepal width')

											plt.xlim(xx.min(), xx.max())
											plt.ylim(yy.min(), yy.max())
											plt.xticks(())
											plt.yticks(())

											plt.show()

Linear Regression

Explanation:

Best at:

Baseline predictions
Econometric predictions
Modelling marketing responses

Pros:

Simple to understand and explain
It seldom overfits
Using L1 & L2 regularization is effective in feature selection
Fast to train
Easy to train on big data thanks to its stochastic version

Cons:

You have to work hard to make it fit nonlinear functions
Can suffer from outliers

Sample Python Code:

											print(__doc__)


											# Code source: Jaques Grobler
											# License: BSD 3 clause


											import matplotlib.pyplot as plt
											import numpy as np
											from sklearn import datasets, linear_model

											# Load the diabetes dataset
											diabetes = datasets.load_diabetes()


											# Use only one feature
											diabetes_X = diabetes.data[:, np.newaxis, 2]

											# Split the data into training/testing sets
											diabetes_X_train = diabetes_X[:-20]
											diabetes_X_test = diabetes_X[-20:]

											# Split the targets into training/testing sets
											diabetes_y_train = diabetes.target[:-20]
											diabetes_y_test = diabetes.target[-20:]

											# Create linear regression object
											regr = linear_model.LinearRegression()

											# Train the model using the training sets
											regr.fit(diabetes_X_train, diabetes_y_train)

											# The coefficients
											print('Coefficients: \n', regr.coef_)
											# The mean squared error
											print("Mean squared error: %.2f"
											      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
											# Explained variance score: 1 is perfect prediction
											print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

											# Plot outputs
											plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
											plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
											         linewidth=3)

											plt.xticks(())
											plt.yticks(())

											plt.show()

SVD (Singular Value Decomposition)

Intuitive Explanation:
Mathematical Explanation:

Best at:

Recommender systems

Pros:

Can restructure data in a meaningful way

Cons:

Difficult to understand why data has been restructured in a certain way

Sample Python Code:

											from numpy import *
											import operator
											from os import listdir
											import matplotlib
											import matplotlib.pyplot as plt
											import pandas as pd
											from numpy.linalg import *
											from scipy.stats.stats import pearsonr
											from numpy import linalg as la

											from sklearn.datasets import load_iris

											 
											# load data points
											raw_data = load_iris().data
											samples,features = shape(raw_data)
											 
											#normalize and remove mean
											data = mat(raw_data[:,:4])
											 
											def svd(data, S=2):
											     
											    #calculate SVD
											    U, s, V = linalg.svd( data )
											    Sig = mat(eye(S)*s[:S])
											    #tak out columns you don't need
											    newdata = U[:,:S]
											     
											    # this line is used to retrieve dataset 
											    #~ new = U[:,:2]*Sig*V[:2,:]
											 
											    fig = plt.figure()
											    ax = fig.add_subplot(1,1,1)
											    colors = ['blue','red','black']
											    for i in xrange(samples):
											        ax.scatter(newdata[i,0],newdata[i,1], color= colors[int(raw_data[i,-1])])
											    plt.xlabel('SVD1')
											    plt.ylabel('SVD2')
											    plt.show()
											         
											 
											svd(data,2)

PCA (Principle Component Analysis)

Explanation:

Best at:

Removing collinearity
Reducing dimensions of the dataset

Pros:

Can reduce data dimensionality

Cons:

Implies strong linear assumptions (components are a weighted summations of features)

Sample Python Code:

											print(__doc__)

											# PCA example with Iris Data-set
											# Code source: Gaël Varoquaux
											# License: BSD 3 clause

											import numpy as np
											import matplotlib.pyplot as plt
											from mpl_toolkits.mplot3d import Axes3D


											from sklearn import decomposition
											from sklearn import datasets

											np.random.seed(5)

											centers = [[1, 1], [-1, -1], [1, -1]]
											iris = datasets.load_iris()
											X = iris.data
											y = iris.target

											fig = plt.figure(1, figsize=(4, 3))
											plt.clf()
											ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

											plt.cla()
											pca = decomposition.PCA(n_components=3)
											pca.fit(X)
											X = pca.transform(X)

											for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
											    ax.text3D(X[y == label, 0].mean(),
											              X[y == label, 1].mean() + 1.5,
											              X[y == label, 2].mean(), name,
											              horizontalalignment='center',
											              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
											# Reorder the labels to have colors matching the cluster results
											y = np.choose(y, [1, 2, 0]).astype(np.float)
											ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral)

											ax.w_xaxis.set_ticklabels([])
											ax.w_yaxis.set_ticklabels([])
											ax.w_zaxis.set_ticklabels([])

											plt.show()

K-means Clustering

Explanation:

Best at:

Segmentation

Pros:

Fast in finding clusters
Can detect outliers in multiple dimensions

Cons:

Suffers from multicollinearity
Clusters are spherical, can’t detect groups of other shape
Unstable solutions, depends on initialization

Sample Python Code:

											print(__doc__)


											# Code source: Gaël Varoquaux
											# Modified for documentation by Jaques Grobler
											# License: BSD 3 clause

											import numpy as np
											import matplotlib.pyplot as plt
											from mpl_toolkits.mplot3d import Axes3D


											from sklearn.cluster import KMeans
											from sklearn import datasets

											np.random.seed(5)

											centers = [[1, 1], [-1, -1], [1, -1]]
											iris = datasets.load_iris()
											X = iris.data
											y = iris.target

											estimators = {'k_means_iris_3': KMeans(n_clusters=3),
											              'k_means_iris_8': KMeans(n_clusters=8),
											              'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
											                                              init='random')}


											fignum = 1
											for name, est in estimators.items():
											    fig = plt.figure(fignum, figsize=(4, 3))
											    plt.clf()
											    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

											    plt.cla()
											    est.fit(X)
											    labels = est.labels_

											    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))

											    ax.w_xaxis.set_ticklabels([])
											    ax.w_yaxis.set_ticklabels([])
											    ax.w_zaxis.set_ticklabels([])
											    ax.set_xlabel('Petal width')
											    ax.set_ylabel('Sepal length')
											    ax.set_zlabel('Petal length')
											    fignum = fignum + 1

											# Plot the ground truth
											fig = plt.figure(fignum, figsize=(4, 3))
											plt.clf()
											ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

											plt.cla()

											for name, label in [('Setosa', 0),
											                    ('Versicolour', 1),
											                    ('Virginica', 2)]:
											    ax.text3D(X[y == label, 3].mean(),
											              X[y == label, 0].mean() + 1.5,
											              X[y == label, 2].mean(), name,
											              horizontalalignment='center',
											              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
											# Reorder the labels to have colors matching the cluster results
											y = np.choose(y, [1, 2, 0]).astype(np.float)
											ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)

											ax.w_xaxis.set_ticklabels([])
											ax.w_yaxis.set_ticklabels([])
											ax.w_zaxis.set_ticklabels([])
											ax.set_xlabel('Petal width')
											ax.set_ylabel('Sepal length')
											ax.set_zlabel('Petal length')
											plt.show()

Content Based Filtering

Explanation:

Best at:

Recommender systems

Pros:

Ability to recommend to users with unique tastes
No need for data on other users
Ability to recommend new and unpopular items

Cons:

Finding features about items can be difficult
Difficult to build profile for new users
Might not recommend items outside of a users profile

Collaborative Filtering (CF)

Explanation:

Best at:

Recommender systems

Pros:

No Feature Selection required as CF does not require content information about neither users or items to be machine-recognizable
Can suggest serendipitous items by observing similar-minded peoples behavior

Cons:

CF systems are not content aware meaning that information about items are not considered when they produce recommendation
User/rating matrix is sparse, can be hard to find users who have rated similar items
Cannot recommend an unrated item and tends to be biased towards popular items

Mo Ghasemzadeh Entrepreneur &
Product Manager

I am an Entrepreneur, Product Manager and UC Berkeley Graduate. My passion for technology and interest in design & user experience have inspired me to build digital products that users love. I've launched numerous products from idea, to execution and acquiring tens of thousands of users. I've also won a couple of awards along the way. You can check out some of my products in the Portfolio Section.