Machine Learning Algorithms - Quick Reference

Summary


There are many machine learning algorithms, each with their own unique advantages and disadvantages. I have described algorithms commonly used in machine learning, pros, cons along with sample code. For this post I have compiled information and videos on the web which I thought did a great job of explaining the concepts.The intention of this post isn't to teach you certain algorithms but to make you familiar with some of the concepts of Machine Learning. You can also use this for future reference, when in doubt regarding which algorithm you should consider using. I have used information provided by John Paul Mueller, Luca Massaron and Thales Sehn Korting for this post. I have not included Neural Networks in this post as they are quite difficult to explain in a small section within this post. If you are interested to know more about Neural Networks (deep learning) you can start by checking out this link.



Support Vector Machines


Explanation:
SVMs (linear or otherwise) inherently do binary classification. However, there are various procedures for extending them to multiclass problems. The most common methods involve transforming the problem into a set of binary classification problems, by one of two strategies:

1. One vs. the rest. For K classes, K binary classifiers are trained. Each determines whether an example belongs to its 'own' class versus any other class. The classifier with the largest output is taken to be the class of the example.

2. One vs. one. A binary classifier is trained for each pair of classes. A voting procedure is used to combine the outputs.


Best at:

  1. Character recognition
  2. Image recognition
  3. Text classification


Pros:

  1. Automatic nonlinear feature creation
  2. Can approximate complex nonlinear functions


Cons:

  1. Difficult to interpret when applying nonlinear kernels
  2. Suffers from too many examples, after 10,000 examples it starts taking too long to train


Sample Python Code:

											print(__doc__)


											# Author: Gael Varoquaux "gael dot varoquaux at normalesup dot org"
											# License: BSD 3 clause

											# Standard scientific Python imports
											import matplotlib.pyplot as plt

											# Import datasets, classifiers and performance metrics
											from sklearn import datasets, svm, metrics

											# The digits dataset
											digits = datasets.load_digits()

											%matplotlib inline
											plt.ion()

											# The data that we are interested in is made of 8x8 images of digits, let's
											# have a look at the first 4 images, stored in the `images` attribute of the
											# dataset.  If we were working from image files, we could load them using
											# matplotlib.pyplot.imread.  Note that each image must have the same size. For these
											# images, we know which digit they represent: it is given in the 'target' of
											# the dataset.
											images_and_labels = list(zip(digits.images, digits.target))
											for index, (image, label) in enumerate(images_and_labels[:4]):
											    plt.subplot(2, 4, index + 1)
											    plt.axis('off')
											    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
											    plt.title('Training: %i' % label)

											# To apply a classifier on this data, we need to flatten the image, to
											# turn the data in a (samples, feature) matrix:
											n_samples = len(digits.images)
											data = digits.images.reshape((n_samples, -1))

											# Create a classifier: a support vector classifier
											classifier = svm.SVC(gamma=0.001)

											# We learn the digits on the first half of the digits
											classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])

											# Now predict the value of the digit on the second half:
											expected = digits.target[n_samples / 2:]
											predicted = classifier.predict(data[n_samples / 2:])

											print("Classification report for classifier %s:\n%s\n"
											      % (classifier, metrics.classification_report(expected, predicted)))
											print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

											images_and_predictions = list(zip(digits.images[n_samples / 2:], predicted))
											for index, (image, prediction) in enumerate(images_and_predictions[:4]):
											    plt.subplot(2, 4, index + 5)
											    plt.axis('off')
											    plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
											    plt.title('Prediction: %i' % prediction)

											plt.show()

										


Random Forest


Explanation:


Best at:

  1. Apt at almost any machine learning problem
  2. Bioinformatics


Pros:

  1. Can work in parallel
  2. Seldom overfits
  3. Automatically handles missing values
  4. No need to transform any variable
  5. No need to tweak parameters
  6. Can be used by almost anyone with excellent results


Cons:

  1. Difficult to interpret
  2. Weaker on regression when estimating values at the extremities of the distribution of response values
  3. Biased in multiclass problems toward more frequent classes


Sample Python Code:

											import pandas as pd
											import numpy as np
											from sklearn.cross_validation import train_test_split
											from sklearn.ensemble import RandomForestClassifier
											from sklearn.metrics import r2_score
											import matplotlib.pyplot as plt

											#Assuming your data is in data.csv
											
											dataset = pd.read_csv('data.csv', sep=',')
											
											dataset.fillna(0, inplace=True)
											
											#For turning you categorical variables into dummy variables (sklean can only deal with continuous variables)

											dataset=pd.get_dummies(dataset)

											#assuming your classification is in the last column and your predictors are in the other columns
											X, y = dataset.iloc[:,1:], dataset.iloc[:,:1]

											#for creating a split test/traing set where 20% of the original data is in the test set and the rest is in training
											X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
											
											#Random Forest
											
											clf = RandomForestClassifier(n_estimators=10)
											yt=np.ravel(y_train)
											clf = clf.fit(X_train, yt)
											y_predict=clf.predict(X_test)
											print("Rsquared for Random Forest: ",r2_score(y_test, y_predict))
										


Gradient Boosting


Explanation:
.


Best at:

  1. Apt at almost any machine learning problem
  2. Search engines (solving the problem of learning to rank)


Pros:

  1. It can approximate most nonlinear function
  2. Best in class predictor
  3. Automatically handles missing values
  4. No need to transform any variable


Cons:

  1. It can overfit if run for too many iterations
  2. Sensitive to noisy data and outliers
  3. Doesn't work well without parameter tuning


Sample Python Code:

											from sklearn.datasets import make_hastie_10_2
											from sklearn.ensemble import GradientBoostingClassifier

											X, y = make_hastie_10_2(random_state=0)
											X_train, X_test = X[:2000], X[2000:]
											y_train, y_test = y[:2000], y[2000:]

											clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
											    max_depth=1, random_state=0).fit(X_train, y_train)
											#The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage 

											clf.score(X_test, y_test)  

										


Adaboost (Adaptive Boost)


Explanation:


Zhaojun Zhang, a Software Engineer from Coursera, describes the differences between Adaboost and Gradient boosting quite well. As he mentions, Both methods use a set of weak learners. They try to boost these weak learners into a strong learner. Assume that the strong learner is additive by the weak learners.

Gradient boosting generates learners during the learning process. It build first learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, fourth … until certain threshold.

Adaboost requires users specify a set of weak learners (alternatively, it will randomly generate a set of weak learner before the real learning process). It will learn the weights of how to add these learners to be a strong learner. The weight of each learner is learned by whether it predicts a sample correctly or not. If a learner mis-predicts a sample, the weight of the learner is reduced a bit. It will repeat such process until converge.


Best at:

  1. Face detection


Pros:

  1. Automatically handles missing values
  2. No need to transform any variable
  3. It doesn’t overfit easily
  4. Few parameters to tweak
  5. It can leverage many different weak-learners


Cons:

  1. Sensitive to noisy data and outliers
  2. Never the best in class predictions


Sample Python Code:

											# AdaBoost Classification
											import pandas
											from sklearn import model_selection
											from sklearn.ensemble import AdaBoostClassifier
											url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
											names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
											dataframe = pandas.read_csv(url, names=names)
											array = dataframe.values
											X = array[:,0:8]
											Y = array[:,8]
											seed = 7
											num_trees = 30
											kfold = model_selection.KFold(n_splits=10, random_state=seed)
											model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
											results = model_selection.cross_val_score(model, X, Y, cv=kfold)
											print(results.mean())
										


Naive Bayes


Explanation:


Best at:

  1. Face recognition
  2. Sentiment analysis
  3. Spam detection
  4. Text classification


Pros:

  1. Easy and fast to implement, doesn’t require too much memory and can be used for online learning
  2. Easy to understand
  3. Takes into account prior knowledge


Cons:

  1. Strong and unrealistic feature independence assumptions
  2. Fails estimating rare occurrences
  3. Suffers from irrelevant features


Sample Python Code:

											#Email Spam Detection
											#You will need a folder which included spam only email, we'll call it Spam here
											#You will also need a folder which included non-spam only email, we'll call it Ham

											import os
											import io
											import numpy
											from pandas import DataFrame
											from sklearn.feature_extraction.text import CountVectorizer
											from sklearn.naive_bayes import MultinomialNB

											def readFiles(path):
											    for root, dirnames, filenames in os.walk(path):
											        for filename in filenames:
											            path = os.path.join(root, filename)

											            inBody = False
											            lines = []
											            f = io.open(path, 'r', encoding='latin1')
											            for line in f:
											                if inBody:
											                    lines.append(line)
											                elif line == '\n':
											                    inBody = True
											            f.close()
											            message = '\n'.join(lines)
											            yield path, message


											def dataFrameFromDirectory(path, classification):
											    rows = []
											    index = []
											    for filename, message in readFiles(path):
											        rows.append({'message': message, 'class': classification})
											        index.append(filename)

											    return DataFrame(rows, index=index)

											data = DataFrame({'message': [], 'class': []})

											data = data.append(dataFrameFromDirectory('e:/directory/spam', 'spam'))
											data = data.append(dataFrameFromDirectory('e:/directory/ham', 'ham'))
											vectorizer = CountVectorizer()
											counts = vectorizer.fit_transform(data['message'].values)

											classifier = MultinomialNB()
											targets = data['class'].values
											classifier.fit(counts, targets)
											examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
											example_counts = vectorizer.transform(examples)
											predictions = classifier.predict(example_counts)
											predictions
										


K-nearest Neighbors


Explanation:


Best at:

  1. Computer vision
  2. Multilabel tagging
  3. Recommender systems
  4. Spell checking problems


Pros:

  1. Fast, lazy training
  2. Can naturally handle extreme multiclass problems (like tagging text)


Cons:

  1. Slow and cumbersome in the predicting phase
  2. Can fail to predict correctly due to the curse of dimensionality


Sample Python Code:

											X = [[0], [1], [2], [3]]
											y = [0, 0, 1, 1]
											from sklearn.neighbors import KNeighborsClassifier
											neigh = KNeighborsClassifier(n_neighbors=3)
											neigh.fit(X, y) 

											print(neigh.predict([[1.1]]))
											#[0]
											print(neigh.predict_proba([[0.9]]))
											#[[ 0.66666667  0.33333333]]
										


(LDA) Linear Discriminant Analysis


Explanation:


Both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels).


Best at:

  1. Face recognition
  2. Reducing dimensions of the dataset


Pros:

  1. Can reduce data dimensionality


Cons:

  1. LDA produces at most C-1 feature projections, where is C is the number of class labels
  2. LDA is a parametric method (it assumes unimodal Gaussian likelihoods)


Sample Python Code:

											import numpy as np
											from sklearn.lda import LDA
											X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
											y = np.array([1, 1, 1, 2, 2, 2])
											clf = LDA()
											clf.fit(X, y)

											print(clf.predict([[-0.8, -1]]))

										


Logistic Regression


Explanation:


Best at:

  1. Ordering results by probability
  2. Modelling marketing responses


Pros:

  1. Simple to understand and explain
  2. It seldom overfits
  3. Using L1 & L2 regularization is effective in feature selection
  4. Fast to train
  5. Easy to train on big data thanks to its stochastic version


Cons:

  1. You have to work hard to make it fit nonlinear functions
  2. Can suffer from outliers


Sample Python Code:

											print(__doc__)


											# Code source: Gaël Varoquaux
											# Modified for documentation by Jaques Grobler
											# License: BSD 3 clause

											import numpy as np
											import matplotlib.pyplot as plt
											from sklearn import linear_model, datasets

											# import some data to play with
											iris = datasets.load_iris()
											X = iris.data[:, :2]  # we only take the first two features.
											Y = iris.target

											h = .02  # step size in the mesh

											logreg = linear_model.LogisticRegression(C=1e5)

											# we create an instance of Neighbours Classifier and fit the data.
											logreg.fit(X, Y)

											# Plot the decision boundary. For that, we will assign a color to each
											# point in the mesh [x_min, x_max]x[y_min, y_max].
											x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
											y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
											xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
											Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

											# Put the result into a color plot
											Z = Z.reshape(xx.shape)
											plt.figure(1, figsize=(4, 3))
											plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

											# Plot also the training points
											plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
											plt.xlabel('Sepal length')
											plt.ylabel('Sepal width')

											plt.xlim(xx.min(), xx.max())
											plt.ylim(yy.min(), yy.max())
											plt.xticks(())
											plt.yticks(())

											plt.show()

										


Linear Regression


Explanation:


Best at:

  1. Baseline predictions
  2. Econometric predictions
  3. Modelling marketing responses


Pros:

  1. Simple to understand and explain
  2. It seldom overfits
  3. Using L1 & L2 regularization is effective in feature selection
  4. Fast to train
  5. Easy to train on big data thanks to its stochastic version


Cons:

  1. You have to work hard to make it fit nonlinear functions
  2. Can suffer from outliers


Sample Python Code:

											print(__doc__)


											# Code source: Jaques Grobler
											# License: BSD 3 clause


											import matplotlib.pyplot as plt
											import numpy as np
											from sklearn import datasets, linear_model

											# Load the diabetes dataset
											diabetes = datasets.load_diabetes()


											# Use only one feature
											diabetes_X = diabetes.data[:, np.newaxis, 2]

											# Split the data into training/testing sets
											diabetes_X_train = diabetes_X[:-20]
											diabetes_X_test = diabetes_X[-20:]

											# Split the targets into training/testing sets
											diabetes_y_train = diabetes.target[:-20]
											diabetes_y_test = diabetes.target[-20:]

											# Create linear regression object
											regr = linear_model.LinearRegression()

											# Train the model using the training sets
											regr.fit(diabetes_X_train, diabetes_y_train)

											# The coefficients
											print('Coefficients: \n', regr.coef_)
											# The mean squared error
											print("Mean squared error: %.2f"
											      % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
											# Explained variance score: 1 is perfect prediction
											print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

											# Plot outputs
											plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
											plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
											         linewidth=3)

											plt.xticks(())
											plt.yticks(())

											plt.show()
										


SVD (Singular Value Decomposition)


Intuitive Explanation:
Mathematical Explanation:


Best at:

  1. Recommender systems


Pros:

  1. Can restructure data in a meaningful way


Cons:

  1. Difficult to understand why data has been restructured in a certain way


Sample Python Code:

											from numpy import *
											import operator
											from os import listdir
											import matplotlib
											import matplotlib.pyplot as plt
											import pandas as pd
											from numpy.linalg import *
											from scipy.stats.stats import pearsonr
											from numpy import linalg as la

											from sklearn.datasets import load_iris

											 
											# load data points
											raw_data = load_iris().data
											samples,features = shape(raw_data)
											 
											#normalize and remove mean
											data = mat(raw_data[:,:4])
											 
											def svd(data, S=2):
											     
											    #calculate SVD
											    U, s, V = linalg.svd( data )
											    Sig = mat(eye(S)*s[:S])
											    #tak out columns you don't need
											    newdata = U[:,:S]
											     
											    # this line is used to retrieve dataset 
											    #~ new = U[:,:2]*Sig*V[:2,:]
											 
											    fig = plt.figure()
											    ax = fig.add_subplot(1,1,1)
											    colors = ['blue','red','black']
											    for i in xrange(samples):
											        ax.scatter(newdata[i,0],newdata[i,1], color= colors[int(raw_data[i,-1])])
											    plt.xlabel('SVD1')
											    plt.ylabel('SVD2')
											    plt.show()
											         
											 
											svd(data,2)
										


PCA (Principle Component Analysis)


Explanation:


Best at:

  1. Removing collinearity
  2. Reducing dimensions of the dataset


Pros:

  1. Can reduce data dimensionality


Cons:

  1. Implies strong linear assumptions (components are a weighted summations of features)


Sample Python Code:

											print(__doc__)

											# PCA example with Iris Data-set
											# Code source: Gaël Varoquaux
											# License: BSD 3 clause

											import numpy as np
											import matplotlib.pyplot as plt
											from mpl_toolkits.mplot3d import Axes3D


											from sklearn import decomposition
											from sklearn import datasets

											np.random.seed(5)

											centers = [[1, 1], [-1, -1], [1, -1]]
											iris = datasets.load_iris()
											X = iris.data
											y = iris.target

											fig = plt.figure(1, figsize=(4, 3))
											plt.clf()
											ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

											plt.cla()
											pca = decomposition.PCA(n_components=3)
											pca.fit(X)
											X = pca.transform(X)

											for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
											    ax.text3D(X[y == label, 0].mean(),
											              X[y == label, 1].mean() + 1.5,
											              X[y == label, 2].mean(), name,
											              horizontalalignment='center',
											              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
											# Reorder the labels to have colors matching the cluster results
											y = np.choose(y, [1, 2, 0]).astype(np.float)
											ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral)

											ax.w_xaxis.set_ticklabels([])
											ax.w_yaxis.set_ticklabels([])
											ax.w_zaxis.set_ticklabels([])

											plt.show()

										


K-means Clustering


Explanation:


Best at:

  1. Segmentation


Pros:

  1. Fast in finding clusters
  2. Can detect outliers in multiple dimensions


Cons:

  1. Suffers from multicollinearity
  2. Clusters are spherical, can’t detect groups of other shape
  3. Unstable solutions, depends on initialization


Sample Python Code:

											print(__doc__)


											# Code source: Gaël Varoquaux
											# Modified for documentation by Jaques Grobler
											# License: BSD 3 clause

											import numpy as np
											import matplotlib.pyplot as plt
											from mpl_toolkits.mplot3d import Axes3D


											from sklearn.cluster import KMeans
											from sklearn import datasets

											np.random.seed(5)

											centers = [[1, 1], [-1, -1], [1, -1]]
											iris = datasets.load_iris()
											X = iris.data
											y = iris.target

											estimators = {'k_means_iris_3': KMeans(n_clusters=3),
											              'k_means_iris_8': KMeans(n_clusters=8),
											              'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
											                                              init='random')}


											fignum = 1
											for name, est in estimators.items():
											    fig = plt.figure(fignum, figsize=(4, 3))
											    plt.clf()
											    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

											    plt.cla()
											    est.fit(X)
											    labels = est.labels_

											    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))

											    ax.w_xaxis.set_ticklabels([])
											    ax.w_yaxis.set_ticklabels([])
											    ax.w_zaxis.set_ticklabels([])
											    ax.set_xlabel('Petal width')
											    ax.set_ylabel('Sepal length')
											    ax.set_zlabel('Petal length')
											    fignum = fignum + 1

											# Plot the ground truth
											fig = plt.figure(fignum, figsize=(4, 3))
											plt.clf()
											ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

											plt.cla()

											for name, label in [('Setosa', 0),
											                    ('Versicolour', 1),
											                    ('Virginica', 2)]:
											    ax.text3D(X[y == label, 3].mean(),
											              X[y == label, 0].mean() + 1.5,
											              X[y == label, 2].mean(), name,
											              horizontalalignment='center',
											              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
											# Reorder the labels to have colors matching the cluster results
											y = np.choose(y, [1, 2, 0]).astype(np.float)
											ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)

											ax.w_xaxis.set_ticklabels([])
											ax.w_yaxis.set_ticklabels([])
											ax.w_zaxis.set_ticklabels([])
											ax.set_xlabel('Petal width')
											ax.set_ylabel('Sepal length')
											ax.set_zlabel('Petal length')
											plt.show()

										


Content Based Filtering


Explanation:


Best at:

  1. Recommender systems


Pros:

  1. Ability to recommend to users with unique tastes
  2. No need for data on other users
  3. Ability to recommend new and unpopular items


Cons:

  1. Finding features about items can be difficult
  2. Difficult to build profile for new users
  3. Might not recommend items outside of a users profile


Collaborative Filtering (CF)


Explanation:


Best at:

  1. Recommender systems


Pros:

  1. No Feature Selection required as CF does not require content information about neither users or items to be machine-recognizable
  2. Can suggest serendipitous items by observing similar-minded peoples behavior


Cons:

  1. CF systems are not content aware meaning that information about items are not considered when they produce recommendation
  2. User/rating matrix is sparse, can be hard to find users who have rated similar items
  3. Cannot recommend an unrated item and tends to be biased towards popular items




20 Awesome Colors