Machine Learning Algorithms - Quick Reference

### Summary

There are many machine learning algorithms, each with their own unique advantages and disadvantages. I have described algorithms commonly used in machine learning, pros, cons along with sample code. For this post I have compiled information and videos on the web which I thought did a great job of explaining the concepts.The intention of this post isn't to teach you certain algorithms but to make you familiar with some of the concepts of Machine Learning. You can also use this for future reference, when in doubt regarding which algorithm you should consider using. I have used information provided by John Paul Mueller, Luca Massaron and Thales Sehn Korting for this post. I have not included Neural Networks in this post as they are quite difficult to explain in a small section within this post. If you are interested to know more about Neural Networks (deep learning) you can start by checking out this link.

### Support Vector Machines

Explanation:
SVMs (linear or otherwise) inherently do binary classification. However, there are various procedures for extending them to multiclass problems. The most common methods involve transforming the problem into a set of binary classification problems, by one of two strategies:

1. One vs. the rest. For K classes, K binary classifiers are trained. Each determines whether an example belongs to its 'own' class versus any other class. The classifier with the largest output is taken to be the class of the example.

2. One vs. one. A binary classifier is trained for each pair of classes. A voting procedure is used to combine the outputs.

Best at:

1. Character recognition
2. Image recognition
3. Text classification

Pros:

1. Automatic nonlinear feature creation
2. Can approximate complex nonlinear functions

Cons:

1. Difficult to interpret when applying nonlinear kernels
2. Suffers from too many examples, after 10,000 examples it starts taking too long to train

Sample Python Code:

```											print(__doc__)

# Author: Gael Varoquaux "gael dot varoquaux at normalesup dot org"
# License: BSD 3 clause

# Standard scientific Python imports
import matplotlib.pyplot as plt

# Import datasets, classifiers and performance metrics
from sklearn import datasets, svm, metrics

# The digits dataset
digits = datasets.load_digits()

%matplotlib inline
plt.ion()

# The data that we are interested in is made of 8x8 images of digits, let's
# have a look at the first 4 images, stored in the `images` attribute of the
# dataset.  If we were working from image files, we could load them using
# matplotlib.pyplot.imread.  Note that each image must have the same size. For these
# images, we know which digit they represent: it is given in the 'target' of
# the dataset.
images_and_labels = list(zip(digits.images, digits.target))
for index, (image, label) in enumerate(images_and_labels[:4]):
plt.subplot(2, 4, index + 1)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Training: %i' % label)

# To apply a classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
data = digits.images.reshape((n_samples, -1))

# Create a classifier: a support vector classifier
classifier = svm.SVC(gamma=0.001)

# We learn the digits on the first half of the digits
classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2])

# Now predict the value of the digit on the second half:
expected = digits.target[n_samples / 2:]
predicted = classifier.predict(data[n_samples / 2:])

print("Classification report for classifier %s:\n%s\n"
% (classifier, metrics.classification_report(expected, predicted)))
print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted))

images_and_predictions = list(zip(digits.images[n_samples / 2:], predicted))
for index, (image, prediction) in enumerate(images_and_predictions[:4]):
plt.subplot(2, 4, index + 5)
plt.axis('off')
plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest')
plt.title('Prediction: %i' % prediction)

plt.show()

```

### Random Forest

Explanation:

Best at:

1. Apt at almost any machine learning problem
2. Bioinformatics

Pros:

1. Can work in parallel
2. Seldom overfits
3. Automatically handles missing values
4. No need to transform any variable
5. No need to tweak parameters
6. Can be used by almost anyone with excellent results

Cons:

1. Difficult to interpret
2. Weaker on regression when estimating values at the extremities of the distribution of response values
3. Biased in multiclass problems toward more frequent classes

Sample Python Code:

```											import pandas as pd
import numpy as np
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

#Assuming your data is in data.csv

dataset = pd.read_csv('data.csv', sep=',')

dataset.fillna(0, inplace=True)

#For turning you categorical variables into dummy variables (sklean can only deal with continuous variables)

dataset=pd.get_dummies(dataset)

#assuming your classification is in the last column and your predictors are in the other columns
X, y = dataset.iloc[:,1:], dataset.iloc[:,:1]

#for creating a split test/traing set where 20% of the original data is in the test set and the rest is in training
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

#Random Forest

clf = RandomForestClassifier(n_estimators=10)
yt=np.ravel(y_train)
clf = clf.fit(X_train, yt)
y_predict=clf.predict(X_test)
print("Rsquared for Random Forest: ",r2_score(y_test, y_predict))
```

### Gradient Boosting

Explanation:
.

Best at:

1. Apt at almost any machine learning problem
2. Search engines (solving the problem of learning to rank)

Pros:

1. It can approximate most nonlinear function
2. Best in class predictor
3. Automatically handles missing values
4. No need to transform any variable

Cons:

1. It can overfit if run for too many iterations
2. Sensitive to noisy data and outliers
3. Doesn't work well without parameter tuning

Sample Python Code:

```											from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier

X, y = make_hastie_10_2(random_state=0)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
max_depth=1, random_state=0).fit(X_train, y_train)
#The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage

clf.score(X_test, y_test)

```

### Adaboost (Adaptive Boost)

Explanation:

Zhaojun Zhang, a Software Engineer from Coursera, describes the differences between Adaboost and Gradient boosting quite well. As he mentions, Both methods use a set of weak learners. They try to boost these weak learners into a strong learner. Assume that the strong learner is additive by the weak learners.

Gradient boosting generates learners during the learning process. It build first learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, fourth … until certain threshold.

Adaboost requires users specify a set of weak learners (alternatively, it will randomly generate a set of weak learner before the real learning process). It will learn the weights of how to add these learners to be a strong learner. The weight of each learner is learned by whether it predicts a sample correctly or not. If a learner mis-predicts a sample, the weight of the learner is reduced a bit. It will repeat such process until converge.

Best at:

1. Face detection

Pros:

1. Automatically handles missing values
2. No need to transform any variable
3. It doesn’t overfit easily
4. Few parameters to tweak
5. It can leverage many different weak-learners

Cons:

1. Sensitive to noisy data and outliers
2. Never the best in class predictions

Sample Python Code:

```											# AdaBoost Classification
import pandas
from sklearn import model_selection
from sklearn.ensemble import AdaBoostClassifier
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = pandas.read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
seed = 7
num_trees = 30
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)
results = model_selection.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())
```

### Naive Bayes

Explanation:

Best at:

1. Face recognition
2. Sentiment analysis
3. Spam detection
4. Text classification

Pros:

1. Easy and fast to implement, doesn’t require too much memory and can be used for online learning
2. Easy to understand
3. Takes into account prior knowledge

Cons:

1. Strong and unrealistic feature independence assumptions
2. Fails estimating rare occurrences
3. Suffers from irrelevant features

Sample Python Code:

```											#Email Spam Detection
#You will need a folder which included spam only email, we'll call it Spam here
#You will also need a folder which included non-spam only email, we'll call it Ham

import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
path = os.path.join(root, filename)

inBody = False
lines = []
f = io.open(path, 'r', encoding='latin1')
for line in f:
if inBody:
lines.append(line)
elif line == '\n':
inBody = True
f.close()
message = '\n'.join(lines)
yield path, message

def dataFrameFromDirectory(path, classification):
rows = []
index = []
for filename, message in readFiles(path):
rows.append({'message': message, 'class': classification})
index.append(filename)

return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('e:/directory/spam', 'spam'))
data = data.append(dataFrameFromDirectory('e:/directory/ham', 'ham'))
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions
```

### K-nearest Neighbors

Explanation:

Best at:

1. Computer vision
2. Multilabel tagging
3. Recommender systems
4. Spell checking problems

Pros:

1. Fast, lazy training
2. Can naturally handle extreme multiclass problems (like tagging text)

Cons:

1. Slow and cumbersome in the predicting phase
2. Can fail to predict correctly due to the curse of dimensionality

Sample Python Code:

```											X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)

print(neigh.predict([[1.1]]))
#[0]
print(neigh.predict_proba([[0.9]]))
#[[ 0.66666667  0.33333333]]
```

### (LDA) Linear Discriminant Analysis

Explanation:

Both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels).

Best at:

1. Face recognition
2. Reducing dimensions of the dataset

Pros:

1. Can reduce data dimensionality

Cons:

1. LDA produces at most C-1 feature projections, where is C is the number of class labels
2. LDA is a parametric method (it assumes unimodal Gaussian likelihoods)

Sample Python Code:

```											import numpy as np
from sklearn.lda import LDA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf = LDA()
clf.fit(X, y)

print(clf.predict([[-0.8, -1]]))

```

### Logistic Regression

Explanation:

Best at:

1. Ordering results by probability
2. Modelling marketing responses

Pros:

1. Simple to understand and explain
2. It seldom overfits
3. Using L1 & L2 regularization is effective in feature selection
4. Fast to train
5. Easy to train on big data thanks to its stochastic version

Cons:

1. You have to work hard to make it fit nonlinear functions
2. Can suffer from outliers

Sample Python Code:

```											print(__doc__)

# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target

h = .02  # step size in the mesh

logreg = linear_model.LogisticRegression(C=1e5)

# we create an instance of Neighbours Classifier and fit the data.
logreg.fit(X, Y)

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1, figsize=(4, 3))
plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')

plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.xticks(())
plt.yticks(())

plt.show()

```

### Linear Regression

Explanation:

Best at:

1. Baseline predictions
2. Econometric predictions
3. Modelling marketing responses

Pros:

1. Simple to understand and explain
2. It seldom overfits
3. Using L1 & L2 regularization is effective in feature selection
4. Fast to train
5. Easy to train on big data thanks to its stochastic version

Cons:

1. You have to work hard to make it fit nonlinear functions
2. Can suffer from outliers

Sample Python Code:

```											print(__doc__)

# Code source: Jaques Grobler
# License: BSD 3 clause

import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model

# Load the diabetes dataset
diabetes = datasets.load_diabetes()

# Use only one feature
diabetes_X = diabetes.data[:, np.newaxis, 2]

# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test = diabetes.target[-20:]

# Create linear regression object
regr = linear_model.LinearRegression()

# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print("Mean squared error: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))

# Plot outputs
plt.scatter(diabetes_X_test, diabetes_y_test,  color='black')
plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue',
linewidth=3)

plt.xticks(())
plt.yticks(())

plt.show()
```

### SVD (Singular Value Decomposition)

Intuitive Explanation:
Mathematical Explanation:

Best at:

1. Recommender systems

Pros:

1. Can restructure data in a meaningful way

Cons:

1. Difficult to understand why data has been restructured in a certain way

Sample Python Code:

```											from numpy import *
import operator
from os import listdir
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from numpy.linalg import *
from scipy.stats.stats import pearsonr
from numpy import linalg as la

from sklearn.datasets import load_iris

# load data points
raw_data = load_iris().data
samples,features = shape(raw_data)

#normalize and remove mean
data = mat(raw_data[:,:4])

def svd(data, S=2):

#calculate SVD
U, s, V = linalg.svd( data )
Sig = mat(eye(S)*s[:S])
#tak out columns you don't need
newdata = U[:,:S]

# this line is used to retrieve dataset
#~ new = U[:,:2]*Sig*V[:2,:]

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
colors = ['blue','red','black']
for i in xrange(samples):
ax.scatter(newdata[i,0],newdata[i,1], color= colors[int(raw_data[i,-1])])
plt.xlabel('SVD1')
plt.ylabel('SVD2')
plt.show()

svd(data,2)
```

### PCA (Principle Component Analysis)

Explanation:

Best at:

1. Removing collinearity
2. Reducing dimensions of the dataset

Pros:

1. Can reduce data dimensionality

Cons:

1. Implies strong linear assumptions (components are a weighted summations of features)

Sample Python Code:

```											print(__doc__)

# PCA example with Iris Data-set
# Code source: Gaël Varoquaux
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn import decomposition
from sklearn import datasets

np.random.seed(5)

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

fig = plt.figure(1, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()
pca = decomposition.PCA(n_components=3)
pca.fit(X)
X = pca.transform(X)

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
ax.text3D(X[y == label, 0].mean(),
X[y == label, 1].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])

plt.show()

```

### K-means Clustering

Explanation:

Best at:

1. Segmentation

Pros:

1. Fast in finding clusters
2. Can detect outliers in multiple dimensions

Cons:

1. Suffers from multicollinearity
2. Clusters are spherical, can’t detect groups of other shape
3. Unstable solutions, depends on initialization

Sample Python Code:

```											print(__doc__)

# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans
from sklearn import datasets

np.random.seed(5)

centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = {'k_means_iris_3': KMeans(n_clusters=3),
'k_means_iris_8': KMeans(n_clusters=8),
'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
init='random')}

fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()
est.fit(X)
labels = est.labels_

ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
fignum = fignum + 1

# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()

```

### Content Based Filtering

Explanation:

Best at:

1. Recommender systems

Pros:

1. Ability to recommend to users with unique tastes
2. No need for data on other users
3. Ability to recommend new and unpopular items

Cons:

1. Finding features about items can be difficult
2. Difficult to build profile for new users
3. Might not recommend items outside of a users profile

### Collaborative Filtering (CF)

Explanation:

Best at:

1. Recommender systems

Pros:

1. No Feature Selection required as CF does not require content information about neither users or items to be machine-recognizable
2. Can suggest serendipitous items by observing similar-minded peoples behavior

Cons:

1. CF systems are not content aware meaning that information about items are not considered when they produce recommendation
2. User/rating matrix is sparse, can be hard to find users who have rated similar items
3. Cannot recommend an unrated item and tends to be biased towards popular items

20 Awesome Colors