Summary
There are many machine learning algorithms, each with their own unique advantages and disadvantages. I have described algorithms commonly used in machine learning, pros, cons along with sample code. For this post I have compiled information and videos on the web which I thought did a great job of explaining the concepts.The intention of this post isn't to teach you certain algorithms but to make you familiar with some of the concepts of Machine Learning. You can also use this for future reference, when in doubt regarding which algorithm you should consider using. I have used information provided by John Paul Mueller, Luca Massaron and Thales Sehn Korting for this post. I have not included Neural Networks in this post as they are quite difficult to explain in a small section within this post. If you are interested to know more about Neural Networks (deep learning) you can start by checking out this link.
Support Vector Machines
Explanation:
SVMs (linear or otherwise) inherently do binary classification. However, there are various procedures for extending them to multiclass problems. The most common methods involve transforming the problem into a set of binary classification problems, by one of two strategies:
1. One vs. the rest. For K classes, K binary classifiers are trained. Each determines whether an example belongs to its 'own' class versus any other class. The classifier with the largest output is taken to be the class of the example.
2. One vs. one. A binary classifier is trained for each pair of classes. A voting procedure is used to combine the outputs.
Best at:
- Character recognition
- Image recognition
- Text classification
Pros:
- Automatic nonlinear feature creation
- Can approximate complex nonlinear functions
Cons:
- Difficult to interpret when applying nonlinear kernels
- Suffers from too many examples, after 10,000 examples it starts taking too long to train
Sample Python Code:
print(__doc__) # Author: Gael Varoquaux "gael dot varoquaux at normalesup dot org" # License: BSD 3 clause # Standard scientific Python imports import matplotlib.pyplot as plt # Import datasets, classifiers and performance metrics from sklearn import datasets, svm, metrics # The digits dataset digits = datasets.load_digits() %matplotlib inline plt.ion() # The data that we are interested in is made of 8x8 images of digits, let's # have a look at the first 4 images, stored in the `images` attribute of the # dataset. If we were working from image files, we could load them using # matplotlib.pyplot.imread. Note that each image must have the same size. For these # images, we know which digit they represent: it is given in the 'target' of # the dataset. images_and_labels = list(zip(digits.images, digits.target)) for index, (image, label) in enumerate(images_and_labels[:4]): plt.subplot(2, 4, index + 1) plt.axis('off') plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest') plt.title('Training: %i' % label) # To apply a classifier on this data, we need to flatten the image, to # turn the data in a (samples, feature) matrix: n_samples = len(digits.images) data = digits.images.reshape((n_samples, -1)) # Create a classifier: a support vector classifier classifier = svm.SVC(gamma=0.001) # We learn the digits on the first half of the digits classifier.fit(data[:n_samples / 2], digits.target[:n_samples / 2]) # Now predict the value of the digit on the second half: expected = digits.target[n_samples / 2:] predicted = classifier.predict(data[n_samples / 2:]) print("Classification report for classifier %s:\n%s\n" % (classifier, metrics.classification_report(expected, predicted))) print("Confusion matrix:\n%s" % metrics.confusion_matrix(expected, predicted)) images_and_predictions = list(zip(digits.images[n_samples / 2:], predicted)) for index, (image, prediction) in enumerate(images_and_predictions[:4]): plt.subplot(2, 4, index + 5) plt.axis('off') plt.imshow(image, cmap=plt.cm.gray_r, interpolation='nearest') plt.title('Prediction: %i' % prediction) plt.show()
Random Forest
Explanation:
Best at:
- Apt at almost any machine learning problem
- Bioinformatics
Pros:
- Can work in parallel
- Seldom overfits
- Automatically handles missing values
- No need to transform any variable
- No need to tweak parameters
- Can be used by almost anyone with excellent results
Cons:
- Difficult to interpret
- Weaker on regression when estimating values at the extremities of the distribution of response values
- Biased in multiclass problems toward more frequent classes
Sample Python Code:
import pandas as pd import numpy as np from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import r2_score import matplotlib.pyplot as plt #Assuming your data is in data.csv dataset = pd.read_csv('data.csv', sep=',') dataset.fillna(0, inplace=True) #For turning you categorical variables into dummy variables (sklean can only deal with continuous variables) dataset=pd.get_dummies(dataset) #assuming your classification is in the last column and your predictors are in the other columns X, y = dataset.iloc[:,1:], dataset.iloc[:,:1] #for creating a split test/traing set where 20% of the original data is in the test set and the rest is in training X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1) #Random Forest clf = RandomForestClassifier(n_estimators=10) yt=np.ravel(y_train) clf = clf.fit(X_train, yt) y_predict=clf.predict(X_test) print("Rsquared for Random Forest: ",r2_score(y_test, y_predict))
Gradient Boosting
Explanation:
.
Best at:
- Apt at almost any machine learning problem
- Search engines (solving the problem of learning to rank)
Pros:
- It can approximate most nonlinear function
- Best in class predictor
- Automatically handles missing values
- No need to transform any variable
Cons:
- It can overfit if run for too many iterations
- Sensitive to noisy data and outliers
- Doesn't work well without parameter tuning
Sample Python Code:
from sklearn.datasets import make_hastie_10_2 from sklearn.ensemble import GradientBoostingClassifier X, y = make_hastie_10_2(random_state=0) X_train, X_test = X[:2000], X[2000:] y_train, y_test = y[:2000], y[2000:] clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train) #The number of weak learners (i.e. regression trees) is controlled by the parameter n_estimators; The size of each tree can be controlled either by setting the tree depth via max_depth or by setting the number of leaf nodes via max_leaf_nodes. The learning_rate is a hyper-parameter in the range (0.0, 1.0] that controls overfitting via shrinkage clf.score(X_test, y_test)
Adaboost (Adaptive Boost)
Explanation:
Zhaojun Zhang, a Software Engineer from Coursera, describes the differences between Adaboost and Gradient boosting quite well. As he mentions, Both methods use a set of weak learners. They try to boost these weak learners into a strong learner. Assume that the strong learner is additive by the weak learners.
Gradient boosting generates learners during the learning process. It build first learner to predict the values/labels of samples, and calculate the loss (the difference between the outcome of the first learner and the real value). It will build a second learner to predict the loss after the first step. The step continues to learn the third, fourth … until certain threshold.
Adaboost requires users specify a set of weak learners (alternatively, it will randomly generate a set of weak learner before the real learning process). It will learn the weights of how to add these learners to be a strong learner. The weight of each learner is learned by whether it predicts a sample correctly or not. If a learner mis-predicts a sample, the weight of the learner is reduced a bit. It will repeat such process until converge.
Best at:
- Face detection
Pros:
- Automatically handles missing values
- No need to transform any variable
- It doesn’t overfit easily
- Few parameters to tweak
- It can leverage many different weak-learners
Cons:
- Sensitive to noisy data and outliers
- Never the best in class predictions
Sample Python Code:
# AdaBoost Classification import pandas from sklearn import model_selection from sklearn.ensemble import AdaBoostClassifier url = "https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = pandas.read_csv(url, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] seed = 7 num_trees = 30 kfold = model_selection.KFold(n_splits=10, random_state=seed) model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed) results = model_selection.cross_val_score(model, X, Y, cv=kfold) print(results.mean())
Naive Bayes
Explanation:
Best at:
- Face recognition
- Sentiment analysis
- Spam detection
- Text classification
Pros:
- Easy and fast to implement, doesn’t require too much memory and can be used for online learning
- Easy to understand
- Takes into account prior knowledge
Cons:
- Strong and unrealistic feature independence assumptions
- Fails estimating rare occurrences
- Suffers from irrelevant features
Sample Python Code:
#Email Spam Detection #You will need a folder which included spam only email, we'll call it Spam here #You will also need a folder which included non-spam only email, we'll call it Ham import os import io import numpy from pandas import DataFrame from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB def readFiles(path): for root, dirnames, filenames in os.walk(path): for filename in filenames: path = os.path.join(root, filename) inBody = False lines = [] f = io.open(path, 'r', encoding='latin1') for line in f: if inBody: lines.append(line) elif line == '\n': inBody = True f.close() message = '\n'.join(lines) yield path, message def dataFrameFromDirectory(path, classification): rows = [] index = [] for filename, message in readFiles(path): rows.append({'message': message, 'class': classification}) index.append(filename) return DataFrame(rows, index=index) data = DataFrame({'message': [], 'class': []}) data = data.append(dataFrameFromDirectory('e:/directory/spam', 'spam')) data = data.append(dataFrameFromDirectory('e:/directory/ham', 'ham')) vectorizer = CountVectorizer() counts = vectorizer.fit_transform(data['message'].values) classifier = MultinomialNB() targets = data['class'].values classifier.fit(counts, targets) examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?"] example_counts = vectorizer.transform(examples) predictions = classifier.predict(example_counts) predictions
K-nearest Neighbors
Explanation:
Best at:
- Computer vision
- Multilabel tagging
- Recommender systems
- Spell checking problems
Pros:
- Fast, lazy training
- Can naturally handle extreme multiclass problems (like tagging text)
Cons:
- Slow and cumbersome in the predicting phase
- Can fail to predict correctly due to the curse of dimensionality
Sample Python Code:
X = [[0], [1], [2], [3]] y = [0, 0, 1, 1] from sklearn.neighbors import KNeighborsClassifier neigh = KNeighborsClassifier(n_neighbors=3) neigh.fit(X, y) print(neigh.predict([[1.1]])) #[0] print(neigh.predict_proba([[0.9]])) #[[ 0.66666667 0.33333333]]
(LDA) Linear Discriminant Analysis
Explanation:
Both LDA and PCA are linear transformation techniques: LDA is a supervised whereas PCA is unsupervised (ignores class labels).
Best at:
- Face recognition
- Reducing dimensions of the dataset
Pros:
- Can reduce data dimensionality
Cons:
- LDA produces at most C-1 feature projections, where is C is the number of class labels
- LDA is a parametric method (it assumes unimodal Gaussian likelihoods)
Sample Python Code:
import numpy as np from sklearn.lda import LDA X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) y = np.array([1, 1, 1, 2, 2, 2]) clf = LDA() clf.fit(X, y) print(clf.predict([[-0.8, -1]]))
Logistic Regression
Explanation:
Best at:
- Ordering results by probability
- Modelling marketing responses
Pros:
- Simple to understand and explain
- It seldom overfits
- Using L1 & L2 regularization is effective in feature selection
- Fast to train
- Easy to train on big data thanks to its stochastic version
Cons:
- You have to work hard to make it fit nonlinear functions
- Can suffer from outliers
Sample Python Code:
print(__doc__) # Code source: Gaël Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model, datasets # import some data to play with iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. Y = iris.target h = .02 # step size in the mesh logreg = linear_model.LogisticRegression(C=1e5) # we create an instance of Neighbours Classifier and fit the data. logreg.fit(X, Y) # Plot the decision boundary. For that, we will assign a color to each # point in the mesh [x_min, x_max]x[y_min, y_max]. x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5 y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()]) # Put the result into a color plot Z = Z.reshape(xx.shape) plt.figure(1, figsize=(4, 3)) plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired) # Plot also the training points plt.scatter(X[:, 0], X[:, 1], c=Y, edgecolors='k', cmap=plt.cm.Paired) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.xlim(xx.min(), xx.max()) plt.ylim(yy.min(), yy.max()) plt.xticks(()) plt.yticks(()) plt.show()
Linear Regression
Explanation:
Best at:
- Baseline predictions
- Econometric predictions
- Modelling marketing responses
Pros:
- Simple to understand and explain
- It seldom overfits
- Using L1 & L2 regularization is effective in feature selection
- Fast to train
- Easy to train on big data thanks to its stochastic version
Cons:
- You have to work hard to make it fit nonlinear functions
- Can suffer from outliers
Sample Python Code:
print(__doc__) # Code source: Jaques Grobler # License: BSD 3 clause import matplotlib.pyplot as plt import numpy as np from sklearn import datasets, linear_model # Load the diabetes dataset diabetes = datasets.load_diabetes() # Use only one feature diabetes_X = diabetes.data[:, np.newaxis, 2] # Split the data into training/testing sets diabetes_X_train = diabetes_X[:-20] diabetes_X_test = diabetes_X[-20:] # Split the targets into training/testing sets diabetes_y_train = diabetes.target[:-20] diabetes_y_test = diabetes.target[-20:] # Create linear regression object regr = linear_model.LinearRegression() # Train the model using the training sets regr.fit(diabetes_X_train, diabetes_y_train) # The coefficients print('Coefficients: \n', regr.coef_) # The mean squared error print("Mean squared error: %.2f" % np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2)) # Explained variance score: 1 is perfect prediction print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test)) # Plot outputs plt.scatter(diabetes_X_test, diabetes_y_test, color='black') plt.plot(diabetes_X_test, regr.predict(diabetes_X_test), color='blue', linewidth=3) plt.xticks(()) plt.yticks(()) plt.show()
SVD (Singular Value Decomposition)
Intuitive Explanation:
Mathematical Explanation:
Best at:
- Recommender systems
Pros:
- Can restructure data in a meaningful way
Cons:
- Difficult to understand why data has been restructured in a certain way
Sample Python Code:
from numpy import * import operator from os import listdir import matplotlib import matplotlib.pyplot as plt import pandas as pd from numpy.linalg import * from scipy.stats.stats import pearsonr from numpy import linalg as la from sklearn.datasets import load_iris # load data points raw_data = load_iris().data samples,features = shape(raw_data) #normalize and remove mean data = mat(raw_data[:,:4]) def svd(data, S=2): #calculate SVD U, s, V = linalg.svd( data ) Sig = mat(eye(S)*s[:S]) #tak out columns you don't need newdata = U[:,:S] # this line is used to retrieve dataset #~ new = U[:,:2]*Sig*V[:2,:] fig = plt.figure() ax = fig.add_subplot(1,1,1) colors = ['blue','red','black'] for i in xrange(samples): ax.scatter(newdata[i,0],newdata[i,1], color= colors[int(raw_data[i,-1])]) plt.xlabel('SVD1') plt.ylabel('SVD2') plt.show() svd(data,2)
PCA (Principle Component Analysis)
Explanation:
Best at:
- Removing collinearity
- Reducing dimensions of the dataset
Pros:
- Can reduce data dimensionality
Cons:
- Implies strong linear assumptions (components are a weighted summations of features)
Sample Python Code:
print(__doc__) # PCA example with Iris Data-set # Code source: Gaël Varoquaux # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn import decomposition from sklearn import datasets np.random.seed(5) centers = [[1, 1], [-1, -1], [1, -1]] iris = datasets.load_iris() X = iris.data y = iris.target fig = plt.figure(1, figsize=(4, 3)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() pca = decomposition.PCA(n_components=3) pca.fit(X) X = pca.transform(X) for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D(X[y == label, 0].mean(), X[y == label, 1].mean() + 1.5, X[y == label, 2].mean(), name, horizontalalignment='center', bbox=dict(alpha=.5, edgecolor='w', facecolor='w')) # Reorder the labels to have colors matching the cluster results y = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap=plt.cm.spectral) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) plt.show()
K-means Clustering
Explanation:
Best at:
- Segmentation
Pros:
- Fast in finding clusters
- Can detect outliers in multiple dimensions
Cons:
- Suffers from multicollinearity
- Clusters are spherical, can’t detect groups of other shape
- Unstable solutions, depends on initialization
Sample Python Code:
print(__doc__) # Code source: Gaël Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.cluster import KMeans from sklearn import datasets np.random.seed(5) centers = [[1, 1], [-1, -1], [1, -1]] iris = datasets.load_iris() X = iris.data y = iris.target estimators = {'k_means_iris_3': KMeans(n_clusters=3), 'k_means_iris_8': KMeans(n_clusters=8), 'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1, init='random')} fignum = 1 for name, est in estimators.items(): fig = plt.figure(fignum, figsize=(4, 3)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() est.fit(X) labels = est.labels_ ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float)) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') fignum = fignum + 1 # Plot the ground truth fig = plt.figure(fignum, figsize=(4, 3)) plt.clf() ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134) plt.cla() for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]: ax.text3D(X[y == label, 3].mean(), X[y == label, 0].mean() + 1.5, X[y == label, 2].mean(), name, horizontalalignment='center', bbox=dict(alpha=.5, edgecolor='w', facecolor='w')) # Reorder the labels to have colors matching the cluster results y = np.choose(y, [1, 2, 0]).astype(np.float) ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y) ax.w_xaxis.set_ticklabels([]) ax.w_yaxis.set_ticklabels([]) ax.w_zaxis.set_ticklabels([]) ax.set_xlabel('Petal width') ax.set_ylabel('Sepal length') ax.set_zlabel('Petal length') plt.show()
Content Based Filtering
Explanation:
Best at:
- Recommender systems
Pros:
- Ability to recommend to users with unique tastes
- No need for data on other users
- Ability to recommend new and unpopular items
Cons:
- Finding features about items can be difficult
- Difficult to build profile for new users
- Might not recommend items outside of a users profile
Collaborative Filtering (CF)
Explanation:
Best at:
- Recommender systems
Pros:
- No Feature Selection required as CF does not require content information about neither users or items to be machine-recognizable
- Can suggest serendipitous items by observing similar-minded peoples behavior
Cons:
- CF systems are not content aware meaning that information about items are not considered when they produce recommendation
- User/rating matrix is sparse, can be hard to find users who have rated similar items
- Cannot recommend an unrated item and tends to be biased towards popular items