Support Vector Machine

Non parametric supervised model.

One hyper parameter called the "C" parameter.

Uses a hyper-plane to make a decision boundary. Uses a kernel function. Performs well in regression and binary classification.

Good choice for complex but small to medium data set.

Maximum Margin Classifier

Find hyper-plane that has the biggest margin so that it has the lowest probability of mis-classifying any new data points.

yi(B[0] + B[1]x[i,1] + B[2]x[i,2] + B[3]x[i,3] + ...) >= M

When you have in-separate data you soften the margin and accept some error.

Introduce a slack variable >= M(1 - E[i])

E[i] >= 0
sum(i, n)E[i] <= C

C is hyper-parameter, so user gets to choose how much error we are okay with. (lower means tighter margin in book, but sk-learn is the opposite)

Used to create a curved hyper-plane.

Rewriting the function from maximum margin:

f(x) = B[0] + sum(i=1, n) a[i]<x, x[i]>

Uses dot product (called k kernel) which is a linear kernel. (SVM uses this). (N^2 * P run time).

If data is not linearly separable, add one more dimension to separate the data.

K(x[i], x[i]`) = (1 + sum(i=1, p) x[ij]x[i`j])^2

=> f(x) - B[0] + sum( i in S) a[i]K(x, x[i])

Well known kernels:

Choice of kernel strongly depends on the pattern of the data.