# Support Vector Machines¶

## 1. Background¶

Support Vector Machines are a class of supervised models and algorithims used for classification.

Given a set of training examples, each marked as belonging to a category,

an SVM training algorithm builds a model that assigns new examples to a category.

An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on on which side of the gap they fall.

More formally, the goal is to find the set of hyperplanes that maximizes the margin between the catergories.

Any hyperplane can be written as the set of points ${\vec {x}}$ satisfying

${\vec {w}}\cdot {\vec {x}}-b=0,\,$

where ${\vec {w}}$ is a vector perpendicular (normal) to the hyperplane, ${\vec {x}}$ is a sample, and $b$ is the offset of the hyperplane along vector ${\vec {w}}$.

Thus, the dot product ${\vec {w}}\cdot {\vec {x}}$ represents the projection of ${\vec {x}}$ along ${\vec {w}}$.

If the training data are linearly separable, we can select two parallel hyperplanes that separate the two classes of data, so that the distance between them is as large as possible.

The region bounded by these two hyperplanes is called the “margin”, and the maximum-margin hyperplane is the hyperplane that lies halfway between them. These hyperplanes can be described by the equations

${\vec {w}}\cdot {\vec {x}}-b=1\,$

and

${\vec {w}}\cdot {\vec {x}}-b=-1.\,$

Such that the size of the margin is given by ${\tfrac {2}{\|{\vec {w}}\|}}$.

What we want then is to maximize the distance between the planes; that is, minimize

$\|{\vec {w}}\|$.

As we also have to prevent data points from falling

into the margin, we add the following constraint:

for each $i$ either

${\vec {w}}\cdot {\vec {x}}_{i}-b\geq 1$, if

$y_{i}=1$

or

${\vec {w}}\cdot {\vec {x}}_{i}-b\leq -1$, if

$y_{i}=-1.$

These constraints state that each data point must lie on the correct side of the margin.

This can be rewritten as:

$y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\geq 1,\quad {\text{ for all }}1\leq i\leq n.\qquad \qquad (1)$

We can put this together to get the optimization problem:

Minimize $\|{\vec {w}}\|$ subject to

${\displaystyle y_{i}({\vec {w}}\cdot {\vec {x}}_{i}-b)\geq 1,}$ for

${\displaystyle i=1,\,\ldots ,\,n}$

The ${\vec {w}}$ and $b$ that solve this problem determine our classifier,

${\displaystyle {\vec {x}}\mapsto \operatorname {sgn}({\vec {w}}\cdot {\vec {x}}-b)}$.

This optimization problem amounts to solving the Lagrangian simplification of the problem through quadratic programming algorithms. A more detailed explanation can be found in the wikipedia article for SVMs and in this video from the Artificial Intelligence MIT OpenCourse.

An easy-to-see but important consequence of this geometric description is that the max-margin hyperplane is completely determined by those

${\displaystyle {\vec {x}}_{i}}$ which lie nearest to it. These

${\displaystyle {\vec {x}}_{i}}$ are called support vectors.

## Nonlinear classication: The Kernel Trick¶

In 1992, Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik suggested a

way to create nonlinear classifiers by applying a hyperplane transformation.

Essentially, every dot product is replaced by a nonlinear kernel function ${\displaystyle \varphi ({\vec {x_{i}}})}$.

This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space.

The kernel is related to the transform

${\displaystyle \varphi ({\vec {x_{i}}})}$ by the equation

${\displaystyle k({\vec {x_{i}}},{\vec {x_{j}}})=\varphi ({\vec {x_{i}}})\cdot \varphi ({\vec {x_{j}}})}$.

The value $w$ is also in the transformed space, with

${\displaystyle \textstyle {\vec {w}}=\sum _{i}\alpha _{i}y_{i}\varphi ({\vec {x}}_{i})}$.

A number of transformations exist to solve nonlinear problems including polynomial, Gaussian radial basis functions, hyperbolic.

## 2. Simple Linear SVM example¶

For this, we’ll consider the `iris`

dataset provided in the sklearn datasets.

The iris dataset consists of 150 samples: 50 from each of three species of Iris flowers, and four features for each sample (Sepal Length, Sepal Width, Petal Length and Petal Width).

Thus, the dataset consists of three classes, four features, and 150 samples (50 for each class).

For visualization purposes, let’s select only two classes and only the first two features in the data. (Link to original example here).

```
# Visualize data
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 1, :2] # select only the first two features for classes 0 and 2 (exclude class 1)
y = y[y != 1] # exclude class 1
n_sample = len(X)
# samples are ordered so we need to randomize order so we don't pick/exclude samples from only one class
np.random.seed(0)
order = np.random.permutation(n_sample)
X = X[order]
y = y[order].astype(np.float)
# use 90/10 % split for training and testing, int(.9 * n_sample) makes sure to take an integer split.
X_train = X[:int(.9 * n_sample)]
y_train = y[:int(.9 * n_sample)]
X_test = X[int(.9 * n_sample):]
y_test = y[int(.9 * n_sample):]
# Plot the training points
plt.xlabel('Sepal length(mm)')
plt.ylabel('Sepal width(mm)')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
plt.title('Samples on Feature Space')
plt.show()
```

```
# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
# fit and plot the model
C = 1.0 # SVM regularization parameter
clf = svm.SVC(kernel='linear', C=C).fit(X_train, y_train)
```

## Visualize the linear SVM model¶

```
# Visualize model
h = .02 # step size in the mesh
plt.clf()
# plot samples
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.coolwarm)
# Circle out the test data
plt.scatter(X_test[:, 0], X_test[:, 1], s=80, edgecolors='k', linestyles=['--'], facecolors='none', zorder=10)
plt.axis('tight')
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min,y_max = X[:, 1].min(), X[:, 1].max()
XX, YY = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=80,
facecolors='none', zorder=10)
# Put the result into a color plot
Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
# Put the result into a color plot
Z = Z.reshape(XX.shape)
#plt.contourf(XX, YY, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.contour(XX, YY, Z, colors=['k', 'k', 'k'], linestyles=['--', '-', '--'],
levels=[-.5, 0, .5])
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(XX.min(), XX.max())
plt.ylim(YY.min(), YY.max())
plt.title("Linear SVM")
plt.show()
print("Predicted values: \n",clf.predict(X_test))
print("Ground truth: \n",y_test)
print("Accuracy:",sum(clf.predict(X_test)==y_test)/len(y_test),"\n")
```

We can see that in this example the linear SVM did a good job at classifying the classes even for the test samples (circled in black).

## 3. Nonlinear SVMs using different Kernels¶

```
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets, svm
iris = datasets.load_iris()
iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[:, :2]
#y = y[y != 0] # use if you want to exclude a class
n_sample = len(X)
np.random.seed(0)
order = np.random.permutation(n_sample)
X = X[order]
y = y[order].astype(np.float)
h = .02 # step size in the mesh
X_train, y_train = X[:int(.8 * n_sample)], y[:int(.8 * n_sample)]
X_test, y_test = X[int(.8 * n_sample):], y[int(.8 * n_sample):]
# fit and plot the model
for fig_num, kernel in enumerate(('linear', 'rbf', 'poly')):
clf = svm.SVC(kernel=kernel, gamma=10)
clf.fit(X_train, y_train)
plt.figure(fig_num)
plt.clf()
# plot samples
plt.scatter(X[:, 0], X[:, 1], c=y, zorder=10, cmap=plt.cm.Blues)
# Circle out the test data
plt.scatter(X_test[:, 0], X_test[:, 1], s=80, edgecolors='k', linestyles=['--'], facecolors='none', zorder=10)
plt.axis('tight')
x_min, x_max = X[:, 0].min(), X[:, 0].max()
y_min,y_max = X[:, 1].min(), X[:, 1].max()
XX, YY = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
#Z = clf.decision_function(np.c_[XX.ravel(), YY.ravel()])
# Put the result into a color plot
Z = clf.predict(np.c_[XX.ravel(), YY.ravel()])
# Put the result into a color plot
Z = Z.reshape(XX.shape)
plt.contourf(XX, YY, Z, cmap=plt.cm.coolwarm, alpha=0.8)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.xlim(XX.min(), XX.max())
plt.ylim(YY.min(), YY.max())
plt.xticks(())
plt.yticks(())
print(kernel,":")
plt.show()
print("Predicted values: \n",clf.predict(X_test))
print("Ground truth: \n",y_test)
print("Accuracy:",sum(clf.predict(X_test)==y_test)/len(y_test),"\n")
```

```
```