Kernel Trick

SVMs

For training data \(\Xbf \in \R^{n \times d} = (\xbf_1, \dots, \xbf_n)^T \) with labels \(\ybf \in \{-1,1\}^n = (y_1,\dots,y_n)^T\) Support Vector Machines aim to find a hyperplane \[\mathcal{L} = \{\xbf \in \R^d: f(\xbf) = \beta_0 + \beta^T \xbf = 0\}\] which separates the data according to its labels \[\sgn f(\xbf_i) = y_i.\]

The distance of a point \(\xbf\) to the hyperplane \(\mathcal{L}\) is \[\frac{1}{\norm{\beta}}|\beta^T\xbf + \beta_0| = \frac{1}{\norm{\beta}}f(\xbf).\]

For arbitrary \(\xbf_0 \in \mathcal{L}\) we have \begin{align} |\beta^T \xbf + \beta_0| &=|\beta^T \xbf - \beta^T \xbf_0| \\ &= |\beta^T (\xbf - \xbf_0)| \\ &\le \norm{\beta} \norm{\xbf - \xbf_0} \end{align} So \[ \frac{1}{\norm{\beta}}|\beta^T \xbf + \beta_0| \le \norm{\xbf - \xbf_0} \] and for \(\xbf_0 = \xbf - \frac{\beta^T \xbf + \beta_0}{\norm{\beta}^2} \beta \) \[\xbf_0 \in \mathcal{L}, \quad \norm{\xbf - \xbf_0} = \frac{1}{\norm{\beta}}|\beta^T \xbf + \beta_0|.\]

\(\Box\)

If the data is can be linearly separated with \(\beta_0, \beta\) we have \[y_i f(\xbf_i) > 0\] and thus \[\frac{1}{\norm{\beta}} y_i (\beta^T \xbf_i + \beta_0) \ge M\] with \[M = \min_{1 \le i \le n} \frac{1}{\norm{\beta}} y_i f(\xbf_i).\] We can find \(\tilde \beta = \frac{1}{\norm{\beta}M}\beta, \tilde \beta_0 = \frac{1}{\norm{\beta}M}\beta_0\) such that \[y_i(\tilde\beta^T \xbf_i + \tilde \beta_0 )\ge 1\] and \[\min_{1 \le i \le n} \frac{1}{\norm{\tilde \beta}} y_i (\tilde\beta^T \xbf_i + \tilde \beta_0 ) = \frac{1}{\norm{\tilde \beta}}.\] Therefore, if we want to maximise the margin \(M\) it is sufficient to find \begin{gather} \min_{\beta, \beta_0} \norm{\beta} \\ \text{s.t.} \enspace y_i(\beta^T \xbf_i + \beta_0) \ge 1. \end{gather}

The solution \(\beta^*, \beta^*_0\) minimises \[L_p(\beta, \beta_0) = \frac{1}{2}\norm{\beta}^2 + \sum_{i=1}^n \alpha_i (y_i (\beta^T \xbf_i + \beta_0) - 1), \quad \alpha_i \ge 0.\] Differentiating yields \[ \ybf^T(\alpha^T \Xbf) = \beta, \quad \ybf^T \alpha = 0 \] at a local minimum.
We see by substituting for \(\beta\) it is equivalent to maximise \begin{gather} L_d(\alpha) = (1,\dots,1)^T \alpha - \frac{1}{2} \alpha^T (\ybf \ybf^T .\Xbf^T \Xbf ) \alpha \\ \text{s.t. } \alpha \ge 0 \text{ and } \ybf^T \alpha = 0 \end{gather} which is a quadratic program.

It holds that either \(\alpha_i = 0\) or \(y_i (\beta^T \xbf_i + \beta_0) = 1\). Thus an \(\xbf_i\) with \(\alpha_i > 0\) lies exactly on the boundary and is called an support vector.

Predictions are done by \[G(\xbf) = \sgn (\beta^T \xbf + \beta_0) = \sgn \left( \sum_{i=1}^n \alpha_i y_i \xbf_i^T \xbf + \beta_0 \right)\] where \(\beta_0 = -\frac{1}{2}(\max_{y_i =-1} \beta^T \xbf_i + \min_{y_i=1} \beta^T \xbf_i).\)

Kernel Trick

Consider following data:

The inner class is ellipse shaped. How do we find a hyperplane which separates both classes?
In two dimensions we don't.
Instead, we introduce a new feature, the distance of the points to the origin. \[ \phi(x_1,x_2) := (x_1, x_2, \sqrt{x_1^2 + x_2^2})^T \]

With three dimensions we are able to find a plane:

The kernel trick is that we can implicitly go to higher dimensions by introducing a kernel \[k(\xbf_1, \xbf_2) = \phi(\xbf_1)^T \phi(\xbf_2).\] We wouldn't even need to specifiy features and could directly specify a positive definte kernel.

Now simply replace all occurences of \(\xbf_i^T \xbf_j\) with \(k(\xbf_i, \xbf_j)\).

That is, we maximise \begin{gather} L_d(\alpha) = (1,\dots,1)^T \alpha - \frac{1}{2} \alpha^T (\ybf \ybf^T . k(\Xbf, \Xbf) ) \alpha \\ \text{s.t. } \alpha \ge 0 \text{ and } \ybf^T \alpha = 0 \end{gather} and predict with \[G(\xbf) = \sgn (\beta^T \xbf + \beta_0) = \sgn \left( \sum_{i=1}^n \alpha_i y_i k(\xbf_i, \xbf) + \beta_0 \right)\] where \(\beta_0 = -\frac{1}{2}(\max_{y_j =-1} \sum_{i=1}^n \alpha_i y_i k(\xbf_i, \xbf_j) + \min_{y_i=1} \sum_{i=1}^n \alpha_i y_i k(\xbf_i, \xbf_j).\)