1. Some materials are taken from machine learning course of Victor Kitov
General idea of SVM: select hyperplane that maximizes the spread between classes
Discriminant function $$g(x) = w_0 + w_1x_1 + w_2x_2 = w_0 + \langle w, x \rangle = w_0 + w^\top x $$
By Karush-Kuhn-Takker:
$$\begin{cases} \mathcal{L}(w,w_0,\lambda) = \frac{1}{2} ||w||^2 - \sum\limits_i \lambda_i \left( y_i(\langle w, x \rangle + w_0 ) - 1\right) \rightarrow \min\limits_{w,w_0}\max\limits_{\lambda} \\ \lambda_i \geq 0 \quad i=1\dots n\\ \lambda_i = 0 \text{, or } y_{i}(\langle w, x_i \rangle + w_0)=1 \quad i=1\dots n \end{cases}$$non-informative observations: $y_{i}(\langle w, x_i \rangle + w_0)>1$ (or $\lambda_i = 0$)
support vectors: $y_{i}(x_{i}^{T}w+w_{0})=1$ (or $\lambda_i > 0$)
Necessary condition:
interact(lin_sep_svm_demo, class_sep=FloatSlider(min=0.4, max=4, step=0.1, value=2))
<function __main__.lin_sep_svm_demo>
And maximize:
$$ \frac{1}{2} ||w||^2 + C\sum\limits_i\xi_i \rightarrow \min\limits_{w,w_0,\xi} $$That brings us to the following optimization task: $$ \begin{cases} \frac{1}{2} ||w||^2 + C\sum\limits_i\xi_i \rightarrow \min\limits_{w,w_0,\xi} \\ y_i(\langle w, x_i \rangle + w_0 ) \geq 1 - \xi_i \quad i=1\dots n \\ \xi_i \geq 0 \quad i=1\dots n \end{cases} $$
With the same procedures we have the following dual problem $$\begin{cases} \mathcal{L}(\lambda) = \sum\limits_i\lambda_i - \frac{1}{2} \sum\limits_i\sum\limits_j \lambda_i \lambda_j y_i y_j (\langle x_i, x_j \rangle) \rightarrow \max\limits_\lambda \\ 0 \leq \lambda_i \leq C \quad i=1\dots n \\ \sum\limits_i \lambda_iy_i = 0 \end{cases}$$
interact(lin_sep_svm_demo_C, class_sep=FloatSlider(min=0.2, max=4, value=2, step=0.2), C=FloatSlider(min=0.002, max=10, step=0.002, value=1))
<function __main__.lin_sep_svm_demo_C>
Problem $$ \begin{cases} \frac{1}{2} ||w||^2 + C\sum\limits_i\xi_i \rightarrow \min\limits_{w,w_0,\xi} \\ y_i(\langle w, x_i \rangle + w_0 ) \geq 1 - \xi_i \quad i=1\dots n \\ \xi_i \geq 0 \quad i=1\dots n \end{cases} $$ can be rewritten as $$ \frac{1}{2С} ||w||^2 + \sum\limits_i\max(0,1-M_i) \rightarrow \min\limits_{w,w_0}, $$ where $M_i$ - is margin of object $x_i$
demo_nonlin_data()
Discriminant function:
Polynomial type kernel $$ \begin{eqnarray*} K(x,z) & = & (x^{T}z)^{2}=(x_{1}z_{1}+x_{2}z_{2})^{2}=\\ & = & x_{1}^{2}z_{1}^{2}+x_{2}^{2}z_{2}^{2}+2x_{1}z_{1}x_{2}z_{2}\\ & = & \phi^{T}(x)\phi(z) \end{eqnarray*} $$
for $\phi(x)=(x_{1}^{2},x_{2}^{2},\sqrt{2}x_{1}x_{2})$
Polynomial type kernel $$ \begin{eqnarray*} K(x,z) & = & (1+x^{T}z)^{2}=(1+x_{1}z_{1}+x_{2}z_{2})^{2}=\\ & = & 1+x_{1}^{2}z_{1}^{2}+x_{2}^{2}z_{2}^{2}+2x_{1}z_{1}+2x_{2}z_{2}+2x_{1}z_{1}x_{2}z_{2}\\ & = & \phi^{T}(x)\phi(z) \end{eqnarray*} $$ for $\phi(x)=(1,\,x_{1}^{2},\,x_{2}^{2},\,\sqrt{2}x_{1},\,\sqrt{2}x_{2},\,\sqrt{2}x_{1}x_{2})$
Mercer Theorem
Function $K(u, v)$ is a kernel iff
We can construct new kernels from other kernels
YouTubeVideo('3liCbRZPrZA', width=640, height=480)
interact(lin_sep_svm_demo_kernel_C, class_sep=FloatSlider(min=0.2, max=4, value=2, step=0.2), kernel=['rbf', 'linear', 'poly'], coef0=FloatSlider(min=0, max=10, step=0.2, value=0), C=FloatSlider(min=0.002, max=10, step=0.002, value=1), degree=FloatSlider(min=2, max=3, step=1, value=2), gamma=FloatSlider(min=0.01, max=5, step=0.01, value=0.5),)
<function __main__.lin_sep_svm_demo_kernel_C>
$\xi_i \geq 0$, $\hat \xi_i \geq 0$ - slacks: $$ y_i \leq g(x_i) + \epsilon + \xi_i $$ $$ y_i \geq g(x_i) - \epsilon - \hat \xi_i $$
Optimization task
$$ \begin{cases} C \sum_{i=1}^n (\hat \xi_i + \xi_i) + \frac{1}{2}\|w\|^2 \rightarrow \min\limits_{w, w_0} \\ g(x_i) - y_i \leq \epsilon + \xi_i \\ y_i - g(x_i) \leq \epsilon + \hat{\xi}_i \\ \xi_i \geq 0, \hat \xi_i \geq 0 \end{cases} $$Some SVM tutorials and materials: