1. Some materials are taken from machine learning course of Victor Kitov
Vector $w$ is orthogonal to hyperplane $w^{T}x+w_{0}=0$
Proof: Consider arbitrary $x_{A},x_{B}\in\{x:\,w^{T}x+w_{0}=0\}$: $$ \begin{align} w^{T}x_{A}+w_{0}=0 \quad \text{ (1)}\\ w^{T}x_{B}+w_{0}=0 \quad \text{ (2)} \end{align} $$ By substracting (2) from (1), obtain $w^{T}(x_{A}-x_{B})=0$, so $w$ is orthogonal to hyperplane.
Distance from point $x$ to hyperplane $w^{T}x+w_{0}=0$ is equal to $\frac{w^{T}x+w_{0}}{\left\lVert w\right\rVert }$.
Proof: Project $x$ on the hyperplane, let the projection be $p$ and complement $h=x-p$, orthogonal to hyperplane. Then $$ x=p+h $$
Since $p$ lies on the hyperplane, $$ w^{T}p+w_{0}=0 $$
Since $h$ is orthogonal to hyperplane and according to theorem 1 $$ h=r\frac{w}{\left\lVert w\right\rVert },\,r\in\mathbb{R}\text{ - distance to hyperplane}. $$
$$ x=p+r\frac{w}{\left\lVert w\right\rVert } $$
After multiplication by $w$ and addition of $w_{0}$: $$ w^{T}x+w_{0}=w^{T}p+w_{0}+r\frac{w^{T}w}{\left\lVert w\right\rVert }=r\left\lVert w\right\rVert $$
because $w^{T}p+w_{0}=0$ and $\left\lVert w\right\rVert =\sqrt{w^{T}w}$. So we get, that $$ r=\frac{w^{T}x+w_{0}}{\left\lVert w\right\rVert } $$
Comments:
Binary linear classifier: $$ \widehat{y}(x)= sign\left(w^{T}x+w_{0}\right) $$
divides feature space by hyperplane $w^{T}x+w_{0}=0$.
Approaches:
one-versus-all
one-versus-one
$$ M(x,y) =y\left(w^{T}x+w_{0}\right) $$
is not recommended:
continous margin is more informative than binary error indicator.\pause
If we select loss function $\mathcal{L}(M)$ such that $\mathbb{I}[M]\le\mathcal{L}(M)$ then we can optimize upper bound on misclassification rate: $$ \begin{gathered}\begin{gathered}\text{MISCLASSIFICATION RATE}\end{gathered} =\frac{1}{N}\sum_{n=1}^{N}\mathbb{I}[M(x_{n},y_{n}|w)<0]\\ \le\frac{1}{N}\sum_{n=1}^{N}\mathcal{L}(M(x_{n},y_{n}|w))=L(w) \end{gathered} $$
Same story as in linear regression
Same story as in linear regression
$||w||_{1}$ regularizer should do feature selection.
Consider $$ L(w)=\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda\sum_{d=1}^{D}|w_{d}| $$
And gradient updates $$ \frac{\partial}{\partial w_{i}}L(w)=\sum_{n=1}^{N}\frac{\partial}{\partial w_{i}}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda sign (w_{i}) $$
$$ \lambda sign (w_{i})\nrightarrow0\text{ when }w_{i}\to0 $$
$$ L(w)=\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda\sum_{d=1}^{D}w_{d}^{2} $$
$$ \frac{\partial}{\partial w_{i}}L(w)=\sum_{n=1}^{N}\frac{\partial}{\partial w_{i}}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+2\lambda w_{i} $$ $$ 2\lambda w_{i}\to0\text{ when }w_{i}\to0 $$
Consider the foolowing objects
x1 | x2 |
---|---|
0 | 1 |
1 | 0 |
1 | 1 |
2 | 2 |
2 | 3 |
3 | 2 |
Find class prediction if $(w_0 = -0.3 , w_1 = 0.1, w_2 = 0.1)$
where $\sigma(z)=\frac{1}{1+e^{-z}}$ - sigmoid function
demo_sigmoid()
Using the property $1-\sigma(z)=\sigma(-z)$ obtain that $$ p(y=+1|x)=\sigma(w^{T}x+w_0)\Longrightarrow p(y=-1|x)=\sigma(-w^{T}x - w_0) $$
So for $y\in\{+1,-1\}$ $$ p(y|x)=\sigma(y(\langle w,x\rangle + w_0)) $$
Therefore ML estimation can be written as: $$ \prod_{i=1}^{N}\sigma( y_{i}(\langle w,x_{i}\rangle + w_0))\to\max_{w} $$
For binary classification $p(y|x)=\sigma(y(\langle w,x\rangle + w_0))$
Estimation with ML:
$$ \prod_{i=1}^{n}\sigma(y_{i}(\langle w,x_{i}\rangle + w_0)) = \prod_{i=1}^{n}\sigma(y_{i}g(x_{i})) = \to\max_{w} $$
which is equivalent to $$ \sum_{i}^{n}\ln(1+e^{-y_{i}g(x_{i})})\to\min_{w} $$
It follows that logistic regression is linear discriminant estimated with loss function $\mathcal{L}(M)=\ln(1+e^{-M})$.
Lets present Likelihood function in another form $$ \mathcal{L}(w) = \prod_i^n p(y=+1|x)^{[y^{(i)} = +1]} p(y=-1|x)^{[y^{(i)} = -1]} \rightarrow \max_w$$ $$ -\ln{\mathcal{L}(w)} = - \sum_i^n [y^{(i)} = +1]\cdot\ln{\sigma(w^{T}x+w_0))} + {[y^{(i)} = -1]}\cdot\ln{(1-\sigma(w^{T}x+w_0))} \rightarrow \min_w$$ $$L(w) = \log{\mathcal{L}(w)} \rightarrow \min_w $$
Multiple class classification: $$ \begin{cases} score(\omega_{1}|x)=w_{1}^{T}x + w_{0,1} \\ score(\omega_{2}|x)=w_{2}^{T}x + w_{0,2}\\ \cdots\\ score(\omega_{C}|x)=w_{C}^{T}x + + w_{0,C} \end{cases} $$
+relationship between score and class probability is assumed:
$$ p(\omega_{c}|x)=softmax(\omega_c|W, x)=\frac{exp(w_{c}^{T}x + w_{0,c})}{\sum_{i}exp(w_{i}^{T}x + w_{0,i})} $$
Estimation with ML: $$ \prod_{n=1}^{N}\prod_{c=1}^{C} softmax(\omega_c|W, x_i)^{[y_i = w_c]} $$
Which would lead us to cross-entropy