Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Linear classification. Logistic Regression¹

^{1. Some materials are taken from machine learning course of Victor Kitov}

Let's recall previous lecture¶

Linear regression
- linear dependence between target features and predictors
- predictors themselves can be non-linear
Solution can be found
- analytically
- with gradient descent
Various loss functions options
Robust setting
Regularization!

Linear Classification¶

Analytical geometry reminder¶

Reminder¶

$a=[a^{1},...a^{D}]^{T},\,b=[b^{1},...b^{D}]^{T}$
Scalar product $\langle a,b\rangle=a^{T}b=\sum_{d=1}^{D}a_{d}b_{b}$
$a\perp b$ means that $\langle a,b\rangle=0$
Norm $\left\lVert a\right\rVert =\sqrt{\langle a,a\rangle}$
Distance $\rho(a,b)=\left\lVert a-b\right\rVert =\sqrt{\langle a-b,a-b\rangle}$

$p=\langle a,\frac{b}{\left\lVert b\right\rVert }\rangle$
$\left|p\right|=\left|a,\frac{b}{\left\lVert b\right\rVert }\right|$- unsigned projection length

Orthogonal vector to hyperplane¶

Theorem 1¶

Vector $w$ is orthogonal to hyperplane $w^{T}x+w_{0}=0$

Proof: Consider arbitrary $x_{A},x_{B}\in\{x:\,w^{T}x+w_{0}=0\}$: $$ \begin{align} w^{T}x_{A}+w_{0}=0 \quad \text{ (1)}\\ w^{T}x_{B}+w_{0}=0 \quad \text{ (2)} \end{align} $$ By substracting (2) from (1), obtain $w^{T}(x_{A}-x_{B})=0$, so $w$ is orthogonal to hyperplane.

Distance from point to hyperplane¶

Theorem 2¶

Distance from point $x$ to hyperplane $w^{T}x+w_{0}=0$ is equal to $\frac{w^{T}x+w_{0}}{\left\lVert w\right\rVert }$.

Proof: Project $x$ on the hyperplane, let the projection be $p$ and complement $h=x-p$, orthogonal to hyperplane. Then $$ x=p+h $$

Since $p$ lies on the hyperplane, $$ w^{T}p+w_{0}=0 $$

Since $h$ is orthogonal to hyperplane and according to theorem 1 $$ h=r\frac{w}{\left\lVert w\right\rVert },\,r\in\mathbb{R}\text{ - distance to hyperplane}. $$

Distance from point to hyperplane¶

$$ x=p+r\frac{w}{\left\lVert w\right\rVert } $$

After multiplication by $w$ and addition of $w_{0}$: $$ w^{T}x+w_{0}=w^{T}p+w_{0}+r\frac{w^{T}w}{\left\lVert w\right\rVert }=r\left\lVert w\right\rVert $$

because $w^{T}p+w_{0}=0$ and $\left\lVert w\right\rVert =\sqrt{w^{T}w}$. So we get, that $$ r=\frac{w^{T}x+w_{0}}{\left\lVert w\right\rVert } $$

Comments:

From one side of hyperplane $r>0\Leftrightarrow w^{T}x+w_{0}>0$
From the other side $r<0\Leftrightarrow w^{T}x+w_{0}<0$.
Distance from hyperplane to origin 0 is $\frac{w_{0}}{\left\lVert w\right\rVert }$. So $w_{0}$ accounts for hyperplane offset.

Binary linear classifier geometric interpretation¶

Binary linear classifier: $$ \widehat{y}(x)= sign\left(w^{T}x+w_{0}\right) $$

divides feature space by hyperplane $w^{T}x+w_{0}=0$.

Confidence of decision is proportional to distance to hyperplane $\frac{\left|w^{T}x+w_{0}\right|}{\left\lVert w\right\rVert }$.
$w^{T}x+w_{0}$ is the confidence that class is positive.

Multiclass classification with binary classifiers¶

Task - make $C$-class classification using many binary classifiers.
Approaches:
- one-versus-all
  - for each $c=1,2,...C$ train binary classifier on all objects and output $\mathbb{I}[y_{n}=c]$,
  - assign class, getting the highest score in resulting $C$ classifiers.
- one-versus-one
  - for each $i,j\in[1,2,...C],$ $i\ne j$ learn on objects with $y_{n}\in\{i,j\}$ with output $y_{n}$
  - assign class, getting the highest score in resulting $C(C-1)/2$ classifiers.

Binary linear classifier¶

For two classes $y\in\{+1,-1\}$ classifier becomes $$ \widehat{y}(x)=\begin{cases} +1, & w_{+1}^{T}x+w_{+1,0}>w_{-1}^{T}x+w_{-1,0}\\ -1 & \text{otherwise} \end{cases} $$
This decision rule is equivalent to $$ \begin{align*} \widehat{y}(x) & =sign(w_{+1}^{T}x+w_{+1,0}-w_{-1}^{T}x+w_{-1,0})=\\ & =sign\left(\left(w_{+1}^{T}-w_{-1}^{T}\right)x+\left(w_{+1,0}-w_{-1,0}\right)\right)\\ & =sign\left(w^{T}x+w_{0}\right) \end{align*} $$ for $w=w_{+1}-w_{-1},\,w_{0}=w_{+1,0}-w_{-1,0}$.
Decision boundary $w^{T}x+w_{0}=0$ is linear.
Multiclass case can be solved using multiple binary classifiers with one-vs-all, one-vs-one
- can you imagine faulty situation with those approaches for linear classifiers?

Linear classifier¶

Classification among classes 1,2,...C.
Use $C$ discriminant functions $g_{c}(x)=w_{c}^{T}x+w_{c0}$
Decision rule: $$ \widehat{y}(x)=\arg\max_{c}g_{c}(x) $$
Decision boundary between classes $y=i$ and $y=j$ is linear: $$ \left(w_{i}-w_{j}\right)^{T}x+\left(w_{i0}-w_{j0}\right)=0 $$
Decision regions are convex

Margin of binary linear classifier¶

$$ M(x,y) =y\left(w^{T}x+w_{0}\right) $$

Margin = score, how well classifier predicted true $y$ for object $x$.
$M(x,y|w)>0$ <=> object $x$ is correctly classified as $y$
- signs of $w^{T}x+w_{0}$ and $y$ coincide
$|M(x,y|w)|=\left|w^{T}x+w_{0}\right|$ - confidence of decision
- proportional to distance from $x$ to hyperplane $w^{T}x+w_{0}=0$.

Margin¶

Weights optimization¶

Margin = score, how well classifier predicted true $y$ for object $x$.
Task: select such $w$ to increase $M(x_{n},y_{n}|w)$ for all $n$.
Formalization: $$ \frac{1}{N}\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)\to\min_{w} $$

Misclassification rate optimization¶

Misclassification rate optimization: $$ \frac{1}{N}\sum_{n=1}^{N}\mathbb{I}[M(x_{n},y_{n}|w)<0]\to\min_{w} $$

is not recommended:

discontinious function, can't use numerical optimization!
continous margin is more informative than binary error indicator.\pause
If we select loss function $\mathcal{L}(M)$ such that $\mathbb{I}[M]\le\mathcal{L}(M)$ then we can optimize upper bound on misclassification rate: $$ \begin{gathered}\begin{gathered}\text{MISCLASSIFICATION RATE}\end{gathered} =\frac{1}{N}\sum_{n=1}^{N}\mathbb{I}[M(x_{n},y_{n}|w)<0]\\ \le\frac{1}{N}\sum_{n=1}^{N}\mathcal{L}(M(x_{n},y_{n}|w))=L(w) \end{gathered} $$

Common loss functions¶

Optimization¶

Same story as in linear regression

Regularization¶

Same story as in linear regression

Regularization¶

Insert additional requirement for regularizer $R(\beta)$ to be small: $$ \sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w\right)+\lambda R(\beta)\to\min_{\beta} $$
$\lambda>0$ - hyperparameter.
$R(\beta)$ penalizes complexity of models. $$ \begin{array}{ll} R(\beta)=||\beta||_{1} & L_{1}\text{ regularization}\\ R(\beta)=||\beta||_{2}^{2} & L_{2}\text{ regularization} \end{array} $$

$L_{1}$ regularization¶

$||w||_{1}$ regularizer should do feature selection.
Consider $$ L(w)=\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda\sum_{d=1}^{D}|w_{d}| $$
And gradient updates $$ \frac{\partial}{\partial w_{i}}L(w)=\sum_{n=1}^{N}\frac{\partial}{\partial w_{i}}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda sign (w_{i}) $$

$$ \lambda sign (w_{i})\nrightarrow0\text{ when }w_{i}\to0 $$

$L_{2}$ regularization¶

$$ L(w)=\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda\sum_{d=1}^{D}w_{d}^{2} $$

$$ \frac{\partial}{\partial w_{i}}L(w)=\sum_{n=1}^{N}\frac{\partial}{\partial w_{i}}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+2\lambda w_{i} $$ $$ 2\lambda w_{i}\to0\text{ when }w_{i}\to0 $$

Strength of regularization $\to0$ as weights $\to0$.
So $L_{2}$ regularization will not set weights exactly to 0.

Consider the foolowing objects

x1	x2
0	1
1	0
1	1
2	2
2	3
3	2

Find class prediction if $(w_0 = -0.3 , w_1 = 0.1, w_2 = 0.1)$

Logistic Regression¶

Binary classification¶

Linear classifier: $$ score(\omega_{1}|x)=w^{T}x + w_0 = g(x) $$
+relationship between score and class probability is assumed: $$ p(\omega_{1}|x)=\sigma(w^{T}x + w_0) $$

where $\sigma(z)=\frac{1}{1+e^{-z}}$ - sigmoid function

In [3]:

demo_sigmoid()

Binary classification: estimation¶

Using the property $1-\sigma(z)=\sigma(-z)$ obtain that $$ p(y=+1|x)=\sigma(w^{T}x+w_0)\Longrightarrow p(y=-1|x)=\sigma(-w^{T}x - w_0) $$

So for $y\in\{+1,-1\}$ $$ p(y|x)=\sigma(y(\langle w,x\rangle + w_0)) $$

Therefore ML estimation can be written as: $$ \prod_{i=1}^{N}\sigma( y_{i}(\langle w,x_{i}\rangle + w_0))\to\max_{w} $$

Loss function for 2-class logistic regression¶

For binary classification $p(y|x)=\sigma(y(\langle w,x\rangle + w_0))$

Estimation with ML:

$$ \prod_{i=1}^{n}\sigma(y_{i}(\langle w,x_{i}\rangle + w_0)) = \prod_{i=1}^{n}\sigma(y_{i}g(x_{i})) = \to\max_{w} $$

which is equivalent to $$ \sum_{i}^{n}\ln(1+e^{-y_{i}g(x_{i})})\to\min_{w} $$

It follows that logistic regression is linear discriminant estimated with loss function $\mathcal{L}(M)=\ln(1+e^{-M})$.

Another formulation¶

Lets present Likelihood function in another form $$ \mathcal{L}(w) = \prod_i^n p(y=+1|x)^{[y^{(i)} = +1]} p(y=-1|x)^{[y^{(i)} = -1]} \rightarrow \max_w$$ $$ -\ln{\mathcal{L}(w)} = - \sum_i^n [y^{(i)} = +1]\cdot\ln{\sigma(w^{T}x+w_0))} + {[y^{(i)} = -1]}\cdot\ln{(1-\sigma(w^{T}x+w_0))} \rightarrow \min_w$$ $$L(w) = \log{\mathcal{L}(w)} \rightarrow \min_w $$

Multiple classes¶

Multiple class classification: $$ \begin{cases} score(\omega_{1}|x)=w_{1}^{T}x + w_{0,1} \\ score(\omega_{2}|x)=w_{2}^{T}x + w_{0,2}\\ \cdots\\ score(\omega_{C}|x)=w_{C}^{T}x + + w_{0,C} \end{cases} $$

+relationship between score and class probability is assumed:

$$ p(\omega_{c}|x)=softmax(\omega_c|W, x)=\frac{exp(w_{c}^{T}x + w_{0,c})}{\sum_{i}exp(w_{i}^{T}x + w_{0,i})} $$

Estimation with ML: $$ \prod_{n=1}^{N}\prod_{c=1}^{C} softmax(\omega_c|W, x_i)^{[y_i = w_c]} $$

Which would lead us to cross-entropy

Summary¶

Linear classifier - classifier with linear discriminant functions.
Binary linear classifier: $\widehat{y}(x)=sign(w^{T}x+w_{0})$.
Perceptron, logistic, SVM - linear classifiers estimated with different loss functions.
Weights are selected to minimize total loss on margins.
Gradient descent iteratively optimizes $L(w)$ in the direction of maximum descent.
Stochastic gradient descent approximates $\nabla_{w}L$ by averaging gradients over small subset of objects.
Regularization gives smooth control over model complexity.
$L_{1}$ regularization automatically selects features.

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Linear classification. Logistic Regression1

Let's recall previous lecture¶

Linear Classification¶

Analytical geometry reminder¶

Reminder¶

Orthogonal vector to hyperplane¶

Theorem 1¶

Distance from point to hyperplane¶

Theorem 2¶

Distance from point to hyperplane¶

Binary linear classifier geometric interpretation¶

Multiclass classification with binary classifiers¶

Multiclass classification with binary classifiers¶

Binary linear classifier¶

Linear classifier¶

Margin of binary linear classifier¶

Margin¶

Weights optimization¶

Misclassification rate optimization¶

Common loss functions¶

Optimization¶

Regularization¶

Regularization¶

$L_{1}$ regularization¶

$L_{2}$ regularization¶

Logistic Regression¶

Binary classification¶

Binary classification: estimation¶

Loss function for 2-class logistic regression¶

Another formulation¶

Multiple classes¶

Summary¶

Linear classification. Logistic Regression¹