Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Regularization in linear models.

Linear classification. Logistic Regression¹

^{1. Some materials are taken from machine learning course of Victor Kitov}

Let's recall previous lecture¶

Linear regression
- linear dependence between target features and predictors

$$f(x_{n}, \beta) = \hat{y}_{n} = \beta_0 + \beta_1x_{n}^1 + \dots$$

Optimize Ordinary Least Squares
Solution can be found
- analytically
- with gradient descent

Regularization & restrictions¶

Intuition¶

[Andrew's Ng Machine Learning Class - Stanford]

Regularization¶

Regularization - technique that reduces model complexity to prevent overfitting
Desired outcomes:
- train loss goes up
- test loss goes down
Have we already seen it?

Regularization¶

Insert additional requirement for regularizer $R(\beta)$ to be small: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda R(\beta)\to\min_{\beta} $$
$\lambda>0$ - hyperparameter.
$R(\beta)$ penalizes complexity of models. $$ \begin{array}{ll} R(\beta)=||\beta||_{1} & \mbox{Lasso regression}\\ R(\beta)=||\beta||_{2}^{2} & \text{Ridge regression} \end{array} $$

Not only accuracy matters for the solution but also model simplicity!
$\lambda$ controls complexity of the model:$\uparrow\lambda\Leftrightarrow\text{complexity}$$\downarrow$.

Comments¶

Dependency of $\beta$ from $\lambda$ for ridge (A) and LASSO (B):
LASSO can be used for automatic feature selection.
$\lambda$ is usually found using cross-validation on exponential grid, e.g. $[10^{-6},10^{-5},...10^{5},10^{6}]$.
It's always recommended to use regularization because
- it gives smooth control over model complexity.
- reduces ambiguity for multiple solutions case.

In [4]:

df_auto = df_auto.assign(kilometrage = lambda r: r.mileage*1.6)
df_auto.loc[:, ['mileage', 'kilometrage', 'price']].head()

Out[4]:

	mileage	kilometrage	price
0	67697	108315.2	14995
1	73738	117980.8	11988
2	80313	128500.8	11999
3	86096	137753.6	12995
4	79607	127371.2	11333

In [11]:

X_train = df_auto.loc[:, ['mileage', 'kilometrage']].values
y_train = df_auto.price.values

model = Lasso()
# model = Ridge()
model.fit(X_train, y_train)

print('Модель:\nprice = %.2f + (%.2f)*mileage + (%.2f)*kilometrage' % (model.intercept_, model.coef_[0], model.coef_[1],))

y_hat = model.predict(X_train)

df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train[:, 0], y_hat, c='r')

Модель:
price = 16762.02 + (-0.05)*mileage + (-0.00)*kilometrage

ElasticNet¶

ElasticNet:

$$ R(\beta)=\alpha||\beta||_{1}+(1-\alpha)||\beta||_{2}^{2}\to\min_{\beta} $$

$\alpha\in(0,1)$ - hyperparameter, controlling impact of each part.

If two features $x^{i}$and $x^{j}$ are equal:
- LASSO may take only one of them
- Ridge will take both with equal weight
  - but it doesn't remove useless features
ElasticNet both removes useless features but gives equal weight for usefull equal features
- good, because feature equality may be due to chance on this particular training set

Wisdom of the day¶

Overfitting = Death¶

Linear Classification (in general)¶

$$w_1x_1 + w_2x_2 + w_0 = w^{T}x+w_{0}=0$$$$ y \in \{-1,+1\}$$

Given weights $w$ and $w_0$ and feature vector $x$ we can
- tell if $x$ is "below" or "above" hyperplane
- tell how far is $x$ from hyperplane

Analytical geometry reminder¶

Reminder¶

$a=[a^{1},...a^{D}]^{T},\,b=[b^{1},...b^{D}]^{T}$
Scalar product $\langle a,b\rangle=a^{T}b=\sum_{d=1}^{D}a_{d}b_{b}$
$a\perp b$ means that $\langle a,b\rangle=0$
Norm $\left\lVert a\right\rVert =\sqrt{\langle a,a\rangle}$
Distance $\rho(a,b)=\left\lVert a-b\right\rVert =\sqrt{\langle a-b,a-b\rangle}$

$p=\langle a,\frac{b}{\left\lVert b\right\rVert }\rangle$
$\left|p\right|=\left|\langle a,\frac{b}{\left\lVert b\right\rVert }\rangle\right|$- unsigned projection length

Orthogonal vector to hyperplane¶

Theorem 1¶

Vector $w$ is orthogonal to hyperplane $w^{T}x+w_{0}=0$

Proof: Consider arbitrary $x_{A},x_{B}\in\{x:\,w^{T}x+w_{0}=0\}$: $$ \begin{align} w^{T}x_{A}+w_{0}=0 \quad \text{ (1)}\\ w^{T}x_{B}+w_{0}=0 \quad \text{ (2)} \end{align} $$ By substracting (2) from (1), obtain $w^{T}(x_{A}-x_{B})=0$, so $w$ is orthogonal to hyperplane.

Distance from point to hyperplane¶

Theorem 2¶

Distance from point $x$ to hyperplane $w^{T}x+w_{0}=0$ is equal to $\frac{w^{T}x+w_{0}}{\left\lVert w\right\rVert }$.

Proof: Project $x$ on the hyperplane, let the projection be $t$ and complement $h=x-t$, orthogonal to hyperplane. Then $$ x=t+h $$

Since $t$ lies on the hyperplane, $$ w^{T}t+w_{0}=0 $$

Since $h$ is orthogonal to hyperplane and according to theorem 1 $$ h=r\frac{w}{\left\lVert w\right\rVert },\,r\in\mathbb{R}\text{ - distance to hyperplane}. $$

Distance from point to hyperplane¶

$$ x=t+r\frac{w}{\left\lVert w\right\rVert } $$

After multiplication by $w$ and addition of $w_{0}$: $$ w^{T}x+w_{0}=w^{T}t+w_{0}+r\frac{w^{T}w}{\left\lVert w\right\rVert }=r\left\lVert w\right\rVert $$

because $w^{T}t+w_{0}=0$ and $\left\lVert w\right\rVert =\sqrt{w^{T}w}$. So we get, that $$ r=\frac{w^{T}x+w_{0}}{\left\lVert w\right\rVert } $$

Comments:

From one side of hyperplane $r>0\Leftrightarrow w^{T}x+w_{0}>0$
From the other side $r<0\Leftrightarrow w^{T}x+w_{0}<0$.
Distance from hyperplane to origin 0 is $\frac{w_{0}}{\left\lVert w\right\rVert }$. So $w_{0}$ accounts for hyperplane offset.

Binary linear classifier geometric interpretation¶

Binary linear classifier: $$ \widehat{y}(x)= sign\left(w^{T}x+w_{0}\right) $$

divides feature space by hyperplane $w^{T}x+w_{0}=0$.

Confidence of decision is proportional to distance to hyperplane $\frac{\left|w^{T}x+w_{0}\right|}{\left\lVert w\right\rVert }$.
$w^{T}x+w_{0}$ is the confidence that class is positive.

Consider the following objects

x1	x2
0	1
1	0
1	1
2	2
2	3
3	2

Find class prediction if $(w_0 = -0.3 , w_1 = 0.1, w_2 = 0.1)$

Fitting weights¶

Margin of binary linear classifier¶

$$ M(x,y) =y\cdot\left(w^{T}x+w_{0}\right) $$

$ y \in \{-1,+1\}$
Margin = score, how well classifier predicted true $y$ for object $x$.
$M(x,y|w)>0$ $\Leftrightarrow$ object $x$ is correctly classified as $y$
- signs of $w^{T}x+w_{0}$ and $y$ coincide
$|M(x,y|w)|=\left|w^{T}x+w_{0}\right|$ - confidence of decision
- proportional to distance from $x$ to hyperplane $w^{T}x+w_{0}=0$.

Margin¶

Misclassification rate optimization¶

Misclassification rate optimization: $$ \frac{1}{N}\sum_{n=1}^{N}\mathbb{I}[M(x_{n},y_{n}|w)<0]\to\min_{w} $$

is not recommended:

discontinous function, can't use numerical optimization!
continous margin is more informative than binary error indicator.
If we select loss function $\mathcal{L}(M)$ such that $\mathbb{I}[M]\le\mathcal{L}(M)$ then we can optimize upper bound on misclassification rate: $$ \begin{gathered}\begin{gathered}\text{MISCLASSIFICATION RATE}\end{gathered} =\frac{1}{N}\sum_{n=1}^{N}\mathbb{I}[M(x_{n},y_{n}|w)<0]\\ \le\frac{1}{N}\sum_{n=1}^{N}\mathcal{L}(M(x_{n},y_{n}|w))=L(w) \end{gathered} $$

Optimization problem¶

Margin = score, how well classifier predicted true $y$ for object $x$.
Task: select such $w$ to increase $M(x_{n},y_{n}|w)$ for all $n$.
Formalization: $$ \frac{1}{N}\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)\to\min_{w} $$

Common loss functions¶

Optimization procedure¶

Same story as in linear regression

analytical solution (if possible)
or gradient descent (most usual case)

Regularization¶

Same story as in linear regression

add regularization terms to initial loss function
question: is it good idea to regularize intercept?

$L_{1}$ regularization¶

$||w||_{1}$ regularizer should do feature selection.
Consider $$ L(w)=\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda\sum_{d=1}^{D}|w_{d}| $$
And gradient updates $$ \frac{\partial}{\partial w_{i}}L(w)=\sum_{n=1}^{N}\frac{\partial}{\partial w_{i}}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda sign (w_{i}) $$

$$ \lambda sign (w_{i})\nrightarrow0\text{ when }w_{i}\to0 $$

$L_{2}$ regularization¶

$$ L(w)=\sum_{n=1}^{N}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+\lambda\sum_{d=1}^{D}w_{d}^{2} $$$$ \frac{\partial}{\partial w_{i}}L(w)=\sum_{n=1}^{N}\frac{\partial}{\partial w_{i}}\mathcal{L}\left(M(x_{n},y_{n}|w)\right)+2\lambda w_{i} $$$$ 2\lambda w_{i}\to0\text{ when }w_{i}\to0 $$

Strength of regularization $\to0$ as weights $\to0$.
So $L_{2}$ regularization will not set weights exactly to 0.

Logistic Regression¶

solves classification tasks #mindblown

Binary classification¶

Linear classifier: $$ score(y=+1|x)=w^{T}x + w_0 = g(x) $$
+relationship between score and class probability is assumed: $$ p(y=+1|x)=\sigma(w^{T}x + w_0) $$

where $\sigma(z)=\frac{1}{1+e^{-z}}$ - sigmoid function

In [7]:

demo_sigmoid()

Binary classification: estimation¶

Given our assumption:
- $p(y=+1|x)=\sigma(w^{T}x + w_0)$
- $p(y=-1|x)=1 - p(y=+1|x)$
we can write down Likelihood function: $$ \mathcal{L}(w) = \prod_{n=1}^N p(y_n=+1|x_n)^{\mathbb{I}[y_n = +1]} p(y_n=-1|x_n)^{\mathbb{I}[y_n = -1]} \rightarrow \max_w$$
Get rid if products: $$ -\ln{\mathcal{L}(w)} = - \sum_{n=1}^N \mathbb{I}[y_n = +1]\cdot\ln{\sigma(w^{T}x_n+w_0))} + \mathbb{I}[y_n = -1]\cdot\ln{(1-\sigma(w^{T}x_n+w_0))} \rightarrow \min_w$$ $$L(w) = -\ln{\mathcal{L}(w)} \rightarrow \min_w $$
Function $L(w)$ is also called log-loss

In [17]:

plot_logloss()

In [8]:

from sklearn.metrics import log_loss
def linedist_demo(w1=1., 
                  w2=-1., 
                  w0=0.):

    z = w1**2+w2**2

    X = np.array([
        [-2.5,1],
        [-1,1],
        [-3,1],
        [1,-1],
        [2,-5],
        [3,-2]
    ])

    obs_texts = ['$p_{}$'.format(i) for i in range(1,X.shape[0]+1)]

    x_range = np.arange(-5,5)

    y = np.array([-1,-1,-1,1,1,1])

    model = lambda x: w1*x[0] + w2*x[1] + w0
    model_line = lambda x1: -w1/w2*x1 - w0/w2

    y_hat = np.apply_along_axis(model, 1, X)
    prob = sigmoid(y_hat)
    y_line = model_line(x_range)

    x_proj = X - y_hat.reshape(-1,1)*np.array([[w1/z,w2/z]])
    
    _, ax = plt.subplots(1,2)

    ax[0].scatter(X[:, 0], X[:, 1], c=y, s=100, cmap=plt.cm.Spectral)
    ax[0].scatter(x_proj[:, 0], x_proj[:, 1])
    for i, p in enumerate(X):
        ax[0].text(p[0], p[1], obs_texts[i], size=20, multialignment='right')
        ax[0].plot([X[i,0], x_proj[i,0]], [X[i,1], x_proj[i,1]], '--', c='grey', )

    ax[0].plot(x_range, y_line, label='$g(x)={:+6.1f}\cdot x_1 {:+6.1f}\cdot x_2  {:+6.1f}$'.format(w1,w2,w0))
    
    ax[0].set_aspect('equal', adjustable='box')
    ax[0].set_xlim(-6, 5)
    ax[0].set_ylim(-6, 4)
    ax[0].set_xlabel('$x_1$')
    ax[0].set_ylabel('$x_2$')
    ax[0].legend(loc=4)
    
    df = pd.DataFrame(data = np.c_[X, y, np.round(y_hat, 4), np.round(prob,4)], 
                      index=obs_texts, 
                      columns=['x1', 'x2', 'class', 'g(x)', 'sigmoid'])
    
    
    
    font_size=12
    bbox=[0, 0, 1, 1]
    ax[1].axis('off')
    colors = ['r','r','r','#1ac3f5','#1ac3f5','#1ac3f5']
    mpl_table = ax[1].table(cellText = df.values, bbox=bbox, 
                            rowLabels = df.index, 
                            colLabels=df.columns, rowColours=colors, alpha=0.4)
    mpl_table.auto_set_font_size(False)
    mpl_table.set_fontsize(font_size)

    plt.show()
    display('Logloss = {}'.format(log_loss(y, prob)))

In [9]:

_ = interact(linedist_demo, 
         w1=FloatSlider(min=-5, max=5, value=1., step=0.5),
         w2=FloatSlider(min=-5, max=5, value=-1., step=0.51),
         w0=FloatSlider(min=-5, max=5, value=0., step=0.5))

Another formulation with Margin¶

Using the property $1-\sigma(z)=\sigma(-z)$ obtain that $$ p(y=+1|x)=\sigma(w^{T}x+w_0)\Longrightarrow p(y=-1|x)=\sigma(-w^{T}x - w_0) $$

So for $y\in\{+1,-1\}$ $$ p(y|x)=\sigma(y(\langle w,x\rangle + w_0)) $$ "probability of correct classification"

Therefore ML estimation can be written as: $$ \prod_{n=1}^{N}\sigma( y_{n}(\langle w,x_{n}\rangle + w_0))\to\max_{w} $$

Loss function for binary logistic regression¶

For binary classification $p(y|x)=\sigma(y(\langle w,x\rangle + w_0))$

Estimation with ML:

$$ \prod_{n=1}^{N}\sigma(y_n(\langle w,x_n\rangle + w_0)) = \prod_{n=1}^{N}\sigma(y_n g(x_n)) \to\max_{w} $$

which is equivalent to $$ \sum_{n=1}^{N}\ln(1+e^{-y_ng(x_n)})\to\min_{w} $$

It follows that logistic regression is linear discriminant estimated with loss function $\mathcal{L}(M)=\ln(1+e^{-M})$.

Multiclass classification with binary classifiers¶

Task - make $C$-class classification using many binary classifiers.
Approaches:
- one-versus-all
  - for each $c=1,2,...C$ train binary classifier on all objects and output $\mathbb{I}[y_{n}=c]$,
  - assign class, getting the highest score in resulting $C$ classifiers.
- one-versus-one
  - for each $i,j\in[1,2,...C],$ $i\ne j$ learn on objects with $y_{n}\in\{i,j\}$ with output $y_{n}$
  - assign class, getting the highest score in resulting $C(C-1)/2$ classifiers.

Binary linear classifier¶

For two classes $y\in\{+1,-1\}$ classifier becomes $$ \widehat{y}(x)=\begin{cases} +1, & w_{+1}^{T}x+w_{+1,0}>w_{-1}^{T}x+w_{-1,0}\\ -1 & \text{otherwise} \end{cases} $$
This decision rule is equivalent to $$ \begin{align*} \widehat{y}(x) & =sign(w_{+1}^{T}x+w_{+1,0}-w_{-1}^{T}x+w_{-1,0})=\\ & =sign\left(\left(w_{+1}^{T}-w_{-1}^{T}\right)x+\left(w_{+1,0}-w_{-1,0}\right)\right)\\ & =sign\left(w^{T}x+w_{0}\right) \end{align*} $$ for $w=w_{+1}-w_{-1},\,w_{0}=w_{+1,0}-w_{-1,0}$.
Decision boundary $w^{T}x+w_{0}=0$ is linear.
Multiclass case can be solved using multiple binary classifiers with one-vs-all, one-vs-one
- can you imagine faulty situation with those approaches for linear classifiers?

Linear classifier for multiple classes¶

Classification among classes $1,2,\dots,C$.
Use $C$ discriminant functions $g_{c}(x)=w_{c}^{T}x+w_{c0}$
Decision rule: $$ \widehat{y}(x)=\arg\max_{c}g_{c}(x) $$
Decision boundary between classes $y=i$ and $y=j$ is linear: $$ \left(w_{i}-w_{j}\right)^{T}x+\left(w_{i0}-w_{j0}\right)=0 $$
Decision regions are convex

Logistic regression for multiple classes¶

Each class has a set of weights: $$ \begin{cases} score(\omega_{1}|x)=w_{1}^{T}x + w_{0,1} \\ score(\omega_{2}|x)=w_{2}^{T}x + w_{0,2}\\ \cdots\\ score(\omega_{C}|x)=w_{C}^{T}x + + w_{0,C} \end{cases} $$

+relationship between score and class probability is assumed:

$$ p(\omega_{c}|x)=softmax(\omega_c|W, x)=\frac{exp(w_{c}^{T}x + w_{0,c})}{\sum_{i}exp(w_{i}^{T}x + w_{0,i})} $$

Estimation with ML: $$ \prod_{n=1}^{N}\prod_{c=1}^{C} softmax(\omega_c|W, x_n)^{\mathbb{I}[y_n = w_c]} $$

Which would lead us to cross-entropy loss function $$L(w) = - \sum_{n=1}^N\sum_{c=1}^{C} \mathbb{I}[y_n = w_c]\cdot\ln{\sigma(w_c^{T}x_n+w_{c,0}))}$$

Summary¶

Linear classifier - classifier with linear discriminant functions.
Binary linear classifier: $\widehat{y}(x)=sign(w^{T}x+w_{0})$.
Perceptron, logistic, SVM - linear classifiers estimated with different loss functions.
Weights are selected to minimize total loss on margins.
Regularization gives smooth control over model complexity.
$L_{1}$ regularization automatically selects features.

References¶

Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman) - Chapter 4.4
Pattern Recognition and Machine Learning (Christopher M. Bishop) - Chapter 4

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Regularization in linear models.

Linear classification. Logistic Regression1

Let's recall previous lecture¶

Regularization & restrictions¶

Intuition¶

Regularization¶

Regularization¶

Comments¶

ElasticNet¶

Wisdom of the day¶

Overfitting = Death¶

Linear Classification (in general)¶

Analytical geometry reminder¶

Reminder¶

Orthogonal vector to hyperplane¶

Theorem 1¶

Distance from point to hyperplane¶

Theorem 2¶

Distance from point to hyperplane¶

Binary linear classifier geometric interpretation¶

Fitting weights¶

Margin of binary linear classifier¶

Margin¶

Misclassification rate optimization¶

Optimization problem¶

Common loss functions¶

Optimization procedure¶

Regularization¶

$L_{1}$ regularization¶

$L_{2}$ regularization¶

Logistic Regression¶

Binary classification¶

Binary classification: estimation¶

Another formulation with Margin¶

Loss function for binary logistic regression¶

Multiclass classification with binary classifiers¶

Multiclass classification with binary classifiers¶

Binary linear classifier¶

Linear classifier for multiple classes¶

Logistic regression for multiple classes¶

Summary¶

References¶

Linear classification. Logistic Regression¹