Data Analysis

Andrey Shetakov (avshestakov@hse.ru)


Linear regression. Gradient-based optimization1

1. Some materials are taken from machine learning course of Victor Kitov

Let's recall previous lecture

  • Decision trees
    • Utilize the notion of impurity
    • Work both for classification and regression
  • Implicit feature selection
    • good for features of different nature
  • Parallel to axes class separating boundary
  • Local greedy optimizaion
  • Sensitive to even tiny data pertubations

Linear regression

Example: flat prices

  • Obviously, those characteristics somehow relate with price ($f: X \rightarrow Y$)
  • Formalize a model to predict flat price: $$a(x) = a(total\_area, nmbr\_of\_bedrooms, house\_age) = \hat{y}$$
  • Let it be a linear model: $$a(x) = \beta_0 + \beta_1\cdot total\_area + \beta_2 \cdot nmbr\_of\_bedrooms + \beta_3 \cdot house\_age$$
  • Learning - find coefficients $\beta_0,\dots, \beta_3$, that minimizer error on training set

Cars price vs mileage

In [17]:
df_auto.plot(x='mileage', y='price', kind='scatter')
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x109dafc88>

Linear regression

  • Our goal - determine linear dependence between features $X$ and target vector $y$
  • Define $x_n^d$ as $d$-th feature of $n$-th object, $y_{n} \in \mathbb{R}$ - target value for $n$-th object $$f(x_{n}, \beta) = \hat{y}_{n} = \beta_0 + \beta_1x_{n}^1 + \dots$$
  • $x_{n}^0 = 1$ $\forall n$ - intercept
  • Need to estimate $\beta_i$.

Ordinary Least Squares: $$ L(\beta_0,\beta_1,\dots) = \frac{1}{2N}\sum^{N}_{n=1}(\hat{y}_{n} - y_{n})^2 = \frac{1}{2N}\sum^{N}_{n=1}\left(\sum_{d=0}^{D}\beta_{d}x_{n}^{d}-y_{n}\right)^{2} \rightarrow \min\limits_{\beta} $$

Solution

Let's say $f(x, \beta) = \beta_0 + \beta_1x_1$

Calculate partial derivatives wrt $\beta_0$, $\beta_1$ and set them to $0$:

$$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n}) = 0$$$$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n})x^1_{n} = 0$$

Linear regression (in matrix form)

  • Define $X\in\mathbb{R}^{NxD},\{X\}_{ij}$ defines the $j$-th feature of $i$-th object, $y\in\mathbb{R}^{n}$ - vector with target values $$f(x, \beta) = \hat{y} = X\beta \quad \Leftrightarrow \quad \left( \begin{array}{c} \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_n \end{array} \right) = \left( \begin{array}{ccccc} 1 & x_1^1 & x_1^2 & \cdots & x_1^d\\ 1 & x_2^1 & x_2^2 & \cdots & x_2^d\\ \cdots & \cdots & \cdots & \cdots & \cdots\\ 1 & x_n^1 & x_n^2 & \cdots & x_n^d\end{array} \right) \cdot \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots\\ \beta_d \end{array} \right) $$
  • Need to estimate $\beta_i$.

Ordinary Lears Squares: $$ L(\beta) = \frac{1}{2N}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min\limits_{\beta} $$

Solution (in matrix form)

Expand a bit $$ \begin{align*} L(\beta) & = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \\ & = \frac{1}{2N}\left( \beta^\top X^\top X \beta - 2 (X\beta)^\top y + y^\top y \right) \end{align*} $$

Calculate vector of partial derivatives - gradient

$$ \begin{align*} \nabla L(\beta) = \left(\frac{\partial L(\beta)}{\partial\beta_i} \right)_{i=0\dots d} & = \frac{1}{2N}\left(2X^\top X \beta -2X^\top y\right) = 0 \\ X^{\top}X\beta-X^{T}y & = 0 \\ \\ \beta & = (X^\top X)^{-1} X^\top y \quad\text{(Normal Equation)} \end{align*} $$

Comments

  • This is the global minimum, because the optimized criteria is convex.
  • Geometric interpretation:
    • find linear combination of feature measurements that best reproduce $y$
    • solution - combinaton of features, giving projection of $y$ on linear span of feature measurements.
  • Why using Normal Equation could be bad?
    • Calculating inverse costs a lot
    • Not all matrices have it (singular matrices)

Linearly dependent features (multicollinearity)

  • Solution $\widehat{\beta}=(X^{T}X)^{-1}X^{T}y$ exists when $X^{T}X$ is non-singular
  • Problem occurs when one of the features is a linear combination of the other (linear dependency)
    • interpretation: non-identifiability of $\widehat{\beta}$ for linearly dependent features:
      • linear dependence: $\exists\alpha:\,x^{T}\alpha=0\,\forall x$
      • suppose $\beta$ solves linear regression $y=x^{T}\beta$
      • then $x^{T}\beta\equiv x^{T}\beta+kx^{T}\alpha\equiv x^{T}(\beta+k\alpha)$, so $\beta+k\alpha$ is also a solution!
  • Multicollinearity can be exact and not exact, which is also bad

Examples of linear dependency

  • $x$ miles $\approx 1.6\cdot x$ kms
  • total flat area $\approx$ area of living rooms $+$ area of kitchen
  • dummy variable trap!

Linearly dependent features

  • Problem may be solved by:
    • feature selection
    • dimensionality reduction
    • imposing additional requirements on the solution (regularization)

Analysis of linear regression

Advantages:

  • single optimum, which is global (for non-singular matrix)
  • analytical solution
  • interpretable solution and algorithm

Drawbacks:

  • too simple model assumptions (may not be satisfied)
  • $X^{T}X$ should be non-degenerate (and well-conditioned)

Optimization methods for LR

Gradient descent

Intuition

$$ L(\beta_0, \beta_1) = \frac{1}{2N}\sum_{n=1}^N(\beta_0 + \beta_1x^1_{n} - y_{n})^2$$

  • Suppose we have some initial approximation of $(\hat{\beta_0}, \hat{\beta_1})$
  • How should we change it in order to improve?
In [19]:
sq_loss_demo()

Tiny Refresher

Derivative of $f(x)$ in $x_0$: $$ f'(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+h) - f(x_0)}{h}$$

Derivative shows the slope of tangent line in $x_0$

  • If $x_0$ is extreme point of $f(x)$ and $f'(x_0)$ exists $\Rightarrow$ $f'(x_0) = 0$
In [21]:
interact(deriv_demo, h=FloatSlider(min=0.0001, max=2, step=0.005), x0=FloatSlider(min=1, max=15, step=.2))
Out[21]:
<function __main__.deriv_demo>

Tiny Refresher

  • In multidimential world we switch to gradients and directional derivatives: $$ f'_v(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+hv) - f(x_0)}{h} = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0}, \quad ||v|| = 1 \quad \text{directional derivatives}$$
$$ \frac{ \partial f(x_1,x_2,\dots,x_d)}{\partial x_i} = \lim\limits_{h \rightarrow 0}\frac{f(x_1, x_2, \dots, x_i + h, \dots, x_d) - f(x_1, x_2, \dots, x_i, \dots, x_d)}{h} \quad \text{partial derivative}$$$$ \nabla f = \left(\frac{\partial f}{\partial x_i}\right),\quad i=1\dots d \quad \text{Gradient = a vector of partial derivatives}$$

Tiny Refresher

  • Unsing multivariate chain rule:
$$ f'_v(x_0) = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0} = \sum_{i=1}^d \frac{\partial f}{\partial x_i} \frac{d}{dh} (x_{0,i} + hv_i) = \langle \nabla f, v \rangle$$$$ \langle \nabla f, v \rangle = || \nabla f || \cdot ||v|| \cdot \cos{\phi} = || \nabla f || \cdot \cos{\phi}$$

Tiny Refresher

$$ \langle \nabla f, v \rangle = || \nabla f || \cdot \cos{\phi}$$
  • Directional derivative is maximal what direction is collinear to gradient
  • gradient — direction of steepest ascent of $f(x)$
  • antigradient — direction of steepest descent of $f(x)$

Given $L(\beta_0, \beta_1)$ calculate gradient (patial derivatives) $$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})x^1_{n}$$

Or in matrix form: $$ \nabla_{\beta}L(\beta) = \frac{1}{N} X^\top(X\beta - y)$$

Run gradient update, which is simultaneous(!!!) update of $\beta$ in antigradient direction:

$$ \beta := \beta - \alpha\nabla_{\beta}L(\beta)$$
  • $\alpha$ - descent "speed"

Pseudocode

{python}
1.function gd(X, alpha, epsilon):

2.  initialise beta 

3.  do: 

4.      Beta = new_beta

5.      new_Beta = Beta - alpha*grad(X, beta)

6.  until dist(new_beta, beta) < epsilon

7.  return beta
In [23]:
grad_demo(iters=105, alpha=0.08)

Comments

  • How do we set $\alpha$
  • Feature scales matters
  • Local minima*

Gradient descent modifications

  • Stochastic gradient descent (!)
  • Descent with momentum
  • Adagrad

Methods overview 1, Methods overview 2

Stochastic gradient descent

{python}
1.function sgd(X, alpha, epsilon):

2.  initialise beta 

3.  do: 

4.        X = shuffle(X)

5.        for x in X:

6.            Beta = new_beta

7.            new_Beta = Beta - alpha*grad(x, beta)

8.  until dist(new_beta, beta) < epsilon

9.  return beta
In [24]:
stoch_grad_demo(iters=105, alpha=0.08)

Momentum

Idea: to move not only to current antigradient direction but consider previous one

$$ v_0 = 0$$$$ v_t = \gamma v_{t - 1} + \alpha\nabla_{\beta}{L(\beta)}$$$$ \beta = \beta - v_t$$

where

  • $\gamma$ — momentum term (usually 0.9)

Adagrad

Idea: update parameters $\beta_i$ for each feature differenly.

Denote $\frac{\partial L}{\partial \beta_i}$ on iteration $t$ as $g_{t,i}$

Vanilla gd update

$$ \beta_{t+1, i} = \beta_{t, i} - \alpha \cdot g_{t,i}$$

In Adagrad $\alpha$ is normalized wrt "size" of previous derivatives:

$$ \beta_{t+1, i} = \beta_{t, i} - \dfrac{\alpha}{\sqrt{G_{t,ii} + \varepsilon}} \cdot g_{t,i}$$

where $G_t$ is diagonal matrix with sum of squares of derivatives of $\beta_{i}$ before iteration $t$. $\varepsilon$ — is smoothing hyperparameter.

FYI

  • Zero-order methods
    • like...
  • 2nd order methods
    • Newton method

Nonlinear dependencies

Generalization by nonlinear transformations

Nonlinearity by $x$ in linear regression may be achieved by applying non-linear transformations to the features:

$$ x\to[\phi_{0}(x),\,\phi_{1}(x),\,\phi_{2}(x),\,...\,\phi_{M}(x)] $$$$ f(x)=\mathbf{\phi}(x)^{T}\beta=\sum_{m=0}^{M}\beta_{m}\phi_{m}(x) $$

The model remains to be linear in $\beta$, so all advantages of linear regression remain.

Typical transformations

  • $x^{i}\in[a,b]$ : binarization of feature
  • $x^{i}x^{j}$ : interaction of features
  • $\exp\left\{ -\gamma\left\lVert x-\tilde{x}\right\rVert ^{2}\right\} $ : closeness to reference point $\tilde{x}$
  • $\ln x_{k}$ : the alignment of the distribution with heavy tails
In [26]:
demo_weights()