Data Analysis

Andrey Shestakov (avshestakov@hse.ru)


Linear regression. Gradient-based optimization1

1. Some materials are taken from machine learning course of Victor Kitov

Let's recall previous lecture

  • Decision trees
    • Utilize the notion of impurity
    • Work both for classification and regression
  • Implicit feature selection
    • good for features of different nature
  • Parallel to axes class separating boundary
  • Local greedy optimizaion
  • Sensitive to even tiny data pertubations

Linear regression

In [3]:
df_auto.plot(x='mileage', y='price', kind='scatter')
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x110515810>

Linear regression

  • Our goal - determine linear dependence between features $X$ and target vector $y$
  • Define $X\in\mathbb{R}^{NxD},\{X\}_{ij}$ defines the $j$-th feature of $i$-th object, $y\in\mathbb{R}^{n}$, $y_{i}$ - target value for $i$-th object $$f(x, \beta) = \hat{y} = X\beta \quad \Leftrightarrow \quad f(x_{n}, \beta) = \hat{y}_{n} = \beta_0 + \beta_1x_{n}^1 + \dots$$
  • Need to estimate $\beta_i$.

Ordinary Lears Squares: $$ L(\beta_0,\beta_1,\dots) = \frac{1}{2N}\sum^{N}_{n=1}(\hat{y}_{n} - y_{n})^2 = \frac{1}{2N}\sum^{N}_{n=1}\left(\sum_{d=1}^{D}\beta_{d}x_{n}^{d}-y_{n}\right)^{2} \rightarrow \min\limits_{\beta} $$ $$ \Updownarrow $$
$$ L(\beta) = \frac{1}{2N}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min\limits_{\beta} $$

Solution

Let's say $f(x, \beta) = \beta_0 + \beta_1x_1$

Calculate partial derivatives wrt $\beta_0$, $\beta_1$ and set them to $0$:

$$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n}) = 0$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n})x^1_{n} = 0$$

Solution (in matrix form)

Either with matrix calculus or with some linear algebra trick we get

$$ X^{T}(X\beta-y)=0 \quad \Leftrightarrow \quad \beta = (X^\top X)^{-1} X^\top y \quad\text{(Normal Equation)}$$

Comments

  • This is the global minimum, because the optimized criteria is convex.
  • Geometric interpretation:
    • find linear combination of feature measurements that best reproduce $y$
    • solution - combinaton of features, giving projection of $y$ on linear span of feature measurements.
  • Why using Normal Equation could be bad?
    • Calculating inverse costs a lot
    • Not all matrices have it (singular matrices)

Linearly dependent features (multicollinearity)

  • Solution $\widehat{\beta}=(X^{T}X)^{-1}X^{T}y$ exists when $X^{T}X$ is non-singular
  • Problem occurs when one of the features is a linear combination of the other (linear dependency)
    • because of the property $\forall X:\,rank(X)=rank(X^{T}X)$
    • interpretation: non-identifiability of $\widehat{\beta}$ for linearly dependent features:
      • linear dependence: $\exists\alpha:\,x^{T}\alpha=0\,\forall x$
      • suppose $\beta$ solves linear regression $y=x^{T}\beta$
      • then $x^{T}\beta\equiv x^{T}\beta+kx^{T}\alpha\equiv x^{T}(\beta+k\alpha)$, so $\beta+k\alpha$ is also a solution!
  • Multicollinearity can be exact and not exact, which is also bad
  • Dummy variable trap!

Linearly dependent features

  • Problem may be solved by:
    • feature selection
    • dimensionality reduction
    • imposing additional requirements on the solution (regularization)

Analysis of linear regression

Advantages:

  • single optimum, which is global (for non-singular matrix)
  • analytical solution
  • interpretable solution and algorithm

Drawbacks:

  • too simple model assumptions (may not be satisfied)
  • $X^{T}X$ should be non-degenerate (and well-conditioned)

Optimization methods for LR

Gradient descent

Intuition

$$ L(\beta_0, \beta_1) = \frac{1}{2N}\sum_{n=1}^N(\beta_0 + \beta_1x^1_{n} - y_{n})^2$$

  • Suppose we have some initial approximation of $(\hat{\beta_0}, \hat{\beta_1})$
  • How should we change it in order to improve?
In [5]:
sq_loss_demo()

Tiny Refresher

Derivative of $f(x)$ in $x_0$: $$ f'(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+h) - f(x_0)}{h}$$

Derivative shows the slope of tangent line in $x_0$

  • If $x_0$ is extreme point of $f(x)$ and $f'(x_0)$ exists $\Rightarrow$ $f'(x_0) = 0$
In [7]:
interact(deriv_demo, h=FloatSlider(min=0.0001, max=2, step=0.005), x0=FloatSlider(min=1, max=15, step=.2))
Out[7]:
<function __main__.deriv_demo>

Tiny Refresher

  • In multidimential world we switch to gradients and directional derivatives: $$ f'_v(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+hv) - f(x_0)}{h} = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0}, \quad ||v|| = 1 \quad \text{directional derivatives}$$

$$ \frac{ \partial f(x_1,x_2,\dots,x_d)}{\partial x_i} = \lim\limits_{h \rightarrow 0}\frac{f(x_1, x_2, \dots, x_i + h, \dots, x_d) - f(x_1, x_2, \dots, x_i, \dots, x_d)}{h} \quad \text{partial derivative}$$

$$ \nabla f = \left(\frac{\partial f}{\partial x_i}\right),\quad i=1\dots d \quad \text{Gradient = a vector of partial derivatives}$$

Tiny Refresher

  • Unsing multivariate chain rule:

$$ f'_v(x_0) = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0} = \sum_{i=1}^d \frac{\partial f}{\partial x_i} \frac{d}{dh} (x_{0,i} + hv_i) = \langle \nabla f, v \rangle$$

$$ \langle \nabla f, v \rangle = || \nabla f || \cdot ||v|| \cdot \cos{\phi} = || \nabla f || \cdot \cos{\phi}$$

Tiny Refresher

$$ \langle \nabla f, v \rangle = || \nabla f || \cdot \cos{\phi}$$

  • Directional derivative is maximal what direction is collinear to gradient
  • gradient — direction of steepest ascent of $f(x)$
  • antigradient — direction of steepest descent of $f(x)$

Given $L(\beta_0, \beta_1)$ calculate gradient (patial derivatives) $$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})x^1_{n}$$

Or in matrix form: $$ \nabla_{\beta}L(\beta) = \frac{1}{N} X^\top(X\beta - y)$$

Run gradient update, which is simultaneous(!!!) update of $\beta$ in antigradient direction:

$$ \beta := \beta - \alpha\nabla_{\beta}L(\beta)$$

  • $\alpha$ - descent "speed"

Pseudocode

{python}
1.function gd(X, alpha, epsilon):

2.  initialise beta 

3.  do: 

4.      Beta = new_beta

5.      new_Beta = Beta - alpha*grad(X, beta)

6.  until dist(new_beta, beta) < epsilon

7.  return beta
In [9]:
grad_demo(iters=105, alpha=0.08)

Comments

  • How do we set $\alpha$
  • Feature scales matters
  • Local minima*

Gradient descent modifications

  • Stochastic gradient descent (!)
  • Descent with momentum
  • Adagrad

Methods overview 1, Methods overview 2

Stochastic gradient descent

{python}
1.function sgd(X, alpha, epsilon):

2.  initialise beta 

3.  do: 

4.        X = shuffle(X)

5.        for x in X:

6.            Beta = new_beta

7.            new_Beta = Beta - alpha*grad(x, beta)

8.  until dist(new_beta, beta) < epsilon

9.  return beta
In [10]:
stoch_grad_demo(iters=105, alpha=0.08)

Momentum

Idea: to move not only to current antigradient direction but consider previous one

$$ v_t = \gamma v_{t - 1} + \alpha\nabla_{\beta}{L(\beta)}$$ $$ \beta = \beta - v_t$$

where

  • $\gamma$ — momentum term (usually 0.9)

Adagrad

Idea: update parameters $\beta_i$ for each feature differenly.

Denote $\frac{\partial L}{\beta_i}$ on iteration $t$ as $g_{t,i}$

Vanilla gd update

$$ \beta_{t+1, i} = \beta_{t, i} - \alpha \cdot g_{t,i}$$

In Adagrad $\alpha$ is normalized wrt "size" of previous derivatives:

$$ \beta_{t+1, i} = \beta_{t, i} - \dfrac{\alpha}{\sqrt{G_{t,ii} + \varepsilon}} \cdot g_{t,i}$$

where $G_t$ is diagonal matrix with sum of squares of derivatives of $\beta_{i}$ before iteration $t$. $\varepsilon$ — is smoothing hyperparameter.

FYI

  • Zero-order methods
    • like...
  • 2nd order methods
    • Newton method

Nonlinear dependencies

Generalization by nonlinear transformations

Nonlinearity by $x$ in linear regression may be achieved by applying non-linear transformations to the features:

$$ x\to[\phi_{0}(x),\,\phi_{1}(x),\,\phi_{2}(x),\,...\,\phi_{M}(x)] $$

$$ f(x)=\mathbf{\phi}(x)^{T}\beta=\sum_{m=0}^{M}\beta_{m}\phi_{m}(x) $$

The model remains to be linear in $\beta$, so all advantages of linear regression remain.

Typical transformations

  • $x^{i}\in[a,b]$ : binarization of feature
  • $x^{i}x^{j}$ : interaction of features
  • $\exp\left\{ -\gamma\left\lVert x-\tilde{x}\right\rVert ^{2}\right\} $ : closeness to reference point $\tilde{x}$
  • $\ln x_{k}$ : the alignment of the distribution with heavy tails
In [12]:
demo_weights()

Regularization & restrictions

Intuition

[Andrew's Ng Machine Learning Class - Stanford]

Regularization

  • Insert additional requirement for regularizer $R(\beta)$ to be small: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda R(\beta)\to\min_{\beta} $$
  • $\lambda>0$ - hyperparameter.
  • $R(\beta)$ penalizes complexity of models. $$ \begin{array}{ll} R(\beta)=||\beta||_{1} & \mbox{Lasso regression}\\ R(\beta)=||\beta||_{2}^{2} & \text{Ridge regression} \end{array} $$
  • Not only accuracy matters for the solution but also model simplicity!
  • $\lambda$ controls complexity of the model:$\uparrow\lambda\Leftrightarrow\text{complexity}$$\downarrow$.

Comments

  • Dependency of $\beta$ from $\lambda$ for ridge (A) and LASSO (B):

  • LASSO can be used for automatic feature selection.

  • $\lambda$ is usually found using cross-validation on exponential grid, e.g. $[10^{-6},10^{-5},...10^{5},10^{6}]$.
  • It's always recommended to use regularization because
    • it gives smooth control over model complexity.
    • reduces ambiguity for multiple solutions case.

ElasticNet

  • ElasticNet:

$$ R(\beta)=\alpha||\beta||_{1}+(1-\alpha)||\beta||_{2}^{2}\to\min_{\beta} $$ $\alpha\in(0,1)$ - hyperparameter, controlling impact of each part.

  • If two features $x^{i}$and $x^{j}$ are equal:
    • LASSO may take only one of them
    • Ridge will take both with equal weight
      • but it doesn't remove useless features
  • ElasticNet both removes useless features but gives equal weight for usefull equal features
    • good, because feature equality may be due to chance on this particular training set

Ridge regression analytic solution

Ridge regression criterion $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda\beta^{T}\beta\to\min_{\beta} $$

Stationarity condition can be written as:

$$ \begin{gathered}2\sum_{n=1}^{N}x_{n}\left(x_{n}^{T}\beta-y_{n}\right)+2\lambda\beta=0\\ 2X^{T}(X\beta-y)+\lambda\beta=0\\ \left(X^{T}X+\lambda I\right)\beta=X^{T}y \end{gathered} $$

so

$$ \widehat{\beta}=(X^{T}X+\lambda I)^{-1}X^{T}y $$

Comments

  • $X^{T}X+\lambda I$ is always non-degenerate as a sum of:
    • non-negative definite $X^{T}X$
    • positive definite $\lambda I$
  • Intuition:
    • out of all valid solutions select one giving simplest model
  • Other regularizations also restrict the set of solutions.

Different account for different features

  • Traditional approach regularizes all features uniformly: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda R(\beta)\to\min_{w} $$
  • Suppose we have $K$ groups of features with indices: $$ I_{1},I_{2},...I_{K} $$
  • We may control the impact of each group on the model by: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda_{1}R(\{\beta_{i}|i\in I_{1}\})+...+\lambda_{K}R(\{\beta_{i}|i\in I_{K}\})\to\min_{w} $$
  • $\lambda_{1},\lambda_{2},...\lambda_{K}$ can be set using cross-validation
  • In practice use common regularizer but with different feature scaling.

Different loss-functions

Idea

  • Generalize squared to arbitrary loss: $$ \sum_{n=1}^{N}\left(x^{T}\beta-y_{n}\right)^{2}\to\min_{\beta}\qquad\Longrightarrow\qquad\sum_{n=1}^{N}\mathcal{L}(x_{n}^{T}\beta-y_{n})\to\min_{\beta} $$

$$ \begin{array}{lll} \textbf{LOSS} & \textbf{NAME} & \textbf{PROPERTIES}\\ \mathcal{L}(\varepsilon)=\varepsilon^{2} & \text{quadratic} & \text{differentiable}\\ \mathcal{L}(\varepsilon)=\left|\varepsilon\right| & \text{absolute} & \text{robust}\\ \mathcal{L}(\varepsilon)=\begin{cases} \frac{1}{2}\varepsilon^{2}, & \left|\varepsilon\right|\le\delta\\ \delta\left(\left|\varepsilon\right|-\frac{1}{2}\delta\right) & \left|\varepsilon\right|>\delta \end{cases} & \text{Huber} & \text{differentiable, robust} \end{array} $$

  • Robust means solution is robust to outliers in the training set.

Loss function matters

  • For $y_{1},...y_{N}\in\mathbb{R}$ constant minimizers $\widehat{\mu}$: $$ \begin{align*} \arg\min_{\mu}\sum_{n=1}^{N}(y_{n}-\mu)^{2} & = {\frac{1}{N}\sum_{n=1}^{N}y_{n}}\\ \arg\min_{\mu}\sum_{n=1}^{N}\left|y_{n}-\mu\right| & = {\text{ median}\{y_{1},...y_{N}\}} \end{align*} $$

For $x,y\sim P(x,y)$ and functional minimizers $f(x)$: $$ \begin{align*} \arg\min_{f(x)}\mathbb{E}\left\{ \left.\left(f(x)-y\right)^{2}\right|x\right\} & ={\mathbb{E}[y|x]}\\ \arg\min_{f(x)}\mathbb{E}\left\{ \left.|f(x)-y|\,\right|x\right\} & =\text{median}[y|x] \end{align*} $$

Weighted account for observations

Weighted account for observation

Weighted account for observations $$ \sum_{n=1}^{N}w_{n}(x_{n}^{T}\beta-y_{n})^{2} $$

  • Weights may be:
  • increased for incorrectly predicted objects
    • algorithm becomes more oriented on error correction
  • decreased for incorrectly predicted objects
    • they may be considered outliers that break our model

Solution for weighted regression

$$ \sum_{n=1}^{N}w_{n}\left(x_{n}^{T}\beta-y_{n}\right)^{2}\to\min_{\beta\in\mathbb{R}} $$

Stationarity condition: $$ \sum_{n=1}^{N}w_{n}x_{n}^{d}\left(x_{n}^{T}\beta-y_{n}\right)=0 $$

Define $\{X\}_{n,d}=x_{n}^{d}$, $W=diag\{w_{1},...x_{N}\}$. Then

$$ X^{T}W\left(X\beta-y\right)=0 $$ $$ \beta=\left(X^{T}WX\right)^{-1}X^{T}Wy $$

Robust regression

  • Initialize $w_{1}=...=w_{N}=1/N$

  • Repeat:

    • estimate regression $\widehat{y}(x)$ using observations $(x_{i},y_{i})$ with weights $w_{i}$.
    • for each $i=1,2,...N$:
      • re-estimate $\varepsilon_{i}=\widehat{y}(x_{i})-y_{i}$
      • recalculate $w_{i}=K\left(\left|\varepsilon_{i}\right|\right)$
    • normalize weights $w_{i}=\frac{w_{i}}{\sum_{n=1}^{N}w_{n}}$

Comments: $K(\cdot)$ is some decreasing function, repetition may be

* predefined number of times
* until convergence of model parameters.

Example

Nadaraya-Watson regression

Minimum squared error estimate

For training sample $\left(x_{1},y_{1}\right),...\left(x_{N},y_{N}\right)$ consider finding constant $\widehat{y}\in\mathbb{R}$ $$ L(\alpha)=\sum_{i=1}^{N}(\widehat{y}-y_{i})^{2}\to\min_{\widehat{y}\in\mathbb{R}} $$ $$ \frac{\partial L}{\partial\alpha}=2\sum_{i=1}^{N}\left(\widehat{y}-y_{i}\right)=0\text{, so }\widehat{y}=\frac{1}{N}\sum_{i=1}^{N}y_{i} $$

We need to model general curve $y(x)$:

Nadaraya-Watson regression - localized averaging approach.

Nadaraya-Watson regression

  • Equivalent names: local constant regression, kernel regression
  • For each $x$ assume $f(x)=const=\alpha,\,\alpha\in\mathbb{R}$. $$ Q(\alpha,X_{training})=\sum_{i=1}^{N}w_{i}(x)(\alpha-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}} $$
  • Weights depend on the proximity of training objects to the predicted object:

$$ w_{i}(x)=K\left(\frac{\rho(x,x_{i})}{h}\right) $$

  • $K(u)$ - some decreasing function, called kernel.
  • $h(x)$ - some $\ge0$ function called bandwidth.
    • Intuition: ``window width'', consider $h(x)=h,\,K(u)=\mathbb{I}[u\le1].$

Parameters

  • Typically used $K(u)$: $$ \begin{eqnarray*} K_{G}(u) & = & e^{-\frac{1}{2}u^{2}}-\text{Gaussian kernel}\\ K_{P}(u) & = & (1-u^{2})^{2}\mathbb{I}[|u|<1]-\text{quartic kernel} \end{eqnarray*} $$
  • Typically used $h(x)$:
    • $h(x)=const$
    • $h(x)=\rho(x,x_{i_{K}})$, where $x_{i_{K}}$ - $K$-th nearest neighbour.
      • better for unequal distribution of objects

Solution

$$ \begin{align*} Q(\alpha,X_{training}) & =\sum_{i=1}^{N}w_{i}(x)(\alpha-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}}\\ w_{i}(x) & =K\left(\frac{\rho(x,x_{i})}{h(x)}\right) \end{align*} $$

  • From stationarity condition $\frac{\partial Q}{\partial\alpha}=0$ obtain optimal $\widehat{\alpha}(x)$: $$ f(x,\alpha)=\widehat{\alpha}(x)=\frac{\sum_{i=1}^{N}y_{i}w_{i}(x)}{\sum_{i=1}^{N}w_{i}(x)}=\frac{\sum_{i=1}^{N}y_{i}K\left(\frac{\rho(x,x_{i})}{h(x)}\right)}{\sum_{i=1}^{N}K\left(\frac{\rho(x,x_{i})}{h(x)}\right)} $$

Comments

  • Under general regularity conditions $g(x,\alpha)\overset{P}{\to}E[y|x]$
  • The specific form of the kernel function does not affect the accuracy much.
    • but may affect efficiency
  • Compared to K-NN: may use all objects, bandwidth controls smoothness

Comments

Insead of optimizing locally with constant $\alpha$ $$ Q(\alpha,X_{training})=\sum_{i=1}^{N}w_{i}(x)({\alpha}-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}} $$

we could have optimized local linear regression $$ Q(\alpha,X_{training})=\sum_{i=1}^{N}w_{i}(x)({x^{T}\beta}-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}} $$

This better handles approximation on the edges of domain and local extrema.