Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Linear regression. Gradient-based optimization¹

^{1. Some materials are taken from machine learning course of Victor Kitov}

Let's recall previous lecture¶

Decision trees
- Utilize the notion of impurity
- Work both for classification and regression
Implicit feature selection
- good for features of different nature
Parallel to axes class separating boundary
Local greedy optimizaion
Sensitive to even tiny data pertubations

Linear regression¶

Example: flat prices¶

Obviously, those characteristics somehow relate with price ($f: X \rightarrow Y$)
Formalize a model to predict flat price: $$a(x) = a(total\_area, nmbr\_of\_bedrooms, house\_age) = \hat{y}$$
Let it be a linear model: $$a(x) = \beta_0 + \beta_1\cdot total\_area + \beta_2 \cdot nmbr\_of\_bedrooms + \beta_3 \cdot house\_age$$
Task - find coefficients $\beta_0,\dots, \beta_3$, that minimize loss function on training set

Cars price vs mileage¶

In [4]:

df_auto.plot(x='mileage', y='price', kind='scatter')

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x120564d68>

Linear regression¶

Our goal - determine linear dependence between features $X$ and target vector $y$
Define $x_n^d$ as $d$-th feature of $n$-th object, $y_{n} \in \mathbb{R}$ - target value for $n$-th object $$f(x_{n}, \beta) = \hat{y}_{n} = \beta_0 + \beta_1x_{n}^1 + \dots$$
$x_{n}^0 = 1$ $\forall n$ - intercept
Need to estimate $\beta_i$.

Ordinary Least Squares: $$ L(\beta_0,\beta_1,\dots) = \frac{1}{2N}\sum^{N}_{n=1}(\hat{y}_{n} - y_{n})^2 = \frac{1}{2N}\sum^{N}_{n=1}\left(\sum_{d=0}^{D}\beta_{d}x_{n}^{d}-y_{n}\right)^{2} \rightarrow \min\limits_{\beta} $$

Solution¶

Let's say $f(x, \beta) = \beta_0 + \beta_1x_1$

$$L(\beta_0, \beta_1) = \dots$$

What do we do next?

Calculate partial derivatives wrt $\beta_0$, $\beta_1$ and set them to $0$:

$$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n}) = 0$$$$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n})x^1_{n} = 0$$

Linear regression (in matrix form)¶

Define $X\in\mathbb{R}^{NxD},\{X\}_{ij}$ defines the $j$-th feature of $i$-th object, $y\in\mathbb{R}^{n}$ - vector with target values $$f(x, \beta) = \hat{y} = X\beta \quad \Leftrightarrow \quad \left( \begin{array}{c} \hat{y}_1 \\ \hat{y}_2 \\ \vdots \\ \hat{y}_n \end{array} \right) = \left( \begin{array}{ccccc} 1 & x_1^1 & x_1^2 & \cdots & x_1^d\\ 1 & x_2^1 & x_2^2 & \cdots & x_2^d\\ \cdots & \cdots & \cdots & \cdots & \cdots\\ 1 & x_n^1 & x_n^2 & \cdots & x_n^d\end{array} \right) \cdot \left( \begin{array}{c} \beta_0 \\ \beta_1 \\ \vdots\\ \beta_d \end{array} \right) $$
Need to estimate $\beta_i$.

Ordinary Lears Squares: $$ L(\beta) = \frac{1}{2N}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min\limits_{\beta} $$

Solution (in matrix form)¶

Expand a bit $$ \begin{align*} L(\beta) & = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \\ & = \frac{1}{2N}\left( \beta^\top X^\top X \beta - 2 (X\beta)^\top y + y^\top y \right) \end{align*} $$

Calculate vector of partial derivatives - gradient

Some reminders¶

$$\begin{array}{rcl} \frac{\partial}{\partial x} x^T a &=& a \\ \frac{\partial}{\partial x} x^T A x &=& \left(A + A^T\right)x \\ \frac{\partial}{\partial A} x^T A y &=& xy^T \end{array}$$

$$ \begin{align*} \nabla L(\beta) = \left(\frac{\partial L(\beta)}{\partial\beta_i} \right)_{i=0\dots d} = \frac{1}{2N}\left(2X^\top X \beta -2X^\top y\right) & = 0 \\ X^{\top}X\beta-X^{T}y & = 0 \\ \\ \beta & = (X^\top X)^{-1} X^\top y \quad\text{(Normal Equation)} \end{align*} $$

Comments¶

This is the global minimum, because the optimized criteria is convex.
Geometric interpretation:
- find linear combination of feature measurements that best reproduce $y$
- solution - combinaton of features, giving projection of $y$ on linear span of feature measurements.

Why using Normal Equation could be bad?
- Calculating inverse costs a lot
- Not all matrices have it (singular matrices)

Linearly dependent features (multicollinearity)¶

Solution $\widehat{\beta}=(X^{T}X)^{-1}X^{T}y$ exists when $X^{T}X$ is non-singular
Problem occurs when one of the features is a linear combination of the other(s) (linear dependency)
- interpretation: non-identifiability of $\widehat{\beta}$ for linearly dependent features:
  - linear dependence: $\exists\alpha:\,x^{T}\alpha=0\,\forall x$
  - suppose $\beta$ solves linear regression $y=x^{T}\beta$
  - then $x^{T}\beta\equiv x^{T}\beta+kx^{T}\alpha\equiv x^{T}(\beta+k\alpha)$, so $\beta+k\alpha$ is also a solution!
Multicollinearity can be exact and not exact, which is also bad

Examples of linear dependency¶

$x$ miles $\approx 1.6\cdot x$ kms
total flat area $\approx$ area of living rooms $+$ area of kitchen
dummy variable trap!

In [5]:

df_auto.plot(x='mileage', y='price', kind='scatter')

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x1210c8e48>

In [6]:

df_auto.loc[:, ['mileage', 'price']].head()

Out[6]:

	mileage	price
0	67697	14995
1	73738	11988
2	80313	11999
3	86096	12995
4	79607	11333

In [8]:

X_train = df_auto.mileage.values.reshape(-1, 1)
y_train = df_auto.price.values

model = LinearRegression()
model.fit(X_train, y_train)

print('Модель:\nprice = %.2f + (%.2f)*mileage' % (model.intercept_, model.coef_[0]))

y_hat = model.predict(X_train)

df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train, y_hat, c='r')

Модель:
price = 16762.02 + (-0.05)*mileage

In [9]:

df_auto = df_auto.assign(kilometrage = lambda r: r.mileage*1.6)
df_auto.loc[:, ['mileage', 'kilometrage', 'price']].head()

Out[9]:

	mileage	kilometrage	price
0	67697	108315.2	14995
1	73738	117980.8	11988
2	80313	128500.8	11999
3	86096	137753.6	12995
4	79607	127371.2	11333

In [10]:

X_train = df_auto.loc[:, ['mileage', 'kilometrage']].values
y_train = df_auto.price.values

model = LinearRegression()
model.fit(X_train, y_train)

print('Модель:\nprice = %.2f + (%.2f)*mileage + (%.2f)*kilometrage' % (model.intercept_, model.coef_[0], model.coef_[1],))

y_hat = model.predict(X_train)

df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train[:, 0], y_hat, c='r')

Модель:
price = 16628.24 + (4363806708466.83)*mileage + (-2727379192791.80)*kilometrage

Linearly dependent features¶

Problem may be solved by:
- feature selection
- dimensionality reduction
- imposing additional requirements on the solution (regularization)

Analysis of linear regression¶

Advantages:

single optimum, which is global (for non-singular matrix)
analytical solution
interpretable solution and algorithm

Drawbacks:

too simple model assumptions (may not be satisfied)
$X^{T}X$ should be non-degenerate (and well-conditioned)

Optimization methods for LR¶

Gradient descent¶

Intuition¶

$$ L(\beta_0, \beta_1) = \frac{1}{2N}\sum_{n=1}^N(\beta_0 + \beta_1x^1_{n} - y_{n})^2$$

Suppose we have some initial approximation of $(\hat{\beta_0}, \hat{\beta_1})$
How should we change it in order to improve?

In [12]:

sq_loss_demo()

Tiny Refresher¶

Derivative of $f(x)$ in $x_0$: $$ f'(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+h) - f(x_0)}{h}$$

Derivative shows the slope of tangent line in $x_0$

If $x_0$ is extreme point of $f(x)$ and $f'(x_0)$ exists $\Rightarrow$ $f'(x_0) = 0$

In [14]:

interact(deriv_demo, h=FloatSlider(min=0.0001, max=2, step=0.005), x0=FloatSlider(min=1, max=15, step=.2))

Out[14]:

<function __main__.deriv_demo(h, x0)>

Tiny Refresher¶

In multidimential world we switch to gradients and directional derivatives: $$ f'_v(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+hv) - f(x_0)}{h} = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0}, \quad ||v|| = 1 \quad \text{directional derivatives}$$

$$ \frac{ \partial f(x_1,x_2,\dots,x_d)}{\partial x_i} = \lim\limits_{h \rightarrow 0}\frac{f(x_1, x_2, \dots, x_i + h, \dots, x_d) - f(x_1, x_2, \dots, x_i, \dots, x_d)}{h} \quad \text{partial derivative}$$$$ \nabla f = \left(\frac{\partial f}{\partial x_i}\right),\quad i=1\dots d \quad \text{Gradient = a vector of partial derivatives}$$

Tiny Refresher¶

Unsing multivariate chain rule:

$$ f'_v(x_0) = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0} = \sum_{i=1}^d \frac{\partial f}{\partial x_i} \frac{d}{dh} (x_{0,i} + hv_i) = \langle \nabla f, v \rangle$$$$ \langle \nabla f, v \rangle = || \nabla f || \cdot ||v|| \cdot \cos{\phi} = || \nabla f || \cdot \cos{\phi}$$

Tiny Refresher¶

$$ \langle \nabla f, v \rangle = || \nabla f || \cdot \cos{\phi}$$

Directional derivative is maximal when the direction is collinear to gradient
gradient — direction of steepest ascent of $f(x)$
antigradient — direction of steepest descent of $f(x)$

Given $L(\beta_0, \beta_1)$ calculate gradient (patial derivatives) $$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})x^1_{n}$$

Or in matrix form: $$ \nabla_{\beta}L(\beta) = \frac{1}{N} X^\top(X\beta - y)$$

Run gradient update, which is simultaneous(!!!) update of $\beta$ in antigradient direction:

$$ \beta := \beta - \alpha\nabla_{\beta}L(\beta)$$

$\alpha$ - descent "speed"

Pseudocode¶

{python}
1.function gd(X, alpha, epsilon):

2.  initialise beta 

3.  do: 

4.      Beta = new_beta

5.      new_Beta = Beta - alpha*grad(X, beta)

6.  until dist(new_beta, beta) < epsilon

7.  return beta

In [15]:

from sklearn.utils import shuffle

def grad_demo(iters=1, alpha=0.001):
    
    beta0 = np.linspace(-8000, 6500, 100)
    beta1 = np.linspace(5000, 20000, 100)

    x = df_auto.loc[:, ['mileage']]
    x = (x-x.mean(axis=0))/x.std(axis=0)
    y = df_auto.loc[:, 'price'].values
    X = np.c_[x, np.ones(df_auto.shape[0])]


    B0, B1 = np.meshgrid(beta0, beta1)
    L = ((X.dot(np.r_[B0.reshape(1,-1), B1.reshape(1,-1)]) - y.reshape(-1,1))**2).sum(axis=0)/(2*df_auto.shape[0])

    fig = plt.figure(figsize=(18, 10)) 
    ax = fig.add_subplot(1, 2, 1)
    ax.scatter(X[:,0], y)
    
    Beta, costs, Betas, grads = gradient_descent_upd(X, y, alpha, tol=10**-3, max_iter=iters-1)
    Betas = np.c_[Betas]
    grads = np.c_[grads]
    
    X_1 = np.sort(X, axis=0)
    
    y_hat = X_1.dot(Betas.T)
    
    plt.plot(X_1[:,0], y_hat)
    plt.text(-2.5, .0, r'$\beta_0={:.2f},\ \beta_1={:.2f}$'.format(Beta[0], Beta[1]), fontsize=16)
    plt.text(-2.5, -1000, r'$\frac{{\partial L}}{{\partial \beta_0}}={:.2f},\ \frac{{\partial L}}{{\partial \beta_1}}={:.2f}$'.format(grads[-1,0], grads[-1,1]), fontsize=16)
    ax = fig.add_subplot(1, 2, 2)
    contour = ax.contour(B0, B1, L.reshape(B0.shape),)
    plt.clabel(contour, inline=1, fontsize=10)
    ax.set_xlabel(r'$\beta_0$')
    ax.set_ylabel(r'$\beta_1$')
    
    ax.plot(Betas[:,0], Betas[:, 1], '*-')
    
    plt.show()
    
    
def stoch_grad_demo(iters=1, alpha=0.001):
    
    beta0 = np.linspace(-8000, 6500, 100)
    beta1 = np.linspace(5000, 20000, 100)

    x = df_auto.loc[:, ['mileage']]
    x = (x-x.mean(axis=0))/x.std(axis=0)
    y = df_auto.loc[:, 'price'].values
    X = np.c_[x, np.ones(df_auto.shape[0])]


    B0, B1 = np.meshgrid(beta0, beta1)
    L = ((X.dot(np.r_[B0.reshape(1,-1), B1.reshape(1,-1)]) - y.reshape(-1,1))**2).sum(axis=0)/(2*df_auto.shape[0])

    fig = plt.figure(figsize=(18, 10))    
    Beta, costs, Betas, _ = gradient_descent_upd(X, y, alpha*5, tol=10**-3, max_iter=iters)
    Betas = np.c_[Betas]
    
    _, _, Betas_stoch = stoch_gradient_descent(X, y, alpha, max_iter=iters)
    Betas_stoch = np.c_[Betas_stoch]
    
    X_1 = np.sort(X, axis=0)
    
    y_hat = X_1.dot(Betas.T)
    ax = fig.add_subplot(1, 1, 1)
    contour = ax.contour(B0, B1, L.reshape(B0.shape),)
    plt.clabel(contour, inline=1, fontsize=10)
    ax.set_xlabel(r'$\beta_0$')
    ax.set_ylabel(r'$\beta_1$')
    
    ax.plot(Betas[:,0], Betas[:, 1], '*-')
    ax.plot(Betas_stoch[:,0], Betas_stoch[:, 1], 'o-', c='b')
    plt.axis('equal')
    
    plt.show()
    

def gradient_descent_upd(X, y, alpha, tol=10**-3, max_iter=10):
    n = y.shape[0]
    Beta = np.array([-4000, 6000])
    delta = 10
    cost_prev = 0
    i = 0
    
    # for logging
    Betas = [Beta]
    grads = [[0,0]]
    costs = []
    
    while (delta > tol) and (i <= max_iter):
        y_hat = X.dot(Beta)
        
        # считаем ошибку и значение функции потерь
        error = (y_hat - y)
        cost = np.sum(error ** 2)/float(2 * n)
        delta = abs(cost - cost_prev)
        cost_prev = cost
        
        # считаем градиент
        grad = X.T.dot(error) / n

        # обновляем коэффициенты
        Beta = Beta - alpha * grad
        
        # logging
        costs.append(cost)
        Betas.append(Beta)
        grads.append(grad*alpha)
        i += 1
        
    return Beta, costs, Betas, grads


def stoch_gradient_descent(X, y, alpha, max_iter=10):
    n = y.shape[0] 
    Beta = np.array([-4000, 6000])
    
    costs = []
    Betas = [Beta]
    
    for i in range(max_iter):
        
        X, y = shuffle(X, y, random_state=10)
        
        for j in range(n):
            
            y_hat = X[j].dot(Beta)

            # считаем ошибку и значение функции потерь
            error = y_hat - y[j]

            # считаем градиент
            gradient = X[j].T.dot(error)

            # обновляем коэффициенты
            Beta = Beta - alpha * gradient  # update
            alpha *= 0.99
                # logging
            if j % 5 == 0 and i % 5 == 0:
                Betas.append(Beta)
        
        cost_epoch = np.sum((X.dot(Beta) - y)**2 / (2*n))
        costs.append(cost_epoch)
        
        
    return Beta, costs, Betas

In [23]:

interact(grad_demo, iters=IntSlider(min=0, max=20, step=1), alpha=FloatSlider(min=0.01, max=3, step=0.05))

Out[23]:

<function __main__.grad_demo(iters=1, alpha=0.001)>

Comments¶

How do we set $\alpha$
Feature scales matters
Local minima*

Gradient descent modifications¶

Stochastic gradient descent (!)
Descent with momentum
Adagrad

Methods overview 1, Methods overview 2

Stochastic gradient descent¶

{python}
1.function sgd(X, alpha, epsilon):

2.  initialise beta 

3.  do: 

4.        X = shuffle(X)

5.        for x in X:

6.            Beta = new_beta

7.            new_Beta = Beta - alpha*grad(x, beta)

8.  until dist(new_beta, beta) < epsilon

9.  return beta

In [17]:

stoch_grad_demo(iters=105, alpha=0.08)

Momentum¶

Idea: to move not only to current antigradient direction but consider previous one

$$ v_0 = 0$$$$ v_t = \gamma v_{t - 1} + \alpha\nabla_{\beta}{L(\beta)}$$$$ \beta = \beta - v_t$$

where

$\gamma$ — momentum term (usually 0.9)

Nonlinear dependencies¶

Generalization by nonlinear transformations¶

Nonlinearity by $x$ in linear regression may be achieved by applying non-linear transformations to the features:

$$ x\to[\phi_{0}(x),\,\phi_{1}(x),\,\phi_{2}(x),\,...\,\phi_{M}(x)] $$$$ f(x)=\mathbf{\phi}(x)^{T}\beta=\sum_{m=0}^{M}\beta_{m}\phi_{m}(x) $$

The model remains to be linear in $\beta$, so all advantages of linear regression remain.

Typical transformations¶

$x^{i}\in[a,b]$ : binarization of feature
$x^{i}x^{j}$ : interaction of features
$\exp\left\{ -\gamma\left\lVert x-\tilde{x}\right\rVert ^{2}\right\} $ : closeness to reference point $\tilde{x}$
$\ln x_{k}$ : the alignment of the distribution with heavy tails

In [19]:

demo_weights()

Regularization & restrictions¶

Intuition¶

, width=800 [Andrew's Ng Machine Learning Class - Stanford]

Regularization¶

Regularization - technique that reduces model complexity to prevent overfitting
Desired outcomes:
- train loss goes up
- test loass goes down
Have we already seen it?

Regularization¶

Insert additional requirement for regularizer $R(\beta)$ to be small: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda R(\beta)\to\min_{\beta} $$
$\lambda>0$ - hyperparameter.
$R(\beta)$ penalizes complexity of models. $$ \begin{array}{ll} R(\beta)=||\beta||_{1} & \mbox{Lasso regression}\\ R(\beta)=||\beta||_{2}^{2} & \text{Ridge regression} \end{array} $$

Not only accuracy matters for the solution but also model simplicity!
$\lambda$ controls complexity of the model:$\uparrow\lambda\Leftrightarrow\text{complexity}$$\downarrow$.

Comments¶

Dependency of $\beta$ from $\lambda$ for ridge (A) and LASSO (B):
LASSO can be used for automatic feature selection.
$\lambda$ is usually found using cross-validation on exponential grid, e.g. $[10^{-6},10^{-5},...10^{5},10^{6}]$.
It's always recommended to use regularization because
- it gives smooth control over model complexity.
- reduces ambiguity for multiple solutions case.

In [21]:

df_auto.loc[:, ['mileage', 'kilometrage', 'price']].head()

Out[21]:

	mileage	kilometrage	price
0	67697	108315.2	14995
1	73738	117980.8	11988
2	80313	128500.8	11999
3	86096	137753.6	12995
4	79607	127371.2	11333

In [22]:

X_train = df_auto.loc[:, ['mileage', 'kilometrage']].values
y_train = df_auto.price.values

model = Lasso()
# model = Ridge()
model.fit(X_train, y_train)

print('Модель:\nprice = %.2f + (%.2f)*mileage + (%.2f)*kilometrage' % (model.intercept_, model.coef_[0], model.coef_[1],))

y_hat = model.predict(X_train)

df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train[:, 0], y_hat, c='r')

Модель:
price = 16762.02 + (-0.05)*mileage + (-0.00)*kilometrage

ElasticNet¶

ElasticNet:

$$ R(\beta)=\alpha||\beta||_{1}+(1-\alpha)||\beta||_{2}^{2}\to\min_{\beta} $$

$\alpha\in(0,1)$ - hyperparameter, controlling impact of each part.

If two features $x^{i}$and $x^{j}$ are equal:
- LASSO may take only one of them
- Ridge will take both with equal weight
  - but it doesn't remove useless features
ElasticNet both removes useless features but gives equal weight for usefull equal features
- good, because feature equality may be due to chance on this particular training set

Ridge regression analytic solution¶

Ridge regression criterion $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda\beta^{T}\beta\to\min_{\beta} $$

Stationarity condition can be written as:

$$ \begin{gathered}2\sum_{n=1}^{N}x_{n}\left(x_{n}^{T}\beta-y_{n}\right)+2\lambda\beta=0\\ 2X^{T}(X\beta-y)+\lambda\beta=0\\ \left(X^{T}X+\lambda I\right)\beta=X^{T}y \end{gathered} $$

$$ \widehat{\beta}=(X^{T}X+\lambda I)^{-1}X^{T}y $$

Comments¶

$X^{T}X+\lambda I$ is always non-degenerate as a sum of:
- non-negative definite $X^{T}X$
- positive definite $\lambda I$

Intuition:
- out of all valid solutions select one giving simplest model
Other regularizations also restrict the set of solutions.

Different account for different features¶

Traditional approach regularizes all features uniformly: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda R(\beta)\to\min_{w} $$
Suppose we have $K$ groups of features with indices: $$ I_{1},I_{2},...I_{K} $$
We may control the impact of each group on the model by: $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda_{1}R(\{\beta_{i}|i\in I_{1}\})+...+\lambda_{K}R(\{\beta_{i}|i\in I_{K}\})\to\min_{w} $$
$\lambda_{1},\lambda_{2},...\lambda_{K}$ can be set using cross-validation
In practice use common regularizer but with different feature scaling.

Different loss-functions¶

Idea¶

Generalize squared to arbitrary loss: $$ \sum_{n=1}^{N}\left(x^{T}\beta-y_{n}\right)^{2}\to\min_{\beta}\qquad\Longrightarrow\qquad\sum_{n=1}^{N}\mathcal{L}(x_{n}^{T}\beta-y_{n})\to\min_{\beta} $$

$$ \begin{array}{lll} \textbf{LOSS} & \textbf{NAME} & \textbf{PROPERTIES}\\ \mathcal{L}(\varepsilon)=\varepsilon^{2} & \text{quadratic} & \text{differentiable}\\ \mathcal{L}(\varepsilon)=\left|\varepsilon\right| & \text{absolute} & \text{robust}\\ \mathcal{L}(\varepsilon)=\begin{cases} \frac{1}{2}\varepsilon^{2}, & \left|\varepsilon\right|\le\delta\\ \delta\left(\left|\varepsilon\right|-\frac{1}{2}\delta\right) & \left|\varepsilon\right|>\delta \end{cases} & \text{Huber} & \text{differentiable, robust} \end{array} $$

Robust means solution is robust to outliers in the training set.

References¶

Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, Jerome Friedman) - Chapter 3

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Linear regression. Gradient-based optimization1

Let's recall previous lecture¶

Linear regression¶

Example: flat prices¶

Cars price vs mileage¶

Linear regression¶

Solution¶

Linear regression (in matrix form)¶

Solution (in matrix form)¶

Some reminders¶

Comments¶

Linearly dependent features (multicollinearity)¶

Examples of linear dependency¶

Linearly dependent features¶

Analysis of linear regression¶

Optimization methods for LR¶

Gradient descent¶

Intuition¶

Tiny Refresher¶

Tiny Refresher¶

Tiny Refresher¶

Tiny Refresher¶

Pseudocode¶

Comments¶

Gradient descent modifications¶

Stochastic gradient descent¶

Momentum¶

Nonlinear dependencies¶

Generalization by nonlinear transformations¶

Typical transformations¶

Regularization & restrictions¶

Intuition¶

Regularization¶

Regularization¶

Comments¶

ElasticNet¶

Ridge regression analytic solution¶

Comments¶

Different account for different features¶

Different loss-functions¶

Idea¶

References¶

Linear regression. Gradient-based optimization¹