1. Some materials are taken from machine learning course of Victor Kitov
df_auto.plot(x='mileage', y='price', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x120564d68>
Ordinary Least Squares: $$ L(\beta_0,\beta_1,\dots) = \frac{1}{2N}\sum^{N}_{n=1}(\hat{y}_{n} - y_{n})^2 = \frac{1}{2N}\sum^{N}_{n=1}\left(\sum_{d=0}^{D}\beta_{d}x_{n}^{d}-y_{n}\right)^{2} \rightarrow \min\limits_{\beta} $$
Let's say $f(x, \beta) = \beta_0 + \beta_1x_1$
$$L(\beta_0, \beta_1) = \dots$$What do we do next?
Calculate partial derivatives wrt $\beta_0$, $\beta_1$ and set them to $0$:
$$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n}) = 0$$$$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n})x^1_{n} = 0$$Ordinary Lears Squares: $$ L(\beta) = \frac{1}{2N}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min\limits_{\beta} $$
Expand a bit $$ \begin{align*} L(\beta) & = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \\ & = \frac{1}{2N}\left( \beta^\top X^\top X \beta - 2 (X\beta)^\top y + y^\top y \right) \end{align*} $$
Calculate vector of partial derivatives - gradient
df_auto.plot(x='mileage', y='price', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x1210c8e48>
df_auto.loc[:, ['mileage', 'price']].head()
mileage | price | |
---|---|---|
0 | 67697 | 14995 |
1 | 73738 | 11988 |
2 | 80313 | 11999 |
3 | 86096 | 12995 |
4 | 79607 | 11333 |
X_train = df_auto.mileage.values.reshape(-1, 1)
y_train = df_auto.price.values
model = LinearRegression()
model.fit(X_train, y_train)
print('Модель:\nprice = %.2f + (%.2f)*mileage' % (model.intercept_, model.coef_[0]))
y_hat = model.predict(X_train)
df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train, y_hat, c='r')
Модель: price = 16762.02 + (-0.05)*mileage
df_auto = df_auto.assign(kilometrage = lambda r: r.mileage*1.6)
df_auto.loc[:, ['mileage', 'kilometrage', 'price']].head()
mileage | kilometrage | price | |
---|---|---|---|
0 | 67697 | 108315.2 | 14995 |
1 | 73738 | 117980.8 | 11988 |
2 | 80313 | 128500.8 | 11999 |
3 | 86096 | 137753.6 | 12995 |
4 | 79607 | 127371.2 | 11333 |
X_train = df_auto.loc[:, ['mileage', 'kilometrage']].values
y_train = df_auto.price.values
model = LinearRegression()
model.fit(X_train, y_train)
print('Модель:\nprice = %.2f + (%.2f)*mileage + (%.2f)*kilometrage' % (model.intercept_, model.coef_[0], model.coef_[1],))
y_hat = model.predict(X_train)
df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train[:, 0], y_hat, c='r')
Модель: price = 16628.24 + (4363806708466.83)*mileage + (-2727379192791.80)*kilometrage
Advantages:
Drawbacks:
sq_loss_demo()
Derivative of $f(x)$ in $x_0$: $$ f'(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+h) - f(x_0)}{h}$$
Derivative shows the slope of tangent line in $x_0$
interact(deriv_demo, h=FloatSlider(min=0.0001, max=2, step=0.005), x0=FloatSlider(min=1, max=15, step=.2))
<function __main__.deriv_demo(h, x0)>
Given $L(\beta_0, \beta_1)$ calculate gradient (patial derivatives) $$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})x^1_{n}$$
Or in matrix form: $$ \nabla_{\beta}L(\beta) = \frac{1}{N} X^\top(X\beta - y)$$
Run gradient update, which is simultaneous(!!!) update of $\beta$ in antigradient direction:
$$ \beta := \beta - \alpha\nabla_{\beta}L(\beta)$${python}
1.function gd(X, alpha, epsilon):
2. initialise beta
3. do:
4. Beta = new_beta
5. new_Beta = Beta - alpha*grad(X, beta)
6. until dist(new_beta, beta) < epsilon
7. return beta
interact(grad_demo, iters=IntSlider(min=0, max=20, step=1), alpha=FloatSlider(min=0.01, max=3, step=0.05))
<function __main__.grad_demo(iters=1, alpha=0.001)>
{python}
1.function sgd(X, alpha, epsilon):
2. initialise beta
3. do:
4. X = shuffle(X)
5. for x in X:
6. Beta = new_beta
7. new_Beta = Beta - alpha*grad(x, beta)
8. until dist(new_beta, beta) < epsilon
9. return beta
stoch_grad_demo(iters=105, alpha=0.08)
Idea: to move not only to current antigradient direction but consider previous one
$$ v_0 = 0$$$$ v_t = \gamma v_{t - 1} + \alpha\nabla_{\beta}{L(\beta)}$$$$ \beta = \beta - v_t$$where
Nonlinearity by $x$ in linear regression may be achieved by applying non-linear transformations to the features:
$$ x\to[\phi_{0}(x),\,\phi_{1}(x),\,\phi_{2}(x),\,...\,\phi_{M}(x)] $$$$ f(x)=\mathbf{\phi}(x)^{T}\beta=\sum_{m=0}^{M}\beta_{m}\phi_{m}(x) $$The model remains to be linear in $\beta$, so all advantages of linear regression remain.
demo_weights()
Dependency of $\beta$ from $\lambda$ for ridge (A) and LASSO (B):
LASSO can be used for automatic feature selection.
df_auto.loc[:, ['mileage', 'kilometrage', 'price']].head()
mileage | kilometrage | price | |
---|---|---|---|
0 | 67697 | 108315.2 | 14995 |
1 | 73738 | 117980.8 | 11988 |
2 | 80313 | 128500.8 | 11999 |
3 | 86096 | 137753.6 | 12995 |
4 | 79607 | 127371.2 | 11333 |
X_train = df_auto.loc[:, ['mileage', 'kilometrage']].values
y_train = df_auto.price.values
model = Lasso()
# model = Ridge()
model.fit(X_train, y_train)
print('Модель:\nprice = %.2f + (%.2f)*mileage + (%.2f)*kilometrage' % (model.intercept_, model.coef_[0], model.coef_[1],))
y_hat = model.predict(X_train)
df_auto.plot(x='mileage', y='price', kind='scatter')
_ = plt.plot(X_train[:, 0], y_hat, c='r')
Модель: price = 16762.02 + (-0.05)*mileage + (-0.00)*kilometrage
$\alpha\in(0,1)$ - hyperparameter, controlling impact of each part.
Ridge regression criterion $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda\beta^{T}\beta\to\min_{\beta} $$
Stationarity condition can be written as:
$$ \begin{gathered}2\sum_{n=1}^{N}x_{n}\left(x_{n}^{T}\beta-y_{n}\right)+2\lambda\beta=0\\ 2X^{T}(X\beta-y)+\lambda\beta=0\\ \left(X^{T}X+\lambda I\right)\beta=X^{T}y \end{gathered} $$so
$$ \widehat{\beta}=(X^{T}X+\lambda I)^{-1}X^{T}y $$