1. Some materials are taken from machine learning course of Victor Kitov
df_auto.plot(x='mileage', y='price', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x109dafc88>
Ordinary Least Squares: $$ L(\beta_0,\beta_1,\dots) = \frac{1}{2N}\sum^{N}_{n=1}(\hat{y}_{n} - y_{n})^2 = \frac{1}{2N}\sum^{N}_{n=1}\left(\sum_{d=0}^{D}\beta_{d}x_{n}^{d}-y_{n}\right)^{2} \rightarrow \min\limits_{\beta} $$
Let's say $f(x, \beta) = \beta_0 + \beta_1x_1$
Calculate partial derivatives wrt $\beta_0$, $\beta_1$ and set them to $0$:
$$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n}) = 0$$$$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n})x^1_{n} = 0$$Ordinary Lears Squares: $$ L(\beta) = \frac{1}{2N}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min\limits_{\beta} $$
Expand a bit $$ \begin{align*} L(\beta) & = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \\ & = \frac{1}{2N}\left( \beta^\top X^\top X \beta - 2 (X\beta)^\top y + y^\top y \right) \end{align*} $$
Calculate vector of partial derivatives - gradient
Advantages:
Drawbacks:
sq_loss_demo()
Derivative of $f(x)$ in $x_0$: $$ f'(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+h) - f(x_0)}{h}$$
Derivative shows the slope of tangent line in $x_0$
interact(deriv_demo, h=FloatSlider(min=0.0001, max=2, step=0.005), x0=FloatSlider(min=1, max=15, step=.2))
<function __main__.deriv_demo>
Given $L(\beta_0, \beta_1)$ calculate gradient (patial derivatives) $$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})x^1_{n}$$
Or in matrix form: $$ \nabla_{\beta}L(\beta) = \frac{1}{N} X^\top(X\beta - y)$$
Run gradient update, which is simultaneous(!!!) update of $\beta$ in antigradient direction:
$$ \beta := \beta - \alpha\nabla_{\beta}L(\beta)$${python}
1.function gd(X, alpha, epsilon):
2. initialise beta
3. do:
4. Beta = new_beta
5. new_Beta = Beta - alpha*grad(X, beta)
6. until dist(new_beta, beta) < epsilon
7. return beta
grad_demo(iters=105, alpha=0.08)
{python}
1.function sgd(X, alpha, epsilon):
2. initialise beta
3. do:
4. X = shuffle(X)
5. for x in X:
6. Beta = new_beta
7. new_Beta = Beta - alpha*grad(x, beta)
8. until dist(new_beta, beta) < epsilon
9. return beta
stoch_grad_demo(iters=105, alpha=0.08)
Idea: to move not only to current antigradient direction but consider previous one
$$ v_0 = 0$$$$ v_t = \gamma v_{t - 1} + \alpha\nabla_{\beta}{L(\beta)}$$$$ \beta = \beta - v_t$$where
Idea: update parameters $\beta_i$ for each feature differenly.
Denote $\frac{\partial L}{\partial \beta_i}$ on iteration $t$ as $g_{t,i}$
Vanilla gd update
$$ \beta_{t+1, i} = \beta_{t, i} - \alpha \cdot g_{t,i}$$In Adagrad $\alpha$ is normalized wrt "size" of previous derivatives:
$$ \beta_{t+1, i} = \beta_{t, i} - \dfrac{\alpha}{\sqrt{G_{t,ii} + \varepsilon}} \cdot g_{t,i}$$where $G_t$ is diagonal matrix with sum of squares of derivatives of $\beta_{i}$ before iteration $t$. $\varepsilon$ — is smoothing hyperparameter.
Nonlinearity by $x$ in linear regression may be achieved by applying non-linear transformations to the features:
$$ x\to[\phi_{0}(x),\,\phi_{1}(x),\,\phi_{2}(x),\,...\,\phi_{M}(x)] $$$$ f(x)=\mathbf{\phi}(x)^{T}\beta=\sum_{m=0}^{M}\beta_{m}\phi_{m}(x) $$The model remains to be linear in $\beta$, so all advantages of linear regression remain.
demo_weights()