1. Some materials are taken from machine learning course of Victor Kitov
df_auto.plot(x='mileage', y='price', kind='scatter')
<matplotlib.axes._subplots.AxesSubplot at 0x110515810>
Ordinary Lears Squares:
$$ L(\beta_0,\beta_1,\dots) = \frac{1}{2N}\sum^{N}_{n=1}(\hat{y}_{n} - y_{n})^2 = \frac{1}{2N}\sum^{N}_{n=1}\left(\sum_{d=1}^{D}\beta_{d}x_{n}^{d}-y_{n}\right)^{2} \rightarrow \min\limits_{\beta} $$
$$ \Updownarrow $$
$$ L(\beta) = \frac{1}{2N}(\hat{y} - y)^{\top}(\hat{y} - y) = \frac{1}{2N}(X\beta - y)^{\top}(X\beta - y) \rightarrow \min\limits_{\beta} $$
Let's say $f(x, \beta) = \beta_0 + \beta_1x_1$
Calculate partial derivatives wrt $\beta_0$, $\beta_1$ and set them to $0$:
$$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n}) = 0$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{n=1}(\beta_0 + \beta_1x_{n}^1 - y_{n})x^1_{n} = 0$$
Either with matrix calculus or with some linear algebra trick we get
$$ X^{T}(X\beta-y)=0 \quad \Leftrightarrow \quad \beta = (X^\top X)^{-1} X^\top y \quad\text{(Normal Equation)}$$
Advantages:
Drawbacks:
$$ L(\beta_0, \beta_1) = \frac{1}{2N}\sum_{n=1}^N(\beta_0 + \beta_1x^1_{n} - y_{n})^2$$
sq_loss_demo()
Derivative of $f(x)$ in $x_0$: $$ f'(x_0) = \lim\limits_{h \rightarrow 0}\frac{f(x_0+h) - f(x_0)}{h}$$
Derivative shows the slope of tangent line in $x_0$
interact(deriv_demo, h=FloatSlider(min=0.0001, max=2, step=0.005), x0=FloatSlider(min=1, max=15, step=.2))
<function __main__.deriv_demo>
$$ \frac{ \partial f(x_1,x_2,\dots,x_d)}{\partial x_i} = \lim\limits_{h \rightarrow 0}\frac{f(x_1, x_2, \dots, x_i + h, \dots, x_d) - f(x_1, x_2, \dots, x_i, \dots, x_d)}{h} \quad \text{partial derivative}$$
$$ \nabla f = \left(\frac{\partial f}{\partial x_i}\right),\quad i=1\dots d \quad \text{Gradient = a vector of partial derivatives}$$
$$ f'_v(x_0) = \frac{d}{dh}f(x_{0,1} + hv_1, \dots, x_{0,d} + hv_d) \rvert_{h=0} = \sum_{i=1}^d \frac{\partial f}{\partial x_i} \frac{d}{dh} (x_{0,i} + hv_i) = \langle \nabla f, v \rangle$$
$$ \langle \nabla f, v \rangle = || \nabla f || \cdot ||v|| \cdot \cos{\phi} = || \nabla f || \cdot \cos{\phi}$$
$$ \langle \nabla f, v \rangle = || \nabla f || \cdot \cos{\phi}$$
Given $L(\beta_0, \beta_1)$ calculate gradient (patial derivatives) $$ \frac{\partial L}{\partial \beta_0} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})$$ $$ \frac{\partial L}{\partial \beta_1} = \frac{1}{N}\sum^{N}_{i=1}(\beta_0 + \beta_1x_{n}^1 - y^{n})x^1_{n}$$
Or in matrix form: $$ \nabla_{\beta}L(\beta) = \frac{1}{N} X^\top(X\beta - y)$$
Run gradient update, which is simultaneous(!!!) update of $\beta$ in antigradient direction:
$$ \beta := \beta - \alpha\nabla_{\beta}L(\beta)$$
{python}
1.function gd(X, alpha, epsilon):
2. initialise beta
3. do:
4. Beta = new_beta
5. new_Beta = Beta - alpha*grad(X, beta)
6. until dist(new_beta, beta) < epsilon
7. return beta
grad_demo(iters=105, alpha=0.08)
{python}
1.function sgd(X, alpha, epsilon):
2. initialise beta
3. do:
4. X = shuffle(X)
5. for x in X:
6. Beta = new_beta
7. new_Beta = Beta - alpha*grad(x, beta)
8. until dist(new_beta, beta) < epsilon
9. return beta
stoch_grad_demo(iters=105, alpha=0.08)
Idea: to move not only to current antigradient direction but consider previous one
$$ v_t = \gamma v_{t - 1} + \alpha\nabla_{\beta}{L(\beta)}$$ $$ \beta = \beta - v_t$$
where
Idea: update parameters $\beta_i$ for each feature differenly.
Denote $\frac{\partial L}{\beta_i}$ on iteration $t$ as $g_{t,i}$
Vanilla gd update
$$ \beta_{t+1, i} = \beta_{t, i} - \alpha \cdot g_{t,i}$$
In Adagrad $\alpha$ is normalized wrt "size" of previous derivatives:
$$ \beta_{t+1, i} = \beta_{t, i} - \dfrac{\alpha}{\sqrt{G_{t,ii} + \varepsilon}} \cdot g_{t,i}$$
where $G_t$ is diagonal matrix with sum of squares of derivatives of $\beta_{i}$ before iteration $t$. $\varepsilon$ — is smoothing hyperparameter.
Nonlinearity by $x$ in linear regression may be achieved by applying non-linear transformations to the features:
$$ x\to[\phi_{0}(x),\,\phi_{1}(x),\,\phi_{2}(x),\,...\,\phi_{M}(x)] $$
$$ f(x)=\mathbf{\phi}(x)^{T}\beta=\sum_{m=0}^{M}\beta_{m}\phi_{m}(x) $$
The model remains to be linear in $\beta$, so all advantages of linear regression remain.
demo_weights()
Dependency of $\beta$ from $\lambda$ for ridge (A) and LASSO (B):
LASSO can be used for automatic feature selection.
$$ R(\beta)=\alpha||\beta||_{1}+(1-\alpha)||\beta||_{2}^{2}\to\min_{\beta} $$ $\alpha\in(0,1)$ - hyperparameter, controlling impact of each part.
Ridge regression criterion $$ \sum_{n=1}^{N}\left(x_{n}^{T}\beta-y_{n}\right)^{2}+\lambda\beta^{T}\beta\to\min_{\beta} $$
Stationarity condition can be written as:
$$ \begin{gathered}2\sum_{n=1}^{N}x_{n}\left(x_{n}^{T}\beta-y_{n}\right)+2\lambda\beta=0\\ 2X^{T}(X\beta-y)+\lambda\beta=0\\ \left(X^{T}X+\lambda I\right)\beta=X^{T}y \end{gathered} $$
so
$$ \widehat{\beta}=(X^{T}X+\lambda I)^{-1}X^{T}y $$
$$ \begin{array}{lll} \textbf{LOSS} & \textbf{NAME} & \textbf{PROPERTIES}\\ \mathcal{L}(\varepsilon)=\varepsilon^{2} & \text{quadratic} & \text{differentiable}\\ \mathcal{L}(\varepsilon)=\left|\varepsilon\right| & \text{absolute} & \text{robust}\\ \mathcal{L}(\varepsilon)=\begin{cases} \frac{1}{2}\varepsilon^{2}, & \left|\varepsilon\right|\le\delta\\ \delta\left(\left|\varepsilon\right|-\frac{1}{2}\delta\right) & \left|\varepsilon\right|>\delta \end{cases} & \text{Huber} & \text{differentiable, robust} \end{array} $$
For $x,y\sim P(x,y)$ and functional minimizers $f(x)$: $$ \begin{align*} \arg\min_{f(x)}\mathbb{E}\left\{ \left.\left(f(x)-y\right)^{2}\right|x\right\} & ={\mathbb{E}[y|x]}\\ \arg\min_{f(x)}\mathbb{E}\left\{ \left.|f(x)-y|\,\right|x\right\} & =\text{median}[y|x] \end{align*} $$
Weighted account for observations $$ \sum_{n=1}^{N}w_{n}(x_{n}^{T}\beta-y_{n})^{2} $$
$$ \sum_{n=1}^{N}w_{n}\left(x_{n}^{T}\beta-y_{n}\right)^{2}\to\min_{\beta\in\mathbb{R}} $$
Stationarity condition: $$ \sum_{n=1}^{N}w_{n}x_{n}^{d}\left(x_{n}^{T}\beta-y_{n}\right)=0 $$
Define $\{X\}_{n,d}=x_{n}^{d}$, $W=diag\{w_{1},...x_{N}\}$. Then
$$ X^{T}W\left(X\beta-y\right)=0 $$ $$ \beta=\left(X^{T}WX\right)^{-1}X^{T}Wy $$
Initialize $w_{1}=...=w_{N}=1/N$
Repeat:
Comments: $K(\cdot)$ is some decreasing function, repetition may be
* predefined number of times
* until convergence of model parameters.
For training sample $\left(x_{1},y_{1}\right),...\left(x_{N},y_{N}\right)$ consider finding constant $\widehat{y}\in\mathbb{R}$ $$ L(\alpha)=\sum_{i=1}^{N}(\widehat{y}-y_{i})^{2}\to\min_{\widehat{y}\in\mathbb{R}} $$ $$ \frac{\partial L}{\partial\alpha}=2\sum_{i=1}^{N}\left(\widehat{y}-y_{i}\right)=0\text{, so }\widehat{y}=\frac{1}{N}\sum_{i=1}^{N}y_{i} $$
We need to model general curve $y(x)$:
Nadaraya-Watson regression - localized averaging approach.
$$ w_{i}(x)=K\left(\frac{\rho(x,x_{i})}{h}\right) $$
$$ \begin{align*} Q(\alpha,X_{training}) & =\sum_{i=1}^{N}w_{i}(x)(\alpha-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}}\\ w_{i}(x) & =K\left(\frac{\rho(x,x_{i})}{h(x)}\right) \end{align*} $$
Insead of optimizing locally with constant $\alpha$ $$ Q(\alpha,X_{training})=\sum_{i=1}^{N}w_{i}(x)({\alpha}-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}} $$
we could have optimized local linear regression $$ Q(\alpha,X_{training})=\sum_{i=1}^{N}w_{i}(x)({x^{T}\beta}-y_{i})^{2}\to\min_{\alpha\in\mathbb{R}} $$
This better handles approximation on the edges of domain and local extrema.