Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Boosting.¹

^{1. Some materials are taken from machine learning course of Victor Kitov}

Let's recall previous lecture¶

Ensembles¶

Random Forests, Bagging
Stacking, Blending

Boosting¶

Linear ensembles¶

$$ F(x)=f_{0}(x)+\alpha_{1}h_{1}(x)+...+\alpha_{M}h_{M}(x) $$

Regression: $\widehat{y}(x)=F(x)$

Binary classification: $score(y|x)=F(x),\,\widehat{y}(x)= sign(F(x))$

Notation: $h_{1}(x),...h_{M}(x)$ are called \textit{base learners, weak learners, base models}.
Too expensive to optimize $f_{0}(x),h_{1}(x),...h_{M}(x)$ and $\alpha_{1},...\alpha_{M}$ jointly for large $M$.
Idea: optimize $f_{0}(x)$ and then each pair $(h_{m}(x),\,\alpha_{m})$ greedily.

Forward stagewise additive modeling (FSAM)¶

Input: training dataset $(x_{i},y_{i}),\,i=1,2,...N$; loss function $\mathcal{L}(f,y)$, general form of ``base learner'' $h(x|\gamma)$ (dependent from parameter $\gamma$) and the number $M$ of successive additive approximations.

Fit initial approximation $f_{0}(x)=\arg\min_{f}\sum_{i=1}^{N}\mathcal{L}(f(x_{i}),y_{i})$
For $m=1,2,...M$:
1. find next best classifier $$ (\alpha_{m},h_{m})=\arg\min_{h,c}\sum_{i=1}^{N}\mathcal{L}(f_{m-1}(x_{i})+\alpha h(x_{i}),\,y_{i}) $$
2. set $$ f_{m}(x)=f_{m-1}(x)+\alpha_{m}h_{m}(x) $$ Output: approximation function $f_{M}(x)=f_{0}(x)+\sum_{m=1}^{M}\alpha_{m}h_{m}(x)$

Comments on FSAM¶

Number of steps $M$ should be determined by performance on validation set.
Step 1 need not be solved accurately, since its mistakes are expected to be corrected by future base learners.
- we can take $f_{0}(x)=\arg\min_{\beta\in\mathbb{R}}\sum_{i=1}^{N}\mathcal{L}(\beta,y_{i})$ or simply $f_{0}(x)\equiv0$.
By similar reasoning there is no need to solve 2.A accurately
- typically very simple base learners are used such as trees of depth=1,2,3.
For some loss functions, such as $\mathcal{L}(y,f(x))=e^{-yf(x)}$ we can solve FSAM explicitly.
For general loss functions gradient boosting scheme should be used.

AdaBoost¶

Adaboost (discrete version): assumptions¶

binary classification task $y\in\{+1,-1\}$
family of base classifiers $h(x)=h(x|\gamma)$ where $\gamma$ is some fitted parametrization.
$h(x)\in\{+1,-1\}$
classification is performed with $\widehat{y}=sign\{f_{0}(x)+\alpha_{1}f_{1}(x)+...+\alpha_{M}f_{M}(x)\}$
optimized loss is $\mathcal{L}(y,f(x))=e^{-yf(x)}$
FSAM is applied

Exponential loss¶

Adaboost (discrete version): algorithm¶

Input: training dataset $(x_{i},y_{i}),\,i=1,2,...n$; number of additive weak classifiers $M$, a family of weak classifiers $h(x)\in\{+1,-1\}$, trainable on weighted datasets.

Initialize observation weights $w_{i}=1/n$, $i=1,2,...n$.
for $m=1,2,...M$:
1. fit $h^{m}(x)$ to training data using weights $w_{i}$
2. compute weighted misclassification rate: $$ E_{m}=\sum_{i=1}^{N}w_{i}\mathbb{I}[h^{m}(x_i)\ne y_{i}] $$
3. compute $\alpha_{m}=\frac{1}{2}\ln\left((1-E_{m})/E_{m}\right)$
4. update sample weights: $$ w_{i}\leftarrow \frac{w_{i}e^{-\alpha_{m}y_i h^{m}(x_i)}}{W},$$ Where $W$ is normalization factor $\left(W = \sum_i w_i e^{-\alpha_m y_i h^m(x_i)}\right)$

Output: composite classifier $f(x)=sign\left(\sum_{m=1}^{M}\alpha_{m}h^{m}(x)\right)$

In [3]:

X = np.array([[-2, -1], [-2, 1], [2, -1], [2, 1], [-1, -1], [-1, 1], [1, -1], [1, 1]])
y = np.array([-1,-1,-1,-1,1,1,1,1]) 

plt.scatter(X[:, 0], X[:, 1], c=y, s=500)

ada = AdaBoostClassifier(n_estimators=3, algorithm='SAMME', 
                         base_estimator=DecisionTreeClassifier(max_depth=1))
ada.fit(X, y)

Out[3]:

AdaBoostClassifier(algorithm='SAMME',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=3, random_state=None)

In [5]:

plot_decision(ada)
ada.estimator_weights_

Out[5]:

array([ 1.09861229,  1.60943791,  1.38629436])

In [7]:

X, y = make_moons(noise=0.1)
plt.figure(figsize=(7,5))
plt.scatter(X[:, 0], X[:, 1], c=y)

Out[7]:

<matplotlib.collections.PathCollection at 0x1a13a89990>

In [8]:

interact(ada_demo, n_est=IntSlider(min=1, max=150, value=1, step=1))

Out[8]:

<function __main__.ada_demo>

Gradient boosting¶

Motivation¶

Problem: For general loss function $L$ FSAM cannot be solved explicitly
Analogy with function minimization: when we can't find optimum explicitly we use numerical methods
Gradient boosting: numerical method for iterative loss minimization

Gradient descent algorithm¶

$$ F(w)\to\min_{w},\quad w\in\mathbb{R}^{N} $$

Gradient descend algorithm:

Input: $\eta$-parameter, controlling the speed of convergence $M$-number of iterations

ALGORITHM:

initialize $w$
for $m=1,2,...M$:
- $\Delta w = \frac{\partial F(w)}{\partial w}$
- $w = w-\eta \Delta w$

Modified gradient descent algorithm¶

Input: $M$-number of iterations

ALGORITHM:

initialize $w$
for $m=1,2,...M$:
- $\Delta w = \frac{\partial F(w)}{\partial w}$
- $c^* = \arg\min_c F(w-c \Delta w)$
- $w = w-c^* \Delta w$

Gradient boosting¶

Now consider $F\left(f(x_{1}),...f(x_{N})\right)=\sum_{n=1}^{N}\mathcal{L}\left(f(x_{n}),y_{n}\right)$
Gradient descent performs pointwise optimization, but we need generalization, so we optimize in space of functions.
Gradient boosting implements modified gradient descent in function space:
- find $z_{i}=-\frac{\partial\mathcal{L}(r,y_{i})}{\partial r}|_{r=f^{m-1}(x_{i})}$
- fit base learner $h_{m}(x)$ to $\left\{ (x_{i},z_{i})\right\} _{i=1}^{N}$

Gradient boosting¶

Input: training dataset $(x_{i},y_{i}),\,i=1,2,...N$; loss function $\mathcal{L}(f,y)$ and the number $M$ of successive additive approximations.

Fit initial approximation $f_{0}(x)$ (might be taken $f_{0}(x)\equiv0$)
For each step $m=1,2,...M$:
1. calculate derivatives $z_{i}=-\frac{\partial\mathcal{L}(r,y_{i})}{\partial r}|_{r=f^{m-1}(x_{i})}$
2. fit $h_{m}$ to $\{(x_{i},z_{i})\}_{i=1}^{N}$, for example by solving $$ \sum_{n=1}^{N}(h_{m}(x_{n})-z_{n})^{2}\to\min_{h_{m}} $$
3. solve univariate optimization problem: $$ \sum_{i=1}^{N}\mathcal{L}\left(f_{m-1}(x_{i})+c_{m}h_{m}(x_{i}),y_{i}\right)\to\min_{c_{m}\in\mathbb{R}_{+}} $$
4. set $f_{m}(x)=f_{m-1}(x)+c_{m}h_{m}(x)$

Output: approximation function $f_{M}(x)=f_{0}(x)+\sum_{m=1}^{M}c_{m}h_{m}(x)$

Gradient boosting of trees¶

Input : training dataset $(x_{i},y_{i}),\,i=1,2,...N$; loss function $\mathcal{L}(f,y)$ and the number $M$ of successive additive approximations.

Fit constant initial approximation $f_{0}(x)$: $f_{0}(x)=\arg\min_{\gamma}\sum_{i=1}^{N}\mathcal{L}(\gamma,\,y_{i})$
For each step $m=1,2,...M$:
1. calculate derivatives $z_{i}=-\frac{\partial\mathcal{L}(r,y)}{\partial r}|_{r=f^{m-1}(x)}$
2. fit regression tree $h^{m}$ on $\{(x_{i},z_{i})\}_{i=1}^{N}$ with some loss function, get leaf regions $\{R_{j}^{m}\}_{j=1}^{J_{m}}$.
3. for each terminal region $R_{j}^{m}$, $j=1,2,...J_{m}$ solve univariate optimization problem: $$ \gamma_{j}^{m}=\arg\min_{\gamma}\sum_{x_{i}\in R_{j}^{m}}\mathcal{L}(f_{m-1}(x_{i})+\gamma,\,y_{i}) $$
4. update $f_{m}(x)=f_{m-1}(x)+\sum_{j=1}^{J_{m}}\gamma_{j}^{m}\mathbb{I}[x\in R_{j}^{m}]$

Output: approximation function $f_{M}(x)$

In [9]:

from sklearn.ensemble import GradientBoostingRegressor

def grad_demo(n_est=1):
    
    np.random.seed(123)
    X = np.random.uniform(-10, 10, 500)

    y = np.sin(X)/X + np.random.normal(0, .1, 500)
    plt.scatter(X, y)
    

    gbr = GradientBoostingRegressor(n_estimators=n_est, learning_rate=0.15)
    gbr_full = GradientBoostingRegressor(n_estimators=200, learning_rate=0.15)
    gbr.fit(X.reshape(-1,1), y)
    gbr_full.fit(X.reshape(-1,1), y)
    
    x_range = np.linspace(X.min(), X.max(), 100).reshape((-1,1))

    for y_hat in gbr.staged_predict(x_range):
        plt.plot(x_range, y_hat, alpha=0.4, c='g')

    y_hat = gbr_full.predict(x_range)
    
    plt.title('Estimators %d' % n_est)
    plt.plot(x_range, y_hat, c='r')
    plt.ylim((-0.5, 1.3))
    plt.xlim(-11,11)
    
    plt.show()

In [10]:

interact(grad_demo, n_est=IntSlider(min=1, max=150, value=1, step=1))

Out[10]:

<function __main__.grad_demo>

Modification of boosting for trees¶

Compared to first method of gradient boosting, boosting of regression trees finds additive coefficients individually for each terminal region $R_{j}^{m}$, not globally for the whole classifier $h^{m}(x)$.
This is done to increase accuracy: forward stagewise algorithm cannot be applied to find $R_{j}^{m}$, but it can be applied to find $\gamma_{j}^{m}$, because second task is solvable for arbitrary $L$.
Max leaves $J$
- interaction between no more than $J-1$ terms
$M$ controls underfitting-overfitting tradeoff and selected using validation set

Shrinkage & subsampling¶

Shrinkage of general GB, step (d): $$ f_{m}(x)=f_{m-1}(x)+\nu c_{m}h_{m}(x) $$
Shrinkage of trees GB, step (d):

$$ f_{m}(x)=f_{m-1}(x)+\nu\sum_{j=1}^{J_{m}}\gamma_{jm}\mathbb{I}[x\in R_{jm}] $$

Comments:
- $\nu\in(0,1]$
- $\nu \text{ decrices } \implies M \text{ increases } $
Subsampling
- increases speed of fitting
- may increase accuracy

Interpretation - partial dependency plots¶

Problem - we have huge black-box model $\hat{y}^k = F(x^k_1, x^k_2,\dots,x^k_p)$
Want to have at least some interpretation
Idea - cosider a single predictor $x_j$
- Find out, its influence on prediction after we have "averaged out" the influence of all other variables:

$$ \phi_j(x) = \frac{1}{N}\sum_{k=1}^N F(x^k_1, x^k_2,\dots, x^k_{j-1}, x, x^k_{j+1} \dots,x^k_p) $$

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Boosting.1

Let's recall previous lecture¶

Ensembles¶

Boosting¶

Linear ensembles¶

Forward stagewise additive modeling (FSAM)¶

Comments on FSAM¶

AdaBoost¶

Adaboost (discrete version): assumptions¶

Exponential loss¶

Adaboost (discrete version): algorithm¶

Gradient boosting¶

Motivation¶

Gradient descent algorithm¶

Modified gradient descent algorithm¶

Gradient boosting¶

Gradient boosting¶

Gradient boosting of trees¶

Modification of boosting for trees¶

Shrinkage & subsampling¶

Interpretation - partial dependency plots¶

Boosting.¹