1. Some materials are taken from machine learning course of Victor Kitov
Those components can be interpreted as
Define for brevity of notation $y(x)=y,\, a(x) = a,\, f(x) = f, \, \mathbb{E}=\mathbb{E}_{X,Y,\varepsilon}$:
\begin{equation} \begin{split} \text{Error} & = \mathbb{E}\left[(a - y)^2\right] = \mathbb{E}\left[a^2 - 2ay + y^2\right] = \\ & = \underbrace{\mathbb{E}\left[(a - \mathbb{E}[a])^2\right] + \mathbb{E}[a]^2}_{=\mathbb{E}[a^2]} - 2\mathbb{E}[ay] + \underbrace{\mathbb{E}\left[(y - \mathbb{E}[y])^2\right] + \mathbb{E}[y]^2}_{=\mathbb{E}[y^2]} = \\ & = \mathbb{E}\left[(a - \mathbb{E}[a])^2\right] + \left(\mathbb{E}\left[a\right] - f\right)^2 + \mathbb{E}\left[(y - f)^2\right] = \\ & = \text{Variance} + \text{Bias}^2 + \text{Noise} \end{split} \end{equation}Ensemble learning - using multiple machine learning methods for a given problem and integrating their output to obtain final result
Synonyms: committee-based learning, multiple classifier systems.
Applications:
underfitting, high model bias
overfitting, high model variance
M = 50
T = M//2 + 1
p = 0.6
p_ens = sum([scipy.misc.comb(M,k)*(p**k)*((1-p)**(M-k)) for k in range(T, M+1)])
p_ens
0.9021926358467505
Output of base learner k: exact class: $\omega_{1}$ or $\omega_{2}$.
Combiner predicts $\omega_{1}$ if:
Each classifier may be assigned a weight, based on its performance:
Output of base learner k:
Ranking of classes: $$ \omega_{k_{1}}\succeq\omega_{k_{2}}\succeq\ldots\succeq\omega_{k_{C}} $$
Ranking is equivalent to scoring of each class (with incomparable scoring between classifiers).
Definition
Let $B_{k}(i)$ be the count of classes scored below $\omega_{i}$
by classifier $k$.
Borda count $B(i)$ of class $\omega_{i}$ is the total number of classes scored below $\omega_{i}$ by all classifiers: $$ B(i)=\sum_{k=1}^{K}B_{k}(i) $$
Combiner predicts $\omega_{i}$ where $i=\arg\max_{i}B(i)$
Output of base learner k: vectors of class probabilities: $$ [p^{k}(\omega_{1}),\,p^{k}(\omega_{2}),\,\ldots\,p^{k}(\omega_{C})] $$
Combiner predicts $\omega_{i}$ if $i=\arg\max_{i}F(p^{1}(\omega_{i}),p^{2}(\omega_{i}),\ldots p^{K}(\omega_{i}))$
Intuitively, bagging should reduce variance
Input:
for $b=1,2,...B$:
Output: $B$ trees. Classification is done using majority vote and regression using averaging of $B$ outputs
Left out samples may be used for evaluation of model performance.
Less interpretable than individual trees
Extra-Random trees-random sampling of (feature,value) pairs
RandomForest and ExtraRandomTrees do not overfit with increasing $B$
Consider regression with $K$ predictor models $f_{k}(x)$, $k=1,2,...K$.
(Alternatively we may consider $K$ discriminant functions in classification)
Weighted averaging combiner
$$
f(x)=\sum_{k=1}^{K}w_{k}f_{k}(x)
$$
Naive fitting $$ \widehat{w}=\arg\min_{w}\sum_{i=1}^{N}\mathcal{L}(y_{i},\sum_{k=1}^{K}w_{k}f_{k}(x_{i})) $$
overfits. The mostly overfitted method will get the most weight.
Definition</br>
Linear stacking is weighted averaging combiner, where weights are found using $$ \widehat{w}=\arg\min_{w}\sum_{i=1}^{N}\mathcal{L}(y_{i},\sum_{k=1}^{K}w_{k}f_{k}^{-fold(i)}(x_{i})) $$
Definition
Generalized stacking is prediction $$ f(x)=A_{\theta}\left(f_{1}(x),f_{2}(x),\ldots f_{K}(x)\right), $$
where $A$ is some general form predictor and $\theta$ is a vector of parameters, estimated by $$ \widehat{\theta}=\arg\min_{\theta}\sum_{i=1}^{N}\mathcal{L}\left(y_{i},\,A_{\theta}\left(f_{1}^{-fold(i)}(x),f_{2}^{-fold(i)}(x),\ldots f_{K}^{-fold(i)}(x)\right)\right) $$