Data Analysis

Andrey Shestakov (avshestakov@hse.ru)


Feature Selection and Dimention Reduction. PCA.1

1. Some materials are taken from machine learning course of Victor Kitov

Feature Selection vs. Dimension Reduction

  • Feature selection is a process of selecting a subset of original features with minimum loss of information related to final task (classification, regression, etc.)
  • Dimension reduction is a result of some transformation of initial features to (possibly) lower dimension feature space

Why feature selection?

  • increase predictive accuracy of classifier
  • improve optimization stability by removing multicollinearity
  • increase computational efficiency
  • reduce cost of future data collection
  • make classifier more interpretable

Not always necessary step

  • some methods have implicit feature selection

Feature Selection Approaches

  • Unsupervised methods
    • don't use target feature
  • Filter methdos
    • use target feature
    • consider each feature independently
  • Wrapper methods
    • uses model quality
  • Embedded methdos
    • embedded inside model

"Unsupervised" methods

  • Determine feature importance regardless of target feature
  • Your ideas?

Filter methods

  • Features are considered independently of each other
  • Individual predictive power is measures

Basically

  • Order features with respect to feature importances $I(f)$: $$ I(f_{1})> I(f_{2})> \dots\ge I(f_{D}) $$
  • Select top $m$ $$ \hat{F}=\{f_{1},f_{2},...f_{m}\} $$
  • Simple to implement
  • Usually quite fast
  • When features are correlated, it will take many redundant features

Examples

  • Correlation
    • Which kind of relationship does correlation measure?
  • Mutual Information
    • Entropy of variable $Y$: $H(Y) = - \sum_y p(y)\ln p(y)$
    • Conditional entropy of $Y$ after observing $X$: $H(y|x) = - \sum_x p(x) \sum_y p(y|x)\ln p(y|x) $
    • Mutial information: $$MI(Y, X) = \sum_{x,y} p(x,y) \ln\left[\frac{p(x,y)}{p(x)p(y)}\right]$$
      • Mutual information measures how much $X$ and $Y$ share information between each other
      • $MI(Y,X) = H(Y) - H(Y|X)$
In [3]:
df_titanic = pd.read_csv('data/titanic.csv')
df_titanic.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [4]:
P = pd.crosstab(df_titanic.Survived, df_titanic.Sex, normalize=True).values
print(P)
[[0.09090909 0.52525253]
 [0.26150393 0.12233446]]
In [5]:
px = P.sum(axis=1)[:, np.newaxis]
py = P.sum(axis=0)[:, np.newaxis]
print(px)
print(py)
[[0.61616162]
 [0.38383838]]
[[0.35241302]
 [0.64758698]]
In [6]:
px.dot(py.T)
Out[6]:
array([[0.21714338, 0.39901824],
       [0.13526964, 0.24856874]])
In [8]:
mutual_info(df_titanic.Sex, df_titanic.Survived)
Out[8]:
0.15087048925218172

Wrapper methods

  • Selecting suboptimal subset of features
  • Could be slow
  • Examples:
    • Recursive Feature Elimination
      • Consider full set of features
      • Fit a model, measure feature importance (based on model)
      • Remove least important feature(s)
      • Repeat
    • Boruta Algorithm

Embedded methods

  • Feature selection process in included in the model
  • Examples:
    • Decision Trees
    • Linear model with L1 regularization

Dimension Reduction

Principal Component Analysis (PCA)

Feature Selection vs. Feature Extraction

Why dimension reduction?

same as in feature selection +

  • data visualization

PCA

  • Intuition 1: Find directions along which our datapoints have the greatest variance

PCA

  • Intuition 2: Find a subspace $L$ (of lesser dimention) s.t. the sum of squares of differences between points and their projections is minimized

In other words, we can consider PCA as

  • A transformation of your inital feature axes ...
    • New axes are just a linear combination of initial axes
    • New axes are orthogonal (orthonormal) to each other
    • Variance of data across those axes is maximized
  • ... which keeps only the most "informative" axes

How do we project data on new axes?

  • Consider an object $x$ with 3 features: $x=(-0.343, -0.754, 0.241)$
  • We can say that it is spanned on 3 basis vectors of our feature space: $$ e_1 = \left( \begin{array}{c} 1 \\ 0 \\ 0 \end{array} \right) \quad e_2 = \left( \begin{array}{c} 0 \\ 1 \\ 0 \end{array} \right) \quad e_3 = \left( \begin{array}{c} 0 \\ 0 \\ 1 \end{array} \right) \quad$$
$$ x = -0.343 e_1 + -0.754 e_2 + 0.241 e_3 $$
  • Consider new basis vectors:
$$ a_1 = \left( \begin{array}{c} -0.390 \\ 0.089 \\ -0.916 \end{array} \right) \quad a_2 = \left( \begin{array}{c} -0.639 \\ -0.742 \\ 0.200 \end{array} \right) \quad a_3 = \left( \begin{array}{c} -0.663 \\ 0.664 \\ 0.346 \end{array} \right) \quad$$
  • How do we project $x$ on it?

Projecting

$$ a_1 = \left( \begin{array}{c} -0.390 \\ 0.089 \\ -0.916 \end{array} \right) \quad a_2 = \left( \begin{array}{c} -0.639 \\ -0.742 \\ 0.200 \end{array} \right) \quad a_3 = \left( \begin{array}{c} -0.663 \\ 0.664 \\ 0.346 \end{array} \right) \quad$$$$ z = A^\top x = \left( \begin{array}{ccc} -0.390 & 0.089 & -0.916\\ -0.639 & -0.742 & 0.200 \\ -0.663 & 0.664 & 0.346 \end{array} \right) \left( \begin{array}{c} -0.343 \\ -0.754 \\ 0.241 \end{array} \right) = \left( \begin{array}{c} -1.154 \\ 0.828 \\ 0.190 \end{array} \right)$$

That is: $$ z = -1.154 a_1 + 0.828 a_2 + 0.190 a_3$$

(Example from Mohammed J. Zaki, Ch7 )

  • So how do we find those $a_i$?
    • Orthonormality
    • Maximize variance
  • Vectors $a_i$ are called principal components

Construction of PCA

Construction of PCA

  • Principal components $a_{1},a_{2},...a_{D}\in\mathbb{R}^{D}$ are orthonormal: $$\langle a_{i},a_{j}\rangle=\begin{cases} 1, & i=j\\ 0 & i\ne j \end{cases}$$
  • Maximize variance

    • Datapoints in $X$ assumed to be centralized (and scaled)
    • $z_i = a^\top x_i$ - projection of $x_i$ on $a$
    • Variance across principal component $a$ for dataset $$ \begin{align} \sigma^2_a & = \frac{1}{n}\sum\limits_{i=1}^n(z_i - \mu_z)^2 \\ & = \frac{1}{n}\sum\limits_{i=1}^n(a^\top x_i)^2 \\ & = \frac{1}{n}\sum\limits_{i=1}^n a^\top( x_i x_i^\top) a \\ & = a^\top \left(\frac{1}{n}\sum\limits_{i=1}^n x_i x_i^\top \right) a \\ & = a^\top C a \rightarrow \\ & = a^\top X^\top X a \rightarrow \max_w \\ \end{align} $$
    • $C = X^\top X$ - convariance (correlation in case of scaled dataset) matrix

    • Interpreting Covariance Matrix

Construction of PCA

  1. $a_{1}$ is selected to maximize $a_1^\top X^\top X a_1$ subject to $\langle a_{1},a_{1}\rangle=1$
  2. $a_{2}$ is selected to maximize $a_2^\top X^\top X a_2$ subject to $\langle a_{2},a_{2}\rangle=1$, $\langle a_{2},a_{1}\rangle=0$
  3. $a_{3}$ is selected to maximize $a_3^\top X^\top X a_3$ subject to $\langle a_{3},a_{3}\rangle=1$, $\langle a_{3},a_{1}\rangle=\langle a_{3},a_{2}\rangle=0$

etc.

Derivation of 1st component

$$ \begin{equation} \begin{cases} a_1^\top X^\top X a_1 \rightarrow \max_{a_1} \\ a_1^\top a_1 = 1 \end{cases} \end{equation} $$
  • Lagrangian of optimization problem $$ \mathcal{L}(a_1, \nu) = a_1^\top X^\top X a_1 - \nu (a_1^\top a_1 - 1) \rightarrow max_{a_1, \nu}$$
  • Derivative w.r.t. $a_1$ $$ \frac{\partial\mathcal{L}}{\partial a_1} = 2X^\top X a_1 - 2\nu a_1 = 0 $$
  • so $a_1$ is selected from a set of eigenvectors of $X^\top X$. But which one?

Useful properties of $X^\top X$

  • $X^\top X$ - symmetric and positive semi-definite
    • $(X^\top X)^\top = X^\top X$
    • $\forall w \in \mathbb{R}^D:\ w^\top (X^\top X) w \geq 0$
  • Properties
    • All eigenvalues $\lambda_i \in \mathbb{R}, \lambda_i \geq 0$ (we also assume $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_d $)
    • Eigenvectors for $\lambda_i \neq \lambda_j $ are orthogonal: $v_i^\top v_j = 0$
    • For each unique eigenvalue $\lambda_i$ there is a unique $v_i$

Back to component 1

Initially, our objective was $$a_1^\top X^\top X a_1 \rightarrow \max_{a_1}$$

From lagrangian we derived that $$X^\top X a_1 = \nu a_1$$

Putting one in to another: $$ a_1^\top X^\top X a_1 = \nu a_1^\top a_1 = \nu \rightarrow \max$$

That means:

  • $\nu$ should be the greatest eigenvalue of matrix $X^\top X$, which is $\lambda_1$
  • $a_1$ is eigenvector, correspondent to $\lambda_1$

Derivation of 2nd component

$$ \begin{equation} \begin{cases} a_2^\top X^\top X a_2 \rightarrow \max_{a_2} \\ a_2^\top a_2 = 1 \\ a_2^\top a_1 = 0 \end{cases} \end{equation} $$
  • Lagrangian of optimization problem $$ \mathcal{L}(a_2, \nu,\alpha) = a_2^\top X^\top X a_2 - \nu (a_2^\top a_2 - 1) - \alpha (a_1^\top a_2) \rightarrow max_{a_2, \nu,\alpha}$$
  • Derivative w.r.t. $a_2$ $$ \frac{\partial\mathcal{L}}{\partial a_2} = 2X^\top X a_2 - 2\nu a_2 - \alpha a_1 = 0 $$

Derivation of 2nd component

  • By multiplying by $a_1^\top$ : $$ a_1^\top\frac{\partial\mathcal{L}}{\partial a_2} = 2a_1^\top X^\top X a_2 - 2\nu a_1^\top a_2 - \alpha a_1^\top a_1 = 0 $$

  • It follows that $\alpha a_1^\top a_1 = \alpha = 0$, which means that $$ \frac{\partial\mathcal{L}}{\partial a_2} = 2X^\top X a_2 - \nu a_2 = 0 $$ And $a_2$ is selected from a set of eigenvectors of $X^\top X$. Again, which one?

Derivations of other components proceeds in the same manner

PCA algorithm

  1. Center (and scale) dataset
  2. Calculate covariance matrix $С=X^\top X$
  3. Find first $k$ eigenvalues and eigenvectors $$A = \left[ \begin{array}{cccc} \mid & \mid & & \mid\\ a_{1} & a_{2} & \ldots & a_{k} \\ \mid & \mid & & \mid \end{array} \right] $$
  4. Perform projection: $$ Z = XA $$

Explained variance or why do we neeed $\lambda$?

  • As we remember $$a_i^\top X^\top X a_i = \lambda_i,$$ which means that $\lambda_i$ shows variance of data "explained" by $a_i$
  • We can calculate explained variance ratio for $a_i$: $$ \frac{\lambda_{i}}{\sum_{d=1}^{D}\lambda_{d}} $$

(Supplementary) alternative view on PCA

Best hyperplane fit

For point $x$ and subspace $L$ denote:

  • $p$: the projection of $x$ on $L$
  • $h$: orthogonal complement
  • $x=p+h$, $\langle p,h\rangle=0$,

Proposition

$\|x\|^2 = \|p\|^2 + \|h\|^2$

For training set $x_{1},x_{2},...x_{N}$ and subspace $L$ we can find:

  • projections: $p_{1},p_{2},...p_{N}$
  • orthogonal complements: $h_{1},h_{2},...h_{N}$.

Best subspace fit

Definition

Best-fit $k$-dimentional subspace for a set of points $x_1 \dots x_N$ is a subspace, spanned by $k$ vectors $v_1$, $v_2$, $\dots$, $v_k$, solving

$$ \sum_{n=1}^N \| h_n \| ^2 \rightarrow \min\limits_{v_1, v_2,\dots,v_k}$$

or

$$ \sum_{n=1}^N \| p_n \| ^2 \rightarrow \max\limits_{v_1, v_2,\dots,v_k}$$

Definition of PCA

Principal components $a_1, a_2, \dots, a_k$ are vectors, forming orthonormal basis in the k-dimentional subspace of best fit

Properties

  • Not invariant to translation:
    • center data before PCA: $$ x\leftarrow x-\mu\text{ where }\mu=\frac{1}{N}\sum_{n=1}^{N}x_{n} $$
  • Not invariant to scaling:
    • scale features to have unit variance before PCA

Example: line of best fit

  • In PCA the sum of squared perpendicular distances to line is minimized:
  • What is the difference with least squares minimization in regression?

Example: plane of best fit

Construction of PCA

Construction of PCA

  • Datapoints in $X$ assumed to be centralized and scaled (!)
  • Principal components $a_{1},a_{2},...a_{D}\in\mathbb{R}^{D}$ are found such that $\langle a_{i},a_{j}\rangle=\begin{cases} 1, & i=j\\ 0 & i\ne j \end{cases}$
  • $Xa_{i}$ is a vector of projections of all objects onto the $i$-th principal component.
  • For any object $x$ its projections onto principal components are equal to: $$ p=A^{T}x=[\langle a_{1},x\rangle,...\langle a_{D},x\rangle]^{T} $$ where $A=[a_{1};a_{2};...a_{D}]\in\mathbb{R}^{DxD}$.

Construction of PCA

  1. $a_{1}$ is selected to maximize $\left\lVert Xa_{1}\right\rVert $ subject to $\langle a_{1},a_{1}\rangle=1$
  2. $a_{2}$ is selected to maximize $\left\lVert Xa_{2}\right\rVert $ subject to $\langle a_{2},a_{2}\rangle=1$, $\langle a_{2},a_{1}\rangle=0$
  3. $a_{3}$ is selected to maximize $\left\lVert Xa_{3}\right\rVert $ subject to $\langle a_{3},a_{3}\rangle=1$, $\langle a_{3},a_{1}\rangle=\langle a_{3},a_{2}\rangle=0$

etc.

PCA summary

  • Dimensionality reduction - common preprocessing step for efficiency and numerical stability.
  • Solution vectors $a_1,\dots,a_k$ are called top $k$ principal components.
  • Principal component analysis - expression of $x$ in terms of first $k$ principal components.
  • It is unsupervised linear dimensionality reduction.
  • Solution is achieved on top $k$ eigenvectors $a_{1},...a_{k}$ of covariance matrix.
  • PCA tutorial

t-SNE

t-distributed stochastic neighbor embedding

Multidimentional scaling - intuinition

Find feature space with lesser dimentions s.t. distances in initial space are conserved in the new one. A bit more formally:

  • Given $X = [x_1,\dots, x_n]\in \mathbb{R}^{N \times D}$ and/or $\delta_{ij}$ - proximity measure between $(x_i,x_j)$ in initial feature space.
  • Find $Y = [y_1,\dots,y_n] \in \mathbb{R}^{N \times d}$ such that $\delta_{ij} \approx d(y_i, y_j) = \|y_i-y_j\|^2$

Multidimentional scaling

It is clear, that most of the times distances won't be conserved completely:

Multidimentional scaling approaches

  • classical MDS (essentially PCA)
  • metric MDS
  • non-metric MDS
  • But what if we want to conserve not the distances themselves, but the structure of inital dataset? Like neighbourhood of each point?
  • Here comes t-SNE !!!

T-SNE

  • t-SNE - is not multidimentional scaling, but the goal is somewhat similar
  • There are going to be 3 types of similarities:
    • Similarity between points in initial feature space
    • Similarity between points in derived feature space
    • Similarity between feature spaces!

T-SNE

  • Similarity between points in initial feature space $\mathbb{R}^D$ $$ p(i, j) = \frac{p(i | j) + p(j | i)}{2N}, \quad p(j | i) = \frac{\exp(-\|\mathbf{x}_j-\mathbf{x}_i\|^2/{2 \sigma_i^2})}{\sum_{k \neq i}\exp(-\|\mathbf{x}_k-\mathbf{x}_i\|^2/{2 \sigma_i^2})} $$ $\sigma_i$ is set by user (implicitly)
  • Similarity between points in derived feature space $\mathbb{R}^d, d << D$ $$ q(i, j) = \frac{g(|\mathbf{y}_i - \mathbf{y}_j|)}{\sum_{k \neq l} g(|\mathbf{y}_i - \mathbf{y}_j|)} $$ where $g(z) = \frac{1}{1 + z^2}$ - is student t-distribution with dof=$1$
  • Similarity between feature spaces (Kullback–Leibler divergence) $$ J_{t-SNE}(y) = KL(P \| Q) = \sum_i \sum_j p(i, j) \log \frac{p(i, j)}{q(i, j)} \rightarrow \min\limits_{\mathbf{y}} $$

KL-Divergence

$$ KL(P \| Q) = \sum_z P(z) \log \frac{P(z)}{Q(z)} $$

Optimization

  • Optimize $J_{t-SNE}(y)$ with SGD
$$\frac{\partial J_{t-SNE}}{\partial y_i}=4 \sum_j(p(i,j)−q(i,j))(y_i−y_j)g(|y_i−y_j|)$$