# Data Analysis

## Feature Selection and Dimention Reduction. PCA.1

1. Some materials are taken from machine learning course of Victor Kitov

## Feature Selection vs. Dimension Reduction¶

• Feature selection is a process of selecting a subset of original features with minimum loss of information related to final task (classification, regression, etc.)
• Dimension reduction is a result of some transformation of initial features to (possibly) lower dimension feature space

## Why feature selection?¶

• increase predictive accuracy of classifier
• improve optimization stability by removing multicollinearity
• increase computational efficiency
• reduce cost of future data collection
• make classifier more interpretable

Not always necessary step

• some methods have implicit feature selection

## Feature Selection Approaches¶

• Unsupervised methods
• don't use target feature
• Filter methdos
• use target feature
• consider each feature independently
• Wrapper methods
• uses model quality
• Embedded methdos
• embedded inside model

### "Unsupervised" methods¶

• Determine feature importance regardless of target feature

### Filter methods¶

• Features are considered independently of each other
• Individual predictive power is measures

Basically

• Order features with respect to feature importances $I(f)$: $$I(f_{1})> I(f_{2})> \dots\ge I(f_{D})$$
• Select top $m$ $$\hat{F}=\{f_{1},f_{2},...f_{m}\}$$
• Simple to implement
• Usually quite fast
• When features are correlated, it will take many redundant features

#### Examples¶

• Correlation
• Which kind of relationship does correlation measure?
• Mutual Information
• Entropy of variable $Y$: $H(Y) = - \sum_y p(y)\ln p(y)$
• Conditional entropy of $Y$ after observing $X$: $H(y|x) = - \sum_x p(x) \sum_y p(y|x)\ln p(y|x)$
• Mutial information: $$MI(Y, X) = \sum_{x,y} p(x,y) \ln\left[\frac{p(x,y)}{p(x)p(y)}\right]$$
• Mutual information measures how much $X$ and $Y$ share information between each other
• $MI(Y,X) = H(Y) - H(Y|X)$
In [3]:
df_titanic = pd.read_csv('data/titanic.csv')

Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [4]:
P = pd.crosstab(df_titanic.Survived, df_titanic.Sex, normalize=True).values
print(P)

[[0.09090909 0.52525253]
[0.26150393 0.12233446]]

In [5]:
px = P.sum(axis=1)[:, np.newaxis]
py = P.sum(axis=0)[:, np.newaxis]
print(px)
print(py)

[[0.61616162]
[0.38383838]]
[[0.35241302]
[0.64758698]]

In [6]:
px.dot(py.T)

Out[6]:
array([[0.21714338, 0.39901824],
[0.13526964, 0.24856874]])
In [8]:
mutual_info(df_titanic.Pclass, df_titanic.Survived)

Out[8]:
0.05810725269032348

### Wrapper methods¶

• Selecting suboptimal subset of features
• Could be slow
• Examples:
• Recursive Feature Elimination
• Consider full set of features
• Fit a model, measure feature importance (based on model)
• Remove least important feature(s)
• Repeat
• Boruta Algorithm

### Embedded methods¶

• Feature selection process in included in the model
• Examples:
• Decision Trees
• Linear model with L1 regularization

## Something different: permutation importance¶

• Calculate quality measure on validation set
• For each feature:
• replace given values with random values from the same distribution
• ... or perform random shuffling
• observe the difference in quality measure

# Principal Component Analysis (PCA)¶

## Why dimension reduction?¶

same as in feature selection +

• data visualization
In [9]:
iris = sns.load_dataset('iris')
sns.pairplot(iris, hue='species')

Out[9]:
<seaborn.axisgrid.PairGrid at 0x7fada055da20>
In [11]:
# ะะตัะตะนะดะตะผ ะพั 4 ะฟัะธะทะฝะฐะบะพะฒ ะบ 2 ะบะพะผะฟะพะฝะตะฝัะฐะผ
pca = PCA(n_components=2, random_state=123)
pca.fit(X) # ะะธะบะฐะบะพะณะพ "y"
PC = pca.transform(X)
plt.scatter(PC[:,0], PC[:,1],c=y)

Out[11]:
<matplotlib.collections.PathCollection at 0x7fad98296208>

# PCA¶

• Intuition 1: Find directions along which our datapoints have the greatest variance

# PCA¶

• Intuition 2: Find a subspace $L$ (of lesser dimention) s.t. the sum of squares of differences between points and their projections is minimized

# PCA¶

• Intuition 2: Find a subspace $L$ (of lesser dimention) s.t. the sum of squares of differences between points and their projections is minimized

## In other words, we can consider PCA as¶

• A transformation of your inital feature axes ...
• New axes are just a linear combination of initial axes
• New axes are orthogonal (orthonormal) to each other
• Variance of data across those axes is maximized
• ... which keeps only the most "informative" axes

## How do we project data on new axes?¶

• Consider an object $x$ with 3 features: $x=(-0.343, -0.754, 0.241)$
• We can say that it is spanned on 3 basis vectors of our feature space: $$e_1 = \left( \begin{array}{c} 1 \\ 0 \\ 0 \end{array} \right) \quad e_2 = \left( \begin{array}{c} 0 \\ 1 \\ 0 \end{array} \right) \quad e_3 = \left( \begin{array}{c} 0 \\ 0 \\ 1 \end{array} \right) \quad$$
$$x = -0.343 e_1 + -0.754 e_2 + 0.241 e_3$$
• Consider new basis vectors:
$$a_1 = \left( \begin{array}{c} -0.390 \\ 0.089 \\ -0.916 \end{array} \right) \quad a_2 = \left( \begin{array}{c} -0.639 \\ -0.742 \\ 0.200 \end{array} \right) \quad a_3 = \left( \begin{array}{c} -0.663 \\ 0.664 \\ 0.346 \end{array} \right) \quad$$
• How do we project $x$ on it?

## Projecting¶

$$a_1 = \left( \begin{array}{c} -0.390 \\ 0.089 \\ -0.916 \end{array} \right) \quad a_2 = \left( \begin{array}{c} -0.639 \\ -0.742 \\ 0.200 \end{array} \right) \quad a_3 = \left( \begin{array}{c} -0.663 \\ 0.664 \\ 0.346 \end{array} \right) \quad$$$$z = A^\top x = \left( \begin{array}{ccc} -0.390 & 0.089 & -0.916\\ -0.639 & -0.742 & 0.200 \\ -0.663 & 0.664 & 0.346 \end{array} \right) \left( \begin{array}{c} -0.343 \\ -0.754 \\ 0.241 \end{array} \right) = \left( \begin{array}{c} -1.154 \\ 0.828 \\ 0.190 \end{array} \right)$$

That is: $$z = -1.154 a_1 + 0.828 a_2 + 0.190 a_3$$

(Example from Mohammed J. Zaki, Ch7 )

• So how do we find those $a_i$?
• Orthonormality
• Maximize variance
• Vectors $a_i$ are called principal components

# Construction of PCA¶

## Construction of PCA¶

• Principal components $a_{1},a_{2},...a_{D}\in\mathbb{R}^{D}$ are orthonormal: $$\langle a_{i},a_{j}\rangle=\begin{cases} 1, & i=j\\ 0 & i\ne j \end{cases}$$
• Maximize variance
• Datapoints in $X$ assumed to be centered (and scaled)
• $z_i = a^\top x_i$ - projection of $x_i$ on $a$
• Variance across principal component $a$ for dataset \begin{align} \sigma^2_a & = \frac{1}{n}\sum\limits_{i=1}^n(z_i - \mu_z)^2 \\ & = \frac{1}{n}\sum\limits_{i=1}^n(a^\top x_i)^2 \\ & = \frac{1}{n}\sum\limits_{i=1}^n a^\top( x_i x_i^\top) a \\ & = a^\top \left(\frac{1}{n}\sum\limits_{i=1}^n x_i x_i^\top \right) a \\ & = a^\top X^\top X a \\ & = a^\top C a \rightarrow \max_a \\ \end{align}
• $C = X^\top X$ - convariance (correlation in case of scaled dataset) matrix

## Covariance matrix reminder¶

• Given $N$ observations of variables $x$ and $y$, covariace is calculated as $${s}_{xy}=\mathrm {cov} (x,y)={1 \over N}\sum _{n=1}^{N}\left(x_{n}-{\overline {x}}\right)\left(y_{n}-{\overline {y}}\right)$$
• scaled measure of linear dependency between $x$ and $y$

Relationship with variance and correlation:

• $\mathrm {cov} (x,x) = \mathrm {var} (x)$

• $\mathrm {corr} (x,y)=\frac{\mathrm {cov} (x,y)}{\mathrm {var} (x)\mathrm {var} (y)}$

## Covariance matrix reminder¶

• If $X$ is feature matrix (features are on columns) we can calculate covariance matrix: $$\operatorname {cov} (X ) ={\begin{bmatrix}\mathrm {var}(x_1)&\mathrm {cov} (x_1,x_2)&\cdots &\mathrm {cov} (x_1,x_d)\\\\\mathrm {cov} (x_2,x_1)&\mathrm {var}(x_2)&\cdots &\mathrm {cov} (x_2,x_d)\\\\\vdots &\vdots &\ddots &\vdots \\\\\mathrm {cov} (x_d,x_1)&\mathrm {cov} (x_d,x_2)&\cdots &\mathrm {var}(x_d)\end{bmatrix}}$$

Examples of data distributions and corresponding covariances

## Useful properties of $X^\top X$¶

• $X^\top X$ - symmetric and positive semi-definite
• $(X^\top X)^\top = X^\top X$
• $\forall w \in \mathbb{R}^D:\ w^\top (X^\top X) w \geq 0$
• Properties
• All eigenvalues $\lambda_i \in \mathbb{R}, \lambda_i \geq 0$ (we also assume $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_d$)
• Eigenvectors for $\lambda_i \neq \lambda_j$ are orthogonal: $v_i^\top v_j = 0$
• For each unique eigenvalue $\lambda_i$ there is a unique $v_i$

## Construction of PCA¶

1. $a_{1}$ is selected to maximize $a_1^\top X^\top X a_1$ subject to $\langle a_{1},a_{1}\rangle=1$
2. $a_{2}$ is selected to maximize $a_2^\top X^\top X a_2$ subject to $\langle a_{2},a_{2}\rangle=1$, $\langle a_{2},a_{1}\rangle=0$
3. $a_{3}$ is selected to maximize $a_3^\top X^\top X a_3$ subject to $\langle a_{3},a_{3}\rangle=1$, $\langle a_{3},a_{1}\rangle=\langle a_{3},a_{2}\rangle=0$

etc.

## Derivation of 1st component¶

$$$$\begin{cases} a_1^\top X^\top X a_1 \rightarrow \max_{a_1} \\ a_1^\top a_1 = 1 \end{cases}$$$$
• Lagrangian of optimization problem $$\mathcal{L}(a_1, \nu) = a_1^\top X^\top X a_1 - \nu (a_1^\top a_1 - 1) \rightarrow max_{a_1, \nu}$$
• Derivative w.r.t. $a_1$ $$\frac{\partial\mathcal{L}}{\partial a_1} = 2X^\top X a_1 - 2\nu a_1 = 0$$
• so $a_1$ is selected from a set of eigenvectors of $X^\top X$. But which one?

## Back to component 1¶

Initially, our objective was $$a_1^\top X^\top X a_1 \rightarrow \max_{a_1}$$

From lagrangian we derived that $$X^\top X a_1 = \nu a_1$$

Putting one in to another: $$a_1^\top X^\top X a_1 = \nu a_1^\top a_1 = \nu \rightarrow \max$$

That means:

• $\nu$ should be the greatest eigenvalue of matrix $X^\top X$, which is $\lambda_1$
• $a_1$ is eigenvector, correspondent to $\lambda_1$

## Derivation of 2nd component¶

$$$$\begin{cases} a_2^\top X^\top X a_2 \rightarrow \max_{a_2} \\ a_2^\top a_2 = 1 \\ a_2^\top a_1 = 0 \end{cases}$$$$
• Lagrangian of optimization problem $$\mathcal{L}(a_2, \nu,\alpha) = a_2^\top X^\top X a_2 - \nu (a_2^\top a_2 - 1) - \alpha (a_1^\top a_2) \rightarrow max_{a_2, \nu,\alpha}$$
• Derivative w.r.t. $a_2$ $$\frac{\partial\mathcal{L}}{\partial a_2} = 2X^\top X a_2 - 2\nu a_2 - \alpha a_1 = 0$$

## Derivation of 2nd component¶

• By multiplying by $a_1^\top$ : $$a_1^\top\frac{\partial\mathcal{L}}{\partial a_2} = 2a_1^\top X^\top X a_2 - 2\nu a_1^\top a_2 - \alpha a_1^\top a_1 = 0$$

• It follows that $\alpha a_1^\top a_1 = \alpha = 0$, which means that $$\frac{\partial\mathcal{L}}{\partial a_2} = 2X^\top X a_2 - \nu a_2 = 0$$ And $a_2$ is selected from a set of eigenvectors of $X^\top X$. Again, which one?

Derivations of other components proceeds in the same manner

## PCA algorithm¶

1. Center (and scale) dataset
2. Calculate covariance matrix $ะก=X^\top X$
3. Find first $k$ eigenvalues and eigenvectors $$A = \left[ \begin{array}{cccc} \mid & \mid & & \mid\\ a_{1} & a_{2} & \ldots & a_{k} \\ \mid & \mid & & \mid \end{array} \right]$$
4. Perform projection: $$Z = XA$$

## Explained variance or why do we neeed $\lambda$?¶

• As we remember $$a_i^\top X^\top X a_i = \lambda_i,$$ which means that $\lambda_i$ shows variance of data "explained" by $a_i$
• We can calculate explained variance ratio for $a_i$: $$\frac{\lambda_{i}}{\sum_{d=1}^{D}\lambda_{d}}$$

# (Supplementary) alternative view on PCA¶

## Best hyperplane fit¶

For point $x$ and subspace $L$ denote:

• $p$: the projection of $x$ on $L$
• $h$: orthogonal complement
• $x=p+h$, $\langle p,h\rangle=0$,

#### Proposition¶

$\|x\|^2 = \|p\|^2 + \|h\|^2$

For training set $x_{1},x_{2},...x_{N}$ and subspace $L$ we can find:

• projections: $p_{1},p_{2},...p_{N}$
• orthogonal complements: $h_{1},h_{2},...h_{N}$.

## Example: line of best fit¶

• In PCA the sum of squared perpendicular distances to line is minimized:

## Best subspace fit¶

#### Definition¶

Best-fit $k$-dimentional subspace for a set of points $x_1 \dots x_N$ is a subspace, spanned by $k$ vectors $a_1$, $a_2$, $\dots$, $a_k$, solving

$$\sum_{n=1}^N \| h_n \| ^2 \rightarrow \min\limits_{a_1, a_2,\dots,a_k}$$

or

$$\sum_{n=1}^N \| p_n \| ^2 \rightarrow \max\limits_{a_1, a_2,\dots,a_k}$$

## Definition of PCA¶

Principal components $a_1, a_2, \dots, a_k$ are vectors, forming orthonormal basis in the k-dimentional subspace of best fit

#### Properties¶

• Not invariant to translation:
• center data before PCA: $$x\leftarrow x-\mu\text{ where }\mu=\frac{1}{N}\sum_{n=1}^{N}x_{n}$$
• Not invariant to scaling:
• scale features to have unit variance before PCA

# Construction of PCA¶

## Construction of PCA¶

• Datapoints in $X$ assumed to be centralized and scaled (!)
• Principal components $a_{1},a_{2},...a_{D}\in\mathbb{R}^{D}$ are found such that $\langle a_{i},a_{j}\rangle=\begin{cases} 1, & i=j\\ 0 & i\ne j \end{cases}$
• $Xa_{i}$ is a vector of projections of all objects onto the $i$-th principal component.
• For any object $x$ its projections onto principal components are equal to: $$p=A^{T}x=[\langle a_{1},x\rangle,...\langle a_{D},x\rangle]^{T}$$ where $A=[a_{1};a_{2};...a_{D}]\in\mathbb{R}^{DxD}$.

## Construction of PCA¶

1. $a_{1}$ is selected to maximize $\left\lVert Xa_{1}\right\rVert$ subject to $\langle a_{1},a_{1}\rangle=1$
2. $a_{2}$ is selected to maximize $\left\lVert Xa_{2}\right\rVert$ subject to $\langle a_{2},a_{2}\rangle=1$, $\langle a_{2},a_{1}\rangle=0$
3. $a_{3}$ is selected to maximize $\left\lVert Xa_{3}\right\rVert$ subject to $\langle a_{3},a_{3}\rangle=1$, $\langle a_{3},a_{1}\rangle=\langle a_{3},a_{2}\rangle=0$

etc.

## PCA summary¶

• Dimensionality reduction - common preprocessing step for efficiency and numerical stability.
• Solution vectors $a_1,\dots,a_k$ are called top $k$ principal components.
• Principal component analysis - expression of $x$ in terms of first $k$ principal components.
• It is unsupervised linear dimensionality reduction.
• Solution is achieved on top $k$ eigenvectors $a_{1},...a_{k}$ of covariance matrix.