1. Some materials are taken from machine learning course of Victor Kitov
Not always necessary step
Basically
df_titanic = pd.read_csv('data/titanic.csv')
df_titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
P = pd.crosstab(df_titanic.Survived, df_titanic.Sex, normalize=True).values
print(P)
[[0.09090909 0.52525253] [0.26150393 0.12233446]]
px = P.sum(axis=1)[:, np.newaxis]
py = P.sum(axis=0)[:, np.newaxis]
print(px)
print(py)
[[0.61616162] [0.38383838]] [[0.35241302] [0.64758698]]
px.dot(py.T)
array([[0.21714338, 0.39901824], [0.13526964, 0.24856874]])
mutual_info(df_titanic.Sex, df_titanic.Survived)
0.15087048925218172
That is: $$ z = -1.154 a_1 + 0.828 a_2 + 0.190 a_3$$
(Example from Mohammed J. Zaki, Ch7 )
Maximize variance
$C = X^\top X$ - convariance (correlation in case of scaled dataset) matrix
etc.
Initially, our objective was $$a_1^\top X^\top X a_1 \rightarrow \max_{a_1}$$
From lagrangian we derived that $$X^\top X a_1 = \nu a_1$$
Putting one in to another: $$ a_1^\top X^\top X a_1 = \nu a_1^\top a_1 = \nu \rightarrow \max$$
That means:
By multiplying by $a_1^\top$ : $$ a_1^\top\frac{\partial\mathcal{L}}{\partial a_2} = 2a_1^\top X^\top X a_2 - 2\nu a_1^\top a_2 - \alpha a_1^\top a_1 = 0 $$
It follows that $\alpha a_1^\top a_1 = \alpha = 0$, which means that $$ \frac{\partial\mathcal{L}}{\partial a_2} = 2X^\top X a_2 - \nu a_2 = 0 $$ And $a_2$ is selected from a set of eigenvectors of $X^\top X$. Again, which one?
Derivations of other components proceeds in the same manner
For point $x$ and subspace $L$ denote:
$\|x\|^2 = \|p\|^2 + \|h\|^2$
For training set $x_{1},x_{2},...x_{N}$ and subspace $L$ we can find:
Best-fit $k$-dimentional subspace for a set of points $x_1 \dots x_N$ is a subspace, spanned by $k$ vectors $v_1$, $v_2$, $\dots$, $v_k$, solving
$$ \sum_{n=1}^N \| h_n \| ^2 \rightarrow \min\limits_{v_1, v_2,\dots,v_k}$$or
$$ \sum_{n=1}^N \| p_n \| ^2 \rightarrow \max\limits_{v_1, v_2,\dots,v_k}$$Principal components $a_1, a_2, \dots, a_k$ are vectors, forming orthonormal basis in the k-dimentional subspace of best fit
etc.
Find feature space with lesser dimentions s.t. distances in initial space are conserved in the new one. A bit more formally:
It is clear, that most of the times distances won't be conserved completely: