1. Some materials are taken from machine learning course of Victor Kitov
For point $x$ and subspace $L$ denote:
$\|x\|^2 = \|p\|^2 + \|h\|^2$
For training set $x_{1},x_{2},...x_{N}$ and subspace $L$ we can find:
Best-fit $k$-dimentional subspace for a set of points $x_1 \dots x_N$ is a subspace, spanned by $k$ vectors $v_1$, $v_2$, $\dots$, $v_k$, solving
$$ \sum_{n=1}^N \| h_n \| ^2 \rightarrow \min\limits_{v_1, v_2,\dots,v_k}$$
or
$$ \sum_{n=1}^N \| p_n \| ^2 \rightarrow \max\limits_{v_1, v_2,\dots,v_k}$$
Principal components $a_1, a_2, \dots, a_k$ are vectors, forming orthonormal basis in the k-dimentional subspace of best fit
Eigenvectors are called eigenfaces. Projections on first several eigenfaces describe most of face variability.
Consider vector $x$. Since all $D$ principal components form a full othonormal basis, $x$ can be written as $$ x=\langle x,a_{1}\rangle a_{1}+\langle x,a_{2}\rangle a_{2}+...+\langle x,a_{D}\rangle a_{D} $$
Let $p^{K}$ be the projection of $x$ onto subspace spanned by first $K$ principal components: $$ p^{K}=\langle x,a_{1}\rangle a_{1}+\langle x,a_{2}\rangle a_{2}+...+\langle x,a_{K}\rangle a_{K} $$
Error of this approximation is $$ h^{K}=x-p^{K}=\langle x,a_{K+1}\rangle a_{K+1}+...+\langle x,a_{D}\rangle a_{D} $$
Using that $a_{1},...a_{D}$ is an orthonormal set of vectors, we get
$$ \begin{gathered}\left\lVert x\right\rVert ^{2}=\langle x,x\rangle=\langle x,a_{1}\rangle^{2}+...+\langle x,a_{D}\rangle^{2}\\ \left\lVert p^{K}\right\rVert ^{2}=\langle p^{K},p^{K}\rangle=\langle x,a_{1}\rangle^{2}+...+\langle x,a_{K}\rangle^{2}\\ \left\lVert h^{K}\right\rVert ^{2}=\langle h^{K},h^{K}\rangle=\langle x,a_{K+1}\rangle^{2}+...+\langle x,a_{D}\rangle^{2} \end{gathered} $$
We can measure how well first $K$ components describe our dataset $x_{1},x_{2},...x_{N}$ using relative loss $$ \begin{equation} L(K)=\frac{\sum_{n=1}^{N}\left\lVert h_{n}^{K}\right\rVert ^{2}}{\sum_{n=1}^{N}\left\lVert x_{n}\right\rVert ^{2}}\label{eq:relative approximation error} \end{equation} $$ or relative score $$ \begin{equation} S(K)=\frac{\sum_{n=1}^{N}\left\lVert p_{n}^{K}\right\rVert ^{2}}{\sum_{n=1}^{N}\left\lVert x_{n}\right\rVert ^{2}}\label{eq:relative approximation quality} \end{equation} $$ Evidently $L(K)+S(K)=1$.
Contribution of $a_{k}$ for explaining $x$ is $\langle x,a_{k}\rangle^{2}$.
Contribution of $a_{k}$ for explaining $x_{1},x_{2},...x_{N}$ is: $$ \sum_{n=1}^{N}\langle x_{n},a_{k}\rangle^{2} $$
Explained variance ratio: $$ E(a_{k})=\frac{\sum_{n=1}^{N}\langle x_{n},a_{k}\rangle^{2}}{\sum_{d=1}^{D}\sum_{n=1}^{N}\langle x_{n},a_{d}\rangle^{2}}=\frac{\sum_{n=1}^{N}\langle x_{n},a_{k}\rangle^{2}}{\sum_{n=1}^{N}\left\lVert x_{n}\right\rVert ^{2}} $$
etc.
$$ \begin{equation} \begin{cases} \|X a_1 \|^2 \rightarrow \max_{a_1} \\ a_1^\top a_1 = 1 \end{cases} \end{equation} $$
Initially, our objective was $$\|X a_1 \|^2 = a_1^\top X^\top X a_1 \rightarrow \max_{a_1}$$
From lagrangian we derived that $$X^\top X a_1 = \nu a_1$$
Putting one in to another: $$ a_1^\top X^\top X a_1 = \nu a_1^\top a_1 = \nu \rightarrow \max$$
That means:
$$ \begin{equation} \begin{cases} \|X a_2 \|^2 \rightarrow \max_{a_2} \\ a_2^\top a_2 = 1 \\ a_2^\top a_1 = 0 \end{cases} \end{equation} $$
By multiplying by $a_1^\top$ : $$ a_1^\top\frac{\partial\mathcal{L}}{\partial a_2} = 2a_1^\top X^\top X a_2 - 2\nu a_1^\top a_2 - \alpha a_1^\top a_1 = 0 $$
It follows that $\alpha a_1^\top a_1 = \alpha = 0$, which means that $$ \frac{\partial\mathcal{L}}{\partial a_2} = 2X^\top X a_2 - \nu a_2 = 0 $$ And $a_2$ is selected from a set of eigenvectors of $X^\top X$. Again, which one?
Derivations of other components proceeds in the same manner
$$ \left\lVert h_{1}\right\rVert ^{2}+...+\left\lVert h_{N}\right\rVert ^{2}\to\min_{v_{1},...v_{k}} $$
Find feature space with lesser dimentions s.t. distances in initial space are conserved in the new one. A bit more formally:
It is clear, that most of the times distances won't be conserved completely:
$$ KL(P \| Q) = \sum_z P(z) \log \frac{P(z)}{Q(z)} $$
$$\frac{\partial J_{t-SNE}}{\partial y_i}=4 \sum_j(p(i,j)−q(i,j))(y_i−y_j)g(|y_i−y_j|)$$