BoW text representation of documents
Each document $d_i$ from corpus $D$ is represented with vector $$ d_i = (w_1^i, w_2^i, \dots, w_N^i) $$
“You shall know a word by the company it keeps.” (J.R. Firth (1957))
Still, vectors are huge, so let's run dimention reduction!
Use truncated SVD!
and our objective transformes to $$L(u,v) = -\sum_{(o,c)} \left(u_o^\top v_c - \log\left(\sum\limits_{w\in V} \exp(u_w^\top v_c)\right)\right) \rightarrow \min\limits_{u,v}$$
"Combination of context vectors predicts center word" $$L = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-m} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+m})$$
Use dimensionality reduction algorithms such as t-SNE and PCA to visualize (a subset) of the embedding space to project points to a 2-D or 3-D space.
Intrinsic evaluations are those where you can use embeddings to perform relatively simple, word-related tasks.
Tasks:
n-1
words are the options. Only one of the options is a synonym.In Extrinsic Evaluations, we have a more complex task we are interested in (e.g. text classification, text translation, image captioning), whereby we can use embeddings as a way to represent words (or tokens). Assuming we have:
we can then train the model using different embeddings and evaluate its overall performance. The idea is that better embeddings will make it easier for the model to learn the overall task.
People have trained word2vec-like models on huge datasets:
On the other hand you have some text corpus for your specific task. Should you learn your own embeddings or use pretrained ones?
Ideas of word2vec can be transfered to any domain with proper data structure
Action2Vec
Find feature space with lesser dimentions s.t. distances in initial space are conserved in the new one. A bit more formally:
It is clear, that most of the times distances won't be conserved completely: