Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Introduction to Natural Language Processing¹

^{1. Some materials are taken from machine learning course of Victor Kitov}

Huge amount of information is represented in a form of natual language:
- Blog-posts
- Comments
- Review
- Search queries
- Code
- ...
Need a way to process this information:
- transfer "characters" to vectors
- recognize key entities in text
- understand relationships between them
- ...
Computational linguistics — is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.

Applications¶

Machine Translation
Information retrieval
Text summarization
Text Classification
- Spam filtering
- Semantic analysis
- Multilabel classification
Text Categorization
Named-entity recognition
QA-systems and assistants
Text Generation
Spell-checking and next-word suggestion

Machine Translation¶

Named Entity Recognition¶

QA-systems and assistants¶

Text suggestion¶

Text Generation¶

NLP is HARD!¶

Various linguistic uncertainties
- morphologic: "мой", "три", "стекло"
- phonetic: "Надо ждать" - "Надо ж дать"
- lexical: "кран", "ключ"
- syntactical: "мужу изменять нельзя"
Language is in dynamic!

Tokenization¶

Levels of abstraction¶

Characters
Words
Sentences
Paragraphs
Documents

Tokenization¶

Tokenization - task of document decomposition on to tokens (elemental units)
Usually, word-based granularity is considered

"Кролики - это не только ценный мех, но и 3-4 килограмма диетического, легкоусвояемого мяса"
"Через 1.5 часа поеду в Гусь-Хрустальный."

Tokenization of words¶

Special regular expressions can be utilized to word this out

In [7]:

line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in line.split(' '):
    print(w)

Через
1.5
часа
поеду
в
Гусь-Хрустальный.

In [8]:

line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
tokenizer = RegexpTokenizer('\w+| \$ [\d \.]+ | S\+')
for w in tokenizer.tokenize(line):
    print(w)

Через
1
5
часа
поеду
в
Гусь
Хрустальный

In [9]:

line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in wordpunct_tokenize(line):
    print(w)

Через
1
.
5
часа
поеду
в
Гусь
-
Хрустальный
.

In [6]:

line = u'テキストで表示ロシア語'
line = u'Nach der Wahrscheinlichkeitstheorie steckt alles in Schwierigkeiten!'

# =(

Tokenization of sentences¶

Also not clear
May use regex
Or some model)

In [10]:

from nltk.tokenize import sent_tokenize

text = 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
sent_tokenize(text, language='english')

Out[10]:

['Good muffins cost $3.88 in New York.',
 'Please buy me two of them.',
 'Thanks.']

In [11]:

text = u"Через 1.5 часа поеду в Гусь-Хрустальный. Куплю там квасу!"
for sent in sent_tokenize(text, language='english'):
    print(sent)

Через 1.5 часа поеду в Гусь-Хрустальный.
Куплю там квасу!

n-grams¶

Sometimes it is usefull to consider sequences of tokens
To catch context
To be robust to spelling errors

In [12]:

from nltk.util import ngrams

def word_grams(words, min_=1, max_=4):
    s = []
    for n in range(min_, max_):
        for ngram in ngrams(words, n):
            s.append(u" ".join(str(i) for i in ngram))
    return s

word_grams(u"I prefer cheese sause".split(u" "))

Out[12]:

['I',
 'prefer',
 'cheese',
 'sause',
 'I prefer',
 'prefer cheese',
 'cheese sause',
 'I prefer cheese',
 'prefer cheese sause']

In [13]:

n = 3
line = "cheese sause"

for i in range(len(line) - n + 1):
    print(line[i:i+n])

che
hee
ees
ese
se 
e s
 sa
sau
aus
use

Text normalization¶

Two types of text normalization

Lemmatization - process of transformation of a word to its base form (normal form) based on dictionaries morphological analysis
- mystem, pymorphy
Stemming - rule-based process of reduction of a words to its stem based
- sses → ss (caresses → caress)
- ies → i (ponies → poni)

In [14]:

from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))

caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot

In [16]:

from pymystem3 import Mystem

m = Mystem()
text = 'в петербурге прошел митинг против передачи исаакиевского собора рпц'

print(''.join(m.lemmatize(text)))

в петербург проходить митинг против передача исаакиевский собор рпц

In [17]:

import pymorphy2 
morph = pymorphy2.MorphAnalyzer()

res = morph.parse(u'пожарница')
for item in res:
    print('====')
    print(u'norm_form: {}'.format(item.normal_form))
    print(u'tag: {}'.format(item.tag))
    print(u'score: {}'.format(item.score))

====
norm_form: пожарница
tag: NOUN,anim,femn sing,nomn
score: 0.9999999999999999

Other preprocessing techniques¶

high-frequency word (stop-words) removal
- Unions, prepositions, numerals, pronouns
low-frequency word removal

In [13]:

from nltk.corpus import stopwords
stopwords = stopwords.words('russian')
print(stopwords[:20])

['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты']

Other preprocessing techniques¶

Switching to lowercase
- may loose meaning (US > us) or emotion (GREAT > great)

Document vectorization¶

Bag Of Words (BOW)¶

Bag of words¶

Most common way to represent documents
Each document $d_i$ from corpus $D$ is represented with vector $$ d_i = (w_1^i, w_2^i, \dots, w_N^i) $$
$w_j^i$ - weight of word $j$ in document $i$
word order is ignored

Bag of words¶

$w_j^i$ can be calculated as

Binary signal (does document contain the word)
Frequency of word $j$ in document $i$

Tf - Idf¶

tf-idf (term frequency - inversed document frequency)
- tf(term, doc)
  - Binary signal
  - Frequency of word $j$ in document $i$
  - Smoothed frequency $\log(f_j^i + 1)$
- idf(term, corpus) = inversed document frequency if a word $j$ in corpus $D$: $\log\left(\frac{|D|}{n_j}\right)$
  - $|D|$ - size of corpus
  - $n_t$ - number of documents which contain word $j$
  - idf measures how specifit a word is

$$tfidf = tf(term,doc) * idf(term, corpus)$$

Bag of words¶

Additionaly, l1 or l2 normalization is applied to document vectors
- Normalization is usually good for models
- Maintain "meaning" of a document (for longer documents words' frequency are usually greater)

Ok, what we can do now with BoW?
- Put it as features in models (linear models, naive bayes)
- Use it to find similar documents (with cosine or jaccard distance)
- Compress it to "topics"!

Latent Semantic Indexing - LSI¶

aka¶

Latent Semantic Analysis - LSA¶

LSA and feature reduction technique¶

Idea (and implementation) of LSI is very similar to PCA

We want to compress sparse BOW-representation in to dense representation
We want this representation to save as much initial information and be meaningful

Singular Value Decomposition (SVD)¶

Every matrix $A$ of size $n \times m$ and rank $r$ can be decomposed as: $$ A = U \Sigma V^\top ,$$ where

$U$ - unitary matrix, consists of eigenvectors of $AA^\top$
$V$ - unitary matrix, consists of eigenvectors of $A^\top A$
$S$ - diagonal matrix with singular values $s_i = \sqrt{\lambda_i}$

SVD via PCA¶

Matrices $U$ and $V$ can be used for PCA $$ AV = U\Sigma $$

LSA¶

Let's make SVD of BOW matrix

Pros

Easy
Can find similar words with $V$ and similar documents with $U$
LSA in search task is called Latent Semantic Indexing
- Query $q$ is also some sequence of words
- How to we find most similar documents in corpus with LSI?

Pros

Simple
Can find similar words with $V$ and similar documents with $U$
LSA in search task is called Latent Semantic Indexing
- Query $q$ is also some sequence of words
- Pass q through the same preprocessing as main text
- Make BOW representation of it
- Make LSI representation with $$ \hat{q} = qV\Sigma^{-1} $$

Cons

Problems with interpretation

Example¶

LSI space¶

Query LSI repsesentation¶

Query in space!!¶

example from http://web.eecs.utk.edu/~mberry/sc95/sc95.html

LSA is just some strange dimention-reduction algorithm which is applied to data
If we consider topic modelling as a phenomenon - it is much deeper than simple compression

Probabilistic topic models¶

</cetner>

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Introduction to Natural Language Processing¹

Applications¶

Machine Translation¶

Named Entity Recognition¶

QA-systems and assistants¶

Text suggestion¶

Text Generation¶

NLP is HARD!¶

Tokenization¶

Levels of abstraction¶

Tokenization¶

Tokenization of words¶

Tokenization of sentences¶

n-grams¶

Text normalization¶

Other preprocessing techniques¶

Other preprocessing techniques¶

Document vectorization¶

Bag Of Words (BOW)¶

Bag of words¶

Bag of words¶

Tf - Idf¶

Bag of words¶

Latent Semantic Indexing - LSI¶

aka¶

Latent Semantic Analysis - LSA¶

LSA and feature reduction technique¶

Singular Value Decomposition (SVD)¶

SVD via PCA¶

LSA¶

Example¶

LSI space¶

Query LSI repsesentation¶

Query in space!!¶

Probabilistic topic models¶

But.. not today..¶

FYI¶

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Introduction to Natural Language Processing1

Applications¶

Machine Translation¶

Named Entity Recognition¶

QA-systems and assistants¶

Text suggestion¶

Text Generation¶

NLP is HARD!¶

Tokenization¶

Levels of abstraction¶

Tokenization¶

Tokenization of words¶

Tokenization of sentences¶

n-grams¶

Text normalization¶

Other preprocessing techniques¶

Other preprocessing techniques¶

Document vectorization¶

Bag Of Words (BOW)¶

Bag of words¶

Bag of words¶

Tf - Idf¶

Bag of words¶

Latent Semantic Indexing - LSI¶

aka¶

Latent Semantic Analysis - LSA¶

LSA and feature reduction technique¶

Singular Value Decomposition (SVD)¶

SVD via PCA¶

LSA¶

Example¶

LSI space¶

Query LSI repsesentation¶

Query in space!!¶

Probabilistic topic models¶

But.. not today..¶

FYI¶

Introduction to Natural Language Processing¹