Data Analysis

Andrey Shestakov (avshestakov@hse.ru)


Introduction to Natural Language Processing1

1. Some materials are taken from machine learning course of Victor Kitov

  • Huge amount of information is represented in a form of natual language:

    • Blog-posts
    • Comments
    • Review
    • Search queries
    • Code
    • ...
  • Need a way to process this information:

    • transfer "characters" to vectors
    • recognize key entities in text
    • understand relationships between them
    • ...
  • Computational linguistics — is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.

Applications

  • Machine Translation
  • Information retrieval
  • Text summarization
  • Text Classification
    • Spam filtering
    • Semantic analysis
    • Multilabel classification
  • Text Categorization
  • Named-entity recognition
  • QA-systems and assistants
  • Text Generation
  • Spell-checking and next-word suggestion

Machine Translation

Named Entity Recognition

QA-systems and assistants

Text suggestion

NLP is HARD!

  • Various linguistic uncertainties
    • morphologic: "мой", "три", "стекло"
    • phonetic: "Надо ждать" - "Надо ж дать"
    • lexical: "кран", "ключ"
    • syntactical: "мужу изменять нельзя"
  • Language is in dynamic!

Tokenization

Levels of abstraction

  • Characters
  • Words
  • Sentences
  • Paragraphs
  • Documents

Tokenization

  • Tokenization - task of document decomposition on to tokens (elemental units)
  • Usually, word-based granularity is considered
  • "Кролики - это не только ценный мех, но и 3-4 килограмма диетического, легкоусвояемого мяса"
  • "Через 1.5 часа поеду в Гусь-Хрустальный."

Tokenization of words

  • Special regular expressions can be utilized to word this out
In [7]:
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in line.split(' '):
    print(w)
Через
1.5
часа
поеду
в
Гусь-Хрустальный.
In [8]:
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
tokenizer = RegexpTokenizer('\w+| \$ [\d \.]+ | S\+')
for w in tokenizer.tokenize(line):
    print(w)
Через
1
5
часа
поеду
в
Гусь
Хрустальный
In [9]:
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in wordpunct_tokenize(line):
    print(w)
Через
1
.
5
часа
поеду
в
Гусь
-
Хрустальный
.
In [6]:
line = u'テキストで表示ロシア語'
line = u'Nach der Wahrscheinlichkeitstheorie steckt alles in Schwierigkeiten!'

# =(

Tokenization of sentences

  • Also not clear
  • May use regex
  • Or some model)
In [10]:
from nltk.tokenize import sent_tokenize

text = 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
sent_tokenize(text, language='english')
Out[10]:
['Good muffins cost $3.88 in New York.',
 'Please buy me two of them.',
 'Thanks.']
In [11]:
text = u"Через 1.5 часа поеду в Гусь-Хрустальный. Куплю там квасу!"
for sent in sent_tokenize(text, language='english'):
    print(sent)
Через 1.5 часа поеду в Гусь-Хрустальный.
Куплю там квасу!

n-grams

  • Sometimes it is usefull to consider sequences of tokens
  • To catch context
  • To be robust to spelling errors
In [12]:
from nltk.util import ngrams

def word_grams(words, min_=1, max_=4):
    s = []
    for n in range(min_, max_):
        for ngram in ngrams(words, n):
            s.append(u" ".join(str(i) for i in ngram))
    return s

word_grams(u"I prefer cheese sause".split(u" "))
Out[12]:
['I',
 'prefer',
 'cheese',
 'sause',
 'I prefer',
 'prefer cheese',
 'cheese sause',
 'I prefer cheese',
 'prefer cheese sause']
In [13]:
n = 3
line = "cheese sause"

for i in range(len(line) - n + 1):
    print(line[i:i+n])
che
hee
ees
ese
se 
e s
 sa
sau
aus
use

Text normalization

Two types of text normalization

  • Lemmatization - process of transformation of a word to its base form (normal form) based on dictionaries morphological analysis
    • mystem, pymorphy
  • Stemming - rule-based process of reduction of a words to its stem based
    • sses → ss (caresses → caress)
    • ies → i (ponies → poni)
In [14]:
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()

plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
           'died', 'agreed', 'owned', 'humbled', 'sized',
           'meeting', 'stating', 'siezing', 'itemization',
           'sensational', 'traditional', 'reference', 'colonizer',
           'plotted']

singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))
caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot
In [16]:
from pymystem3 import Mystem

m = Mystem()
text = 'в петербурге прошел митинг против передачи исаакиевского собора рпц'

print(''.join(m.lemmatize(text)))
в петербург проходить митинг против передача исаакиевский собор рпц

In [17]:
import pymorphy2 
morph = pymorphy2.MorphAnalyzer()

res = morph.parse(u'пожарница')
for item in res:
    print('====')
    print(u'norm_form: {}'.format(item.normal_form))
    print(u'tag: {}'.format(item.tag))
    print(u'score: {}'.format(item.score))
====
norm_form: пожарница
tag: NOUN,anim,femn sing,nomn
score: 0.9999999999999999

Other preprocessing techniques

  • high-frequency word (stop-words) removal
    • Unions, prepositions, numerals, pronouns
  • low-frequency word removal
In [13]:
from nltk.corpus import stopwords
stopwords = stopwords.words('russian')
print(stopwords[:20])
['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты']

Other preprocessing techniques

  • Switching to lowercase
    • may loose meaning (US > us) or emotion (GREAT > great)

Document vectorization

Bag Of Words (BOW)

Bag of words

  • Most common way to represent documents
  • Each document $d_i$ from corpus $D$ is represented with vector $$ d_i = (w_1^i, w_2^i, \dots, w_N^i) $$
  • $w_j^i$ - weight of word $j$ in document $i$
  • word order is ignored

Bag of words

$w_j^i$ can be calculated as

  • Binary signal (does document contain the word)
  • Frequency of word $j$ in document $i$

Tf - Idf

  • tf-idf (term frequency - inversed document frequency)
    • tf(term, doc)
      • Binary signal
      • Frequency of word $j$ in document $i$
      • Smoothed frequency $\log(f_j^i + 1)$
    • idf(term, corpus) = inversed document frequency if a word $j$ in corpus $D$: $\log\left(\frac{|D|}{n_j}\right)$
      • $|D|$ - size of corpus
      • $n_t$ - number of documents which contain word $j$
      • idf measures how specifit a word is
$$tfidf = tf(term,doc) * idf(term, corpus)$$

<img src='img/tfidf.jpg', width=1200>

Bag of words

  • Additionaly, l1 or l2 normalization is applied to document vectors
    • Normalization is usually good for models
    • Maintain "meaning" of a document (for longer documents words' frequency are usually greater)
  • Ok, what we can do now with BoW?
    • Put it as features in models (linear models, naive bayes)
    • Use it to find similar documents (with cosine or jaccard distance)
    • Compress it to "topics"!

Latent Semantic Indexing - LSI

aka

Latent Semantic Analysis - LSA

LSA and feature reduction technique

Idea (and implementation) of LSI is very similar to PCA

  • We want to compress sparse BOW-representation in to dense representation
  • We want this representation to save as much initial information and be meaningful

Singular Value Decomposition (SVD)

Every matrix $A$ of size $n \times m$ and rank $r$ can be decomposed as: $$ A = U \Sigma V^\top ,$$ where

  • $U$ - unitary matrix, consists of eigenvectors of $AA^\top$
  • $V$ - unitary matrix, consists of eigenvectors of $A^\top A$
  • $S$ - diagonal matrix with singular values $s_i = \sqrt{\lambda_i}$

SVD via PCA

Matrices $U$ and $V$ can be used for PCA $$ AV = U\Sigma $$

LSA

  • Let's make SVD of BOW matrix

Pros

  • Easy
  • Can find similar words with $V$ and similar documents with $U$
  • LSA in search task is called Latent Semantic Indexing
    • Query $q$ is also some sequence of words
    • How to we find most similar documents in corpus with LSI?

Pros

  • Simple
  • Can find similar words with $V$ and similar documents with $U$
  • LSA in search task is called Latent Semantic Indexing
    • Query $q$ is also some sequence of words
    • Pass q through the same preprocessing as main text
    • Make BOW representation of it
    • Make LSI representation with $$ \hat{q} = qV\Sigma^{-1} $$

Cons

  • Problems with interpretation

Example

LSI space

Query LSI repsesentation

  • LSA is just some strange dimention-reduction algorithm which is applied to data
  • If we consider topic modelling as a phenomenon - it is much deeper than simple compression

Probabilistic topic models

</cetner>