1. Some materials are taken from machine learning course of Victor Kitov
Huge amount of information is represented in a form of natual language:
Need a way to process this information:
Computational linguistics — is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in line.split(' '):
print(w)
Через 1.5 часа поеду в Гусь-Хрустальный.
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
tokenizer = RegexpTokenizer('\w+| \$ [\d \.]+ | S\+')
for w in tokenizer.tokenize(line):
print(w)
Через 1 5 часа поеду в Гусь Хрустальный
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in wordpunct_tokenize(line):
print(w)
Через 1 . 5 часа поеду в Гусь - Хрустальный .
line = u'テキストで表示ロシア語'
line = u'Nach der Wahrscheinlichkeitstheorie steckt alles in Schwierigkeiten!'
# =(
from nltk.tokenize import sent_tokenize
text = 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
sent_tokenize(text, language='english')
['Good muffins cost $3.88 in New York.', 'Please buy me two of them.', 'Thanks.']
text = u"Через 1.5 часа поеду в Гусь-Хрустальный. Куплю там квасу!"
for sent in sent_tokenize(text, language='english'):
print(sent)
Через 1.5 часа поеду в Гусь-Хрустальный. Куплю там квасу!
from nltk.util import ngrams
def word_grams(words, min_=1, max_=4):
s = []
for n in range(min_, max_):
for ngram in ngrams(words, n):
s.append(u" ".join(str(i) for i in ngram))
return s
word_grams(u"I prefer cheese sause".split(u" "))
['I', 'prefer', 'cheese', 'sause', 'I prefer', 'prefer cheese', 'cheese sause', 'I prefer cheese', 'prefer cheese sause']
n = 3
line = "cheese sause"
for i in range(len(line) - n + 1):
print(line[i:i+n])
che hee ees ese se e s sa sau aus use
Two types of text normalization
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
'died', 'agreed', 'owned', 'humbled', 'sized',
'meeting', 'stating', 'siezing', 'itemization',
'sensational', 'traditional', 'reference', 'colonizer',
'plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))
caress fli die mule deni die agre own humbl size meet state siez item sensat tradit refer colon plot
from pymystem3 import Mystem
m = Mystem()
text = 'в петербурге прошел митинг против передачи исаакиевского собора рпц'
print(''.join(m.lemmatize(text)))
в петербург проходить митинг против передача исаакиевский собор рпц
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
res = morph.parse(u'пожарница')
for item in res:
print('====')
print(u'norm_form: {}'.format(item.normal_form))
print(u'tag: {}'.format(item.tag))
print(u'score: {}'.format(item.score))
==== norm_form: пожарница tag: NOUN,anim,femn sing,nomn score: 0.9999999999999999
from nltk.corpus import stopwords
stopwords = stopwords.words('russian')
print(stopwords[:20])
['и', 'в', 'во', 'не', 'что', 'он', 'на', 'я', 'с', 'со', 'как', 'а', 'то', 'все', 'она', 'так', 'его', 'но', 'да', 'ты']
$w_j^i$ can be calculated as
Idea (and implementation) of LSI is very similar to PCA
Every matrix $A$ of size $n \times m$ and rank $r$ can be decomposed as: $$ A = U \Sigma V^\top ,$$ where
Pros
Pros
Cons