1. Some materials are taken from machine learning course of Victor Kitov
Huge amount of information is represented in a form of natual language:
Need a way to process this information:
Computational linguistics — is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective.
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in line.split(' '):
print(w)
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
tokenizer = RegexpTokenizer('\w+| \$ [\d \.]+ | S\+')
for w in tokenizer.tokenize(line):
print(w)
line = u"Через 1.5 часа поеду в Гусь-Хрустальный."
for w in wordpunct_tokenize(line):
print(w)
line = u'テキストで表示ロシア語'
line = u'Nach der Wahrscheinlichkeitstheorie steckt alles in Schwierigkeiten!'
# =(
from nltk.tokenize import sent_tokenize
text = 'Good muffins cost $3.88 in New York. Please buy me two of them. Thanks.'
sent_tokenize(text, language='english')
text = u"Через 1.5 часа поеду в Гусь-Хрустальный. Куплю там квасу!"
for sent in sent_tokenize(text, language='english'):
print(sent)
from nltk.util import ngrams
def word_grams(words, min_=1, max_=4):
s = []
for n in range(min_, max_):
for ngram in ngrams(words, n):
s.append(u" ".join(str(i) for i in ngram))
return s
word_grams(u"I prefer cheese sause".split(u" "))
n = 3
line = "cheese sause"
for i in range(len(line) - n + 1):
print(line[i:i+n])
I saw a girl with a telescope.
->
['▁I', '▁saw', '▁a', '▁girl', '▁with', '▁a', '▁', 'te', 'le', 's', 'c', 'o', 'pe', ‘.’]
Two types of text normalization
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
plurals = ['caresses', 'flies', 'dies', 'mules', 'denied',
'died', 'agreed', 'owned', 'humbled', 'sized',
'meeting', 'stating', 'siezing', 'itemization',
'sensational', 'traditional', 'reference', 'colonizer',
'plotted']
singles = [stemmer.stem(plural) for plural in plurals]
print(' '.join(singles))
from pymystem3 import Mystem
m = Mystem()
text = 'в петербурге прошел митинг против передачи исаакиевского собора рпц'
print(''.join(m.lemmatize(text)))
import pymorphy2
morph = pymorphy2.MorphAnalyzer()
res = morph.parse(u'пожарница')
for item in res:
print('====')
print(u'norm_form: {}'.format(item.normal_form))
print(u'tag: {}'.format(item.tag))
print(u'score: {}'.format(item.score))
from nltk.corpus import stopwords
stopwords = stopwords.words('russian')
print(stopwords[:20])
$w_j^i$ can be calculated as
Idea (and implementation) of LSI is very similar to PCA
Every matrix $A$ of size $n \times m$ and rank $r$ can be decomposed as: $$ A = U \Sigma V^\top ,$$ where
Pros
Pros
Cons
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
texts = [
"Группа крови на рукаве",
"Мой порядковый номер на рукаве",
"Пожелай мне удачи в бою",
"Пожелай мне не остаться в этой траве",
"Не остаться в этой траве",
"Пожелай мне удачи"
]
X = TfidfVectorizer(norm=None).fit_transform(texts, )
print(X.toarray().round(2))
Z = TruncatedSVD(n_components=3).fit_transform(X)[:, 1:]
Z
_, ax = plt.subplots(1,1)
ax.scatter(Z[:, 0], Z[:, 1])
ax.scatter(0,0)
for i, z in enumerate(Z):
ax.arrow(0., 0., z[0], z[1], width=0.001)
ax.text(z[0], z[1], texts[i])