'Issues with word2vec
I'm wondering why this program only creates a py plot of letters rather than words. Am I giving it the wrong kind of data for word2vec? The program runs just fine if I use
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
['this', 'is', 'the', 'second', 'sentence'],
['yet', 'another', 'sentence'],
['one', 'more', 'sentence'],
['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
this instead of the text document for my data.
import string
from keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
filename = 'book.txt'
file = open(filename, 'rt')
text = file.read()
file.close()
words1 = list(text.split())
words1 = [word.lower() for word in words1]
table = str.maketrans('', '', string.punctuation)
removepunct = [w.translate(table) for w in words1]
print(removepunct[:100])
# train model
model = Word2Vec(removepunct, min_count=1)
# fit a 2D PCA model to the vectors
X = model.wv[model.wv.key_to_index]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.key_to_index)
for i, word in enumerate(words):
pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
Solution 1:[1]
I think the issue here is that removepunct is not a list of lists of tokens. If removepunct is a list of token, then each character in a token will be considered as a token by gensim's Word2Vec model.
Solution 2:[2]
Your successful example is a list of sentences, i.e. a list of lists of words. The second time, you pass it a list of strings and Python, in its usual Python fashion, iterates over the strings one letter at a time instead of iterating over lists of words.
The nltk provides high-level methods to digest a text file into a list of sentences. Here is a foolproof way to get tokenized sentences:
from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader(".", ["book.txt"])
model = Word2Vec(corpus.sents(), min_count=1)
# Continue as before
If you want to remove punctuation, you must do it after the input has been parsed into sentences (since punctuation matters), like this:
clean_sents = [ [w for w in s if w.isalnum()] for s in corpus.sents() ]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | MehdAi |
| Solution 2 |
