'Issues with word2vec

I'm wondering why this program only creates a py plot of letters rather than words. Am I giving it the wrong kind of data for word2vec? The program runs just fine if I use

sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
            ['this', 'is', 'the', 'second', 'sentence'],
            ['yet', 'another', 'sentence'],
            ['one', 'more', 'sentence'],
            ['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)

this instead of the text document for my data.

import string
from keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot

filename =  'book.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

words1 = list(text.split())
words1 = [word.lower() for word in words1]

table = str.maketrans('', '', string.punctuation)
removepunct = [w.translate(table) for w in words1]
print(removepunct[:100])



# train model
model = Word2Vec(removepunct, min_count=1)
# fit a 2D PCA model to the vectors
X = model.wv[model.wv.key_to_index]


pca = PCA(n_components=2)
result = pca.fit_transform(X)
# create a scatter plot of the projection
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.key_to_index)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()


Solution 1:[1]

I think the issue here is that removepunct is not a list of lists of tokens. If removepunct is a list of token, then each character in a token will be considered as a token by gensim's Word2Vec model.

Solution 2:[2]

Your successful example is a list of sentences, i.e. a list of lists of words. The second time, you pass it a list of strings and Python, in its usual Python fashion, iterates over the strings one letter at a time instead of iterating over lists of words.

The nltk provides high-level methods to digest a text file into a list of sentences. Here is a foolproof way to get tokenized sentences:

from nltk.corpus import PlaintextCorpusReader
corpus = PlaintextCorpusReader(".", ["book.txt"])

model = Word2Vec(corpus.sents(), min_count=1)
# Continue as before

If you want to remove punctuation, you must do it after the input has been parsed into sentences (since punctuation matters), like this:

clean_sents = [ [w for w in s if w.isalnum()] for s in corpus.sents() ]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 MehdAi
Solution 2