'Doc2Vec results not as expected
I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
dm=1)
I then looped through the corpus to find similar documents for each document. Results were not very good (in comparison to TF-IDF). Note this is after testing different epochs and vector sizes.
inferred_vector = model.infer_vector(row['cleaned'].split())
sims = model.docvecs.most_similar([inferred_vector], topn=4)
I also tried extracting the trained vectors and using cosine_similarty, but results were oddly even worse.
cosine_similarities = cosine_similarity(model.docvecs.vectors_docs, model.docvecs.vectors_docs)
Am I doing something wrong or is the smallish corpus the problem?
Edit: Prep and training code
def unesc(s):
for idx, row in s.iteritems():
s[idx] = html.unescape(row)
return s
custom_pipeline = [
preprocessing.lowercase,
unesc,
preprocessing.remove_urls,
preprocessing.remove_html_tags,
preprocessing.remove_diacritics,
preprocessing.remove_digits,
preprocessing.remove_punctuation,
lambda s: hero.remove_stopwords(s, stopwords=custom_stopwords),
preprocessing.remove_whitespace,
preprocessing.tokenize
]
ds['cleaned'] = ds['body'].pipe(hero.clean, pipeline=custom_pipeline)
w2v_total_data = list(ds['cleaned'])
tag_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(w2v_total_data)]
model = Doc2Vec(vector_size=25,
alpha=0.025,
min_alpha=0.00025,
min_count=1,
epochs=20,
dm=1)
model.build_vocab(tag_data)
model.train(tag_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("lmdocs_d2v.model")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
