'Doc2Vec results not as expected

I'm evaluating Doc2Vec for a recommender API. I wasn't able to find a decent pre-trained model, so I trained a model on the corpus, which is about 8,000 small documents.

    model = Doc2Vec(vector_size=25,
                    alpha=0.025,
                    min_alpha=0.00025,
                    min_count=1,
                    dm=1)

I then looped through the corpus to find similar documents for each document. Results were not very good (in comparison to TF-IDF). Note this is after testing different epochs and vector sizes.

    inferred_vector = model.infer_vector(row['cleaned'].split())
    sims = model.docvecs.most_similar([inferred_vector], topn=4)

I also tried extracting the trained vectors and using cosine_similarty, but results were oddly even worse.

cosine_similarities = cosine_similarity(model.docvecs.vectors_docs, model.docvecs.vectors_docs)

Am I doing something wrong or is the smallish corpus the problem?

Edit: Prep and training code

def unesc(s):
    for idx, row in s.iteritems():
        s[idx] = html.unescape(row)
    return s

custom_pipeline = [
    preprocessing.lowercase,
    unesc,
    preprocessing.remove_urls,
    preprocessing.remove_html_tags,
    preprocessing.remove_diacritics,
    preprocessing.remove_digits,
    preprocessing.remove_punctuation,
    lambda s: hero.remove_stopwords(s, stopwords=custom_stopwords),
    preprocessing.remove_whitespace,
    preprocessing.tokenize
]

ds['cleaned'] = ds['body'].pipe(hero.clean, pipeline=custom_pipeline)
w2v_total_data = list(ds['cleaned'])
tag_data = [TaggedDocument(words=doc, tags=[str(i)]) for i, doc in enumerate(w2v_total_data)]

model = Doc2Vec(vector_size=25,
                alpha=0.025,
                min_alpha=0.00025,
                min_count=1,
                epochs=20,
                dm=1)

model.build_vocab(tag_data)
model.train(tag_data, total_examples=model.corpus_count, epochs=model.epochs)
model.save("lmdocs_d2v.model")


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source