'LDA Gensim Coherence model gives high values for very high NUM_TOPICS that generates a single topic name
I read that using the Coherence measure can help estimate the most optimal number of topics (K) to be used in the LDA model. I created the below code to run multiple LDA Models with different numbers of topics and calculate the Coherence measure for each.
def compute_coherence_values(dictionary,corpus,texts,limit,start=2,step=1):
coherence_values = []
model_list = []
for num_topics in range(start, limit, step):
model = gensim.models.ldamodel.LdaModel(corpus=corpus, num_topics=num_topics,random_state=100,
chunksize=200,passes=10,per_word_topics=True,id2word=id2word)
model_list.append(model)
coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
model_list, coherence_values = compute_coherence_values(dictionary=id2word,corpus=corpus,
texts=data_lemmatized, start=2, limit=500, step=5)
# Show Coherence graph
limit=500; start=2; step=5;
x= range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()
Now, according to this graph, the coherence measure peaks at for example 150, 187, 200, and above!
- I do not quite understand how the coherence doesn't go down when using too much number of topics? Why does it plateau?
- When Running my LDA using any of the above-mentioned numbers of topics, I get the same topic name applied to all extracted topics with 0% relevance of each word!
Am I doing something wrong here?
Sample of output:
[(149, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (103, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (49, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (40, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (56, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (105, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (35, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (63, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (146, '0.288*"role" + 0.135*"machine" + 0.119*"age"'), (120, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (18, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (92, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (157, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (143, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (39, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (141, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (78, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (151, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (90, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"'), (101, '0.000*"klebsiella" + 0.000*"cyclade" + 0.000*"slope"')]
If I choose topic numbers <100 I ok results:
[(132, '0.263*"subject" + 0.155*"deputy" + 0.091*"plastic"'), (110, '0.208*"club" + 0.164*"fan" + 0.096*"ground"'), (200, '0.225*"book" + 0.080*"writer" + 0.078*"winner"'), (16, '0.000*"cellnet" + 0.000*"katherine" + 0.000*"accommodation"'), (71, '0.000*"cellnet" + 0.000*"katherine" + 0.000*"accommodation"'), (29, '0.543*"language" + 0.095*"rush" + 0.074*"hip"'), (66, '0.312*"case" + 0.195*"court" + 0.129*"charge"'), (34, '0.000*"cellnet" + 0.000*"katherine" + 0.000*"accommodation"'), (191, '0.492*"film" + 0.146*"cinema" + 0.094*"director"'), (28, '0.295*"number" + 0.132*"chart" + 0.107*"week"'), (116, '0.130*"email" + 0.109*"union" + 0.086*"difference"'), (144, '0.283*"rate" + 0.253*"interest" + 0.036*"month"'), (207, '0.202*"camera" + 0.107*"message" + 0.058*"text"'), (174, '0.260*"revenue" + 0.188*"earning" + 0.104*"world"'), (121, '0.631*"distribution" + 0.000*"accommodation" + 0.000*"cambridgeshire"'), (68, '0.382*"price" + 0.179*"oil" + 0.095*"demand"'), (163, '0.417*"action" + 0.258*"official" + 0.095*"lawsuit"'), (206, '0.135*"race" + 0.111*"world" + 0.066*"year"'), (215, '0.305*"technology" + 0.194*"device" + 0.085*"generation"'), (125, '0.312*"man" + 0.072*"hunt" + 0.069*"ban"')]
N.B: The corpus is made up of 10k unique documents
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|

