'How to measure the coherence in sklearn LDA model (Non-gensim LDA model)?
I have tried using two techniques, but I am getting different results. I just want to be sure about which one to go with.
Method 1: I tried using
from tmtoolkit.topicmod.evaluate import metric_coherence_gensim
metric_coherence_gensim(measure='u_mass',
top_n=25,
topic_word_distrib = lda.components_,
dtm = dtm,
vocab=np.array([x for x in tfidf_vect.vocabulary_.keys()]),
return_mean = True)
The source mentioned that a decent coherence score should be between -14 to +14. Any explanation on this also helps.
Method 2: I had to write functions to calculate the score without any in-built library.
def get_umass_score(dt_matrix, i, j):
zo_matrix = (dt_matrix > 0).astype(int)
col_i, col_j = zo_matrix[:, i], zo_matrix[:, j]
col_ij = col_i + col_j
col_ij = (col_ij == 2).astype(int)
Di, Dij = col_i.sum(), col_ij.sum()
return math.log((Dij + 1) / Di)
def get_topic_coherence(dt_matrix, topic, n_top_words):
indexed_topic = zip(topic, range(0, len(topic)))
topic_top = sorted(indexed_topic, key=lambda x: 1 - x[0])[0:n_top_words]
coherence = 0
for j_index in range(0, len(topic_top)):
for i_index in range(0, j_index - 1):
i = topic_top[i_index][1]
j = topic_top[j_index][1]
coherence += get_umass_score(dt_matrix, i, j)
return coherence
def get_average_topic_coherence(dt_matrix, topics, n_top_words):
total_coherence = 0
for i in range(0, len(topics)):
total_coherence += get_topic_coherence(dt_matrix, topics[i], n_top_words)
return total_coherence / len(topics)
I got this from a StackOverflow post. Credits to that guy who wrote this, but I was getting huge value depending on the value I pass for n_top_words.
Can someone tell me which method is reliable, or is there any better way I can find the coherence score for sklearn LDA models?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
