'LDA: topic model gensim gives same set of topics
Why am I getting same set of topics # words in gensim lda model? I used these parameters. I checked there are no duplicate documents in my corpus.
lda_model = gensim.models.ldamodel.LdaModel(corpus=MY_CORPUS,
id2word=WORD_AND_ID,
num_topics=4,
minimum_probability=minimum_probability,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto', # symmetric, asymmetric
per_word_topics=True)
Results
[
(0, '0.004*lily + 0.01*rose + 0.00*jasmine'),
(1, '0.005*geometry + 0.07*algebra + 0.01*calculation'),
(2, '0.003*painting + 0.001*brush + 0.01*colors'),
(3, '0.005*geometry + 0.07*algebra + 0.01*calculation')
]
Notice: Topic #1 and #3 are identical.
Solution 1:[1]
Each of the topics likely contains a large number of words weighted differently. When a topic is being displayed (e.g. using lda_model.show_topics()) you are going to get only a few words with the largest weights. This does not mean that there are no differences between topics among the remaining vocabulary.
You can steer the number of displayed words to inspect the remaining weights:
show_topics(num_topics=4, num_words=10, log=False, formatted=True)
and change num_words parameter to include even more words.
Now, there is also a possibility that:
- the number of topics should be different (e.g. 3),
- or
minimum_probabilitysmaller (what is the value you use?), - or number of
passeslarger, chunksizesmaller,- corpus larger (what is the size?) or stripped off of stop words (did you do that?).
I encourage you to experiment with different values of these parameters to check if any of the combination works better.
Solution 2:[2]
you need to change the alpha parameter to 50/i which i is your topics number and use the eta parameter. (eta = 0.1)
like this code :
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=4,
update_every=1,
chunksize=100,
passes=10,
alpha=50/4,
eta = 0.1,
per_word_topics=True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | sophros |
| Solution 2 | yoones_khosravi |
