'Assigning Topic Probabilities to each Document in Anchored Topic Model using Python
I was interested in to run Anchored Topic Model (a.k.a. Corex Topic Model) and successfully ran it using my data set. But when I assigned topic probabilities to each document, I found that those probabilities became almost either 1 or 0. For example, see the simplified output:
Doc Topic0 Topic1 Topic2 Topic3
A 0.9999 0.0001 0.0103 0.9999
B 0.9999 0.0001 0.9999 0.9999
C 0.0025 0.9999 0.2033 0.9999
... ... ... ... ...
I was wondering if this result was natural to get. I understand that LDA and Corex are based on different model where LDA is a generative model and Corex is a discriminative model which means that sum of probabilities doesn't have to be 1 for each document.
My question is not about getting the sum of probabilities above 1 but whether these extreme probabilities were normal to get when running Corex topic modeling. I searched for example codes, related papers, and other materials but couldn't find any example codes which showed their derived topic probabilities for each document.
Instead, I found the code here where the author converted these probabilities to binary so there's no way to infer how the original probabilities were like prior to binary conversion.
Solution 1:[1]
I have experienced the same issue while using CorEx. A way I found that counters your problem is by using different vectorizers. For example, I used TfidfVectorizer (from sklearn.feature_extraction.text import TfidfVectorizer) and CountVectorizer (from sklearn.feature_extraction.text import CountVectorizer) which provided different results.
https://notebook.community/gregversteeg/corex_topic/corextopic/example/corex_topic_example
The above link showcases a very similar method to how I used the CountVectoriser. The link you provided shows the way to use the tdidf method.
I hope this helps.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | JordanB |
