'Assigning Topic Probabilities to each Document in Anchored Topic Model using Python

I was interested in to run Anchored Topic Model (a.k.a. Corex Topic Model) and successfully ran it using my data set. But when I assigned topic probabilities to each document, I found that those probabilities became almost either 1 or 0. For example, see the simplified output:

Doc Topic0 Topic1 Topic2 Topic3
A   0.9999 0.0001 0.0103 0.9999
B   0.9999 0.0001 0.9999 0.9999
C   0.0025 0.9999 0.2033 0.9999 
... ...    ...    ...    ...

I was wondering if this result was natural to get. I understand that LDA and Corex are based on different model where LDA is a generative model and Corex is a discriminative model which means that sum of probabilities doesn't have to be 1 for each document.

My question is not about getting the sum of probabilities above 1 but whether these extreme probabilities were normal to get when running Corex topic modeling. I searched for example codes, related papers, and other materials but couldn't find any example codes which showed their derived topic probabilities for each document.

Instead, I found the code here where the author converted these probabilities to binary so there's no way to infer how the original probabilities were like prior to binary conversion.



Solution 1:[1]

I have experienced the same issue while using CorEx. A way I found that counters your problem is by using different vectorizers. For example, I used TfidfVectorizer (from sklearn.feature_extraction.text import TfidfVectorizer) and CountVectorizer (from sklearn.feature_extraction.text import CountVectorizer) which provided different results.

https://notebook.community/gregversteeg/corex_topic/corextopic/example/corex_topic_example

The above link showcases a very similar method to how I used the CountVectoriser. The link you provided shows the way to use the tdidf method.

I hope this helps.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 JordanB