'Entropy Weighting DFM in R and Quanteda
I am trying to weight a dfm built in quanteda using entropy weighting and I am a little confused. (I've edited my question after I realized where I was going completely wrong).
According to https://en.wikipedia.org/wiki/Latent_semantic_analysis#Mathematics_of_LSI:
gi = 1 + Σj (pij log2pij /log2n )
pij = tfij / gfi
aij = gi log(tfij + 1)
And according to Term document entropy calculation the recommendation is to use:
quanteda.textstats::textstat_entropy()
This function returns global entropies (by document or feature). For example:
head(textstat_entropy(data_dfm_lbgexample))
document entropy
1 R1 3.386943
2 R2 3.386943
3 R3 3.386943
4 R4 3.386943
5 R5 3.386943
6 V1 3.386943
This means we have gi (without the +1).
And if I want to weight the DFM by this global weight to get aij then:
x = data_corpus_inaugural %>%
quanteda::corpus() %>%
quanteda::tokens() %>%
quanteda::dfm()
ent = quanteda.textstats::textstat_entropy(x, margin="documents")
ent = Diagonal(x=ent$entropy)
x = dfm_weight(x, scheme="logcount")
x = ent %*% x
I recognize there is a slight different between the dfm_weight logcount scheme [1 + log tfij] and wiki [log (tfij + 1)].
Is this the correct way to go about it?
Appreciate any guidance
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
