'Entropy Weighting DFM in R and Quanteda

I am trying to weight a dfm built in quanteda using entropy weighting and I am a little confused. (I've edited my question after I realized where I was going completely wrong).

According to https://en.wikipedia.org/wiki/Latent_semantic_analysis#Mathematics_of_LSI:

gi = 1 + Σj (pij log2pij /log2n )

pij = tfij / gfi

aij = gi log(tfij + 1)

And according to Term document entropy calculation the recommendation is to use:

quanteda.textstats::textstat_entropy()

This function returns global entropies (by document or feature). For example:

head(textstat_entropy(data_dfm_lbgexample))
  document  entropy
1       R1 3.386943
2       R2 3.386943
3       R3 3.386943
4       R4 3.386943
5       R5 3.386943
6       V1 3.386943

This means we have gi (without the +1).

And if I want to weight the DFM by this global weight to get aij then:

x = data_corpus_inaugural %>%
  quanteda::corpus() %>%
  quanteda::tokens() %>%
  quanteda::dfm()

ent = quanteda.textstats::textstat_entropy(x, margin="documents")
ent = Diagonal(x=ent$entropy)
x = dfm_weight(x, scheme="logcount")
x = ent %*% x

I recognize there is a slight different between the dfm_weight logcount scheme [1 + log tfij] and wiki [log (tfij + 1)].

Is this the correct way to go about it?

Appreciate any guidance



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source