'How to properly format Twitter data for analysis in tidytext

I'm using the r tidytext package to analyze the tf-idf content of Twitter data that I retrieved using the package academictwitteR (using the get_all_tweets function). My code looks like this:

tot_tf_idf<-v_tweets_total%>%
  unnest_tokens(word, text)%>%
  count(id, word, sort = T)

tot_tf_idf<-tot_tf_idf%>%
  bind_tf_idf(word, id, n)

tot_tf_idf%>%
  arrange(desc(tf_idf))%>%
  head()

I used the Tweet ID as the document term needed in the bind_tf_idf function, but is this correct, or is there a better way to account for the document? The tf-idf values look okay for the most part, although stop words have an oddly high idf and tf-idf value, probably due to the large number of documents. The reason I ask is that I am trying to calculate the cosine similarity and create a plot where I can cluster the Tweets using Louvain modularity according to the text content of the Tweets. However, creating the plot is taking a very long time and I am wondering if that is because I set it up using the wrong kind of "document".

My code for creating cosine similarity and the Louvain cluster plot:

# Create a matrix
dtm<-cast_sparse(td_idf, row = id, column = word, value = tf_idf)

# Function for calculating log likelihood term of each word in the corpus
log_likelihood_terms <- function(dtm){
  b = colSums(dtm)
  b = ifelse(b[] == 0, 1e-12, b[])
  LLs = c()
  for(i in 1:nrow(dtm)){
    a = dtm[i,]
    a = ifelse(a[] == 0, 1e-12, a[])
    c = sum(a)
    d = sum(b)
    E1 = c*(a+b)/(c+d)
    E2 = d*(a+b)/(c+d)
    LL = 2*((a*log(a/E1))+(b*log(b/E2)))
    LL = sum(LL)
    LLs = c(LLs, LL)
    length(LLs)/ nrow(dtm)
  }
  names(LLs) = rownames(dtm)
  LLs = LLs[order(LLs, decreasing = T)]
  return(LLs)
}

dtm<-as.matrix(dtm)

# Restricting to 1000 words
ll_terms = log_likelihood_terms(t(dtm))
ll_terms = ll_terms[order(ll_terms, decreasing = T)]
ll_terms = names(ll_terms[1:1000])

cosine_dtm <- simil(dtm[,ll_terms], method = "cosine")
cosine_dtm <- as.matrix(cosine_dtm)

cosine_net_dtm <- graph.adjacency(cosine_dtm, weighted = T, mode = "undirected")

comps <- components(cosine_net_dtm)$membership
largest_component <- which.max(table(comps))
cosine_net_dtm <- delete.vertices(cosine_net_dtm, which(comps != largest_component))


plot(cosine_net_dtm, 
     vertex.size = 5,
     mark.groups = cluster_louvain(cosine_net_dtm), 
     vertex.color = "grey80",
     vertex.border.color = "grey60",
     vertex.label.cex = 0.5,
     vertex.label.color = "black")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'How to properly format Twitter data for analysis in tidytext

Sources

Related Questions