'How to properly format Twitter data for analysis in tidytext
I'm using the r tidytext package to analyze the tf-idf content of Twitter data that I retrieved using the package academictwitteR (using the get_all_tweets function). My code looks like this:
tot_tf_idf<-v_tweets_total%>%
unnest_tokens(word, text)%>%
count(id, word, sort = T)
tot_tf_idf<-tot_tf_idf%>%
bind_tf_idf(word, id, n)
tot_tf_idf%>%
arrange(desc(tf_idf))%>%
head()
I used the Tweet ID as the document term needed in the bind_tf_idf function, but is this correct, or is there a better way to account for the document? The tf-idf values look okay for the most part, although stop words have an oddly high idf and tf-idf value, probably due to the large number of documents. The reason I ask is that I am trying to calculate the cosine similarity and create a plot where I can cluster the Tweets using Louvain modularity according to the text content of the Tweets. However, creating the plot is taking a very long time and I am wondering if that is because I set it up using the wrong kind of "document".
My code for creating cosine similarity and the Louvain cluster plot:
# Create a matrix
dtm<-cast_sparse(td_idf, row = id, column = word, value = tf_idf)
# Function for calculating log likelihood term of each word in the corpus
log_likelihood_terms <- function(dtm){
b = colSums(dtm)
b = ifelse(b[] == 0, 1e-12, b[])
LLs = c()
for(i in 1:nrow(dtm)){
a = dtm[i,]
a = ifelse(a[] == 0, 1e-12, a[])
c = sum(a)
d = sum(b)
E1 = c*(a+b)/(c+d)
E2 = d*(a+b)/(c+d)
LL = 2*((a*log(a/E1))+(b*log(b/E2)))
LL = sum(LL)
LLs = c(LLs, LL)
length(LLs)/ nrow(dtm)
}
names(LLs) = rownames(dtm)
LLs = LLs[order(LLs, decreasing = T)]
return(LLs)
}
dtm<-as.matrix(dtm)
# Restricting to 1000 words
ll_terms = log_likelihood_terms(t(dtm))
ll_terms = ll_terms[order(ll_terms, decreasing = T)]
ll_terms = names(ll_terms[1:1000])
cosine_dtm <- simil(dtm[,ll_terms], method = "cosine")
cosine_dtm <- as.matrix(cosine_dtm)
cosine_net_dtm <- graph.adjacency(cosine_dtm, weighted = T, mode = "undirected")
comps <- components(cosine_net_dtm)$membership
largest_component <- which.max(table(comps))
cosine_net_dtm <- delete.vertices(cosine_net_dtm, which(comps != largest_component))
plot(cosine_net_dtm,
vertex.size = 5,
mark.groups = cluster_louvain(cosine_net_dtm),
vertex.color = "grey80",
vertex.border.color = "grey60",
vertex.label.cex = 0.5,
vertex.label.color = "black")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
