'Parellelized TermDocumentMatrix
I've got a lot of spare hardware when using this code
get_cleand_tdm <- function(datatable,min_freq,max_freq_perc) {
dt <- datatable
#remove all punctuation followed by blanks
text <- VCorpus(DataframeSource(dt))
corpus_clean <- tm_map(text, content_transformer(tolower))
corpus_clean <- tm_map(corpus_clean, content_transformer(lemmatize_strings))
corpus_clean <- tm_map(corpus_clean, content_transformer(remove_doublespace))
corpus_clean <- tm_map(corpus_clean, content_transformer(remove_urls))
corpus_clean <- tm_map(corpus_clean, content_transformer(remove_www))
corpus_clean <- tm_map(corpus_clean, content_transformer(remove_emoticons))
corpus_clean <- tm_map(corpus_clean, removeWords,stopwords("en"))
tdm <- TermDocumentMatrix(corpus_clean,
control=list(removeNumbers = TRUE,
removePunctuation=TRUE,
bounds=list(global=c(min_freq,
floor(length(corpus_clean)*max_freq_perc)))))
#tdm <- weightBin(tdm)
# convert as matrix
return(tdm)
}
#### final tdm
tdm <- get_cleand_tdm(dt1,30,1)
since it takes a loooong time, I wonder whether it could be (also in part) parallelized.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
