'Parellelized TermDocumentMatrix

I've got a lot of spare hardware when using this code

get_cleand_tdm <- function(datatable,min_freq,max_freq_perc) {
  dt <- datatable
  #remove all punctuation followed by blanks
  text <- VCorpus(DataframeSource(dt))
  corpus_clean <- tm_map(text, content_transformer(tolower))
  corpus_clean <- tm_map(corpus_clean, content_transformer(lemmatize_strings))
  corpus_clean <- tm_map(corpus_clean, content_transformer(remove_doublespace))
  corpus_clean <- tm_map(corpus_clean, content_transformer(remove_urls))
  corpus_clean <- tm_map(corpus_clean, content_transformer(remove_www))
  corpus_clean <- tm_map(corpus_clean, content_transformer(remove_emoticons))
  corpus_clean <- tm_map(corpus_clean, removeWords,stopwords("en"))
  tdm <- TermDocumentMatrix(corpus_clean, 
                            control=list(removeNumbers = TRUE,
                                         removePunctuation=TRUE,
                                         bounds=list(global=c(min_freq, 
                                                              floor(length(corpus_clean)*max_freq_perc)))))
  #tdm <- weightBin(tdm)
  # convert as matrix
  return(tdm)
}


#### final tdm
tdm <- get_cleand_tdm(dt1,30,1)

since it takes a loooong time, I wonder whether it could be (also in part) parallelized.

r mclapply

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source

'Parellelized TermDocumentMatrix

Sources

Related Questions