'Regarding TDM of specific keywords

I am working on multiple annual report pdf files and I want to extract certain keywords for example "Finance" "CSR" etc I created a code to extract keywords like this for each document

filelength= length(files)
wordlength= length(keywords)
word_count= seq(1, filelength*wordlength)
dim(word_count)= c(filelength, wordlength)
j = 1
for (j in 1:length(files)) {
  P1 <- pdftools::pdf_text(pdf = files[j]) %>%
    str_to_lower() %>%
    str_replace_all("\\t", "") %>%
    str_replace_all("\n", " ") %>%
    str_replace_all("      ", " ") %>%
    str_replace_all("    ", " ") %>%
    str_replace_all("   ", " ") %>%
    str_replace_all("  ", " ") %>%
    str_replace_all("[:digit:]", "") %>%
    str_replace_all("[:punct:]", "") %>%
    str_trim()
  
  for (i in 1:length(keywords)) {
    word_count[j,i] <- P1 %>% str_count(keywords[i]) %>% sum()
  }
  
}

word_count= as.data.frame(word_count)
rownames(word_count)= files
colnames(word_count)= keywords

know my query is that, I have extracted the words required (and count), but I want to do further analysis, for example, want to create tdm of specific keywords and want to extract only those phrases which are related to keywords. Is it possible in r this thing



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source