'Regarding TDM of specific keywords
I am working on multiple annual report pdf files and I want to extract certain keywords for example "Finance" "CSR" etc I created a code to extract keywords like this for each document
filelength= length(files)
wordlength= length(keywords)
word_count= seq(1, filelength*wordlength)
dim(word_count)= c(filelength, wordlength)
j = 1
for (j in 1:length(files)) {
P1 <- pdftools::pdf_text(pdf = files[j]) %>%
str_to_lower() %>%
str_replace_all("\\t", "") %>%
str_replace_all("\n", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all(" ", " ") %>%
str_replace_all("[:digit:]", "") %>%
str_replace_all("[:punct:]", "") %>%
str_trim()
for (i in 1:length(keywords)) {
word_count[j,i] <- P1 %>% str_count(keywords[i]) %>% sum()
}
}
word_count= as.data.frame(word_count)
rownames(word_count)= files
colnames(word_count)= keywords
know my query is that, I have extracted the words required (and count), but I want to do further analysis, for example, want to create tdm of specific keywords and want to extract only those phrases which are related to keywords. Is it possible in r this thing
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
