'How to iteratively apply a function (pdf_text()) across pdf files in a folder in R?
I have a large folder of pdf documents. I am trying to extract the text from each document iteratively (such that the only input is the folder pathway). It seems one can approach this with a imap/map and a for loop. Below is an attempt mapping a function onto a vector in which all files in the folder reside.
files <- list.files(path = "foldername", pattern = "*.pdf")
text_vector = c()
df <- files %>% map(function(x) {
text <- pdf_text(x))
text_vector <- append(text)})
I welcome alternative methods to the same end of extracting the text across all files in a folder.
Solution 1:[1]
You could use sapply followed by rbind to join the results together.
library(pdftools)
pdfs <- list.files('foldername', pattern = 'pdf', full.names = T)
text <- sapply(pdfs, pdf_text)
all_text <- do.call(rbind, text)
Solution 2:[2]
Here's a more concise way of assigning your pdf text to a single vector using map_chr:
files <- list.files(path = "foldername", pattern = "*.pdf")
text_vector <- map_chr(files, ~ pdf_text(.))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Andrew Chisholm |
| Solution 2 | Rory S |
