'How to iteratively apply a function (pdf_text()) across pdf files in a folder in R?

I have a large folder of pdf documents. I am trying to extract the text from each document iteratively (such that the only input is the folder pathway). It seems one can approach this with a imap/map and a for loop. Below is an attempt mapping a function onto a vector in which all files in the folder reside.

files <- list.files(path = "foldername", pattern = "*.pdf")

text_vector = c()

df <- files %>% map(function(x) {
    text <- pdf_text(x))
    text_vector <- append(text)})

I welcome alternative methods to the same end of extracting the text across all files in a folder.

r


Solution 1:[1]

You could use sapply followed by rbind to join the results together.

library(pdftools)
pdfs <- list.files('foldername', pattern = 'pdf', full.names = T)
text <- sapply(pdfs, pdf_text)
all_text <- do.call(rbind, text)

Solution 2:[2]

Here's a more concise way of assigning your pdf text to a single vector using map_chr:

files <- list.files(path = "foldername", pattern = "*.pdf")

text_vector <- map_chr(files, ~ pdf_text(.))

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Andrew Chisholm
Solution 2 Rory S