'Merge PDF pages to 1 file without generating single page files
The goal is to take a set of jpg/tif images and convert them into 1 text-searchable PDF. I am using Python's PyPDF2 and pytesseract to accomplish this; however, I am unable to find a method of combining these pages without saving each page as its own PDF. Turns out some of these sets could be 1k-10k pages so saving each page individually is unfortunately no longer feasible ... here's what I've got so far:
# Convert each image to a searchable PDF
for fileset in filesets:
merger = PdfFileMerger()
page_path = fr".\output\pages"
for file in fileset:
# Load image, read with pytesseract
path = os.path.join(download_location,file)
img = cv2.imread(path,1)
result = (pytesseract.image_to_pdf_or_hocr(img, lang="eng",config=tessdata_dir_config))
# Save result as PDF
f = open(os.path.join(path_out,getfilename.findall(file)[0])+".pdf","w+b")
f.write(bytearray(result))
f.close()
Which works just fine for single pages, and from here I could merge each of these pages and save them as one document such as:
# pdfs is a list of all the single page pdf's
for page in pdfs:
merger.append(page)
merger.write(fr".\output\{FILE}.pdf")
merger.close();
del merger
# Get rid of single page files
for page in pdfs:
os.remove(page)
This produces the text searchable PDF's as desired, but those individual page files are going to destroy my memory. I've tried appending the result object(s) to merger, which produces the AttributeError: 'bytearray' object has no attribute 'seek' error. I've also tried to read the result objects as PDF's with PyPDF2.PdfFileReader() and got a similar result. Any ideas? My gut feeling is that there is a quick solution that requires some sort of variable type() conversion but I rarely work with PDF's.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
