'Write specific pages from multiple pdf files to a new pdf file
I have multiple pdf files that I want to extract a group of specific pages from where each set of pages is different for each pdf file. I have created a dictionary with the keys as the pdf file name and the values as the list of pages to be extracted from each pdf file (shown as key). I intend to extract the given pages from the associated pdf file and write them all to one new pdf file so that I can do data extraction on this final file. I have tried PyPDF4 as well as FPDF but no joy as yet as it gives me either a large pdf with blank pages or a pdf with just 1 or 2 pages extracted or error that the pdf object cannot be found. I am hoping to get some guidance on where I am going wrong with my approach. Below is my code:
import PyPDF4
from PyPDF4 import PdfFileReader, PdfFileWriter
for pdf,pgs in dic_11_1.items():
pdf=list(dic_11_1.keys())
pgs=list(dic_11_1.values())
for i in range(0,len(pdf)):
pages = pgs[i]
object = open(pdf[i],'rb')
pdfinput=PyPDF4.PdfFileReader(object,'rb')
if pdfinput.isEncrypted:
pdfinput.decrypt('')
else:
pdfinput
for p in pages:
page=pdfinput.getPage(p)
pdf_writer=PyPDF4.PdfFileWriter()
pdf_writer.addPage(page)
with open('F111.pdf',mode='wb') as output:
pdf_writer.write(output)
The error that I get is 'PdfReadError: Could not find object.'
When I try FPDF with the following code, it runs a long time and gives me a large empty pdf file:
from fpdf import FPDF
import os
for pdf,pgs in dic_11_1.items():
pdf_in=open(pdf,'rb')
inputpdf=PdfFileReader(pdf_in,'rb')
if inputpdf.isEncrypted:
inputpdf.decrypt('')
else:
inputpdf
for p in pgs:
content=inputpdf.getPage(p).extractText()
pdf = FPDF('P','mm','A4')
pdf.add_page()
pdf.set_font("arial", size = 10)
for text in content:
text2=text.encode('latin-1', 'replace').decode('latin-1')
pdf.write(10,text2)
pdf.ln(8)
pdf.close()
return_byte_string=pdf.output('F_11_1.pdf','S').encode('latin-1')
pdf_file=open('F_11_1.pdf','wb')
pdf_file.write(return_byte_string)
pdf_file.close()
Any guidance would be greatly appreciated. Thank you in advance
The solution provided by @SUTerliakov was great but only wrote the last page or last document from the the dictionary values list of pages. It was resolved with a minor indentation in the code and that got all my data for me. Thanks again @SUTerliakov for starting me on the correct path! Here is your adjusted code:
pdf_writer = PdfFileWriter()
open_files = []
try:
for filename, pgs in dic_11_1.items():
src = open(filename, 'rb')
open_files.append(src)
pdfinput = PdfFileReader(src, 'rb')
if pdfinput.isEncrypted:
pdfinput.decrypt('')
print(f'Extracting relevant pages from {filename} to central repository')
for p in pgs:
print(f'{filename} pg{str(p)}')
pdf_writer.addPage(pdfinput.getPage(p))
print(f'Writing {len(pgs)} pages to central file')
Stream=open('F_11_1.pdf','wb')
pdf_writer.write(Stream)
finally:
print('Closing Source File...')
for f in open_files:
f.close()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
