'Concatenating PDF files in memory with PyPDF2
I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a iterable in memory, say a list:
my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj] # type is BytesIO
Where each pdf_fileobj is of type BytesIO. Then, the base memory usage is about 200 MB (500 pdfs, 400kB each).
Ideally, I would want the following code to concatenate using no more than 400-500 MB of memory in total (including my_pdfs). However, that doesn't seem to be the case, the debugging statement on the last line indicates the maximum memory used to be almost 700 MB. Moreover, using the Mac os x resource monitor, the allocated memory is indicated to be 600 MB when reaching the last line.
Running gc.collect() reduces this to 350 MB (almost too good?). Why do I have to run garbage collection manually to get rid of merging garbage, in this case? I have seen this (probably) causing memory build up in a slightly different scenario I'll skip for now.
import io
import resource # For debugging
from PyPDF2 import PdfFileMerger
def merge_pdfs(iterable):
"""Merge pdfs in memory"""
merger = PdfFileMerger()
for pdf_fileobj in iterable:
merger.append(pdf_fileobj)
myio = io.BytesIO()
merger.write(myio)
merger.close()
myio.seek(0)
return myio
my_concatenated_pdf = merge_pdfs(my_pdfs)
# Print the maximum memory usage
print("Memory usage: %s (kB)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
Question summary
- Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimize it?
- Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
- What about this general approach? Is BytesIO suitable to use is this case?
merger.write(myio)does seem to run kind of slow given that all happen in ram.
Thank you!
Solution 1:[1]
Q: Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimise it?
A: Because .append creates a new stream object and then you use merger.write(myio), which creates another stream object and you already have 200 MB of pdf files in memory so 3*200 MB.
Q: Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
A: It is a known issue in PyPDF2.
Q: What about this general approach? Is BytesIO suitable to use is this case?
A: Considering the memory issues, you might want to try a different approach. Maybe merging one by one, temporarily saving the files to disk, then clearing the already merged ones from memory.
Solution 2:[2]
PyMuPdf library may also be a good alternative to the performance issues of PDFMerger from PyPDF2.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | spedy |
| Solution 2 | Guibod |
