'Concatenating PDF files in memory with PyPDF2

I wish to concatenate (append) a bunch of small pdfs together effectively in memory in pure python. Specifically, an usual case is 500 single page pdfs, each with a size of about 400 kB, to be merged into one. Let's say the pdfs are available as a iterable in memory, say a list:

my_pdfs = [pdf1_fileobj, pdf2_fileobj, ..., pdfn_fileobj]  # type is BytesIO

Where each pdf_fileobj is of type BytesIO. Then, the base memory usage is about 200 MB (500 pdfs, 400kB each).

Ideally, I would want the following code to concatenate using no more than 400-500 MB of memory in total (including my_pdfs). However, that doesn't seem to be the case, the debugging statement on the last line indicates the maximum memory used to be almost 700 MB. Moreover, using the Mac os x resource monitor, the allocated memory is indicated to be 600 MB when reaching the last line.

Running gc.collect() reduces this to 350 MB (almost too good?). Why do I have to run garbage collection manually to get rid of merging garbage, in this case? I have seen this (probably) causing memory build up in a slightly different scenario I'll skip for now.

import io
import resource  # For debugging

from PyPDF2 import PdfFileMerger


def merge_pdfs(iterable):
    """Merge pdfs in memory"""
    merger = PdfFileMerger()
    for pdf_fileobj in iterable:
        merger.append(pdf_fileobj)

    myio = io.BytesIO()
    merger.write(myio)
    merger.close()

    myio.seek(0)
    return myio


my_concatenated_pdf = merge_pdfs(my_pdfs)

# Print the maximum memory usage
print("Memory usage: %s (kB)" % resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)

Question summary

  • Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimize it?
  • Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?
  • What about this general approach? Is BytesIO suitable to use is this case? merger.write(myio) does seem to run kind of slow given that all happen in ram.

Thank you!



Solution 1:[1]

Q: Why does the code above need almost 700 MB of memory to merge 200 MB worth of pdfs? Shouldn't 400 MB + overhead be enough? How do I optimise it?

A: Because .append creates a new stream object and then you use merger.write(myio), which creates another stream object and you already have 200 MB of pdf files in memory so 3*200 MB.


Q: Why do I need to run garbage collection manually to get rid of PyPDF2 merging junk when the variables in question should already be out of scope?

A: It is a known issue in PyPDF2.


Q: What about this general approach? Is BytesIO suitable to use is this case?

A: Considering the memory issues, you might want to try a different approach. Maybe merging one by one, temporarily saving the files to disk, then clearing the already merged ones from memory.

Solution 2:[2]

PyMuPdf library may also be a good alternative to the performance issues of PDFMerger from PyPDF2.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 spedy
Solution 2 Guibod