'How to join multiple PDF pages to a single Page

I have a PDF with 4 pages. I want to create another PDF where the pages are positioned one after the another (Vertical aligment) in a single page. Which commandline tool can be used for that?



Solution 1:[1]

If you use a Unix-like operating system, there is pdfjam, which combines the Latex backend with an easy command:

pdfjam --nup 1x4,landscape input.pdf

EDIT: recently I had issues with pdfjam with that exact command. I had it working with:

cat input.pdf | pdfjam -nup 1x4 -landscape –outfile out.pdf

Solution 2:[2]

I recently had a similar problem. I needed to develop a solution to preprocess a PDF to better identify the tables using the tabula-py package. Here's the step-by-step solution to my problem:

  1. Remove header and footer from all n pages;
  2. Split the PDF into n files containing 1 single page each;
  3. Crop the single page from the n files according to its bounding box;
  4. Merge the n files into 1 single PDF containing 1 page, keeping the order;
  5. Read and preprocess tables from text based PDF using tabula-py.

In my case, step 3 of the process can generate files with different dimensions. When using the pdfjam command I had problems aligning the pages in step 4, even using the --pagetemplate parameter. For me the vertical alignment was the worst.

Fortunately, I was able to solve this page alignment problem using a LaTeX based approach. Here is the base source code I used -- put all files in the same directory.

The answer to this post's question is in the "Shell Script" section, starting from the line "Merging the 'n' pages into a single one using LaTeX...".

Requirements:

I tested this solution using the Debian linux distro. To work on Windows, you can install Debian via WSL, for example. Assuming you are using Debian or Ubuntu, run the following commands:

sudo apt update
sudo apt install texlive-extra-utils texlive-latex-extra poppler-utils ghostscript default-jdk -y

As a result of PDF preprocessing, the final file may have reduced dimensions, especially in width. Even though it is vectorized, reducing the size of the PDF has a negative impact on the quality of extracting tables using the tabula-py tool.

To increase the size of the final PDF file, keeping the aspect ratio, we can use the cpdf tool. In the case of using a Linux distro, just copy the binary file to the same directory as the Shell script. Then give the binary file execute permission:

chmod +x cpdf

LaTeX template:

Create a file named pdf_merge_template.tex with the following content:

\documentclass[dvipdfmx]{article}

\usepackage[margin=0in]{geometry}
\usepackage{pdfpages}

\begin{document}

\centering<PDF-PAGES>

\end{document}

Shell script:

Create a file named pdf-preprocessing.sh with the following content:

#!/bin/bash

# References
# ----------
# [PDF preprocessing] https://stackoverflow.com/a/71802078/16109419

PDF_MERGE_TEMPLATE_FILENAME="pdf_merge_template.tex"
PDF_MERGE_TEMPLATE_PAGES_MACRO="<PDF-PAGES>"
CPDF_FILENAME="cpdf"  # Binary file of "Coherent PDF" tool.

INPUT_FILEPATH=$1
OUTPUT_FILEPATH=$2

if [ "$#" -eq  "2" ]; then
    echo "Starting PDF preprocessing..."
else
    echo "ERROR: Wrong arguments ('INPUT_FILEPATH', 'OUTPUT_FILEPATH')!"
    exit
fi

TEMP_DIR_PATH=$(mktemp -d)
echo "Setting up working directory: ${TEMP_DIR_PATH}"

THIS_FILENAME=$(readlink -f "$0")
THIS_DIR=$(dirname "$THIS_FILENAME")
PDF_MERGE_TEMPLATE_PATH="${THIS_DIR}/${PDF_MERGE_TEMPLATE_FILENAME}"
CPDF_BIN_PATH="${THIS_DIR}/${CPDF_FILENAME}"

echo "Removing header and footer from all pages..."

# Removing header and footer from PDF pages based on margin settings (e.g., '5 -75 5 -25'):
pdfcrop --margins '5 -75 5 -25' "${INPUT_FILEPATH}" "${TEMP_DIR_PATH}/input-tmp.pdf"

# Deleting the textual content of the cut parts:
pdftocairo "${TEMP_DIR_PATH}/input-tmp.pdf" "${TEMP_DIR_PATH}/input.pdf" -pdf

# Counting the number of pages from PDF:
NUM_PAGES=$(pdfinfo "${TEMP_DIR_PATH}/input.pdf" | awk '/^Pages:/ {print $2}')
REL_PAGE_WIDTH=$(python -c "print(1 / ${NUM_PAGES})")
PAGES_LIST=""

echo "Cropping each page based on its specific bounding box..."
for i in $(seq 1 $NUM_PAGES)
do
    # Splitting current page:
    pdfjam "${TEMP_DIR_PATH}/input.pdf" $i --outfile "${TEMP_DIR_PATH}/page_${i}-tmp.pdf"

    # Fetching bounding box of current page:
    PAGE_WIDTH=$(pdfinfo "${TEMP_DIR_PATH}/page_${i}-tmp.pdf" | awk '/^Page size:/ {print $3}')
    BBOX=$(gs -dBATCH -dNOPAUSE -q -sDEVICE=bbox "${TEMP_DIR_PATH}/page_${i}-tmp.pdf" 2>&1 | awk '/^%%HiResBoundingBox:/ {print 0 " " $3 " " '${PAGE_WIDTH}' " " $5}')

    # Removing content out of current page's bounding box:
    PAGE_FILENAME="page_${i}"
    PDF_FILEPATH="${TEMP_DIR_PATH}/${PAGE_FILENAME}.pdf"
    PS_FILEPATH="${TEMP_DIR_PATH}/${PAGE_FILENAME}.ps"
    gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -sOutputFile=$PDF_FILEPATH -c "[/CropBox [${BBOX}] /PAGES pdfmark" -f "${TEMP_DIR_PATH}/page_${i}-tmp.pdf"

    PAGES_LIST="${PAGES_LIST}\n    \\\includegraphics[width=${REL_PAGE_WIDTH}\\\textwidth]{${PDF_FILEPATH}} \\\\\\\ \\\vspace{-0.03cm}"
done

echo "Merging the 'n' pages into a single one using LaTeX..."
cp $PDF_MERGE_TEMPLATE_PATH "${TEMP_DIR_PATH}/merged.tex"
sed -i "s#${PDF_MERGE_TEMPLATE_PAGES_MACRO}#${PAGES_LIST}#g" "${TEMP_DIR_PATH}/merged.tex"
latex -halt-on-error -output-directory $TEMP_DIR_PATH "${TEMP_DIR_PATH}/merged.tex"
dvipdfm "${TEMP_DIR_PATH}/merged.dvi" -o "${TEMP_DIR_PATH}/merged.pdf"

echo "Finalizing the PDF and sending a copy to the destination directory..."

# Adding margins to the final PDF:
pdfcrop --margins 5 "${TEMP_DIR_PATH}/merged.pdf" "${TEMP_DIR_PATH}/output-tmp.pdf"

# Enlargement of the final PDF file to extract tables correctly using the "tabula-py" tool:
$CPDF_BIN_PATH -scale-page "10 10" "${TEMP_DIR_PATH}/output-tmp.pdf" -o "${TEMP_DIR_PATH}/output.pdf"

# Copying the pre-processed PDF file to the output path:
cp "${TEMP_DIR_PATH}/output.pdf" "${OUTPUT_FILEPATH}"

echo "Deleting temp files..."
rm -rf "${TEMP_DIR_PATH}"

Give execution permission to the script using the following command:

chmod +x pdf-preprocessing.sh

Usage examples:

To preprocess a file called example.pdf, simply run the following command:

sh pdf-preprocessing.sh example.pdf preprocessed-example.pdf

Edit the script parameters as per your case. The header and footer coordinates, for example, will depend on the template of your files. Feel free to suggest improvements and simplifications.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2