'How do I replace a string in a PDF file using NodeJS?

I have a template PDF file, and I want to replace some marker strings to generate new PDF files and save them. What's the best/simplest way to do this? I don't need to add graphics or anything fancy, just a simple text replacement, so I don't want anything too complicated.

Thanks!

Edit: Just found HummusJS, I'll see if I can make progress and post it here.

Solution 1:^[1]

I found this question by searching, so I think it deserves the answer. I found the answer by BrighTide here: https://github.com/galkahana/HummusJS/issues/71#issuecomment-275956347

Basically, there is this very powerful Hummus package which uses library written in C++ (crossplatform of course). I think the answer given in that github comment can be functionalized like this:

var hummus = require('hummus');

/**
 * Returns a byteArray string
 * 
 * @param {string} str - input string
 */
function strToByteArray(str) {
  var myBuffer = [];
  var buffer = new Buffer(str);
  for (var i = 0; i < buffer.length; i++) {
      myBuffer.push(buffer[i]);
  }
  return myBuffer;
}

function replaceText(sourceFile, targetFile, pageNumber, findText, replaceText) {  
    var writer = hummus.createWriterToModify(sourceFile, {
        modifiedFilePath: targetFile
    });
    var sourceParser = writer.createPDFCopyingContextForModifiedFile().getSourceDocumentParser();
    var pageObject = sourceParser.parsePage(pageNumber);
    var textObjectId = pageObject.getDictionary().toJSObject().Contents.getObjectID();
    var textStream = sourceParser.queryDictionaryObject(pageObject.getDictionary(), 'Contents');
    //read the original block of text data
    var data = [];
    var readStream = sourceParser.startReadingFromStream(textStream);
    while(readStream.notEnded()){
        Array.prototype.push.apply(data, readStream.read(10000));
    }
    var string = new Buffer(data).toString().replace(findText, replaceText);

    //Create and write our new text object
    var objectsContext = writer.getObjectsContext();
    objectsContext.startModifiedIndirectObject(textObjectId);

    var stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(string));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();

    writer.end();
}

// replaceText('source.pdf', 'output.pdf', 0, /REPLACEME/g, 'My New Custom Text');

UPDATE:
The version used at the time of writing an example was 1.0.83, things might change recently.

UPDATE 2: Recently I got an issue with another PDF file which had a different font. For some reason the text got split into small chunks, i.e. string QWERTYUIOPASDFGHJKLZXCVBNM1234567890- got represented as -286(Q)9(WER)24(T)-8(YUIOP)116(ASDF)19(GHJKLZX)15(CVBNM1234567890-) I had no idea what else to do rather than make up a regex.. So instead of this one line:

var string = new Buffer(data).toString().replace(findText, replaceText);

I have something like this now:

var string = Buffer.from(data).toString();

var characters = REPLACE_ME;
var match = [];
for (var a = 0; a < characters.length; a++) {
    match.push('(-?[0-9]+)?(\\()?' + characters[a] + '(\\))?');
}

string = string.replace(new RegExp(match.join('')), function(m, m1) {
    // m1 holds the first item which is a space
    return m1 + '( ' + REPLACE_WITH_THIS + ')';
});

Solution 2:^[2]

Building on Alex's (and other's) solution, I noticed an issue where some non-text data were becoming corrupted. I tracked this down to encoding/decoding the PDF text as utf-8 instead of as a binary string. Anyways here's a modified solution that:

Avoids corrupting non-text data
Uses streams instead of files
Allows multiple patterns/replacements
Uses the MuhammaraJS package which is a maintained fork of HummusJS (should be able to swap in HummusJS just fine as well)
Is written in TypeScript (feel free to remove the types for JS)

import muhammara from "muhammara";

interface Pattern {
  searchValue: RegExp | string;
  replaceValue: string;
}

/**
 * Modify a PDF by replacing text in it
 */
const modifyPdf = ({
  sourceStream,
  targetStream,
  patterns,
}: {
  sourceStream: muhammara.ReadStream;
  targetStream: muhammara.WriteStream;
  patterns: Pattern[];
}): void => {
  const modPdfWriter = muhammara.createWriterToModify(sourceStream, targetStream, { compress: false });
  const numPages = modPdfWriter
    .createPDFCopyingContextForModifiedFile()
    .getSourceDocumentParser()
    .getPagesCount();

  for (let page = 0; page < numPages; page++) {
    const copyingContext = modPdfWriter.createPDFCopyingContextForModifiedFile();
    const objectsContext = modPdfWriter.getObjectsContext();

    const pageObject = copyingContext.getSourceDocumentParser().parsePage(page);
    const textStream = copyingContext
      .getSourceDocumentParser()
      .queryDictionaryObject(pageObject.getDictionary(), "Contents");
    const textObjectID = pageObject.getDictionary().toJSObject().Contents.getObjectID();

    let data: number[] = [];
    const readStream = copyingContext.getSourceDocumentParser().startReadingFromStream(textStream);
    while (readStream.notEnded()) {
      const readData = readStream.read(10000);
      data = data.concat(readData);
    }

    const pdfPageAsString = Buffer.from(data).toString("binary"); // key change 1

    let modifiedPdfPageAsString = pdfPageAsString;
    for (const pattern of patterns) {
      modifiedPdfPageAsString = modifiedPdfPageAsString.replaceAll(pattern.searchValue, pattern.replaceValue);
    }

    // Create what will become our new text object
    objectsContext.startModifiedIndirectObject(textObjectID);

    const stream = objectsContext.startUnfilteredPDFStream();
    stream.getWriteStream().write(strToByteArray(modifiedPdfPageAsString));
    objectsContext.endPDFStream(stream);

    objectsContext.endIndirectObject();
  }

  modPdfWriter.end();
};

/**
 * Create a byte array from a string, as muhammara expects
 */
const strToByteArray = (str: string): number[] => {
  const myBuffer = [];
  const buffer = Buffer.from(str, "binary"); // key change 2
  for (let i = 0; i < buffer.length; i++) {
    myBuffer.push(buffer[i]);
  }
  return myBuffer;
};

And then to use it:

/**
 * Fill a PDF with template data
 */
export const fillPdf = async (sourceBuffer: Buffer): Promise<Buffer> => {
  const sourceStream = new muhammara.PDFRStreamForBuffer(sourceBuffer);
  const targetStream = new muhammara.PDFWStreamForBuffer();

  modifyPdf({
    sourceStream,
    targetStream,
    patterns: [{ searchValue: "home", replaceValue: "emoh" }], // TODO use actual patterns
  });

  return targetStream.buffer;
};

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	miguelmorin
Solution 2

'How do I replace a string in a PDF file using NodeJS?

Solution 1:[1]

Solution 2:[2]

Sources

Related Questions

Solution 1:^[1]

Solution 2:^[2]