'How to delete pages based on phrases in PDF using Adobe XI Pro?

This is my first time using Actions in Adobe Pro. I would like to..

  1. Remove all pages in a PDF that contain any of the following strings (Total, Word Document, Excel Spreadsheet) for a PDF in Adobe Pro.
  2. Remove common strings (CSI, Export, Import) from all pages throughout the PDF.

The following code was found online and addresses #1 but extracts pages based on 1 string, I was not able to get it to work and I would also prefer to run through multiple strings and delete the pages.

// Iterates over all pages and find a given string and extracts all

// pages on which that string is found to a new file.



var pageArray = [];



var stringToSearchFor = "Total";



for (var p = 0; p < this.numPages; p++) {

// iterate over all words

for (var n = 0; n < this.getPageNumWords(p); n++) {

if (this.getPageNthWord(p, n) == stringToSearchFor) {

pageArray.push(p);

break;

}

}

}



if (pageArray.length > 0) {

// extract all pages that contain the string into a new document

var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done

for (var n = 0; n < pageArray.length; n++) {

d.insertPages( {

nPage: d.numPages-1,

cPath: this.path,

nStart: pageArray[n],

nEnd: pageArray[n],

} );

}



  // remove the first page

  d.deletePages(0);



}


Solution 1:[1]

  1. One word and two word phrase options.

one-word:

for (var p=this.numPages-1; p>=0; p--) {  
    if (this.numPages==1) break;  
    for (var n=0; n<this.getPageNumWords(p)-1; n++) {  
        if (this.getPageNthWord(p, n) == "one-word") {  
            this.deletePages(p);  
            break;  
        }  
    }  
}  

two-word:

for (var p=this.numPages-1; p>=0; p--) {  
    if (this.numPages==1) break;  
    for (var n=0; n<this.getPageNumWords(p)-1; n++) {  
        if (this.getPageNthWord(p, n) == "1st-word" && this.getPageNthWord(p, n+1) == "2nd-word") {  
            this.deletePages(p);  
            break;  
        }  
    }  
}  
  1. Within Adobe XI Pro, Tools--> Protection-->Search & Remove Text

Solution 2:[2]

I was facing similar need, delete pages from PDF when a word exists on that word. I had 35000 documents and 80000-230000 pages.

Running javascript was really really slow.

I also tried Autobookmark-plugin for Adobe Acrobat from Evermap - it could handle files with tens of pages but 20000 pages did not finish the process and the plugin probably ran into RAM problems at around 80000 pages.

So, then I looked at things I know from earlier:

Powershell - PDF editing and handling modules do not seem to be there, or are old, or ...

Python - worked! My code below (much of it copied from others on the web!), I use Anaconda package and it is quite easy to set up. Before running the script you have to install some of the modules and the code could use some tidying:

# Import modules
import PyPDF2
import re
import pandas

# open the pdf file
object = PyPDF2.PdfFileReader("C:\\folder\\file.pdf") #python style path

# get number of pages
NumPages = object.getNumPages()

# define keyterm to search
String = "word"

ListPages = []

# extract text and do the search on each page
for i in range(0, NumPages):
    PageObj = object.getPage(i)
    Text = PageObj.extractText()
    if re.search(String,Text): 
        ListPages.append(i)
        # print("Pattern Found on Page: " + str(i))

# the pages to delete 
pages_to_delete = ListPages

infile = PyPDF2.PdfFileReader("C:\\folder\\file.pdf", 'rb')
output = PyPDF2.PdfFileWriter()

for i in range(infile.getNumPages()):
    if i not in pages_to_delete:
        p = infile.getPage(i)
        output.addPage(p)

with open("C:\\folder\\newfile.pdf", 'wb') as f:
    output.write(f)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 t.breeze
Solution 2 PDFs suck