'How to delete pages based on phrases in PDF using Adobe XI Pro?
This is my first time using Actions in Adobe Pro. I would like to..
- Remove all pages in a PDF that contain any of the following strings (Total, Word Document, Excel Spreadsheet) for a PDF in Adobe Pro.
- Remove common strings (CSI, Export, Import) from all pages throughout the PDF.
The following code was found online and addresses #1 but extracts pages based on 1 string, I was not able to get it to work and I would also prefer to run through multiple strings and delete the pages.
// Iterates over all pages and find a given string and extracts all
// pages on which that string is found to a new file.
var pageArray = [];
var stringToSearchFor = "Total";
for (var p = 0; p < this.numPages; p++) {
// iterate over all words
for (var n = 0; n < this.getPageNumWords(p); n++) {
if (this.getPageNthWord(p, n) == stringToSearchFor) {
pageArray.push(p);
break;
}
}
}
if (pageArray.length > 0) {
// extract all pages that contain the string into a new document
var d = app.newDoc(); // this will add a blank page - we need to remove that once we are done
for (var n = 0; n < pageArray.length; n++) {
d.insertPages( {
nPage: d.numPages-1,
cPath: this.path,
nStart: pageArray[n],
nEnd: pageArray[n],
} );
}
// remove the first page
d.deletePages(0);
}
Solution 1:[1]
- One word and two word phrase options.
one-word:
for (var p=this.numPages-1; p>=0; p--) {
if (this.numPages==1) break;
for (var n=0; n<this.getPageNumWords(p)-1; n++) {
if (this.getPageNthWord(p, n) == "one-word") {
this.deletePages(p);
break;
}
}
}
two-word:
for (var p=this.numPages-1; p>=0; p--) {
if (this.numPages==1) break;
for (var n=0; n<this.getPageNumWords(p)-1; n++) {
if (this.getPageNthWord(p, n) == "1st-word" && this.getPageNthWord(p, n+1) == "2nd-word") {
this.deletePages(p);
break;
}
}
}
- Within Adobe XI Pro, Tools--> Protection-->Search & Remove Text
Solution 2:[2]
I was facing similar need, delete pages from PDF when a word exists on that word. I had 35000 documents and 80000-230000 pages.
Running javascript was really really slow.
I also tried Autobookmark-plugin for Adobe Acrobat from Evermap - it could handle files with tens of pages but 20000 pages did not finish the process and the plugin probably ran into RAM problems at around 80000 pages.
So, then I looked at things I know from earlier:
Powershell - PDF editing and handling modules do not seem to be there, or are old, or ...
Python - worked! My code below (much of it copied from others on the web!), I use Anaconda package and it is quite easy to set up. Before running the script you have to install some of the modules and the code could use some tidying:
# Import modules
import PyPDF2
import re
import pandas
# open the pdf file
object = PyPDF2.PdfFileReader("C:\\folder\\file.pdf") #python style path
# get number of pages
NumPages = object.getNumPages()
# define keyterm to search
String = "word"
ListPages = []
# extract text and do the search on each page
for i in range(0, NumPages):
PageObj = object.getPage(i)
Text = PageObj.extractText()
if re.search(String,Text):
ListPages.append(i)
# print("Pattern Found on Page: " + str(i))
# the pages to delete
pages_to_delete = ListPages
infile = PyPDF2.PdfFileReader("C:\\folder\\file.pdf", 'rb')
output = PyPDF2.PdfFileWriter()
for i in range(infile.getNumPages()):
if i not in pages_to_delete:
p = infile.getPage(i)
output.addPage(p)
with open("C:\\folder\\newfile.pdf", 'wb') as f:
output.write(f)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | t.breeze |
| Solution 2 | PDFs suck |
