'Case-sensitive PDF highlighting using PyMuPDF and re
The goal is a program that can take a PDF of a script as well as the name of a character and output a script with only that character's lines (or at least their name) highlighted. An example of the way these scripts are typically formatted: Here I would want just "MISHA" highlighted, but not "Misha" in the italic stage directions, eg
I was able to get a version of this working with PyMuPDF, but it would highlight every instance of the characters name.
The case-insensitive version was:
doc = fitz.open("HighlightTest.pdf")
character = input("Character name? ").upper()
for page in doc:
### SEARCH
text = character
text_instances = page.searchFor(character)
### HIGHLIGHT
for inst in text_instances:
highlight = page.addHighlightAnnot(inst)
highlight.update()
Which then spit out a PDF with every instance of "character" highlighted- as expected.
I found the following bit about case-sensitive searching from the PyMuPDF documentation:
"Note A feature repeatedly asked for is supporting regular expressions when specifying the "needle" string: There is no way to do this. If you need something in that direction, first extract text in the desired format and then subselect the result by matching with some regex pattern. Here is an example for matching words:"
pattern = re.compile(r"...") # the regex pattern
words = page.get_text("words") # extract words on page
matches = [w for w in words if pattern.search(w[4])]
So I'm trying to figure out how to implement this as follows:
doc = fitz.open("HighlightTest.pdf")
character = input("Character name? ").upper()
for page in doc:
text = character
words = page.get_text(character) # extract words on page
matches = [w for w in words if pattern.search(w[4])]
for inst in matches:
highlight = page.addHighlightAnnot(inst)
highlight.update()
where pattern = re.compile("^"+character).
This gives the following Error:
File "C:\Users\me\Desktop\Python Projects\highlighter.py", line 45, in matches = [w for w in words if pattern.search(w[4])]
File "C:\Users\me\Desktop\Python Projects\highlighter.py", line 45, in matches = [w for w in words if pattern.search(w[4])]
IndexError: string index out of range
Unsure how to proceed from here and would welcome any advice! I'm certain that what I have above is jank in many ways, so no proposed solution is too basic. Thanks!
Solution 1:[1]
I came across exact issue and I was able to solve it with the help of PyMuPDF's one more function, get_text("words",sort=False)
please find the below doc for more information: [1]:https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractWORDS
This function will return you the container that has 4 rectangular coordinates , followed by the exact Text (word) which looks like below: (x0, y0, x1, y1, "word", block_no, line_no, word_no)
Then you can take these returned items , find the word one by one for exact (case sensitive) match. Incase you have already formed sentences to match against pdf content, you can retain the order of words from PDF as original with help of argument "sort" by setting it to False, then check each word of your sentence sequentially to check if that pattern is noticed inside the word list.
For whatever match is found , just provide the rectangular coordinates for highlighter object by following steps:
- Convert the collected coordinates ( first four elements ) to Rect object with the help of fitz.Rect(x0,y0,x1,y1).
- Pass this object into page_obj.add_highlight_annot.
import fitz #Pymupdf library
pdf_file = fitz.open(<file_name>.pdf) #Create pdf file object
pdf_page_count = pdf_file.page_count #var to hold page count
for page in range(pdf_page_count): #notice that page starts with index 0
page_obj = pdf_file[page] #Create page object
content_of_page = pdf_file.get_page_text(page) #Get page content
match_word = "MONTANA"
content_of_page = page_obj.get_text("words",sort=False) #get rect for all words
for word in content_of_page:
if word[4] == match_word:
rect_comp = fitz.Rect(word[0],word[1],word[2],word[3])
highlight = page_obj.add_highlight_annot(rect_comp)
highlight.set_colors(stroke=[0, 1, 0.8])
highlight.update()
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
