'Case-sensitive PDF highlighting using PyMuPDF and re

The goal is a program that can take a PDF of a script as well as the name of a character and output a script with only that character's lines (or at least their name) highlighted. An example of the way these scripts are typically formatted: Here I would want just "MISHA" highlighted, but not "Misha" in the italic stage directions, eg

I was able to get a version of this working with PyMuPDF, but it would highlight every instance of the characters name.

The case-insensitive version was:

doc = fitz.open("HighlightTest.pdf")

character = input("Character name? ").upper()

for page in doc:
    ### SEARCH
    text = character
    text_instances = page.searchFor(character)
    

    ### HIGHLIGHT
    for inst in text_instances: 
        highlight = page.addHighlightAnnot(inst)
        highlight.update()

Which then spit out a PDF with every instance of "character" highlighted- as expected.

I found the following bit about case-sensitive searching from the PyMuPDF documentation:

"Note A feature repeatedly asked for is supporting regular expressions when specifying the "needle" string: There is no way to do this. If you need something in that direction, first extract text in the desired format and then subselect the result by matching with some regex pattern. Here is an example for matching words:"

pattern = re.compile(r"...")  # the regex pattern
words = page.get_text("words")  # extract words on page
matches = [w for w in words if pattern.search(w[4])]

So I'm trying to figure out how to implement this as follows:

doc = fitz.open("HighlightTest.pdf")
    
character = input("Character name? ").upper()

for page in doc:
    text = character
    words = page.get_text(character)  # extract words on page
    matches = [w for w in words if pattern.search(w[4])]

    for inst in matches:
        highlight = page.addHighlightAnnot(inst)
        highlight.update()

where pattern = re.compile("^"+character).

This gives the following Error:

File "C:\Users\me\Desktop\Python Projects\highlighter.py", line 45, in matches = [w for w in words if pattern.search(w[4])]

IndexError: string index out of range

Unsure how to proceed from here and would welcome any advice! I'm certain that what I have above is jank in many ways, so no proposed solution is too basic. Thanks!

Solution 1:^[1]

I came across exact issue and I was able to solve it with the help of PyMuPDF's one more function, get_text("words",sort=False)

please find the below doc for more information: [1]:https://pymupdf.readthedocs.io/en/latest/textpage.html#TextPage.extractWORDS

This function will return you the container that has 4 rectangular coordinates , followed by the exact Text (word) which looks like below: (x0, y0, x1, y1, "word", block_no, line_no, word_no)

Then you can take these returned items , find the word one by one for exact (case sensitive) match. Incase you have already formed sentences to match against pdf content, you can retain the order of words from PDF as original with help of argument "sort" by setting it to False, then check each word of your sentence sequentially to check if that pattern is noticed inside the word list.

For whatever match is found , just provide the rectangular coordinates for highlighter object by following steps:

Convert the collected coordinates ( first four elements ) to Rect object with the help of fitz.Rect(x0,y0,x1,y1).
Pass this object into page_obj.add_highlight_annot.

    import fitz #Pymupdf library
    
    pdf_file = fitz.open(<file_name>.pdf)  #Create pdf file object
    pdf_page_count = pdf_file.page_count   #var to hold page count
    for page in range(pdf_page_count):  #notice that page starts with index 0
       page_obj = pdf_file[page] #Create page object
       content_of_page = pdf_file.get_page_text(page) #Get page content
       match_word = "MONTANA" 
       content_of_page = page_obj.get_text("words",sort=False)  #get rect for all words
       for word in content_of_page:
          if word[4] == match_word:
             rect_comp = fitz.Rect(word[0],word[1],word[2],word[3])
             highlight = page_obj.add_highlight_annot(rect_comp)
             highlight.set_colors(stroke=[0, 1, 0.8])
             highlight.update()

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Case-sensitive PDF highlighting using PyMuPDF and re

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]