'How can I make this Python Match function faster?

I need to test around 100 million HTML documents to see if they meet certain criteria. I am doing this by checking whether certain strings exist in the text after stripping the HTML tags.

I have made a list of these strings as follows:

simpleToParse = []
simpleToParse.append('"Subject" "act=dispBoardWrite"'.lower())
simpleToParse.append('"dispMemberSignUpForm"'.lower())
simpleToParse.append('"Help get things started by asking a question"'.lower())
simpleToParse.append('"Recent questions and answers"'.lower())
...
...
...

I have around 200 of these strings. So, for this

simpleToParse.append('"Subject" "act=dispBoardWrite"'.lower())

If the document contains the words subject and act=dispBoardWrite, it will be considered a match. ANY of the items matching from the main simpleToParse list will be considered a match. In essence,

sometext AND sometext

OR

sometext

OR

sometext AND sometext AND sometext

...

...

...

Here is the function I am using

def check(strippedHTML):  
    if any(all(x in strippedHTML.lower() for x in re.findall(r'"(.+?)"',y)) for y in simpleToParse):
        return True
    return False

Problem is, this function above is taking anywhere from 2-30 seconds per document depending on the length of the document.

I have an 8C/16T Ryzen 5800x but it would still take me weeks to do this. Any help would be appreciated. Thanks.



Solution 1:[1]

I assume the problem is CPU-bound, so to benefit from parallelism you have to use multiprocessing.
Also, do not use a regex if you are not matching regular expressions.

Here is a proposal :

import concurrent.futures

rules = tuple(
    tuple(term.lower() for term in rule)
    for rule in (
        ["subject", "act=dispBoardWrite"],
        ["dispMemberSignUpForm"],
        ["Help get things started by asking a question"],
        ["Recent questions and answers"],
        # ["is creating a new copy of the input"],
    )
)


def check_rule(rule, text):
    return all(term in text for term in rule)


def check_all_rules(filepath):
    with open(filepath, "rt") as file:
        file_content = file.read()

    with concurrent.futures.ProcessPoolExecutor(max_workers=None) as executor:
        # create all futures, one per rule, and get their result as soon as available
        for future in concurrent.futures.as_completed(executor.submit(check_rule, rule, file_content)
                                                      for rule in rules):
            # cf https://stackoverflow.com/q/16276423/11384184
            if future.result() is True:
                # we have one match, no need to search for more
                executor.shutdown(wait=False, cancel_futures=True)  # Python>=3.9
                break
            else:
                # continue searching for a match
                continue
        else:  # the iterator is exhausted, no rule matched
            return False
        # the iterator was interrupted, there was a match
        return True


def check_all_files():
    filepaths = (
        "so70674556.html",
    )
    for filepath in filepaths:
        print(filepath, check_all_rules(filepath))


check_all_files()

It will iterate over all the files (in my example just an HTML dump of this page), for each it will create a ProcessPoolExecutor to handle the parallelism, feed it the tasks (one per rule to match), and get the results as they complete. As soon as there is a match, all tasks gets cancelled and it moves to the next file.

I think it will be much faster than your single-threaded regex matching.
If it was not sufficient though, you should provide in your question an example of a real file to process, and an example of the 200 rules to match against.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Lenormju