'How can I make this Python Match function faster?
I need to test around 100 million HTML documents to see if they meet certain criteria. I am doing this by checking whether certain strings exist in the text after stripping the HTML tags.
I have made a list of these strings as follows:
simpleToParse = []
simpleToParse.append('"Subject" "act=dispBoardWrite"'.lower())
simpleToParse.append('"dispMemberSignUpForm"'.lower())
simpleToParse.append('"Help get things started by asking a question"'.lower())
simpleToParse.append('"Recent questions and answers"'.lower())
...
...
...
I have around 200 of these strings. So, for this
simpleToParse.append('"Subject" "act=dispBoardWrite"'.lower())
If the document contains the words subject and act=dispBoardWrite, it will be considered a match. ANY of the items matching from the main simpleToParse list will be considered a match. In essence,
sometext AND sometext
OR
sometext
OR
sometext AND sometext AND sometext
...
...
...
Here is the function I am using
def check(strippedHTML):
if any(all(x in strippedHTML.lower() for x in re.findall(r'"(.+?)"',y)) for y in simpleToParse):
return True
return False
Problem is, this function above is taking anywhere from 2-30 seconds per document depending on the length of the document.
I have an 8C/16T Ryzen 5800x but it would still take me weeks to do this. Any help would be appreciated. Thanks.
Solution 1:[1]
I assume the problem is CPU-bound, so to benefit from parallelism you have to use multiprocessing.
Also, do not use a regex if you are not matching regular expressions.
Here is a proposal :
import concurrent.futures
rules = tuple(
tuple(term.lower() for term in rule)
for rule in (
["subject", "act=dispBoardWrite"],
["dispMemberSignUpForm"],
["Help get things started by asking a question"],
["Recent questions and answers"],
# ["is creating a new copy of the input"],
)
)
def check_rule(rule, text):
return all(term in text for term in rule)
def check_all_rules(filepath):
with open(filepath, "rt") as file:
file_content = file.read()
with concurrent.futures.ProcessPoolExecutor(max_workers=None) as executor:
# create all futures, one per rule, and get their result as soon as available
for future in concurrent.futures.as_completed(executor.submit(check_rule, rule, file_content)
for rule in rules):
# cf https://stackoverflow.com/q/16276423/11384184
if future.result() is True:
# we have one match, no need to search for more
executor.shutdown(wait=False, cancel_futures=True) # Python>=3.9
break
else:
# continue searching for a match
continue
else: # the iterator is exhausted, no rule matched
return False
# the iterator was interrupted, there was a match
return True
def check_all_files():
filepaths = (
"so70674556.html",
)
for filepath in filepaths:
print(filepath, check_all_rules(filepath))
check_all_files()
It will iterate over all the files (in my example just an HTML dump of this page), for each it will create a ProcessPoolExecutor to handle the parallelism, feed it the tasks (one per rule to match), and get the results as they complete. As soon as there is a match, all tasks gets cancelled and it moves to the next file.
I think it will be much faster than your single-threaded regex matching.
If it was not sufficient though, you should provide in your question an example of a real file to process, and an example of the 200 rules to match against.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Lenormju |
