'How to speed up the filtering process in Python?
I have written below code to filter the matching product description available in the data frame by comparing the input product description in Python. This filtering process is taking much time to complete. Could you please help me to speed up the above filtering logic bit faster?
df1["PRODSPLIT"] = df1["PRODUCTDESC"].str.split()
df1["INTERSECT"] = df1["PRODSPLIT"].apply(lambda x:list(set(x).intersection("input product description".split())))
df1["PRODSPLITLEN"] = df1["INTERSECT"].str.len()
df1 = df1[df1["PRODSPLITLEN"] > 0]
Edit:
def intersects_inp_prod_or_desc(self, product_desc, input_product_desc):
product_desc_lst = product_desc.split()
input_product_desc_lst = input_product_desc.split()
return any([prod in input_product_desc_lst for prod in product_desc_lst])
df["INTERSECT"] = df["PRODUCTDESC"].apply(lambda x: self.__intersect_product_desc(x, query[0]))
Solution 1:[1]
Is the check always supposed to just check whether any word within a row in df1["PRODUCTDESC"] is input, product or discription? In this case you could use:
def intersects_inp_prod_or_desc(string_):
return any([spl in ["input", "product", "description"] for spl in string_.split()])
df1["INTERSECT"] = df1["PRODUCTDESC"].apply(intersects_inp_prod_or_desc)
df1 = df1[df1["INTERSECT"]]
Which creates a boolean column containing True if it does and False if it doesnt and then uses that to filter the DataFrame.
Edit: About the runtime
We can do some basic timing like this:
def run(rows):
start = time.time()
productdesc = ["something with input", "something with product",
"something with description", "something else"]
productdesc *= rows//len(productdesc)
df1 = pandas.DataFrame(productdesc, columns=["PRODUCTDESC"])
df1["INTERSECT"] = df1["PRODUCTDESC"].apply(intersects_inp_prod_or_desc)
df1 = df1[df1["INTERSECT"]]
finish = time.time()
print(f"{rows:,} rows: {round(finish - start, 5)} seconds")
run(10)
run(100)
run(1000)
run(10000)
run(100000)
run(1000000)
run(10000000)
First with the unchanged method:
def intersects_inp_prod_or_desc(string_):
return any([spl in ["input", "product", "description"] for spl in string_.split()])
# 10 rows: 0.0 seconds
# 100 rows: 0.0 seconds
# 1,000 rows: 0.01012 seconds
# 10,000 rows: 0.01013 seconds
# 100,000 rows: 0.12163 seconds
# 1,000,000 rows: 1.21986 seconds
# 10,000,000 rows: 12.12009 seconds
We can do the early returns by hand to save a bit of time:
def intersects_inp_prod_or_desc(string_):
split_string = string_.split()
for split_ in split_string:
if split_ == "input":
return True
if split_ == "product":
return True
if split_ == "description":
return True
return False
# 10 rows: 0.0 seconds
# 100 rows: 0.00812 seconds
# 1,000 rows: 0.00215 seconds
# 10,000 rows: 0.0102 seconds
# 100,000 rows: 0.09158 seconds
# 1,000,000 rows: 0.94362 seconds
# 10,000,000 rows: 9.40814 seconds
However as soon as we have some descriptions in the DataFrame which are very long, like this (250 words) it starts to get much slower:
productdesc = ["something with input", "something with product",
"something with description", "something else",
"something long without key words "*50]
# 10 rows: 0.0 seconds
# 100 rows: 0.00813 seconds
# 1,000 rows: 0.01014 seconds
# 10,000 rows: 0.09424 seconds
# 100,000 rows: 0.93761 seconds
# 1,000,000 rows: 9.35188 seconds
# 10,000,000 rows: 73.0588 seconds
But we can modify the method by first checking if any of the keywords is actually a substring of the description, and only if it is, we do the splitting and iteration over the split string:
def intersects_inp_prod_or_desc(string_):
if "input" in string_ or "product" in string_ or "description" in string_:
split_string = string_.split()
for split_ in split_string:
if split_ == "input":
return True
if split_ == "product":
return True
if split_ == "description":
return True
return False
# 10 rows: 0.00814 seconds
# 100 rows: 0.00209 seconds
# 1,000 rows: 0.0 seconds
# 10,000 rows: 0.0102 seconds
# 100,000 rows: 0.12225 seconds
# 1,000,000 rows: 1.15786 seconds
# 10,000,000 rows: 11.52258 seconds
But be aware that this extra check is costing us a little time in any case, so if there are no long descriptions it will not save time because the extra time for that pre-check can be longer than the benefit from the skipped splitting of the strings, e.g. if you call this modified version on the list without the long descriptions again:
productdesc = ["something with input", "something with product",
"something with description", "something else"]
# 10 rows: 0.0 seconds
# 100 rows: 0.0 seconds
# 1,000 rows: 0.0 seconds
# 10,000 rows: 0.01017 seconds
# 100,000 rows: 0.10158 seconds
# 1,000,000 rows: 0.95558 seconds
# 10,000,000 rows: 9.60593 seconds
Generally I don't know how large your DataFrame is, as it seems to me even the unchanged version is very fast. Probably the reason it is slow is that there might be many rows and quite a lot of them are very long. In this case use the modified version of the method which prechecks if any of the keywords is a substring of the string and instantly returns False if they are not.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
