'Improve performance when fuzzy matching values in one list with values in another in python
I have an issue where I need to take a list from an input, compare it to a list of relevant values, and if the value is relevant place the value in a new list to be output. I've found due to the data not being 100% accurate I've got to fuzzy match each value from the input string with each value in the desired string, check the score and then append it to the output list.
So far I have created this function (I've added comments for clarity):
def refine_attributes(row, policy_list):
"""
Fucntion to identify the relevant attributes and refine the reported attibutes to only relenvant ones.
:param row:
:param policy_list:
:return:
"""
all_attributes = [] # this forms the desired list of relevant values
for p in policy_list:
value = next(iter(p))
attributes_list = p.get(value, {}).get('attributes')
for a in attributes_list:
all_attributes.append(a) # This is retrieving a nested list field in a list of dictionaries and adding each value separately to the new desired list.
# Further refinement and test performed on the desireable list.
all_attributes_processed = []
for i in all_attributes:
i = i.replace('"', '').strip()
# Test the attribute is legitimate
if len(i) > 1:
all_attributes_processed.append(i)
else:
continue
######### The steps above here will be moved to a seperate function to be create an object to refer to instead of doing so for each row. ############
new_attributes = [] # this is the output list
current_attributes = row['attribute_original_name'] # this is the input list
current_attributes = current_attributes.replace('[','').replace(']','').replace("'", "").split(',') # this is a bit of preprocessing on the input list as it is given as a string in the input
### This is the section where each string in the lists are compared and scored
for a in current_attributes:
for attr in all_attributes_processed:
ratio = fuzz.partial_ratio(a, attr)
if ratio > 90:
new_attributes.append(attr)
return new_attributes
The issue with the above is that it is not very performant. I'm sure I can work in a lambda function here but I'm unable to see how best to do it. Any suggestions to speed this up would be greatly appreciated.
PS: The lists are usually only up to 20 strings long at most but this needs to occur for every row in a data frame that is hundreds of thousands in length.
PPS: This function is called in a lambda function as follows:
df['attribute_original_name'] = df.apply(
lambda row: refine_attributes(row, p_list), axis = 1
)
I've seen this thread and wonder if I need to create a data frame here too: Python Fuzzy matching strings in list performance Is this needing to be a row for a comparison for every string in the inout list to every string in the desired list?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
