'Optimize the traversal of a column of a dataframe
I want to check for fuzzy duplicates in a column of the dataframe using fuzzywuzzy. In this case, I have to iterate over the rows one by one using two nested for loops.
for i in df['col']:
for j in df['col']:
ratio = fuzz.ratio(i, j)
if ratio > 90:
print("row duplicates")
Except that my dataframe contains 600 000 rows, and this code has a complexity of 0(n²).
Is there a lighter way of doing this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
