'Comparing strings in a list using Pandas and and selecting the ones that look the most similar to each other
The goal is to look at a list of strings and compare the ones that look alike and select them if the ratio of similarities is 0.8 or more the ratio should be between 0 and 1 and anything less than 0.8 is discarded. here is a sample of the list of hashtags. I started the functions and couldn't translate my idea into code.
Here is a sample of the list of the words.

def getratio():
difflib.SequenceMatcher(None,string1 ,string2).ratio()
Solution 1:[1]
Here's how you can loop through a list of strings you have in a column January and compare it with a list of strings in another column February. You should avoid using an image to share data as we are unable to copy and paste it.
# import
from difflib import SequenceMatcher
# define string and list of targets
list_of_strings = [['celtics', 'boston'], ['celtics', 'sample']]
list_of_targets = [['Celtics', 'sample1'], ['sample1', 'sample2', 'sample3']]
# create sample dataframe
df = pd.DataFrame({'January':list_of_strings, 'February':list_of_targets})
This is the sample input dataframe:
January February
0 [celtics, boston] [Celtics, sample1]
1 [celtics, sample] [sample1, sample2, sample3]
This the function to compare a list of strings to another list.
# define function to compare a list of strings to a test list of strings
def get_ratio(word_list, test_list, threshold):
# create empty list to store results
matched_words = []
# loop through each string in word_list
for word in word_list:
# compare string in word_list to list of test strings and return if ratio is more than the defined threshold
matched_words.append([test for test in test_list if SequenceMatcher(None, word, test).ratio() > threshold])
# flatten list of lists
matched_words = [ele for sublist in matched_words for ele in sublist]
# return results
return matched_words
# apply function and store results in new column
df['matches'] = df.apply(lambda x: get_ratio(x['January'], x['February'], 0.8), axis=1)
This outputs:
January February matches
0 [celtics, boston] [Celtics, sample1] [Celtics]
1 [celtics, sample] [sample1, sample2, sample3] [sample1, sample2, sample3]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | greco |
