'Comparing strings in a list using Pandas and and selecting the ones that look the most similar to each other

The goal is to look at a list of strings and compare the ones that look alike and select them if the ratio of similarities is 0.8 or more the ratio should be between 0 and 1 and anything less than 0.8 is discarded. here is a sample of the list of hashtags. I started the functions and couldn't translate my idea into code.

Here is a sample of the list of the words. here is a sample of the list of the words.

def getratio():
    difflib.SequenceMatcher(None,string1 ,string2).ratio()


Solution 1:[1]

Here's how you can loop through a list of strings you have in a column January and compare it with a list of strings in another column February. You should avoid using an image to share data as we are unable to copy and paste it.

# import
from difflib import SequenceMatcher

# define string and list of targets
list_of_strings = [['celtics', 'boston'], ['celtics', 'sample']]
list_of_targets = [['Celtics', 'sample1'], ['sample1', 'sample2', 'sample3']]
# create sample dataframe
df = pd.DataFrame({'January':list_of_strings, 'February':list_of_targets})

This is the sample input dataframe:

             January                     February
0  [celtics, boston]           [Celtics, sample1]
1  [celtics, sample]  [sample1, sample2, sample3]

This the function to compare a list of strings to another list.

# define function to compare a list of strings to a test list of strings
def get_ratio(word_list, test_list, threshold):
    # create empty list to store results
    matched_words = []
    # loop through each string in word_list
    for word in word_list:
        # compare string in word_list to list of test strings and return if ratio is more than the defined threshold
        matched_words.append([test for test in test_list if SequenceMatcher(None, word, test).ratio() > threshold])
    # flatten list of lists
    matched_words = [ele for sublist in matched_words for ele in sublist]
    # return results
    return matched_words

# apply function and store results in new column
df['matches'] = df.apply(lambda x: get_ratio(x['January'], x['February'], 0.8), axis=1)

This outputs:

             January                     February                      matches
0  [celtics, boston]           [Celtics, sample1]                    [Celtics]
1  [celtics, sample]  [sample1, sample2, sample3]  [sample1, sample2, sample3]

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 greco