'How to do fuzzy matching within in the same dataset with multiple columns

I have a student rank dataset in which a few values are missing and I want to do fuzzy logic on names and rank columns within the same dataset, find the best matching values, update null values for the rest of the columns, and add a matched name column, matched rank column, and score. I'm a beginner that would be great if someone help me. Thank You.

data:

    Name  School Marks Location Rank
0   JACK   TML    90    AU       3
1   JHON   SSP    85    NULL     NULL
2   NULL   TML    NULL  AU       3
3   BECK   NTC    NULL  EU       2
4   JHON   SSP    NULL  JP       1
5   SEON   NTC    80    RS       5



Expected Data Output:
data:

    Name  School Marks Location Rank Matched_Name Matched_Rank Score
0   JACK   TML    90    AU       3     Jack           3         100
1   JHON   SSP    85    JP       1     JHON           1         100
2   BECK   NTC    NULL  EU       2      -             -          -
3   SEON   NTC    80    RS       5      -             -          -

I how to do it with fuzzy logic ?

here is my code

ds1 = pd.read_csv(dataset.csv)
ds2 = pd.read_csv(dataset.csv)


# Columns to match on from df_left
left_on = ["Name", "Rank"]

# Columns to match on from df_right
right_on = ["Name", "Rank"]

# Now perform the match
#Start the time
a = datetime.datetime.now()
print('started at :',a)
# It will take several minutes to run on this data set
matched_results = fuzzymatcher.fuzzy_left_join(ds1,
                                               ds2,
                                               left_on,
                                               right_on)
b = datetime.datetime.now()
print('end at :', b)
print("Time taken: ", b-a)

print(matched_results)
try:
    print(matched_results.columns)
    cols = matched_results.columns
except:
    pass
print(matched_results.to_csv('matched_results.csv',index=False))

# Let's see the best matches
try:
    matched_results[cols].sort_values(by=['best_match_score'], ascending=False).head(5)
except:
    pass


Solution 1:[1]

Using fuzzywuzzy is usually when names are not exact matches. I can't see this in your case. However, if your names aren't exact matches, you may do the following:

  1. Create a list of all school names using
df['school_name'].tolist()
  1. Find null values in your data frame.
  2. Use
process.extractOne(current_name, school_names_list, scorer=fuzz.partial_ratio)

Just remember that you should never use Fuzzy if you have exact names. You'll only need to filter the data frame like this:

filtered = df[df['school_name'] == x]

and use it to replace values in the original data frame.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Esraa Abdelmaksoud