'Pandas Merge not working correctly, there are lots of NaNs

I have 2 datasets (See sample DFs below)

SKU

SKU Description
CDH21A Screw
EG1 Bolt
00TTA Drill
BBHRS Screw
434TFR34F Screwdriver
3QFF5Q4Y Nail
AAA54Q3F Timber

Sales

SKU
02cdh21a
00eg11
0100tta
bbhrs22
ggffg
fgjhty
rtrtrtrt

What I'd like to do is merge the description column from the SKU DF to the Sales DF. The issue is that the SKUs in the SKU DF can be substrings in the Sales DF.

The code I currently have is outlined below, I strip any special characters and Whitespace from both DFs first. I then create a list called key which finds the common SKUs in both DFs and then I try to merge. Unfortunately when I merge I'm left with the following (The Description column is all NaN):

SKU Description
02cdh21a NaN
00eg11 NaN
0100tta NaN
bbhrs22 NaN
ggffg NaN
fgjhty NaN
rtrtrtrt NaN

Code

SKU2= pd.read_excel(r'C:\Desktop\DF1.xlsx') 
Sales2= pd.read_excel(r'C:\Desktop\DF2.xlsx')



 # Change D.Types
Sales2['SKU'] = Sales2['SKU'].astype(str)
SKU2['SKU'] = SKU2['SKU'].astype(str)

# Strip whitespace
SKU2['SKU'] = SKU2['SKU'].str.strip()
Sales2['SKU'] = Sales2['SKU'].str.strip()

# Remove Special characters
SKU2['SKU'] = SKU2['SKU'].str.replace(r'\s+', ' ')
Sales2['SKU'] = Sales2['SKU'].str.replace(r'\s+', ' ')

import re
Sales2['SKU'] = Sales2['SKU'].str.replace("[^A-Za-z0-9]","")
SKU2['SKU'] = SKU2['SKU'].str.replace("[^A-Za-z0-9]","")


Sales2['SKU'] = Sales2['SKU'].str.lower()
SKU2['SKU'] = SKU2['SKU'].str.lower()

# Create a list
key = Sales2['SKU'].str.extract(f"({'|'.join(sorted(SKU2['SKU'].values, key=len,reverse=True))})", expand=False)

# Merge SKU column and TF column to the sales DF
Sales2 = pd.merge(Sales2, SKU2[['FRUIT']], left_on=['SKU'],right_on=key, how='left').drop_duplicates()


Solution 1:[1]

Change to lower case first:

sku_df['SKU_to_merge'] = sku_df['SKU'].str.lower()

Extract mergable strings:

pat = r'(%s)'%'|'.join(sku_df['SKU'].str.lower().unique())
sales_df['SKU_to_merge'] = sales_df['SKU'].str.lower().str.extract(pat)

And merge:

sales_df.merge(sku_df[['SKU_to_merge', 'Description']], on='SKU_to_merge', how='left')


        SKU SKU_to_merge Description
0  02cdh21a       cdh21a       Screw
1    00eg11          eg1        Bolt
2   0100tta        00tta       Drill
3   bbhrs22        bbhrs       Screw
4     ggffg          NaN         NaN
5    fgjhty          NaN         NaN
6  rtrtrtrt          NaN         NaN

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raymond Kwok