'Pandas Merge not working correctly, there are lots of NaNs
I have 2 datasets (See sample DFs below)
SKU
| SKU | Description |
|---|---|
| CDH21A | Screw |
| EG1 | Bolt |
| 00TTA | Drill |
| BBHRS | Screw |
| 434TFR34F | Screwdriver |
| 3QFF5Q4Y | Nail |
| AAA54Q3F | Timber |
Sales
| SKU |
|---|
| 02cdh21a |
| 00eg11 |
| 0100tta |
| bbhrs22 |
| ggffg |
| fgjhty |
| rtrtrtrt |
What I'd like to do is merge the description column from the SKU DF to the Sales DF. The issue is that the SKUs in the SKU DF can be substrings in the Sales DF.
The code I currently have is outlined below, I strip any special characters and Whitespace from both DFs first. I then create a list called key which finds the common SKUs in both DFs and then I try to merge. Unfortunately when I merge I'm left with the following (The Description column is all NaN):
| SKU | Description |
|---|---|
| 02cdh21a | NaN |
| 00eg11 | NaN |
| 0100tta | NaN |
| bbhrs22 | NaN |
| ggffg | NaN |
| fgjhty | NaN |
| rtrtrtrt | NaN |
Code
SKU2= pd.read_excel(r'C:\Desktop\DF1.xlsx')
Sales2= pd.read_excel(r'C:\Desktop\DF2.xlsx')
# Change D.Types
Sales2['SKU'] = Sales2['SKU'].astype(str)
SKU2['SKU'] = SKU2['SKU'].astype(str)
# Strip whitespace
SKU2['SKU'] = SKU2['SKU'].str.strip()
Sales2['SKU'] = Sales2['SKU'].str.strip()
# Remove Special characters
SKU2['SKU'] = SKU2['SKU'].str.replace(r'\s+', ' ')
Sales2['SKU'] = Sales2['SKU'].str.replace(r'\s+', ' ')
import re
Sales2['SKU'] = Sales2['SKU'].str.replace("[^A-Za-z0-9]","")
SKU2['SKU'] = SKU2['SKU'].str.replace("[^A-Za-z0-9]","")
Sales2['SKU'] = Sales2['SKU'].str.lower()
SKU2['SKU'] = SKU2['SKU'].str.lower()
# Create a list
key = Sales2['SKU'].str.extract(f"({'|'.join(sorted(SKU2['SKU'].values, key=len,reverse=True))})", expand=False)
# Merge SKU column and TF column to the sales DF
Sales2 = pd.merge(Sales2, SKU2[['FRUIT']], left_on=['SKU'],right_on=key, how='left').drop_duplicates()
Solution 1:[1]
Change to lower case first:
sku_df['SKU_to_merge'] = sku_df['SKU'].str.lower()
Extract mergable strings:
pat = r'(%s)'%'|'.join(sku_df['SKU'].str.lower().unique())
sales_df['SKU_to_merge'] = sales_df['SKU'].str.lower().str.extract(pat)
And merge:
sales_df.merge(sku_df[['SKU_to_merge', 'Description']], on='SKU_to_merge', how='left')
SKU SKU_to_merge Description
0 02cdh21a cdh21a Screw
1 00eg11 eg1 Bolt
2 0100tta 00tta Drill
3 bbhrs22 bbhrs Screw
4 ggffg NaN NaN
5 fgjhty NaN NaN
6 rtrtrtrt NaN NaN
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Raymond Kwok |
