'merge two datasets based on a string match
I have two datasets;
DF1 contains a column that has a short word.
DF2 has a column with a series of words, where in some cases they start with the short word from DF1
I would like to create a new column in DF2, and everytime the DF1 word appears at the beginning of DF2's word, to then place that DF1 word into the newly created column at that location.
The idea being that once completed, I can then merge the two datasets using the matching word.
How would I do this?
DF1
| ref |
|---|
| ABC |
| DEF |
| GHI |
DF2
| word |
|---|
| ABC123 |
| DEF456 |
| GHI789 |
DF2 - Desired output
| word | new column |
|---|---|
| ABC123 | ABC |
| DEF456 | DEF |
| GHI789 | GHI |
Solution 1:[1]
You can create a list of pattern from ref column then try to extract it from word column:
import re
pattern = fr"({'|'.join(re.escape(x) for x in df1['ref'])})"
df2['new column'] = df2['word'].str.extract(pattern)
print(df2)
# Output
word new column
0 ABC123 ABC
1 DEF456 DEF
2 GHI789 GHI
If it's too simple, you have to use fuzzy logic. You can check this Question & Answer
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
