'merge two datasets based on a string match

I have two datasets;

DF1 contains a column that has a short word.

DF2 has a column with a series of words, where in some cases they start with the short word from DF1

I would like to create a new column in DF2, and everytime the DF1 word appears at the beginning of DF2's word, to then place that DF1 word into the newly created column at that location.

The idea being that once completed, I can then merge the two datasets using the matching word.

How would I do this?

DF1

ref
ABC
DEF
GHI

DF2

word
ABC123
DEF456
GHI789

DF2 - Desired output

word new column
ABC123 ABC
DEF456 DEF
GHI789 GHI


Solution 1:[1]

You can create a list of pattern from ref column then try to extract it from word column:

import re

pattern = fr"({'|'.join(re.escape(x) for x in df1['ref'])})"
df2['new column'] = df2['word'].str.extract(pattern)
print(df2)

# Output
     word new column
0  ABC123        ABC
1  DEF456        DEF
2  GHI789        GHI

If it's too simple, you have to use fuzzy logic. You can check this Question & Answer

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1