'Pandas, merge two dataframes based on one column that contains the other
I have the following dataframe B:
|index|cond2 |
|-----|-----------------------------------------------------|
|0 | ((age <= 23), ((age <= 31) & (23 < age)))|
|1 | (((age <= 38) & (31 < age)))|
|2 | (((age <= 50) & (43 < age)), ((age <= 43) & (3...|
|3 | (((age <= 54) & (50 < age)))|
|4 | (((age <= 64) & (54 < age)))|
|5 | ((64 < age))|
which contains thousands of lines. Note that cond2 values are strings.
the column cond1 (its values are also strings) was created from the following table A:
|cond1 |note|
|------------------------|----|
|(age <= 23) |20 |
|(age <= 31) & (23 < age)|30 |
|(age <= 38) & (31 < age)|40 |
How may I merge both tables to add the column 'note' to table B if 'cond1' is contained in 'cond2'. The expected output is the following table:
| index | cond2 | cond1 | note |
|---|---|---|---|
| 0 | ((age <= 23), ((age <= 31) & (23 < age))) | (age <= 23) | 20 |
| 1 | (((age <= 38) & (31 < age))) | ((age <= 38) & (31 < age)) | 40 |
| 2 | (((age <= 50) & (43 < age)), ((age <= 43) & (3... | ((age <= 50) & (43 < age)) | 60 |
As far, I have tried to used the function I found in a similar question (Python Pandas DataFrame str contains merge if) but unfortunately, It doesn't work in this specific case:
def str_merge(part_string_df,full_string_df, merge_column):
merge_column_lower = 'merge_column_lower'
part_string_df[merge_column_lower] = part_string_df[merge_column].str.lower()
full_string_df[merge_column_lower] = full_string_df[merge_column].str.lower()
pat = '|'.join(r"{}".format(x) for x in part_string_df[merge_column_lower])
full_string_df['Test3'] = full_string_df[merge_column_lower].str.extract('('+ pat + ')', expand=True)
DF = pd.merge(part_string_df, full_string_df, left_on= merge_column_lower, right_on='Test3').drop([merge_column_lower + '_x',merge_column_lower + '_y','Test3'],axis=1)
return DF
Solution 1:[1]
Try:
dfB["cond1"] = dfB["cond2"].apply(lambda x: [c for c in dfA["cond1"] if c in x])
dfB = dfB.explode("cond1").drop_duplicates("cond2")
output = dfB.merge(dfA)
>>> output
cond2 ... note
0 ((age <= 23), ((age <= 31) & (23 < age))) ... 20
1 (((age <= 38) & (31 < age))) ... 40
2 (((age <= 50) & (43 < age)), ((age <= 43) & (3... ... 60
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | not_speshal |
