'PYTHON Left Join Based on Partial Matches of Column
I have two dateframes:
df1 = pd.DataFrame([
['ABC, TX', 'P1']
, ['NA BCD, CT', 'P2']
, ['CDE, MN', 'P3']
, ['CDE, MN', 'P4']
, ['CDF, MN', 'P5']
, ['GDE, MA', 'P6']
]
,columns = ['task_name', 'pipeline_name']
)
df2 = pd.DataFrame([
['AAA', 'C1']
, ['BCD', 'C2']
, ['CDE', 'C3']
, ['ABC', 'C4']
]
,columns = ['partial_task_name', 'extra_value']
)
print(df1)
print(df2)
task_name pipeline_name
0 ABC, TX P1
1 NA BCD, CT P2
2 CDE, MN P3
3 CDE, MN P4
4 CDF, MN P5
5 GDE, MA P6
partial_task_name extra_value
0 AAA C1
1 BCD C2
2 CDE C3
3 ABC C4
I would like to join them by matching task_name in df1 and partial_name in df2.
the desired output:
task_name pipeline_name extra_value
0 ABC, TX P1 C4
1 NA BCD, CT P2 C2
2 CDE, MN P3 C3
3 CDE, MN P4 C3
4 CDF, MN P5
5 GDE, MA P6
Does anyone know how to do that?
I am trying to install the the fuzz package:
conda install -c conda-forge fuzzywuzzy
conda install -c conda-forge/label/gcc7 fuzzywuzzy
conda install -c conda-forge/label/cf201901 fuzzywuzzy
conda install -c conda-forge/label/cf202003 fuzzywuzzy
NONE of these work.
and I also tried use: pip install thefuzz in the anaconda environment.
It does not work either.
Can I join these dataframes without fuzz package?
Thank you so much!
Update:
if my dataset like:
import pandas as pd
df1 = pd.DataFrame([
['ABC ef , TX', 'P1']
, ['NaA BCD, CT', 'P2']
, ['a CDE MN', 'P3']
, ['aa. CDE, MN', 'P4']
, ['bbbb xxx CDF df, MN', 'P5']
, ['bcDEff, MA', 'P6']
]
,columns = ['task_name', 'pipeline_name']
)
df2 = pd.DataFrame([
['AAA', 'C1']
, ['BCDe', 'C2']
, ['CDE', 'C3']
, ['AB', 'C4']
]
,columns = ['partial_task_name', 'extra_value']
)
print(df1)
print(df2)
task_name pipeline_name
0 ABC ef , TX P1
1 NaA BCD, CT P2
2 a CDE MN P3
3 aa. CDE, MN P4
4 bbbb xxx CDF df, MN P5
5 bcDEff, MA P6
partial_task_name extra_value
0 AAA C1
1 BCDe C2
2 CDE C3
3 ABC C4
Can I get the output like:
task_name pipeline_name extra_value
0 ABC ef , TX P1 C4
1 NaA BCD, CT P2 C2
2 a CDE MN P3 C3
3 aa CDE, MN P4 C3
4 bbbb xxx CDF df, MN P5
5 bcDEff, MA P6 C2
Thank you so much!!!
Solution 1:[1]
Extract the partial_task_name from task_name as a new column and join on that column.
df1['partial_task_name'] = df1['task_name'].str.extract('(\w+),')
df = pd.merge(df1, df2, how='left').drop('partial_task_name', axis=1)
print(df)
Output:
task_name pipeline_name extra_value
0 ABC, TX P1 C4
1 NA BCD, CT P2 C2
2 CDE, MN P3 C3
3 CDE, MN P4 C3
4 CDF, MN P5 NaN
5 GDE, MA P6 NaN
More complicated regex or list and data correction...
parts_reg = '('+'|'.join([f'{x[0:3]}' for x in df2.partial_task_name.str.upper().tolist()])+')'
df1['partial_task_name'] = df1.task_name.str.upper().str.extract(parts_reg, expand=False)
df2.partial_task_name = df2.partial_task_name.str[0:3]
df = pd.merge(df1, df2, how='left').drop('partial_task_name', axis=1)
print(df)
Output:
task_name pipeline_name extra_value
0 ABC ef , TX P1 C4
1 NaA BCD, CT P2 C2
2 a CDE MN P3 C3
3 aa. CDE, MN P4 C3
4 bbbb xxx CDF df, MN P5 NaN
5 bcDEff, MA P6 C2
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
