'PYTHON Left Join Based on Partial Matches of Column

I have two dateframes:

df1 = pd.DataFrame([  
          ['ABC, TX', 'P1']
        , ['NA BCD, CT', 'P2']
        , ['CDE, MN', 'P3']
        , ['CDE, MN', 'P4']
        , ['CDF, MN', 'P5']
        , ['GDE, MA', 'P6']
       ]
    ,columns = ['task_name', 'pipeline_name']
)

df2 = pd.DataFrame([  
          ['AAA', 'C1']
        , ['BCD', 'C2']
        , ['CDE', 'C3']
        , ['ABC', 'C4']
       ]
    ,columns = ['partial_task_name', 'extra_value']
)

print(df1)
print(df2)

    task_name pipeline_name
0     ABC, TX            P1
1  NA BCD, CT            P2
2     CDE, MN            P3
3     CDE, MN            P4
4     CDF, MN            P5
5     GDE, MA            P6
  partial_task_name extra_value
0               AAA          C1
1               BCD          C2
2               CDE          C3
3               ABC          C4

I would like to join them by matching task_name in df1 and partial_name in df2.

the desired output:

    task_name pipeline_name extra_value
0     ABC, TX            P1    C4 
1  NA BCD, CT            P2    C2
2     CDE, MN            P3    C3
3     CDE, MN            P4    C3
4     CDF, MN            P5    
5     GDE, MA            P6

Does anyone know how to do that?

I am trying to install the the fuzz package:

conda install -c conda-forge fuzzywuzzy
conda install -c conda-forge/label/gcc7 fuzzywuzzy
conda install -c conda-forge/label/cf201901 fuzzywuzzy
conda install -c conda-forge/label/cf202003 fuzzywuzzy

NONE of these work.

and I also tried use: pip install thefuzz in the anaconda environment. It does not work either.

Can I join these dataframes without fuzz package?

Thank you so much!

Update:

if my dataset like:

import pandas as pd
df1 = pd.DataFrame([  
      ['ABC ef , TX', 'P1']
    , ['NaA BCD, CT', 'P2']
    , ['a CDE MN', 'P3']
    , ['aa. CDE, MN', 'P4']
    , ['bbbb  xxx CDF df, MN', 'P5']
    , ['bcDEff, MA', 'P6']
   ]
    ,columns = ['task_name', 'pipeline_name']
)

df2 = pd.DataFrame([  
          ['AAA', 'C1']
        , ['BCDe', 'C2']
        , ['CDE', 'C3']
        , ['AB', 'C4']
       ]
    ,columns = ['partial_task_name', 'extra_value']
)
print(df1)
print(df2)

              task_name pipeline_name
0           ABC ef , TX            P1
1           NaA BCD, CT            P2
2             a CDE MN            P3
3            aa. CDE, MN            P4
4  bbbb  xxx CDF df, MN            P5
5            bcDEff, MA            P6
      partial_task_name extra_value
    0               AAA          C1
    1              BCDe          C2
    2               CDE          C3
    3               ABC          C4

Can I get the output like:

              task_name pipeline_name  extra_value
0           ABC ef , TX            P1      C4
1           NaA BCD, CT            P2      C2
2             a CDE  MN            P3      C3
3            aa CDE, MN            P4      C3
4  bbbb  xxx CDF df, MN            P5     
5            bcDEff, MA            P6      C2

Thank you so much!!!



Solution 1:[1]

Extract the partial_task_name from task_name as a new column and join on that column.

df1['partial_task_name'] = df1['task_name'].str.extract('(\w+),')
df = pd.merge(df1, df2, how='left').drop('partial_task_name', axis=1)
print(df)

Output:

    task_name pipeline_name extra_value
0     ABC, TX            P1          C4
1  NA BCD, CT            P2          C2
2     CDE, MN            P3          C3
3     CDE, MN            P4          C3
4     CDF, MN            P5         NaN
5     GDE, MA            P6         NaN

More complicated regex or list and data correction...

parts_reg = '('+'|'.join([f'{x[0:3]}' for x in df2.partial_task_name.str.upper().tolist()])+')'
df1['partial_task_name'] = df1.task_name.str.upper().str.extract(parts_reg, expand=False)
df2.partial_task_name = df2.partial_task_name.str[0:3]
df = pd.merge(df1, df2, how='left').drop('partial_task_name', axis=1)
print(df)

Output:

              task_name pipeline_name extra_value
0           ABC ef , TX            P1          C4
1           NaA BCD, CT            P2          C2
2              a CDE MN            P3          C3
3           aa. CDE, MN            P4          C3
4  bbbb  xxx CDF df, MN            P5         NaN
5            bcDEff, MA            P6          C2

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1