'merging two dataframes only if both values match and belong to the same ID

I have a question about merging pandas dataframes. I have two dataframes that look like:

df1:

chain   cdr3    raw_clonotype_id
TRB CASSARGASGERTDTQYF  clonotype1
TRA CAVSVVRNNNARLMF clonotype1
TRA CAVRGNARLMF clonotype4
TRB CASSLYGTGGSGANVLTF  clonotype4

df2:

cdr3    chain
CAVSVVRNNNARLMF TRA
CAVRGNARLMF TRA
CASSLYGTGGSGANVLTF  TRB

I would like to to merge them on cdr3 but only keep the lines where the clonotype ID for both the chains (TRB and TRA) is a match.

To get a result like this:

chain   cdr3    raw_clonotype_id
TRA CAVRGNARLMF clonotype4
TRB CASSLYGTGGSGANVLTF  clonotype4

I tried this:

df3 = df.merge(df2.groupby('cdr3',as_index=False)[['cdr3']].agg(','.join),how='left')

but the output is:

cdr3    chain
CAVSVVRNNNARLMF TRA
CAVRGNARLMF TRA
CASSLYGTGGSGANVLTF  TRB

I think I just need to add something to the above line but I don't know what.



Solution 1:[1]

IIUC, you can first merge, then filter, here using GroupBy.transform('nunique') to ensure having 2 unique values per clonotype (2 receptors):

df3 = df1.merge(df2, on=['cdr3', 'chain'])
df3 = df3[df3.groupby('raw_clonotype_id')['chain'].transform('nunique').eq(2)]

output:

  chain                cdr3 raw_clonotype_id
1   TRA         CAVRGNARLMF       clonotype4
2   TRB  CASSLYGTGGSGANVLTF       clonotype4

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 mozway