'merging two dataframes only if both values match and belong to the same ID
I have a question about merging pandas dataframes. I have two dataframes that look like:
df1:
chain cdr3 raw_clonotype_id
TRB CASSARGASGERTDTQYF clonotype1
TRA CAVSVVRNNNARLMF clonotype1
TRA CAVRGNARLMF clonotype4
TRB CASSLYGTGGSGANVLTF clonotype4
df2:
cdr3 chain
CAVSVVRNNNARLMF TRA
CAVRGNARLMF TRA
CASSLYGTGGSGANVLTF TRB
I would like to to merge them on cdr3 but only keep the lines where the clonotype ID for both the chains (TRB and TRA) is a match.
To get a result like this:
chain cdr3 raw_clonotype_id
TRA CAVRGNARLMF clonotype4
TRB CASSLYGTGGSGANVLTF clonotype4
I tried this:
df3 = df.merge(df2.groupby('cdr3',as_index=False)[['cdr3']].agg(','.join),how='left')
but the output is:
cdr3 chain
CAVSVVRNNNARLMF TRA
CAVRGNARLMF TRA
CASSLYGTGGSGANVLTF TRB
I think I just need to add something to the above line but I don't know what.
Solution 1:[1]
IIUC, you can first merge, then filter, here using GroupBy.transform('nunique') to ensure having 2 unique values per clonotype (2 receptors):
df3 = df1.merge(df2, on=['cdr3', 'chain'])
df3 = df3[df3.groupby('raw_clonotype_id')['chain'].transform('nunique').eq(2)]
output:
chain cdr3 raw_clonotype_id
1 TRA CAVRGNARLMF clonotype4
2 TRB CASSLYGTGGSGANVLTF clonotype4
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
