'Comparing Multi-indexed Dataframes with Multiple Comparison Criteria
I have two sets of data in form of two DataFrames, df1 and df2, which I want to compare with each other. Each DataFrame contains multiple entries with the hierarchical structure Lang-Code 🠖 Node_Type 🠖 Node_ID 🠖 Entry_Type, each entry consists out of two values, Attribute A and Attribute B.
I've multi-indexed the DataFrames according the hierarchical structure:
DataFrame df1:
Attribute A Attribute B
Lang-Code Node_Type Node_ID Entry_Type
de N1 918 A 2 True
en N1 421 A 8 True
442 A 3 False
A 1 False
B 5 False
N2 732 A 9 False
fr N1 978 D 2 True
DataFrame df2:
Attribute A Attribute B
Lang-Code Node_Type Node_ID Entry_Type
de N1 479 A 92 True # Different value for Attribute A in df1
en N1 734 A 8 True
B 1 False # Entry not conained in df1
887 A 3 False
A 1 False
B 5 False
N2 888 A 9 False
fr N4 264 D 2 True # Different Node_Type in df1
I would be interested in which Nodes are identically regarding their contained values. However, while the Node_Type must match, Node_ID is an unique value used for grouping only (the actual value for Node_ID won't match, but both the number of entries as well as their values shall be considered).
In the example above, only certain nodes do match between df1 and df2, so the result I try to achieve would be as follows:
Attribute A Attribute B
Lang-Code Node_Type Entry_Type
en N1 A 3 False
A 1 False
B 5 False
N2 A 9 False
I already tried looping through the dataframe with df1.iterrows() and df.itertuples() or using df.merge(), but especially the fact that the unique Node_ID fulfills a grouping purpose gives me a headache!
I am fairly new to working with pandas, so I am not fully familiar with every aspect yet, so any tips on how to perform such a comparison more efficiently would be very welcome! :)
Would it be e.g. more useful to transpose the DataFrame, to have Attribute A and Attribute B as an index?
TIA!
Solution 1:[1]
It seems to me that you don't really bother about Node_ID in your output.
In that case you could first droplevel(level='Node_ID') and then reset_index() on both dataframes.
And then just do an inner merge on both dataframes and then you can set the index back using set_index(['Lang_Code', 'Node_Type', 'Entry_Type'])
so something like (not tested):
df1 = df1.droplevel(level=['Node_ID'])
df2 = df2.droplevel(level=['Node_ID'])
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df3 = df1.merge(df2, on=df1.columns.tolist(),how='inner')
df3.set_index(['Lang_Code','Node_Type', 'Entry_Type'], inplace=True)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | SomeDude |
