'Comparing Multi-indexed Dataframes with Multiple Comparison Criteria

I have two sets of data in form of two DataFrames, df1 and df2, which I want to compare with each other. Each DataFrame contains multiple entries with the hierarchical structure Lang-Code 🠖 Node_Type 🠖 Node_ID 🠖 Entry_Type, each entry consists out of two values, Attribute A and Attribute B.

I've multi-indexed the DataFrames according the hierarchical structure:

DataFrame df1:

                                        Attribute A  Attribute B
Lang-Code Node_Type Node_ID Entry_Type                          
de        N1        918     A                     2         True
en        N1        421     A                     8         True
                    442     A                     3        False
                            A                     1        False
                            B                     5        False
          N2        732     A                     9        False
fr        N1        978     D                     2         True

DataFrame df2:

                                        Attribute A  Attribute B
Lang-Code Node_Type Node_ID Entry_Type                          
de        N1        479     A                    92         True # Different value for Attribute A in df1
en        N1        734     A                     8         True
                            B                     1        False # Entry not conained in df1
                    887     A                     3        False
                            A                     1        False
                            B                     5        False
          N2        888     A                     9        False
fr        N4        264     D                     2         True # Different Node_Type in df1

I would be interested in which Nodes are identically regarding their contained values. However, while the Node_Type must match, Node_ID is an unique value used for grouping only (the actual value for Node_ID won't match, but both the number of entries as well as their values shall be considered).

In the example above, only certain nodes do match between df1 and df2, so the result I try to achieve would be as follows:

                                Attribute A  Attribute B
Lang-Code Node_Type Entry_Type                          
en        N1        A                     3        False
                    A                     1        False
                    B                     5        False
          N2        A                     9        False

I already tried looping through the dataframe with df1.iterrows() and df.itertuples() or using df.merge(), but especially the fact that the unique Node_ID fulfills a grouping purpose gives me a headache!

I am fairly new to working with pandas, so I am not fully familiar with every aspect yet, so any tips on how to perform such a comparison more efficiently would be very welcome! :) Would it be e.g. more useful to transpose the DataFrame, to have Attribute A and Attribute B as an index?

TIA!

Solution 1:^[1]

It seems to me that you don't really bother about Node_ID in your output. In that case you could first droplevel(level='Node_ID') and then reset_index() on both dataframes. And then just do an inner merge on both dataframes and then you can set the index back using set_index(['Lang_Code', 'Node_Type', 'Entry_Type'])

so something like (not tested):

df1 = df1.droplevel(level=['Node_ID'])
df2 = df2.droplevel(level=['Node_ID'])
df1.reset_index(inplace=True)
df2.reset_index(inplace=True)
df3 = df1.merge(df2, on=df1.columns.tolist(),how='inner')
df3.set_index(['Lang_Code','Node_Type', 'Entry_Type'], inplace=True)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	SomeDude

'Comparing Multi-indexed Dataframes with Multiple Comparison Criteria

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]