'Merging dataframes in in Python

I am looking for a more efficient way to solve the below problem. I have 1 dataframe which consists of Account Numbers and its parent account. I have another dataframe (taken from a different database) which contains Account Numbers, Parent ID and Account Owner.

The idea is to join the Account Owner from the second dataframe to the first dataframe. While this should be straight forward i.e. doing a left join between both tables, I need to do 2 merges which seems very inefficient way of doing it.

The first join matches on the account number and maps the account owner, then the 2nd join matches on the parent account and populates the account owner. In this example there are 2 rows containing 0 values, they will be left as is.

The reason for 2 checks - if an account doesn't have an owner, it then checks its parent account if it has an owner.

Here is my workings:

Dataframe 1:

df = pd.DataFrame([[5551455514,555145],
                    [9631475400,963147],
                    [1111111111,111111],
                    [2222222222,222222],
                    [8529875405,852987]],
columns = ['Account','ParentAccount'])
df

Dataframe 1 Output:

Account         ParentAccount
0   5551455514  555145
1   9631475400  963147
2   1111111111  111111          
3   2222222222  222222          
4   8529875405  852987

Dataframe 2:

df2 = pd.DataFrame([[5551455514,555145,'Name_1'],
                    [9631475400,963147,'Name_2'],
                    [8529871478,852987,'Name_3']],
columns = ['Account','ParentAccount','Account Owner'])
df2

Dataframe 2 Output:

Account         ParentAccount   Account Owner
0   5551455514  555145          Name_1
1   9631475400  963147          Name_2
2   8529871478  852987          Name_3

Merging on the Account number, then merging on the parent account:

# left join on account as primary key between both dataframes
df = pd.merge(df, df2, left_on=['Account'], right_on=['Account'],how='left')
# left join on Parent ID as primary key between both dataframes
df = pd.merge(df, df2, left_on=['ParentAccount_x'], right_on=['ParentAccount'],how='left')

# Re-formatting.....
# drop Account column from dataframe 2 as it doesn't need to be include since account already exists in df1 
df = df.drop(['ParentAccount_y','Account Owner_x','ParentAccount','Account_y'], axis = 1)
# rename some headers
df.rename(columns = {'Account_x':'Account','ParentAccount_x':'ParentAccount','Account Owner_y':'Account Owner'}, inplace = True)

Output:

Account         ParentAccount   Account Owner
0   5551455514  555145          Name_1
1   9631475400  963147          Name_2
2   1111111111  111111          NaN
3   2222222222  222222          NaN
4   8529875405  852987          Name_3

There must be a better way to achieve the above output.



Solution 1:[1]

Given your two data frames named df and df2 you can get the desired output by:

dk = pd.merge(df, df2, how='left', on='ParentAccount').drop(['Account_y'], axis= 1)
dk.rename(columns={'Account_x':'Account'}, inplace=True)

This yields:

    Account ParentAccount   Account Owner
0   5551455514  555145  Name_1
1   9631475400  963147  Name_2
2   1111111111  111111  NaN
3   2222222222  222222  NaN
4   8529875405  852987  Name_3

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 itprorh66