'Merge different length dfs and preserve all values from "master" df

Working on what I think should be a simple merge but I can't find quite the right solution.

I have two dfs of Fortune 500 companies. df1 is 2 columns (Company and CIK), 117 rows long. df2 is 2 columns (Rank, and Company) and 225 rows long. The company order is different between the dfs.

I want a dataframe that matches the company names (even if the company names aren't spelled 100% the same), preserves the order of df2, and fills in the CIK values from df1 to df2, and fills in Na in the remaining values (108 will be NA).

This merge only gives 52 values (probably the ones that are exactly matching).

df3 = df2.merge(df1, on="Company")

This merge fills a lot of NaN values where things don't exactly match:

df3 = df2.merge(df1, on="Company", how = "outer")

df1:

{'Company': {0: 'Amazon',
  1: 'Verizon Communications',
  2: 'AT&T',
  3: 'Alphabet',
  4: 'Intel',
  5: 'Facebook',
  6: 'Exxon Mobil',
  7: 'Microsoft',
  8: 'Duke Energy',
  9: 'Comcast',
  10: 'Exelon'},
 'CIK': {0: '0001018724',
  1: '0000732712',
  2: '0001018724',
  3: '0001652044',
  4: '0000050863',
  5: '0001326801',
  6: '0000034088',
  7: '0000789019',
  8: '0001326160',
  9: '0001166691',
  10: '0001109357'}}

df2:

{'Rank': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 9,
  9: 10,
  10: 11},
 'Company': {0: 'Walmart',
  1: 'Amazon',
  2: 'Apple',
  3: 'CVS Health',
  4: 'United Health Group',
  5: 'Berkshire Hathaway',
  6: 'McKesson',
  7: 'AmerisourceBergen',
  8: 'Alphabet',
  9: 'Exxon Mobil',
  10: 'AT&T'}}

I have also tried various .concat and .join but have had similar problems.

Would a loop be more appropriate?

Thank you!

UPDATE: using a left join, the CIK row fills in with NaN (picture attached). Also, should mention that the CIK column is dtype=str because dtype=int64 eliminates the leading zeros and I need the zeros for the API call.



Solution 1:[1]

What you want to do is LEFT JOIN. I made my own sample data considering the attached images.

You can refer this code and understand how it works. If all CIK values are NaN in your result, it might be because 'Company' values are not correctly matched between df1 and df2 by some reasons, such as empty spaces in the values.

import pandas as pd

df1 = pd.DataFrame({
    'Company': ['Amazon', 'Intel', 'Alphabet', 'Tech Data'],
    'CIK': ['0001018724', '000050863', '0001652044', '0000790703'],
})

df2 = pd.DataFrame({
    'Rank': [1, 2, 3, 4, 5],
    'Company': ['Walmart', 'Amazon', 'Apple', 'CVS Health', 'Tech Data'],
})

df3 = df2.merge(df1, on='Company', how='left')

print(df3)
#   Rank     Company         CIK
#0     1     Walmart         NaN
#1     2      Amazon  0001018724
#2     3       Apple         NaN
#3     4  CVS Health         NaN
#4     5   Tech Data  0000790703

There are several kinds of JOINs, such as LEFT JOIN, RIGHT JOIN, OUTER JOIN, INNER JOIN, and so on.

You can see various JOINs with example codes here:

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1