'part of data become nan after merging

I am trying merge these 2 tables and here is the code First Table code and result

import pandas as pd
dfpopulation = pd.read_csv("BoroughPopulation.csv")

dfpopulation = dfpopulation.filter(["Borough","2020"], axis = 1)
dfpopulation["2020"]=dfpopulation['2020'].astype(int)
dfpopulation.head

<bound method NDFrame.head of             Borough     2020
0         NYC Total  8550971
1             Bronx  1446788
2          Brooklyn  2648452
3         Manhattan  1638281
4            Queens  2330295
5     Staten Island   487155>

Second Table code and result

Homeless = [["NYC Total",totalNYC],["Bronx",totalBronx],["Brooklyn",totalBrooklyn],["Manhattan",totalManhattan],["Queens",totalQueens],["Staten Island",totalStatenIsland]]
dfhomeless2 = pd.DataFrame(Homeless, columns= ['Borough','Total'])
dfhomeless2["Total"]=dfhomeless2["Total"].astype(int)
dfhomeless2.head

<bound method NDFrame.head of          Borough  Total
0      NYC Total  97104
1          Bronx  16875
2       Brooklyn  33832
3      Manhattan  36821
4         Queens   9576
5  Staten Island      0>

Code I used and result

dfCompare = pd.merge(dfpopulation,dfhomeless2[["Borough","Total"]],on="Borough",how="left")
dfCompare.head

<bound method NDFrame.head of             Borough     2020    Total
0         NYC Total  8550971  97104.0
1             Bronx  1446788      NaN
2          Brooklyn  2648452      NaN
3         Manhattan  1638281      NaN
4            Queens  2330295      NaN
5     Staten Island   487155      NaN>

Info of the result

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Borough  6 non-null      object 
 1   2020     6 non-null      int32  
 2   Total    1 non-null      float64
dtypes: float64(1), int32(1), object(1)
memory usage: 168.0+ bytes

Anyone know why part of data become nan and how to solve it?



Solution 1:[1]

I entered your values/code and it works fine. I assume there is a discrepancy in the names of the boroughs in your dataframes. Make sure they match. Because you are using a left join, the new column will include all values from the left dataframe and only the right dataframe ones IF there is a matching borough - otherwise, it returns a NaN, which is what you are seeing. See that the following works on your computer. You could also try how='outer' which will still give you a bunch of NaNs but may help you figure out where your issues are.

A = pd.DataFrame({
    'borough': ['NYC Total', 'Bronx', 'Brooklyn', 'Manhattan', 'Queens','Staten Island'],
    'total' : [8550971,1446788,2648452,1638281,330295, 487155]
})

B = pd.DataFrame({
    'borough': ['NYC Total', 'Bronx', 'Brooklyn', 'Manhattan', 'Queens','Staten Island'],
    '2020' : [97104,16875,33832,36821, 9576,0]
})

pd.merge(left=A, right=B, on='borough', how='left')

    borough         total   2020
0   NYC Total      8550971  97104
1   Bronx          1446788  16875
2   Brooklyn       2648452  33832
3   Manhattan      1638281  36821
4   Queens         330295   9576
5   Staten Island  487155   0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Coup