'How to Merge two datasets with different indexes but one common ID factor?
I am working with two distinct datasets: one regarding COVID-19 statistics and one with demographic characteristics of a city.
The covid19 one, namely covid.df looks as follows:
Note: Date, City ID, City, and State are all indexes
| Date | City ID | City | State | Population mean | Population_2019 mean | Confirmed_rate_100k mean | Confirmed_rate_100k std | death_rate mean | death_rate std | new_confirmed | new_deaths |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-02 | 120385 | Los Angeles | CA | 9559699 | 45959669 | 0.653 | 0.556 | 0.6 | 0.01 | 33 | 5 |
| 2020-02 | 120054 | Houtson | Texas | 3304040 | 3343560 | 0.543 | 0.043 | 22.34 | 1.6 | 60 | 9 |
| ... | ... | .... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-05 | 120385 | Los Angeles | CA | 9559483 | 45966549 | 0.672 | 0.032 | 2.3 | 0.5 | 22 | 12 |
The one with demographic information, demo.df includes the following
| City ID | HDI | Education | Mobility | Poverty |
|---|---|---|---|---|
| 120385 | 0.54 | 72.5 | 55.522 | 33.21 |
| 120054 | 0.33 | 66.2 | 76.433 | 12.504 |
I would like to include the information from demo.df on covid.df, however, given the different indexes for the two datasets, the concat() function has been giving me a hard time.
How can I merge the two such datasets such that covid.df looks like this:
| Date | City ID | City | State | HDI | Education | Mobility | Poverty | Population mean | Population_2019 mean | Confirmed_rate_100k mean | Confirmed_rate_100k std | death_rate mean | death_rate std | new_confirmed | new_deaths |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-02 | 120385 | Los Angeles | CA | 0.54 | 72.5 | 55.522 | 33.21 | 9559699 | 45959669 | 0.653 | 0.556 | 0.6 | 0.01 | 33 | 5 |
| 2020-02 | 120054 | Houston | TX | 0.33 | 66.2 | 76.433 | 12.504 | 3304040 | 3343560 | 0.543 | 0.043 | 22.34 | 1.6 | 60 | 9 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-05 | 120385 | Los Angeles | CA | 0.54 | 72.5 | 55.522 | 33.21 | 9559483 | 45966549 | 0.672 | 0.032 | 2.3 | 0.5 | 22 | 12 |
Thank you!
Solution 1:[1]
You only need this:
covid = covid.merge(demo, how='left', on='City ID')
For example, suppose we have this input (notice the different indexes of 88, 99 and 'fish', 'fowl'):
covid.df:
Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean Confirmed_rate_100k std death_rate mean death_rate std new_confirmed new_deaths
88 2020-02 120385 Los Angeles CA 9559699 45959669 0.653 0.556 0.60 0.01 33 5
99 2020-02 120054 Houtson Texas 3304040 3343560 0.543 0.043 22.34 1.60 60 9
demo.df:
City ID HDI Education Mobility Poverty
fish 120385 0.54 72.5 55.522 33.210
fowl 120054 0.33 66.2 76.433 12.50
The output will be
Date City ID City State Population mean Population_2019 mean Confirmed_rate_100k mean ... death_rate std new_confirmed new_deaths HDI Education Mobility Poverty
0 2020-02 120385 Los Angeles CA 9559699 45959669 0.653 ... 0.01 33 5 0.54 72.5 55.522 33.210
1 2020-02 120054 Houtson Texas 3304040 3343560 0.543 ... 1.60 60 9 0.33 66.2 76.433 12.504
[2 rows x 16 columns]
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | constantstranger |
