'merge two datasets to find a mean
I have two similar looking tables: df1:
country type mean count last_checked_date
Brazil Weather x 2 2022-02-13
Brazil Corona y 3 2022-02-13
China Corona z 1 2022-02-13
China Fruits s 2 2022-02-13
df2
country type mean count last_checked_date
Ghana Weather a 2 2022-02-13
Brazil Corona b 5 2022-02-13
China Corona c 1 2022-02-13
Germany Fruits d 2 2022-02-13
I want to join df2 with df1 such that no combination of country, type is lost. For each combination of country and type, I want to calculate a mean value with this formula:
df find_new_values(old_mean, new_mean, old_count, new_count):
mean = (old_mean + new_mean)/(old_count+new_count)
count = old_count+new_count
return mean, count
For example, in df2, China, Corona is present in df1 as well so the mean would be (c+z)/(1+1)
However, Ghana, Weather is present in df2 but not in df1 so in this case, I want to simply add a row to df1 as it is without the formula calculation.
How can I achieve this? What's the correct join/merge type to use here?
Solution 1:[1]
We may consider the problem this way, we combine them into one table,
df = pd.concat([df1, df2])
then use groupby to apply aggregations on each group of the rows that share the same country and type.
df.groupby(['country', 'type']).agg({'mean': 'mean', 'count': 'sum'})
For country-type combination that only occur once in one of the dataframe, the corresponding group will only discover one row and the aggregation functions won't change anything.
You may add 'last_checked_date': 'last' to the list of agg if needed.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Raymond Kwok |
