'merge two datasets to find a mean

I have two similar looking tables: df1:

country    type       mean     count     last_checked_date
Brazil     Weather     x       2         2022-02-13    
Brazil     Corona      y       3         2022-02-13
China      Corona      z       1         2022-02-13
China      Fruits      s       2         2022-02-13

df2

country    type       mean     count     last_checked_date
Ghana      Weather     a       2         2022-02-13    
Brazil     Corona      b       5          2022-02-13
China      Corona      c       1         2022-02-13
Germany    Fruits      d       2         2022-02-13

I want to join df2 with df1 such that no combination of country, type is lost. For each combination of country and type, I want to calculate a mean value with this formula:

df find_new_values(old_mean, new_mean, old_count, new_count):
    mean = (old_mean + new_mean)/(old_count+new_count)
    count = old_count+new_count
    return mean, count 

For example, in df2, China, Corona is present in df1 as well so the mean would be (c+z)/(1+1)

However, Ghana, Weather is present in df2 but not in df1 so in this case, I want to simply add a row to df1 as it is without the formula calculation.

How can I achieve this? What's the correct join/merge type to use here?



Solution 1:[1]

We may consider the problem this way, we combine them into one table,

df = pd.concat([df1, df2])

then use groupby to apply aggregations on each group of the rows that share the same country and type.

df.groupby(['country', 'type']).agg({'mean': 'mean', 'count': 'sum'})

For country-type combination that only occur once in one of the dataframe, the corresponding group will only discover one row and the aggregation functions won't change anything.

You may add 'last_checked_date': 'last' to the list of agg if needed.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Raymond Kwok