'Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:

gender city age time value
male newyork 10_20y 2010 10.5
female newyork 10_20y 2010 11

I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).

The time dimension should not be aggregated. Example of means added for different combinations:

gender city age time value
total total total 2010 (mean)
male total total 2010 (mean)
male newyork total 2010 (mean)
female total total 2010 (mean)
female newyork total 2010 (mean)
... total ... total 10_20y 2010 (mean)


Solution 1:[1]

Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :

## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
                   "city" : np.random.choice(cities, n),
                   "age" : np.random.choice(ages, n),
                   "value": np.random.randint(1, 20, n)})

## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)

## itertools contains the function combinations
import itertools

## list all size combinations possible
for n in range(0, len(cols)+1):
    for i in itertools.combinations(cols, n):

        ## if n > 0, the combinations is not empty, 
        ## so we can directly groupby this sublist and take the mean
        if n != 0:
            agg_df = df.groupby(list(i)).mean().reset_index() 
            ## reset index since for multiple columns, the result will be multiindex

        ## if n=0, we just want to take the mean of the whole dataframe
        else:
            agg_df = pd.DataFrame(columns = cols)
            agg_df.loc[0, "value"] = df.loc[:, "value"].mean() 
            ## a bit ugly since this mean is an integer not a dataframe

        ## from here agg_df will have n+1 columns, 
        ## for instance for ["gender"] we will have only "gender" and "value" columns
        ## "city" and "age" are missing since we aggregate on it
        for j in cols:
            if j not in i:
                agg_df.loc[:, j] = "total" ## adding total as you asked for
        new = new.append(agg_df)

For instance, I find :

new
   gender        city     age      value
0   total       total   total   9.750000
0  female       total   total   8.083333
1    male       total   total  12.250000
0   total     chicago   total   8.000000
1   total  losangeles   total  11.428571
2   total     newyork   total  11.666667
0   total       total  10_20y   8.100000
1   total       total  20_30y  11.400000
0  female     chicago   total   7.333333
1  female  losangeles   total   9.250000
2  female     newyork   total   8.000000
3    male     chicago   total   9.000000
4    male  losangeles   total  14.333333
5    male     newyork   total  19.000000
0  female       total  10_20y   7.333333
1  female       total  20_30y   8.833333
2    male       total  10_20y   9.250000
3    male       total  20_30y  15.250000
0   total     chicago  10_20y   7.285714
1   total     chicago  20_30y   9.666667
2   total  losangeles  10_20y  10.000000
3   total  losangeles  20_30y  12.500000
4   total     newyork  20_30y  11.666667
0  female     chicago  10_20y   7.250000
1  female     chicago  20_30y   7.500000
2  female  losangeles  10_20y   7.500000
3  female  losangeles  20_30y  11.000000
4  female     newyork  20_30y   8.000000
5    male     chicago  10_20y   7.333333
6    male     chicago  20_30y  14.000000
7    male  losangeles  10_20y  15.000000
8    male  losangeles  20_30y  14.000000
9    male     newyork  20_30y  19.000000

It's a base work, I think you can work around

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1