'Pandas: Aggregate mean ("totals") for each combination of dimensions

I have a table like this:

gender	city	age	time	value
male	newyork	10_20y	2010	10.5
female	newyork	10_20y	2010	11

I'd like to add all possible means for combinations of dimensions 1-3 to the same table (or a new dataframe that I can concatenate with the original).

The time dimension should not be aggregated. Example of means added for different combinations:

gender	city	age	time	value
total	total	total	2010	(mean)
male	total	total	2010	(mean)
male	newyork	total	2010	(mean)
female	total	total	2010	(mean)
female	newyork	total	2010	(mean)
... total	... total	10_20y	2010	(mean)

Solution 1:^[1]

Using groupby on multiple columns will groupby with all combinations of these columns. So a simple df.groupby(["city", "age"]).mean() will achieve the mean for the "total", "city", "age" combination. The problem here is you want all combinations of all size for the list ["gender", "city", "age"]. It's not straightforward and I think a more pythonic way can be found but here is my proposal :

## create artificial data
cols = ["gender", "city", "age"]
genders = ["male", "female"]
cities = ["newyork", "losangeles", "chicago"]
ages = ["10_20y", "20_30y"]
n = 20
df = pd.DataFrame({"gender" : np.random.choice(genders, n),
                   "city" : np.random.choice(cities, n),
                   "age" : np.random.choice(ages, n),
                   "value": np.random.randint(1, 20, n)})

## the dataframe we will append during the process
new = pd.DataFrame(columns = cols)

## itertools contains the function combinations
import itertools

## list all size combinations possible
for n in range(0, len(cols)+1):
    for i in itertools.combinations(cols, n):

        ## if n > 0, the combinations is not empty, 
        ## so we can directly groupby this sublist and take the mean
        if n != 0:
            agg_df = df.groupby(list(i)).mean().reset_index() 
            ## reset index since for multiple columns, the result will be multiindex

        ## if n=0, we just want to take the mean of the whole dataframe
        else:
            agg_df = pd.DataFrame(columns = cols)
            agg_df.loc[0, "value"] = df.loc[:, "value"].mean() 
            ## a bit ugly since this mean is an integer not a dataframe

        ## from here agg_df will have n+1 columns, 
        ## for instance for ["gender"] we will have only "gender" and "value" columns
        ## "city" and "age" are missing since we aggregate on it
        for j in cols:
            if j not in i:
                agg_df.loc[:, j] = "total" ## adding total as you asked for
        new = new.append(agg_df)

For instance, I find :

new
   gender        city     age      value
0   total       total   total   9.750000
0  female       total   total   8.083333
1    male       total   total  12.250000
0   total     chicago   total   8.000000
1   total  losangeles   total  11.428571
2   total     newyork   total  11.666667
0   total       total  10_20y   8.100000
1   total       total  20_30y  11.400000
0  female     chicago   total   7.333333
1  female  losangeles   total   9.250000
2  female     newyork   total   8.000000
3    male     chicago   total   9.000000
4    male  losangeles   total  14.333333
5    male     newyork   total  19.000000
0  female       total  10_20y   7.333333
1  female       total  20_30y   8.833333
2    male       total  10_20y   9.250000
3    male       total  20_30y  15.250000
0   total     chicago  10_20y   7.285714
1   total     chicago  20_30y   9.666667
2   total  losangeles  10_20y  10.000000
3   total  losangeles  20_30y  12.500000
4   total     newyork  20_30y  11.666667
0  female     chicago  10_20y   7.250000
1  female     chicago  20_30y   7.500000
2  female  losangeles  10_20y   7.500000
3  female  losangeles  20_30y  11.000000
4  female     newyork  20_30y   8.000000
5    male     chicago  10_20y   7.333333
6    male     chicago  20_30y  14.000000
7    male  losangeles  10_20y  15.000000
8    male  losangeles  20_30y  14.000000
9    male     newyork  20_30y  19.000000

It's a base work, I think you can work around

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1

'Pandas: Aggregate mean ("totals") for each combination of dimensions

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]