'How to get mean and std from a dictionary of dataframe per each key
Here is my dilemma:
I got a dictionary of dataframes like this:
dict_df[key]
m1 m2 m3 m4 m5 m6
10410 5 10 21 33 11
15387 3 10 33 45 13
19026 4 16 27 40 11
26083 5 21 16 29 9
27806 4 17 23 31 7
43820 2 12 27 40 18
49199 7 22 30 38 11
50094 4 9 13 18 4
Per each key, it returns a DF with the same column names.
For each key, I need to store the mean and standard deviation of a set of features (let's take for example m2, m3, m4).
In the end, I want to obtain something like this df below (the numbers are totally random):
key m2_mean m2_std m3_mean m3_std m4_mean m4_std
key1 12 55 793 438 44 95
key2 14 442 21 43 14 442
key3 44 1 66 11 42 42
key4 42 42 2 23 98 70
The dataset is not that big, so even if the code is slow should be fine.
Thanks for the help and have a good one!
Solution 1:[1]
Start with creating a dictionary for the aggregates, in the form of new_col_name: (dataframe_column_name, aggregare_function):
required_aggs = {f'{col_name}_{agg_name}': (col_name, agg_name)
for agg_name in ['mean', 'std']
for col_name in df}
Above gives following dictionary of the required aggregates, and the column names.
{
'm1_mean': ('m1', 'mean'),
'm1_std': ('m1', 'std'),
'm2_mean': ('m2', 'mean'),
'm2_std': ('m2', 'std'),
'm3_mean': ('m3', 'mean'),
'm3_std': ('m3', 'std'),
'm4_mean': ('m4', 'mean'),
'm4_std': ('m4', 'std'),
'm5_mean': ('m5', 'mean'),
'm5_std': ('m5', 'std'),
'm6_mean': ('m6', 'mean'),
'm6_std': ('m6', 'std')
}
Now, for each dataframe, assign the key column with the value of the key, then group by this column, and call .agg on groupby object passing unpacked version of above aggregate dictionary:
>>> df.assign(key=123).groupby('key').agg(**required_aggs)
m1_mean m2_mean m3_mean ... m4_std m5_std m6_std
key ...
123 30228.125 4.25 14.625 ... 6.860862 8.447316 4.140393
PS: In above snippet, df is a single dataframe in dict_df[key], and value of this key is assumed to be 123. You need to do the same for each dataframes in the dictionary. Also, if you need to calculate aggregate of specific columns only, you just need to replace df in for col_name in df by list of columns e.g.: for col_name in ['m1', 'm2']
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ThePyGuy |
