'How to groupby average in multicondition with Python Pandas
The objective is calculate subset column average based on the multi condition in a multiindex dataframe.
The first condition is to get the average group by the first level of the multiindex.
The second condition is to get the average based on the dict_ref value below
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
For example,for the key occ, get the average for the F2 and F4.
The code below should do the work
import re
import numpy as np
import numpy.random
import pandas as pd
numpy.random.seed(0)
dict_ref = dict ( occ=['F2', 'F4'], gr=['Fp1', 'Fpx'] )
_names=['pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr','pow_fr',
'hjor_com','hjor_com','hjor_com','hjor_com']
_idx=['Fp1_band0','Fp1_band1','Fpx_band0','Fpx_band1','F2_band0','F2_band1','F4_band0','F4_band1',
'Fp1','Fpx','F2','F4']
X=np.random.rand(4,len(_names))
columns = pd.MultiIndex.from_arrays([_names, _idx])
df=pd.DataFrame(data=X, columns=columns)
remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t + ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]
df.columns = pd.MultiIndex.from_tuples ( remove_nan )
df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
"level_1": "ch", "level_2": "feature","level_3": "region"} )
all_df = []
for nref in dict_ref:
df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
df_ch ['ch'] = nref
all_df.append ( df_ch )
df1 = pd.concat ( [df, *all_df] ).pivot_table ( index=['group_feature', 'ch', 'feature'] ).transpose ()
df1.columns=[(gf[0], f'{gf[1]}' if not gf[-1] else f'{gf[1]}_{gf[-1]}') for gf in df1. columns. values. tolist()]
However, I wonder whether there is a way to avoid the for-loop
all_df = []
for nref in dict_ref:
df_ch = df [df.ch.isin ( dict_ref [nref] )].groupby (["group_feature", "feature"] ).mean ().reset_index ()
df_ch ['ch'] = nref
all_df.append ( df_ch )
Not as important as the above question, but it will be a bonus if there is way to avoid the following lines altogether
remove_nan =[(e [0], *re.split ( '_', e [1] )) for e in df.columns]
remove_nan = [t + ('',) * (len ( max ( remove_nan, key=len ) ) - len ( t )) for t in remove_nan]
df.columns = pd.MultiIndex.from_tuples ( remove_nan )
df = df.T.reset_index ().rename ( columns={"level_0": "group_feature",
"level_1": "ch", "level_2": "feature","level_3": "region"} )
Expected Output
(hjor_com, F2) (hjor_com, F4) ... (pow_fr, occ_band0) (pow_fr, occ_band1)
0 0.791725 0.528895 ... 0.430621 0.768834
1 0.461479 0.780529 ... 0.399188 0.851316
2 0.018790 0.617635 ... 0.393202 0.594448
3 0.210383 0.128926 ... 0.528570 0.248629
[4 rows x 18 columns]
Solution 1:[1]
You can flip dict_ref so each item in the value array becomes the key, perform a replacement, and group by the new ch:
mapping = {
v: key for key, value in dict_ref.items() for v in value
}
all_df = df.replace({"ch": mapping}).groupby(["group_feature", "feature", "ch"]).mean().reset_index()
df1 = pd.concat([df, all_df])...
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Code Different |
