'Pandas Groupby with Aggregate, and Quantiles

I am attempting to groupby a pandas DataFrame and calculate quantiles and aggregates from a column.

Here's a sample DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({
                   'id': [1, 1, 1, 2],
                   'cat': ['p','p','p','n'],
                   'num': [5, 10, 10, 5],
                   'v': [np.nan, np.nan, np.nan, 'v2'],
                   'p': [1000, 1300, 1400, 1100]
                 })

 

I am looking for a solution that can scale with n # of categorical and numeric columns. For numeric and categorical columns, aggregate using mode function.

With p, create two new columns, range of .25 and .75 quantiles and min and max.

Expected output:

id  cat num  v    pquantile     min-max    

1   p   10   NaN  1075 - 1325   1000 - 1400  
2   n   5    v2   1100          1100

Also, aggregate function mode needs to be able to handle a tie.



Solution 1:[1]

As outlined in your question, first group by "cat" and use the agg method to select the "most common value".

df_grouped_by = df.groupby('cat').agg(pd.Series.mode)

Then compute the 0.25 and 0.75 quantiles for each list value in the p column:

df_grouped_by['pquantile'] = df_grouped_by.apply(lambda row : np.quantile(row['p'],[0.25,0.75]), axis = 1)

Finally, compute the min and max values using a similar logic:

df_grouped_by['min-max'] = df_grouped_by.apply(lambda row : [np.min(row['p']), np.max(row['p'])] , axis = 1)

This yields:

cat id  num v   p                   pquantile           min-max                         
n   2   5   v2  1100                [1100.0, 1100.0]    [1100, 1100]
p   1   10  []  [1000, 1300, 1400]  [1150.0, 1350.0]    [1000, 1400]

You may then reindex as you see fit and drop the p column.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Sheldon