'Pandas groupby and filtering groups with rows greater than N rows
I have a pandas df of the following format
STOCK YR MONTH DAY PRICE
AAA 2022 1 1 10
AAA 2022 1 2 11
AAA 2022 1 3 10
AAA 2022 1 4 15
AAA 2022 1 5 10
BBB 2022 1 1 5
BBB 2022 1 2 10
BBB 2022 2 1 10
BBB 2022 2 2 15
What I am looking to do is to filter this df such that I am grouping by STOCK and YR and MONTH and selecting the groups with 3 or more entries.
So the resulting df looks like
STOCK YR MONTH DAY PRICE
AAA 2022 1 1 10
AAA 2022 1 2 11
AAA 2022 1 3 10
AAA 2022 1 4 15
AAA 2022 1 5 10
Note that BBB is eliminated as it had only 2 rows in each group, when grouped by STOCK, YR and MONTH
I have tried df.groupby(['STOCK','YR','MONTH']).filter(lambda x: x.STOCK.nunique() > 5) but this resulted in an empty frame.
Also tried df.groupby(['STOCK','YR','MONTH']).filter(lambda x: x['STOCK','YR','MONTH'].nunique() > 5) but this resulted in a KeyError: ('STOCK', 'YR', 'MONTH')
Thanks!
Solution 1:[1]
Use GroupBy.transform('count'):
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('count').ge(3)]
or 'size':
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('size').ge(3)]
output:
STOCK YR MONTH DAY PRICE
0 AAA 2022 1 1 10
1 AAA 2022 1 2 11
2 AAA 2022 1 3 10
3 AAA 2022 1 4 15
4 AAA 2022 1 5 10
Solution 2:[2]
Use GroupBy.transform:
If need counts (not exclude possible NaNs):
#if need test number of unique values
df[df.groupby(['STOCK', 'YR', 'MONTH'])['STOCK'].transform('size').gt(3)]
Or:
#in large df should be slow
df.groupby(['STOCK','YR','MONTH']).filter(lambda x: len(x) > 3)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | mozway |
| Solution 2 |
