'check to see if first row of a group consist of said groups greatest value in pandas and subset data as such
So let's say I have the following df
pd.DataFrame(
{
'store':['a','a','a','a','b','b','b','b','c','c','c','c'],
'avg_sales':[100,20,30,25,50,60,70,80,90,60,40,50],
'month':[0,1,2,3,0,1,2,3,0,1,2,3]
})
store avg_sales month
0 a 100 0
1 a 20 1
2 a 30 2
3 a 25 3
4 b 50 0
5 b 60 1
6 b 70 2
7 b 80 3
8 c 90 0
9 c 60 1
10 c 40 2
11 c 50 3
What I am trying to do is find all stores where their first month (month==0) is the greatest avg_sales value. Thus, my desired output is to just keep store a and c since their stores had the greatest avg_sale on month==0:
store avg_sales month
0 a 100 0
1 a 20 1
2 a 30 2
3 a 25 3
8 c 90 0
9 c 60 1
10 c 40 2
11 c 50 3
I actually started out with a datetime for each month and used the following code to get the cumcounts:
df['month'] = df.sort_values(by=['store','date']).groupby('store').cumcount()
Now I just need to figure out the logic that only keeps stores where their month==0 is their greatest avg_sales number. Obviously you can do something simple like .loc[], but it seems like you need to do another subset within the loc to find out if month==0 is actually the greatest sales number for each group? I'm a bit confused. I could just use np.where() but that wouldn't account for the multiple groups out-of-the-box...
I can use the following code to make a column that consists of the max avg_sales number for each store:
df['max_sales'] = df.groupby('store').avg_sales.transform('max')
store avg_sales month max_sales
0 a 100 0 100
1 a 20 1 100
2 a 30 2 100
3 a 25 3 100
4 b 50 0 80
5 b 60 1 80
6 b 70 2 80
7 b 80 3 80
8 c 90 0 90
9 c 60 1 90
10 c 40 2 90
11 c 50 3 90
Now I am stuck on the last step of the logic.
EDIT: just fixing grammar
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
