'Groupby by sum of revenue and the corresponding highest contributing month - Pandas
I have a bill details data set and I want to do a groupby of the products based on the sum of their Total value, additionally i want a column which indicates the month which has produced the most revenue for the corresponding product
Data set:
Bill_Id Month Product_Id Net_Value
1 1 20 100
2 1 20 100
3 2 20 100
4 1 30 200
5 2 30 200
6 2 30 200
Desired_Result
Product_Id Total_revenue Top_Month
20 300 1
30 600 2
This just a sample dataset I have the transaction data of the entire year
Solution 1:[1]
Assuming that your only 1 Top month value is needed based on the maximum sum of Net_Revenue, Below is the code that might work for you.
We can achieve this in 3 stages as mentioned below:
1. Extracting the sum of net revenue based on product id
df_1 = df.groupby(['Product_Id']).agg({'Net_Value' : sum}).reset_index()
df_1 = df_1.rename(columns={'Net_Value' : 'Total_revenue'})
print(df_1)
Product_Id Total_revenue
0 20 300
1 30 600
2. Extracting the best contibuting month based on max sum net revenue for each product id
df_2 = df.groupby(['Product_Id', 'Month']).agg({'Net_Value' : sum}).sort_values('Net_Value', ascending=False).reset_index()
df_2 = df_2.drop_duplicates(subset=['Product_Id'])[['Product_Id', 'Month']]
print(df_2)
Product_Id Month
0 30 2
1 20 1
3. Final step is to merge this both dataframes into single based on product id
final_df = df_1.merge(df_2)
print(final_df)
Product_Id Total_revenue Month
0 20 300 1
1 30 600 2
Please do upvote the solution if it helps :)
Solution 2:[2]
Small modification over @Shubham's approach
result = (
df.pivot_table("Net_Value", "Product_Id", "Month", aggfunc="sum")
.agg(["sum", "idxmax"], axis=1)
.set_axis(["Total_revenue", "Top_Month"], axis=1)
)
Solution 3:[3]
As multiple columns are being interacted, I have used the apply function in addition to groupby:
Net_value is calucated using basic aggregate function sum
Top_month required interaction between columns so, first get the index of max Net_value using idxmax then using loc to find the month
The resultant Pandas Series object has the groupby column (Product_id) as index, so it make it a column I have used reset_index
def f(x):
d = {}
d['Net_Value'] = x['Net_Value'].sum()
d['Top_month'] = df.loc[x['Net_Value'].idxmax(), "Month"]
return pd.Series(d, index=['Net_Value', 'Top_month'])
df.groupby('Product_Id').apply(f).reset_index()
# Output
Product_Id Net_Value Top_month
0 20 300 1
1 30 600 2
Check out this amazing answer which helped me and can help you in the future as well.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | Mark Wang |
| Solution 3 | Uchiha012 |
