'Pulling start date, end date, and mean quantity for unbalanced dataset
I have a dataset (seen in the image) that consists of cities (column "IBGE"), dates, and quantities (column "QTD"). I am trying to extract three things into a new column: start date per "IBGE", end date per "IBGE", and mean per "code".
Also, before doing so, should I change the index of my dataset?
The panel data is unbalanced, so different "IBGE" values have different start and end dates, and mean. How could I go about creating a new data frame with the following information separated in columns? I want the dataframe to look like this:
| CODE | Start | End | Mean QTD |
|---|---|---|---|
| 10001 | 2020-01-01 | 2022-01-01 | 604 |
| 10002 | 2019-09-01 | 2021-10-01 | 1008 |
| 10003 | 2019-02-01 | 2020-12-01 | 568 |
| 10004 | 2020-03-01 | 2021-05-01 | 223 |
| ... | ... | ... | ... |
| 99999 | 2020-02-01 | 2022-04-01 | 9394 |

I am thinking that maybe a for while loop could potentially take that info, but I am not sure how to write the code.
Solution 1:[1]
Try with groupby and named aggregations:
#convert DATE column to datetime if needed
df["DATE"] = pd.to_datetime(df["DATE"])
output = df.groupby("IBGE").agg(Start=("DATE","min"),
End=("DATE","max"),
Mean_QTD=("QTD","mean"))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | not_speshal |
