'How do I calculate the daily average across various classes of data in Pandas?

I have data of the following format:

   station_number                date  river_height  river_flow
0               1 2005-01-01 08:09:00      0.285233    0.782065
1               1 2005-01-01 11:28:12      0.129994    0.386652
2               4 2005-01-01 17:33:36      0.457168    0.167025
3               2 2005-01-01 23:21:00      0.359086    0.851716
4               4 2005-01-02 04:18:36      0.332998    0.830749
5               1 2005-01-02 09:28:12      0.867262    0.855507
6               3 2005-01-02 13:15:36      0.352409    0.023737
7               2 2005-01-02 17:31:12      0.696562    0.846762
8               1 2005-01-02 21:15:36      0.910944    0.096999
9               4 2005-01-03 02:13:12      0.981430    0.152109

I need to calculate a daily average of the river height and river flow per unique station number, so as a result something like this:

   station_number       date      river_height  river_flow
0               1 2005-01-01      0.285         0.782
1               1 2005-01-02      0.233         0.753
2               2 2005-01-01      0.129         0.386
3               2 2005-01-02      0.994         0.386
4               3 2005-01-01      0.457         0.167
5               3 2005-01-02      0.168         0.134
6               4 2005-01-01      0.356         0.321
7               4 2005-01-02      0.086         0.716

Keep in mind that the above numbers are random, and not actually the averages I'm looking for. I need an entry for each day for each station. I hope I have clarified what I need!

I have tried aggregating using groupby such as below:

monthly_flow_data_mean = df.groupby(pd.PeriodIndex(df['date'], freq="M"))['river_flow'].mean()

But this obviously just takes all river_flow measurements not considering the station numbers. I have had trouble finding what combination of groupby and aggregations I need to properly achieve what I need.

I tried this as well:

daily_flow_df = df.groupby(pd.PeriodIndex(df['date'], freq="D")).agg({"river_flow": "mean", "river_height": "mean", "station_number": "first"})

But I am pretty sure this also doesn't really work as we are not really using the station number to aggregate, but merely choosing how to aggregate it while aggregating all river flow measurements.

I can obviously also just split the dataframe into 4 classes and then do the aggregation per dataframe, and merge it back together. But I am wondering if there is some smart little groupby trick that can help me achieve this in less lines, as it will be useful later in my project(s) as well where I might have way more classes in the data.



Solution 1:[1]

You can use either of the following solutions to groupby 'station_number' and date on the 'Date' column using pd.Grouper or dt.normalize:

df.groupby(['station_number', pd.Grouper(key='date', freq='D')]).mean()  

or

df.groupby(['station_number', df['date'].dt.normalize()]).mean() 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Scott Boston