'Groupby hours +/- some integer of additional hours
I have a data frame consisting of some columns, where the index is datetime
, i.e. it looks something like:
df =
col1 col2 col3 col4 col5
date
2021-01-01 00:00:00+01:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 01:00:00+01:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 02:00:00+01:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 03:00:00+01:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-02 00:00:00+01:00 179.775434 -743.27 1352.90 1647.35 81.397542
2021-01-02 01:00:00+01:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 02:00:00+01:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 03:00:00+01:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
....
Now if I do something like:
df_grouped = list(df.groupby(df.index.hour))
Then I get a list of data frames where each group has either hour 00:00:00+01:00
, 01:00:00+01:00
, 02:00:00+01:00
etc. Hence 24 different groups in my case.
However, what I would like is to get them grouped by the hour, but ALSO including some integer of the previous and next hour, i.e. the resulting groups for the df
above would instead of:
df_group1 =
col1 col2 col3 col4 col5
date
2021-01-01 00:00:00+01:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-02 00:00:00+01:00 179.775434 -743.27 1352.90 1647.35 81.397542
df_group2 =
col1 col2 col3 col4 col5
date
2021-01-01 01:00:00+01:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-02 01:00:00+01:00 179.775434 -1184.25 752.35 2081.33 81.397542
df_group3 =
col1 col2 col3 col4 col5
date
2021-01-01 02:00:00+01:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-02 02:00:00+01:00 179.775434 -1448.93 -76.22 2306.36 81.397542
df_group4 =
col1 col2 col3 col4 col5
date
2021-01-01 03:00:00+01:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-02 03:00:00+01:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
Become:
df_group1_true =
col1 col2 col3 col4 col5
date
2021-01-01 00:00:00+01:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 01:00:00+01:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-02 00:00:00+01:00 179.775434 -743.27 1352.90 1647.35 81.397542
2021-01-02 01:00:00+01:00 179.775434 -1184.25 752.35 2081.33 81.397542
df_group2_true =
col1 col2 col3 col4 col5
date
2021-01-01 00:00:00+01:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 01:00:00+01:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 02:00:00+01:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-02 00:00:00+01:00 179.775434 -743.27 1352.90 1647.35 81.397542
2021-01-02 01:00:00+01:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 02:00:00+01:00 179.775434 -1448.93 -76.22 2306.36 81.397542
df_group3_true =
col1 col2 col3 col4 col5
date
2021-01-01 01:00:00+01:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 02:00:00+01:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 03:00:00+01:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-02 01:00:00+01:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 02:00:00+01:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 03:00:00+01:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
df_group4_true =
col1 col2 col3 col4 col5
date
2021-01-01 02:00:00+01:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 03:00:00+01:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-02 02:00:00+01:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 03:00:00+01:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
In this example data set here there are no 23:00:00+01:00
, but obviously if you were on 00:00:00+01:00
, the previous datetime to include should be 23:00:00+01:00
and so forth.
The way I do it now is actually by doing df_grouped = list(df.groupby(df.index.hour))
which will get me 24 data frames (one for each hour), and then something like:
additional_hours = 1 # How many +/- hours I want to expand each group with.
df_list_final = [
pd.concat(
[df[(i + j) % 24][1] for j in range(-additional_hours, additional_hours + 1)]
).sort_index()
for i in range(len(df))
]
And this works. However, I was wondering if there were a more elegant way of doing it with just some kind of groupby
instead of having to concatenating a bunch of data frames all the time.
Solution 1:[1]
When I try to run your code, It raises an exception, so I can be sure of the output you expect, but here is my take on your question anyway, in hope it might be helpful.
With the following toy dataframe:
import pandas as pd
df = pd.DataFrame(
[
("2021-01-01 00:00:00", 55.646134, 1.00000e00, -3890.12, 2437.47, 4.716594),
("2021-01-01 01:00:00", 55.646134, 2.00000e00, -3786.65, 2908.26, 4.716594),
("2021-01-01 02:00:00", 55.646134, 3.00000e00, -3694.92, 3123.63, 4.716594),
("2021-01-01 03:00:00", 55.646134, 4.00000e00, -3619.17, 3251.61, 4.716594),
("2021-01-02 00:00:00", 179.775434, -7.43270e02, 1352.9, 1647.35, 81.397542),
("2021-01-02 01:00:00", 179.775434, -1.18425e03, 752.35, 2081.33, 81.397542),
("2021-01-02 02:00:00", 179.775434, -1.44893e03, -76.22, 2306.36, 81.397542),
("2021-01-02 03:00:00", 179.775434, -1.46547e03, -1016.0, 2091.12, 81.397542),
],
columns=["index", "col1", "col2", "col3", "col4", "col5"],
).set_index("index")
df.index = pd.to_datetime(df.index, format="%Y-%m-%d %H:%M:%S")
print(df)
# Output
col1 col2 col3 col4 col5
index
2021-01-01 00:00:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 01:00:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 02:00:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 03:00:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-02 00:00:00 179.775434 -743.27 1352.90 1647.35 81.397542
2021-01-02 01:00:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 02:00:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 03:00:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
Here is one way to do it:
additional_hours = 1
# Create new index with additional values
new_index = []
for i in df.index:
for j in range(additional_hours, 0, -1):
new_index.append(i - pd.Timedelta(j, unit="H"))
new_index.append(i)
for j in range(additional_hours):
new_index.append(i + pd.Timedelta(j + 1, unit="H"))
# Create new dataframe by merging with original df
new_df = (
pd.DataFrame(index=new_index)
.reset_index()
.reset_index()
.set_index("index")
.merge(df, how="left", left_index=True, right_index=True, sort=False)
.fillna(method="ffill")
.fillna(method="bfill")
.sort_values("level_0")
.drop(columns="level_0")
)
print(new_df)
# Output
col1 col2 col3 col4 col5
index
2020-12-31 23:00:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 00:00:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 01:00:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 00:00:00 55.646134 1.00 -3890.12 2437.47 4.716594
2021-01-01 01:00:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 02:00:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 01:00:00 55.646134 2.00 -3786.65 2908.26 4.716594
2021-01-01 02:00:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 03:00:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-01 02:00:00 55.646134 3.00 -3694.92 3123.63 4.716594
2021-01-01 03:00:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-01 04:00:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-01 23:00:00 55.646134 4.00 -3619.17 3251.61 4.716594
2021-01-02 00:00:00 179.775434 -743.27 1352.90 1647.35 81.397542
2021-01-02 01:00:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 00:00:00 179.775434 -743.27 1352.90 1647.35 81.397542
2021-01-02 01:00:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 02:00:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 01:00:00 179.775434 -1184.25 752.35 2081.33 81.397542
2021-01-02 02:00:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 03:00:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
2021-01-02 02:00:00 179.775434 -1448.93 -76.22 2306.36 81.397542
2021-01-02 03:00:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
2021-01-02 04:00:00 179.775434 -1465.47 -1016.00 2091.12 81.397542
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Laurent |