'Groupby hours +/- some integer of additional hours

I have a data frame consisting of some columns, where the index is datetime, i.e. it looks something like:

df = 
                            col1        col2      col3        col4        col5
date                    
2021-01-01 00:00:00+01:00   55.646134   1.00      -3890.12    2437.47     4.716594
2021-01-01 01:00:00+01:00   55.646134   2.00      -3786.65    2908.26     4.716594
2021-01-01 02:00:00+01:00   55.646134   3.00      -3694.92    3123.63     4.716594
2021-01-01 03:00:00+01:00   55.646134   4.00      -3619.17    3251.61     4.716594
2021-01-02 00:00:00+01:00   179.775434  -743.27   1352.90     1647.35     81.397542
2021-01-02 01:00:00+01:00   179.775434  -1184.25  752.35      2081.33     81.397542
2021-01-02 02:00:00+01:00   179.775434  -1448.93  -76.22      2306.36     81.397542
2021-01-02 03:00:00+01:00   179.775434  -1465.47  -1016.00    2091.12     81.397542
....

Now if I do something like:

df_grouped = list(df.groupby(df.index.hour))

Then I get a list of data frames where each group has either hour 00:00:00+01:00, 01:00:00+01:00, 02:00:00+01:00 etc. Hence 24 different groups in my case.

However, what I would like is to get them grouped by the hour, but ALSO including some integer of the previous and next hour, i.e. the resulting groups for the df above would instead of:

df_group1 =
                            col1        col2     col3        col4      col5
date                    
2021-01-01 00:00:00+01:00   55.646134   1.00     -3890.12    2437.47   4.716594
2021-01-02 00:00:00+01:00   179.775434  -743.27  1352.90     1647.35   81.397542


df_group2 =
                            col1        col2     col3        col4      col5
date                    
2021-01-01 01:00:00+01:00   55.646134   2.00    -3786.65     2908.26   4.716594
2021-01-02 01:00:00+01:00   179.775434  -1184.25  752.35     2081.33   81.397542


df_group3 =
                            col1        col2     col3        col4      col5
date                    
2021-01-01 02:00:00+01:00   55.646134   3.00     -3694.92    3123.63   4.716594
2021-01-02 02:00:00+01:00   179.775434  -1448.93 -76.22      2306.36   81.397542


df_group4 =
                            col1        col2     col3        col4      col5
date                    
2021-01-01 03:00:00+01:00   55.646134   4.00     -3619.17    3251.61   4.716594
2021-01-02 03:00:00+01:00   179.775434  -1465.47 -1016.00    2091.12   81.397542

Become:

df_group1_true =
                            col1        col2     col3        col4      col5
date
2021-01-01 00:00:00+01:00   55.646134   1.00     -3890.12    2437.47   4.716594
2021-01-01 01:00:00+01:00   55.646134   2.00     -3786.65    2908.26   4.716594
2021-01-02 00:00:00+01:00   179.775434  -743.27  1352.90     1647.35   81.397542
2021-01-02 01:00:00+01:00   179.775434  -1184.25  752.35     2081.33   81.397542

df_group2_true = 
                            col1        col2     col3        col4      col5
date                    
2021-01-01 00:00:00+01:00   55.646134   1.00     -3890.12    2437.47   4.716594
2021-01-01 01:00:00+01:00   55.646134   2.00     -3786.65    2908.26   4.716594
2021-01-01 02:00:00+01:00   55.646134   3.00     -3694.92    3123.63   4.716594
2021-01-02 00:00:00+01:00   179.775434  -743.27  1352.90     1647.35   81.397542
2021-01-02 01:00:00+01:00   179.775434  -1184.25  752.35     2081.33   81.397542
2021-01-02 02:00:00+01:00   179.775434  -1448.93 -76.22      2306.36   81.397542

df_group3_true = 
                            col1        col2     col3        col4      col5
date                    
2021-01-01 01:00:00+01:00   55.646134   2.00     -3786.65    2908.26   4.716594
2021-01-01 02:00:00+01:00   55.646134   3.00     -3694.92    3123.63   4.716594
2021-01-01 03:00:00+01:00   55.646134   4.00     -3619.17    3251.61   4.716594
2021-01-02 01:00:00+01:00   179.775434  -1184.25  752.35     2081.33   81.397542
2021-01-02 02:00:00+01:00   179.775434  -1448.93 -76.22      2306.36   81.397542
2021-01-02 03:00:00+01:00   179.775434  -1465.47 -1016.00    2091.12   81.397542

df_group4_true = 
                            col1        col2     col3        col4      col5
date                    
2021-01-01 02:00:00+01:00   55.646134   3.00     -3694.92    3123.63   4.716594
2021-01-01 03:00:00+01:00   55.646134   4.00     -3619.17    3251.61   4.716594
2021-01-02 02:00:00+01:00   179.775434  -1448.93 -76.22      2306.36   81.397542
2021-01-02 03:00:00+01:00   179.775434  -1465.47 -1016.00    2091.12   81.397542

In this example data set here there are no 23:00:00+01:00, but obviously if you were on 00:00:00+01:00, the previous datetime to include should be 23:00:00+01:00 and so forth.

The way I do it now is actually by doing df_grouped = list(df.groupby(df.index.hour)) which will get me 24 data frames (one for each hour), and then something like:

additional_hours = 1  # How many +/- hours I want to expand each group with.

df_list_final = [
    pd.concat(
        [df[(i + j) % 24][1] for j in range(-additional_hours, additional_hours + 1)]
    ).sort_index()
    for i in range(len(df))
]

And this works. However, I was wondering if there were a more elegant way of doing it with just some kind of groupby instead of having to concatenating a bunch of data frames all the time.



Solution 1:[1]

When I try to run your code, It raises an exception, so I can be sure of the output you expect, but here is my take on your question anyway, in hope it might be helpful.

With the following toy dataframe:

import pandas as pd

df = pd.DataFrame(
    [
        ("2021-01-01 00:00:00", 55.646134, 1.00000e00, -3890.12, 2437.47, 4.716594),
        ("2021-01-01 01:00:00", 55.646134, 2.00000e00, -3786.65, 2908.26, 4.716594),
        ("2021-01-01 02:00:00", 55.646134, 3.00000e00, -3694.92, 3123.63, 4.716594),
        ("2021-01-01 03:00:00", 55.646134, 4.00000e00, -3619.17, 3251.61, 4.716594),
        ("2021-01-02 00:00:00", 179.775434, -7.43270e02, 1352.9, 1647.35, 81.397542),
        ("2021-01-02 01:00:00", 179.775434, -1.18425e03, 752.35, 2081.33, 81.397542),
        ("2021-01-02 02:00:00", 179.775434, -1.44893e03, -76.22, 2306.36, 81.397542),
        ("2021-01-02 03:00:00", 179.775434, -1.46547e03, -1016.0, 2091.12, 81.397542),
    ],
    columns=["index", "col1", "col2", "col3", "col4", "col5"],
).set_index("index")
df.index = pd.to_datetime(df.index, format="%Y-%m-%d %H:%M:%S")

print(df)
# Output
                           col1     col2     col3     col4       col5
index
2021-01-01 00:00:00   55.646134     1.00 -3890.12  2437.47   4.716594
2021-01-01 01:00:00   55.646134     2.00 -3786.65  2908.26   4.716594
2021-01-01 02:00:00   55.646134     3.00 -3694.92  3123.63   4.716594
2021-01-01 03:00:00   55.646134     4.00 -3619.17  3251.61   4.716594
2021-01-02 00:00:00  179.775434  -743.27  1352.90  1647.35  81.397542
2021-01-02 01:00:00  179.775434 -1184.25   752.35  2081.33  81.397542
2021-01-02 02:00:00  179.775434 -1448.93   -76.22  2306.36  81.397542
2021-01-02 03:00:00  179.775434 -1465.47 -1016.00  2091.12  81.397542

Here is one way to do it:

additional_hours = 1

# Create new index with additional values
new_index = []
for i in df.index:
    for j in range(additional_hours, 0, -1):
        new_index.append(i - pd.Timedelta(j, unit="H"))
    new_index.append(i)
    for j in range(additional_hours):
        new_index.append(i + pd.Timedelta(j + 1, unit="H"))

# Create new dataframe by merging with original df
new_df = (
    pd.DataFrame(index=new_index)
    .reset_index()
    .reset_index()
    .set_index("index")
    .merge(df, how="left", left_index=True, right_index=True, sort=False)
    .fillna(method="ffill")
    .fillna(method="bfill")
    .sort_values("level_0")
    .drop(columns="level_0")
)
print(new_df)
# Output
                           col1     col2     col3     col4       col5
index
2020-12-31 23:00:00   55.646134     1.00 -3890.12  2437.47   4.716594
2021-01-01 00:00:00   55.646134     1.00 -3890.12  2437.47   4.716594
2021-01-01 01:00:00   55.646134     2.00 -3786.65  2908.26   4.716594
2021-01-01 00:00:00   55.646134     1.00 -3890.12  2437.47   4.716594
2021-01-01 01:00:00   55.646134     2.00 -3786.65  2908.26   4.716594
2021-01-01 02:00:00   55.646134     3.00 -3694.92  3123.63   4.716594
2021-01-01 01:00:00   55.646134     2.00 -3786.65  2908.26   4.716594
2021-01-01 02:00:00   55.646134     3.00 -3694.92  3123.63   4.716594
2021-01-01 03:00:00   55.646134     4.00 -3619.17  3251.61   4.716594
2021-01-01 02:00:00   55.646134     3.00 -3694.92  3123.63   4.716594
2021-01-01 03:00:00   55.646134     4.00 -3619.17  3251.61   4.716594
2021-01-01 04:00:00   55.646134     4.00 -3619.17  3251.61   4.716594
2021-01-01 23:00:00   55.646134     4.00 -3619.17  3251.61   4.716594
2021-01-02 00:00:00  179.775434  -743.27  1352.90  1647.35  81.397542
2021-01-02 01:00:00  179.775434 -1184.25   752.35  2081.33  81.397542
2021-01-02 00:00:00  179.775434  -743.27  1352.90  1647.35  81.397542
2021-01-02 01:00:00  179.775434 -1184.25   752.35  2081.33  81.397542
2021-01-02 02:00:00  179.775434 -1448.93   -76.22  2306.36  81.397542
2021-01-02 01:00:00  179.775434 -1184.25   752.35  2081.33  81.397542
2021-01-02 02:00:00  179.775434 -1448.93   -76.22  2306.36  81.397542
2021-01-02 03:00:00  179.775434 -1465.47 -1016.00  2091.12  81.397542
2021-01-02 02:00:00  179.775434 -1448.93   -76.22  2306.36  81.397542
2021-01-02 03:00:00  179.775434 -1465.47 -1016.00  2091.12  81.397542
2021-01-02 04:00:00  179.775434 -1465.47 -1016.00  2091.12  81.397542

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Laurent