'Resampling in Pandas is thrown off by daylight savings time

I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:

In [547]: result
Out[547]: 
      uuid                 timestamp  col1  col2  col3
0      100 2016-03-30 00:00:00+02:00     NaN NaN  NaN
1      100 2016-03-30 00:01:00+02:00     NaN NaN  NaN
2      100 2016-03-30 00:02:00+02:00     NaN NaN  NaN
3      100 2016-03-30 00:03:00+02:00     1.49 1.79  0.979
4      100 2016-03-30 00:04:00+02:00     NaN NaN  NaN
   ...                       ...     ...  ..  ...
1435   100 2016-03-30 23:55:00+02:00     NaN NaN  NaN
1436   100 2016-03-30 23:56:00+02:00     1.39 2.19  1.09
1437   100 2016-03-30 23:57:00+02:00     NaN NaN  NaN
1438   100 2016-03-30 23:58:00+02:00     NaN NaN  NaN
1439   100 2016-03-30 23:59:00+02:00     NaN NaN  NaN

[1440 rows x 5 columns]

I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample() function works great. Here's a sample:

In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]

Out[548]: 
                           col1_mean  col1_last  ...  col3_last  times_changed
timestamp                                            ...                         
2016-03-30 00:00:00+02:00          NaN          NaN  ...       NaN              0
2016-03-30 07:00:00+02:00       1.0690        1.069  ...     1.279              1
2016-03-30 13:00:00+02:00       1.0365        1.009  ...     1.239              4
2016-03-30 19:00:00+02:00       1.0150        0.989  ...     1.209              5
2016-03-30 01:00:00+02:00       1.1290        1.129  ...     1.329              1

[5 rows x 7 columns]

This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:

In [549]: result

Out[549]: 
                           col1_mean  col1_last  ...  col3_last  times_changed
timestamp                                            ...                         
2016-03-27 00:00:00+01:00          NaN          NaN  ...       NaN              0
2016-03-27 07:00:00+02:00       1.0690        1.069  ...     1.279              1
2016-03-27 13:00:00+02:00       1.0365        1.009  ...     1.239              4
2016-03-27 19:00:00+02:00       1.0150        0.989  ...     1.209              5
2016-03-28 01:00:00+02:00       1.1290        1.129  ...     1.329              1

[5 rows x 7 columns]

The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.

Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?

The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.

Are there any creative solutions to fix this?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source