'Resampling in Pandas is thrown off by daylight savings time
I have a dataset that marks the occurrences of an event every minute for four years. Here's a sample:
In [547]: result
Out[547]:
uuid timestamp col1 col2 col3
0 100 2016-03-30 00:00:00+02:00 NaN NaN NaN
1 100 2016-03-30 00:01:00+02:00 NaN NaN NaN
2 100 2016-03-30 00:02:00+02:00 NaN NaN NaN
3 100 2016-03-30 00:03:00+02:00 1.49 1.79 0.979
4 100 2016-03-30 00:04:00+02:00 NaN NaN NaN
... ... ... .. ...
1435 100 2016-03-30 23:55:00+02:00 NaN NaN NaN
1436 100 2016-03-30 23:56:00+02:00 1.39 2.19 1.09
1437 100 2016-03-30 23:57:00+02:00 NaN NaN NaN
1438 100 2016-03-30 23:58:00+02:00 NaN NaN NaN
1439 100 2016-03-30 23:59:00+02:00 NaN NaN NaN
[1440 rows x 5 columns]
I am trying to get summary statistics every time there is a non-blank row and get these statistics for every six hours. To do this, the resample() function works great. Here's a sample:
In [548]: result = result.set_index('timestamp').tz_convert('Europe/Berlin').resample('6h', label='right', closed='right', origin='start_day').agg(['mean', 'last', 'count']).iloc[:,-9:]
Out[548]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-30 00:00:00+02:00 NaN NaN ... NaN 0
2016-03-30 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-30 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-30 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-30 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
This looks great and is the format I'd like to work with. However, when I run my code on all data (spanning many years), here's an excerpt of what the output looks like:
In [549]: result
Out[549]:
col1_mean col1_last ... col3_last times_changed
timestamp ...
2016-03-27 00:00:00+01:00 NaN NaN ... NaN 0
2016-03-27 07:00:00+02:00 1.0690 1.069 ... 1.279 1
2016-03-27 13:00:00+02:00 1.0365 1.009 ... 1.239 4
2016-03-27 19:00:00+02:00 1.0150 0.989 ... 1.209 5
2016-03-28 01:00:00+02:00 1.1290 1.129 ... 1.329 1
[5 rows x 7 columns]
The new index takes DST into consideration and throws everything off by an hour. I would like the new times to still be between 0–6, 6–12 etc.
Is there a way to coerce my dataset to adhere to a 0–6, 6–12 format? If there's an extra hour, maybe the aggregations from that could still be tucked into the 0–6 range?
The timezone I'm working with is Europe/Berlin and I tried converting everything to UTC. However, values are not at their right date or time — for example, an occurrence at 00:15hrs would be 23:15hrs the previous day, which throws off those summary statistics.
Are there any creative solutions to fix this?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
