'Ignoring count reset at specified point in time series
I have a dataframe like this (edited; adding a grouping variable measurement_type):
data <- data.frame(ID = as.factor(c(rep(1, 10),
rep(2, 10))),
measurement_type = as.factor(c(rep("type_1", 5),
rep("type_2", 5),
rep("type_1", 5),
rep("type_2", 5))),
measurement_time = as.POSIXct(c("2014-06-17 04:00:00",
"2014-06-17 11:52:00",
"2014-06-17 18:58:00",
"2014-06-18 02:05:00",
"2014-06-18 08:00:00",
"2014-06-17 05:27:00",
"2014-06-17 11:10:00",
"2014-06-17 17:02:00",
"2014-06-17 23:56:00",
"2014-06-18 07:01:00",
"2014-07-03 16:01:00",
"2014-07-03 19:19:00",
"2014-07-03 23:55:00",
"2014-07-04 08:08:00",
"2014-07-04 13:55:00",
"2014-07-03 22:12:00",
"2014-07-04 08:59:00",
"2014-07-04 14:10:00",
"2014-07-04 17:00:00",
"2014-07-04 23:00:00")),
amount = c(350,470,310,470,650,
175,275,45,255,395,
130,460,540,790,69,
80,210,58,147,326),
entry_time = as.POSIXct(c(rep("2014-06-17 01:53:00", 10),
rep("2014-07-03 14:35:00", 10))))
The Subjects with ID 1 and ID 2 enter at a specified entry_time and thereafter, cumulative amounts are measured at specific measurement_times. However, each day at noon, the amount is set back to zero again and counting starts again (from zero). What I would like to achieve is that once a break at noon happens (and therefore a reset to zero), it keeps adding the new newly starting cumulative amount to that already accumulated before noon (grouped by the grouping variable measurement_type).
UPDATE
Thanks to @Istrel I almost got the right output using the provided answer:
data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))
Output
# A tibble: 20 x 6
ID measurement_type measurement_time amount entry_time amount_cums
<fct> <fct> <dttm> <dbl> <dttm> <dbl>
1 1 type_1 2014-06-17 04:00:00 350 2014-06-17 01:53:00 350
2 1 type_1 2014-06-17 11:52:00 470 2014-06-17 01:53:00 470
3 1 type_1 2014-06-17 18:58:00 310 2014-06-17 01:53:00 780
4 1 type_1 2014-06-18 02:05:00 470 2014-06-17 01:53:00 1250
5 1 type_1 2014-06-18 08:00:00 650 2014-06-17 01:53:00 1430
6 1 type_2 2014-06-17 05:27:00 175 2014-06-17 01:53:00 175
7 1 type_2 2014-06-17 11:10:00 275 2014-06-17 01:53:00 275
8 1 type_2 2014-06-17 17:02:00 45 2014-06-17 01:53:00 320
9 1 type_2 2014-06-17 23:56:00 255 2014-06-17 01:53:00 530
10 1 type_2 2014-06-18 07:01:00 395 2014-06-17 01:53:00 925
11 2 type_1 2014-07-03 16:01:00 130 2014-07-03 14:35:00 130
12 2 type_1 2014-07-03 19:19:00 460 2014-07-03 14:35:00 460
13 2 type_1 2014-07-03 23:55:00 540 2014-07-03 14:35:00 540
14 2 type_1 2014-07-04 08:08:00 790 2014-07-03 14:35:00 1330
15 2 type_1 2014-07-04 13:55:00 69 2014-07-03 14:35:00 1399
16 2 type_2 2014-07-03 22:12:00 80 2014-07-03 14:35:00 80
17 2 type_2 2014-07-04 08:59:00 210 2014-07-03 14:35:00 290
18 2 type_2 2014-07-04 14:10:00 58 2014-07-03 14:35:00 348
19 2 type_2 2014-07-04 17:00:00 147 2014-07-03 14:35:00 437
20 2 type_2 2014-07-04 23:00:00 326 2014-07-03 14:35:00 616
As you can see, the cumulative sums for noon breaks get updated correctly. However, in case of overnight cases, the code adds the amount of after midnight to the total count of before midnight. However, at midnight there is no count reset, and the amount simply should be keeping cumulating to the before midnight amount (no count reset). However, in the output shown above, the cumulative amount is added to the before midnight value: for example line 10, where it adds the value of 395 to the amount_cums of 530 (line 9), whereas it should simply be adding the difference to the last value (395 - 255 = 140), the correct amount_cums being 670 for line 10.
Any idea how I could adapt your code?
Solution 1:[1]
I can suggest this strategy. First, group the data by ID, date (year_month_day) and AM/PM time tag. Then convert the cumulative sum to the original values in each group. Then recalculate the cumulative sum in groups by ID and date.
The solution could look like this:
library(tidyverse)
dat_alt <- data %>% as_tibble() %>%
# Check 12 hours passed --> `pm` column
mutate(pm = format(measurement_time, "%H") >= 12) %>%
mutate(date_fct = format(measurement_time, "%Y_%d")) %>%
# Group by ID and `pm`
group_by(ID, measurement_type, date_fct, pm) %>%
# Turn cumsum into actual values
mutate(amount_act = amount - lag(amount, default = 0)) %>%
# Cumsum over ID
ungroup() %>%
group_by(ID, measurement_type) %>%
mutate(amount_cums = cumsum(amount_act)) %>%
ungroup() %>%
select(-c(pm, date_fct, amount_act))
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
