'R: Remove the rows in domain 'condition' within 270 days before and after the date of rows in domain 'procedure'
I have a question regarding how to clean a dataset using R.
Now I have a dataset that looks like this:
Columns: ID, date, domain;
domain includes: procedure, condition
date is YYYY-MM-DD;
ID <- c(1,1,1,2,2,2,2,3,4,4,4,5,5,5,5,5,5,5,5,5)
date <- c('2018-06-29','2018-08-29','2018-08-29','2020-12-11','2020-12-12','2021-01-21','2021-12-21','2020-11-29','2019-07-12','2018-05-13','2021-06-08','2017-11-14','2018-07-15','2019-01-15','2019-02-19','2021-11-09','2021-11-10','2021-10-09','2021-07-13','2022-12-01')
domain <- c('condition','condition','procedure','condition','procedure','condition','condition','procedure','condition','procedure','procedure','procedure','condition','procedure','condition','procedure','condition','condition','condition','condition')
df <- data.frame(ID, date, domain)
Example dateset:
> df
ID date domain
1 1 2018-06-29 condition
2 1 2018-08-29 condition
3 1 2018-08-29 procedure
4 2 2020-12-11 condition
5 2 2020-12-12 procedure
6 2 2021-01-21 condition
7 2 2021-12-21 condition
8 3 2020-11-29 procedure
9 4 2019-07-12 condition
10 4 2018-05-13 procedure
11 4 2021-06-08 procedure
12 5 2017-11-14 procedure
13 5 2018-07-15 condition
14 5 2019-01-15 procedure
15 5 2019-02-19 condition
16 5 2021-11-09 procedure
17 5 2021-11-10 condition
18 5 2021-10-09 condition
19 5 2021-07-13 condition
20 5 2022-12-01 condition
I would like to remove all the rows in domain 'condition' within 270 days before and after a certain row in domain 'procedure' for each ID.
In other words, for each ID, use the rows in domain 'procedure' as anchors, remove all the rows in domain 'condition' within 270 days before and after the anchors. That is, for each ID, remove the rows in domain 'condition' within the range: date of procedure ± 270 days
For example, the desired output would be:
ID date domain decision
1 1 2018-06-29 condition drop
2 1 2018-08-29 condition drop
3 1 2018-08-29 procedure keep
4 2 2020-12-11 condition drop
5 2 2020-12-12 procedure keep
6 2 2021-01-21 condition drop
7 2 2021-12-21 condition keep
8 3 2020-11-29 procedure keep
9 4 2019-07-12 condition keep
10 4 2018-05-13 procedure keep
11 4 2021-06-08 procedure keep
12 5 2017-11-14 procedure keep
13 5 2018-07-15 condition drop
14 5 2019-01-15 procedure keep
15 5 2019-02-19 condition drop
16 5 2021-11-09 procedure keep
17 5 2021-11-10 condition drop
18 5 2021-10-09 condition drop
19 5 2021-07-13 condition drop
20 5 2022-12-01 condition keep
Thank you for your help!
Solution 1:[1]
We can convert the 'date' to Date (with as.Date), then grouped by 'ID', get the difference between 'date' and 'date' where domain is 'procedure', check if the absolute value is greater than or equal to 270 or (|) if the domain is "procedure"
library(dplyr)
library(purrr)
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(decision = case_when(map(date[domain== "procedure"],
~abs(as.numeric(difftime(date, .x, units = "days"))) >=
270|domain == "procedure" ) %>%
reduce(`&`) ~ "keep", TRUE ~ "drop")) %>%
ungroup
-output
> df1
# A tibble: 20 × 4
ID date domain decision
<dbl> <date> <chr> <chr>
1 1 2018-06-29 condition drop
2 1 2018-08-29 condition drop
3 1 2018-08-29 procedure keep
4 2 2020-12-11 condition drop
5 2 2020-12-12 procedure keep
6 2 2021-01-21 condition drop
7 2 2021-12-21 condition keep
8 3 2020-11-29 procedure keep
9 4 2019-07-12 condition keep
10 4 2018-05-13 procedure keep
11 4 2021-06-08 procedure keep
12 5 2017-11-14 procedure keep
13 5 2018-07-15 condition drop
14 5 2019-01-15 procedure keep
15 5 2019-02-19 condition drop
16 5 2021-11-09 procedure keep
17 5 2021-11-10 condition drop
18 5 2021-10-09 condition drop
19 5 2021-07-13 condition drop
20 5 2022-12-01 condition keep
NOTE: This creates a column in original data as 'decision'. Instead if we want to remove the rows, use filter on the logical output
df1 <- df %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
filter(map(date[domain== "procedure"],
~abs(as.numeric(difftime(date, .x, units = "days"))) >=
270|domain == "procedure" ) %>%
reduce(`&`) ) %>%
ungroup
-output
> df1
# A tibble: 11 × 3
ID date domain
<dbl> <date> <chr>
1 1 2018-08-29 procedure
2 2 2020-12-12 procedure
3 2 2021-12-21 condition
4 3 2020-11-29 procedure
5 4 2019-07-12 condition
6 4 2018-05-13 procedure
7 4 2021-06-08 procedure
8 5 2017-11-14 procedure
9 5 2019-01-15 procedure
10 5 2021-11-09 procedure
11 5 2022-12-01 condition
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
