'R: Remove the rows in domain 'condition' within 270 days before and after the date of rows in domain 'procedure'

I have a question regarding how to clean a dataset using R.

Now I have a dataset that looks like this:

Columns: ID, date, domain;

domain includes: procedure, condition

date is YYYY-MM-DD;

ID <- c(1,1,1,2,2,2,2,3,4,4,4,5,5,5,5,5,5,5,5,5)

date <- c('2018-06-29','2018-08-29','2018-08-29','2020-12-11','2020-12-12','2021-01-21','2021-12-21','2020-11-29','2019-07-12','2018-05-13','2021-06-08','2017-11-14','2018-07-15','2019-01-15','2019-02-19','2021-11-09','2021-11-10','2021-10-09','2021-07-13','2022-12-01')

domain <- c('condition','condition','procedure','condition','procedure','condition','condition','procedure','condition','procedure','procedure','procedure','condition','procedure','condition','procedure','condition','condition','condition','condition')

df <- data.frame(ID, date, domain)

Example dateset:

> df
   ID       date    domain
1   1 2018-06-29 condition
2   1 2018-08-29 condition
3   1 2018-08-29 procedure
4   2 2020-12-11 condition
5   2 2020-12-12 procedure
6   2 2021-01-21 condition
7   2 2021-12-21 condition
8   3 2020-11-29 procedure
9   4 2019-07-12 condition
10  4 2018-05-13 procedure
11  4 2021-06-08 procedure
12  5 2017-11-14 procedure
13  5 2018-07-15 condition
14  5 2019-01-15 procedure
15  5 2019-02-19 condition
16  5 2021-11-09 procedure
17  5 2021-11-10 condition
18  5 2021-10-09 condition
19  5 2021-07-13 condition
20  5 2022-12-01 condition

I would like to remove all the rows in domain 'condition' within 270 days before and after a certain row in domain 'procedure' for each ID.

In other words, for each ID, use the rows in domain 'procedure' as anchors, remove all the rows in domain 'condition' within 270 days before and after the anchors. That is, for each ID, remove the rows in domain 'condition' within the range: date of procedure ± 270 days

For example, the desired output would be:

   ID date      domain      decision
1   1 2018-06-29 condition  drop
2   1 2018-08-29 condition  drop
3   1 2018-08-29 procedure  keep

4   2 2020-12-11 condition  drop
5   2 2020-12-12 procedure  keep
6   2 2021-01-21 condition  drop
7   2 2021-12-21 condition  keep

8   3 2020-11-29 procedure  keep

9   4 2019-07-12 condition  keep
10  4 2018-05-13 procedure  keep
11  4 2021-06-08 procedure  keep

12  5 2017-11-14 procedure  keep
13  5 2018-07-15 condition  drop
14  5 2019-01-15 procedure  keep
15  5 2019-02-19 condition  drop
16  5 2021-11-09 procedure  keep
17  5 2021-11-10 condition  drop
18  5 2021-10-09 condition  drop
19  5 2021-07-13 condition  drop
20  5 2022-12-01 condition  keep

Thank you for your help!



Solution 1:[1]

We can convert the 'date' to Date (with as.Date), then grouped by 'ID', get the difference between 'date' and 'date' where domain is 'procedure', check if the absolute value is greater than or equal to 270 or (|) if the domain is "procedure"

library(dplyr)
library(purrr)
df1 <- df %>%
   mutate(date = as.Date(date)) %>%
   group_by(ID) %>%
   mutate(decision = case_when(map(date[domain== "procedure"], 
   ~abs(as.numeric(difftime(date, .x, units = "days"))) >= 
     270|domain == "procedure" ) %>% 
       reduce(`&`) ~ "keep", TRUE ~ "drop")) %>%
  ungroup

-output

> df1
# A tibble: 20 × 4
      ID date       domain    decision
   <dbl> <date>     <chr>     <chr>   
 1     1 2018-06-29 condition drop    
 2     1 2018-08-29 condition drop    
 3     1 2018-08-29 procedure keep    
 4     2 2020-12-11 condition drop    
 5     2 2020-12-12 procedure keep    
 6     2 2021-01-21 condition drop    
 7     2 2021-12-21 condition keep    
 8     3 2020-11-29 procedure keep    
 9     4 2019-07-12 condition keep    
10     4 2018-05-13 procedure keep    
11     4 2021-06-08 procedure keep    
12     5 2017-11-14 procedure keep    
13     5 2018-07-15 condition drop    
14     5 2019-01-15 procedure keep    
15     5 2019-02-19 condition drop    
16     5 2021-11-09 procedure keep    
17     5 2021-11-10 condition drop    
18     5 2021-10-09 condition drop    
19     5 2021-07-13 condition drop    
20     5 2022-12-01 condition keep   

NOTE: This creates a column in original data as 'decision'. Instead if we want to remove the rows, use filter on the logical output

df1 <- df %>%
   mutate(date = as.Date(date)) %>%
   group_by(ID) %>%
   filter(map(date[domain== "procedure"], 
   ~abs(as.numeric(difftime(date, .x, units = "days"))) >= 
     270|domain == "procedure" ) %>% 
       reduce(`&`) ) %>%
  ungroup

-output

> df1
# A tibble: 11 × 3
      ID date       domain   
   <dbl> <date>     <chr>    
 1     1 2018-08-29 procedure
 2     2 2020-12-12 procedure
 3     2 2021-12-21 condition
 4     3 2020-11-29 procedure
 5     4 2019-07-12 condition
 6     4 2018-05-13 procedure
 7     4 2021-06-08 procedure
 8     5 2017-11-14 procedure
 9     5 2019-01-15 procedure
10     5 2021-11-09 procedure
11     5 2022-12-01 condition

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1