'R - How to condition across two dataframes based on date
Sorry if this is a duplicate but could not quite find what was right for me.
I have a Dataframe 1 - where Episode_ID is unique but patient_ID is not
Episode_ID Patient_ID Drug Commenced_date
1 1 0 A 2018-01-02
2 2 0 B 2020-11-26
3 3 1 B 2017-09-26
4 4 2 B 2020-11-26
and Dataframe 2 where Visit_ID is unique but Patient_ID is not
Patient_ID Visit_ID Visit_date Visit_Score
1 0 1 2018-01-22 2.3
2 0 2 2019-06-01 1.5
3 0 3 2020-11-26 1.6
4 1 B 2020-11-26 1.4
What I want to achieve is for each episode (from df1) to be matched to that patients closest visit in (df2) by looping through all the visit_dates (matched by patient_ID) in df 2 and have the associated visit_score appended. As each row in Df1 is iterated, the patient ID can be matched to patient ID in df2. So the first row, should be a 1 to 3 match. datediff with which.min can then be applied to find the closest match. Its the coding syntax that I really can't get down right.
For example, based on the above. The result would be:
Episode_ID Patient_ID Drug Commenced_date Closest_Visit Visit_score
1 1 0 A 2018-01-02 2018-01-22 2.3
2 2 0 B 2020-11-26 2020-11-26 1.6
3 3 1 B 2017-09-26 2018-01-22 2.3
4 4 2 B 2020-11-26 2020-11-26 1.4
This should be quite simple but I've becoming really rusty in R and out of practice. I tried to use a nested for loop but it kept breaking. Then tried using sapply but couldn't quite get it to work.
Any help would be greatly appreciated. Happy for a solution in python as well but prefer R for now given the rest of the project at the moment.
Solution 1:[1]
There are a few options to consider for approaches with R that do not require loops.
I might consider data.table and join the data.frames with roll:
library(data.table)
setDT(df1)
setDT(df2)
df2[df1, on = c('Patient_ID', 'Visit_date' = 'Commenced_date'), roll = 'nearest']
Output
Patient_ID Visit_ID Visit_date Visit_Score Episode_ID Drug
1: 0 1 2018-01-02 2.3 1 A
2: 0 3 2020-11-26 1.6 2 B
3: 1 B 2017-09-26 1.4 3 B
4: 2 <NA> 2020-11-26 NA 4 B
Or use sqldf with a join of the two data.frames:
library(sqldf)
sqldf("select df1.*, df2.*, min(abs(df1.Commenced_date - df2.Visit_date)) diff
from df1
left join df2 using(Patient_ID)
group by df1.rowid")
Output
Episode_ID Patient_ID Drug Commenced_date Patient_ID Visit_ID Visit_date Visit_Score diff
1 1 0 A 2018-01-02 0 1 2018-01-22 2.3 20
2 2 0 B 2020-11-26 0 3 2020-11-26 1.6 0
3 3 1 B 2017-09-26 1 B 2020-11-26 1.4 1157
4 4 2 B 2020-11-26 NA <NA> <NA> NA NA
In this case, leaving in a calculated difference in dates.
If you did want a tidyverse approach, you could a join and then filter or slice to keep the row with minimum difference in dates:
library(tidyverse)
left_join(df1, df2, by = "Patient_ID") %>%
group_by(Patient_ID, Episode_ID) %>%
slice_min(abs(Visit_date - Commenced_date))
Output
Episode_ID Patient_ID Drug Commenced_date Visit_ID Visit_date Visit_Score
<int> <int> <chr> <date> <chr> <date> <dbl>
1 1 0 A 2018-01-02 1 2018-01-22 2.3
2 2 0 B 2020-11-26 3 2020-11-26 1.6
3 3 1 B 2017-09-26 B 2020-11-26 1.4
In this case, used slice_min which will omit rows in df1 that did not match by Patient_ID on df2.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Ben |
