'How to join on the minimum date when using rolling joins from data.table

I have two datasets- sessions and ratings. I want to join the sessions and ratings datasets based on the artist_id (in both datasets) and the recording_date (from the sessions dataset) with the RatingDate (from the ratings datasets). However, I only want to join the two datasets if there is a RatingDate that is either on or up to 7 days before the very first recording_date of the course_number.

I am using data.table's rolling join function to do this, and it works pretty well. The issue I'm having is that I want to take the earliest RatingDate that is within 7 days before the minimum recording_date, however my current join appears to be taking the RatingDate that is closest to the recording_date.

Here is a reproducible example:

library(tidyverse)
library(data.table)
library(readxl)
library(writexl)
library(scales)
library(lubridate)
library(sqldf)

sessions <- structure(list(artist_id = c(313, 313, 313, 313, 313, 313, 313, 
                                         313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 
                                         313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 313, 
                                         313, 313, 313, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 
                                         356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 
                                         356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356
), session_number_total = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
                            10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 
                            23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 
                            36L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 
                            14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 
                            27L, 28L, 29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L), recording_date = structure(c(16279, 
                                                                                                            16280, 16281, 16282, 16283, 16284, 16287, 16288, 16289, 16292, 
                                                                                                            16293, 16294, 16295, 16296, 16297, 16298, 16299, 16300, 16302, 
                                                                                                            16303, 16304, 16305, 16306, 16307, 16308, 16309, 16310, 16311, 
                                                                                                            16314, 16328, 16329, 16330, 16331, 16332, 16336, 16341, 16972, 
                                                                                                            16973, 16974, 16975, 16976, 16981, 16982, 16983, 16987, 16988, 
                                                                                                            16989, 16990, 16993, 16994, 16995, 16996, 17000, 17001, 17002, 
                                                                                                            17003, 17004, 17007, 17008, 17009, 17010, 17014, 17015, 17018, 
                                                                                                            17021, 17023, 17028, 17030, 17036, 17044, 17051, 17058), class = "Date"), 
days_between = c(0, 1, 1, 1, 1, 1, 3, 1, 1, 3, 1, 1, 1, 1, 
                 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 14, 1, 1, 1, 
                 1, 4, 5, 0, 1, 1, 1, 1, 5, 1, 1, 4, 1, 1, 1, 3, 1, 1, 1, 
                 4, 1, 1, 1, 1, 3, 1, 1, 1, 4, 1, 3, 3, 2, 5, 2, 6, 8, 7, 
                 7), course_number = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                                       2L, 2L, 2L), session_in_course = c(1L, 2L, 3L, 4L, 5L, 6L, 
                                                                          7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
                                                                          19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 1L, 
                                                                          2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 
                                                                          10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 
                                                                          22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L, 32L, 33L, 
                                                                          1L, 2L, 3L), total_sessions_in_course = c(29L, 29L, 29L, 
                                                                                                                    29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 
                                                                                                                    29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 
                                                                                                                    29L, 29L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 33L, 33L, 33L, 33L, 
                                                                                                                    33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 
                                                                                                                    33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 33L, 
                                                                                                                    33L, 33L, 33L, 33L, 33L, 3L, 3L, 3L)), row.names = c(NA, 
                                                                                                                                                                         -72L), class = c("data.table", "data.frame"))

ratings <- structure(list(artist_id = c(313, 313, 313, 313, 313, 313, 313, 
                                        313, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 356, 
                                        356), RatingDate = structure(c(16276, 16279, 16282, 16292, 16297, 
                                                                       16304, 16314, 16341, 16972, 16976, 16987, 16990, 17000, 17004, 
                                                                       17014, 17018, 17023, 17030, 17044, 17051, 17058), class = "Date"), 
                          Name = c("PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", 
                                   "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", 
                                   "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", "PHQ-9", 
                                   "PHQ-9"), RatingValue = c(13L, 13L, 11L, 8L, 5L, 5L, 9L, 
                                                             2L, 20L, 21L, 17L, 16L, 17L, 12L, 13L, 12L, 10L, 9L, 11L, 
                                                             8L, 8L), total_questionnaires_completed = c(8L, 8L, 8L, 8L, 
                                                                                                         8L, 8L, 8L, 8L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 13L, 
                                                                                                         13L, 13L, 13L, 13L)), row.names = c(NA, -21L), class = c("data.table", 
                                                                                                                                                                  "data.frame"))

ratings_min <- ratings

#create min_session_totals dataset
min_session_totals <- sessions %>%
  group_by(artist_id, course_number, total_sessions_in_course) %>%
  summarize(
            min_recording_date = min(recording_date),
            max_recording_date = max(recording_date)) %>%
  select(artist_id, course_number, total_sessions_in_course, min_recording_date, max_recording_date) %>%
  arrange(artist_id, course_number, min_recording_date) 

min_session_totals <- as.data.frame(min_session_totals)

setDT(min_session_totals)
setDT(ratings_min)

min_session_totals[, join_time_min:=min_recording_date]
ratings_min[, join_time_min:=RatingDate]

setkey(min_session_totals, artist_id, join_time_min)
setkey(ratings_min, artist_id, join_time_min)

first_rating <- ratings_min[min_session_totals, roll = 7, nomatch = NA
][j = .(RatingDate, RatingValue, join_time_min, course_number, total_sessions_in_course,
        min_recording_date, max_recording_date),
  by = .(artist_id, Name)][order(artist_id, min_recording_date, RatingDate, Name),]

first_rating <- first_rating %>%
  filter(!is.na(Name))

first_rating

If you look at the resulting first_rating datatable, you will see that artist_id 356 is joined properly- their earliest recording_date for course_number 1 is 6/20/2016 and their earliest rating is also on 6/20/2016. Similarly, course_number 2 has a minimum recording_date of 8/31/2016 and it is joined with the RatingDate of 8/31/2016.

The issue is with artist_id 313. The first_rating datatable shows their min_recording_date as 7/28/2014. However, artist_id 313 actually had a RatingDate on 7/25/2014- which is within 7 days from their minimum recording date for course_number 1 (7/28/2014).

Does anybody know how I can restructure the join such that the minimum RatingDate is joined, so long as it is within 7 days from the minimum recording_date? As you can see, it looks like my current join is just joining the RatingDate that is closest to the recording_date, so long as it is within 7 days rolling backwards.

I am open to all ideas, including non-data.table workarounds. Any and all help is appreciated. Thank you!



Solution 1:[1]

I believe you can rectify this with a non-equi join. Instead of creating the join_time_min col on both min_session_totals and ratings_min, you can instead create it only on the latter, as the RatingDate+7, like this:

ratings_min[, join_time_min:=RatingDate+7]

Then, you can do a non-equi join like this:

ratings_min[min_session_totals, on=.(artist_id, join_time_min>min_recording_date)] %>% 
  ## Now limit to valid rows, and take the first row in each group
  .[RatingDate<=join_time_min, .SD[1], by=.(artist_id, course_number)] %>% 
  ## Now select the columns and the order you want
  .[,.(artist_id, Name, RatingDate, RatingValue, join_time_min, course_number, total_sessions_in_course, max_recording_date)]

Output:

   artist_id   Name RatingDate RatingValue join_time_min course_number total_sessions_in_course max_recording_date
       <num> <char>     <Date>       <int>        <Date>         <int>                    <int>             <Date>
1:       313  PHQ-9 2014-07-25          13    2014-07-28             1                       29         2014-09-01
2:       356  PHQ-9 2016-06-20          20    2016-06-20             1                       33         2016-08-23
3:       356  PHQ-9 2016-08-31          11    2016-08-31             2                        3         2016-09-14

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 langtang