'Sum values if date falls within range (two dataframes)

I have two data frames that look like below:

id1 <- c("1", "2", "3", "4", "4")
start <- c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14")
end <- c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16")

df1 <- data.frame(id1, start, end)
id2 <- c( "1","1", "2", "2" ,"2" ,"2" , "2", "2", "3","3", "3", "4", "4", "4" )
date <- c("2019-07-09","2019-07-10", "2021-03-22", "2021-03-23", "2021-03-24", "2021-03-25", "2021-03-26", "2021-03-27", "2019-02-25", "2019-02-26", "2019-02-27", "2019-01-27", "2019-02-14", "2019-02-15")
value <- c("0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5")

df2 <- data.frame(id2, date, value)

I am trying to match by id and sum the values in df2 if the dates fall with in the start and end date in df1.

The output should look like below. I've been trying to work this through dplyr and haven't had any luck.

###EXPECTED OUTPUT####

id1 <- c("1", "2", "3", "4", "4")
start <- c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14")
end <- c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16")
output_value <- c("1", "3", "1.5", "0.5", "1")

output <- data.frame(id1, start, end, output_value)

Thanks for your help!



Solution 1:[1]

You'll need to first combine the two dataframe using left_join, than make sure the date, start, and end columns are type of "Date", and make sure the column value is numeric. Then remove the date that are not within start and end. Finally group_by the relevant columns and sum up the value.

library(tidyverse)

left_join(df2, df1, by = c("id2" = "id1")) %>% 
  mutate(across(c(date, start, end), as.Date),
         value = as.numeric(value)) %>% 
  filter(date >= start & date <= end) %>% 
  group_by(id2, start, end) %>% 
  summarize(value = sum(value), .groups = "drop")


# A tibble: 5 x 4
  id2   start      end        value
  <chr> <date>     <date>     <dbl>
1 1     2019-07-09 2019-07-10   1  
2 2     2021-03-22 2021-03-27   3  
3 3     2019-02-25 2019-02-27   1.5
4 4     2019-01-27 2019-01-28   0.5
5 4     2019-02-14 2019-02-16   1  

Solution 2:[2]

The lubridate package's %within% operator and interval function is useful for this kind of thing. You could do something like this:

library(dplyr)
library(lubridate)

df1 %>% 
  inner_join(df2, by = c("id1" = "id2")) %>% 
  mutate(across(c(date, start, end), as.Date),
         value = as.numeric(value)) %>% 
  group_by(id1, start, end) %>% 
  summarise(output_value =  sum(
    ifelse(
      date %within% interval(start, end),
      value,
      0
    )
  ))

# A tibble: 5 x 4
# Groups:   id1, start [5]
  id1   start      end             output_value
  <chr> <date>     <date>                 <dbl>
1 1     2019-07-09 2019-07-10               1  
2 2     2021-03-22 2021-03-27               3  
3 3     2019-02-25 2019-02-27               1.5
4 4     2019-01-27 2019-01-28               0.5
5 4     2019-02-14 2019-02-16               1  

Solution 3:[3]

Another option could be to use the data.table package. Below is a quick way to get your desired output. You do need to make sure to convert your data.frames to data.tables.

library(data.table)
df1 <- data.table(id1 = c("1", "2", "3", "4", "4"), 
                  start = c("2019-07-09", "2021-03-22", "2019-02-25", "2019-01-27", "2019-02-14"), 
                  end = c("2019-07-10", "2021-03-27", "2019-02-27", "2019-01-28", "2019-02-16"))

df2 <- data.table(id2 = c( "1","1", "2", "2" ,"2" ,"2" , "2", "2", "3","3", "3", "4", "4", "4" ), 
                  date = c("2019-07-09","2019-07-10", "2021-03-22", "2021-03-23", "2021-03-24", "2021-03-25", 
                           "2021-03-26", "2021-03-27", "2019-02-25", "2019-02-26", "2019-02-27", "2019-01-27", "2019-02-14", "2019-02-15"), 
                  value = c("0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5","0.5"))

# convert value from chr to numeric
df2[,value := as.numeric(value)]

# merge tables by ids
new_dt = merge(df1, df2, by.x = "id1", by.y = "id2", all.y = TRUE)

# sum value when date is between start and end and group by id1,start,end
new_dt2 = new_dt[date %between% .(start, end),.(output_value = sum(value)), 
                 by = .(id1,start,end)]

# result
new_dt2
   id1      start        end output_value
1:   1 2019-07-09 2019-07-10          1.0
2:   2 2021-03-22 2021-03-27          3.0
3:   3 2019-02-25 2019-02-27          1.5
4:   4 2019-01-27 2019-01-28          0.5
5:   4 2019-02-14 2019-02-16          1.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 benson23
Solution 2 henryn
Solution 3 Jamie