'Dynamic offset based on condition

My data.table looks like below.

library(data.table)

my_dt <- data.table(time = c(1000, 1111, 2000, 3000, 4500, 4600, 5000, 7000, 8500, 9000),
                    var1 = 1:10,
                    var2 = c(0.5, 0.6, 2, 0.8, 2.1, 1, 7, 8, 0.1, 0.3))
#printed table:

   time var1 var2
 1: 1000    1  0.5
 2: 1111    2  0.6
 3: 2000    3  2.0
 4: 3000    4  0.8
 5: 4500    5  2.1
 6: 4600    6  1.0
 7: 5000    7  7.0
 8: 7000    8  8.0
 9: 8500    9  0.1
10: 9000   10  0.3

For each value in var1 I need to find the row index of the first value in var2 which is equal to/greater then the value in var1. I then need to compute the difference in time by substracting the corresponding values in the time col. For example: The first value in var1 is 1 (also row index 1). The first value in var2 which is equal to/greater then 1 is 2.0 and at row index 3. Hence the corresponding difference in time is 2000 - 1000 = 1000:

my_dt$time[3] - my_dt$time[1]

The desired result looks like below. I actually don't need the idx column, but it might help as an intermediary result.

my_result <- data.table(time = c(1000, 1111, 2000, 3000, 4500, 4600, 5000, 7000, 8500, 9000),
                        var1 = 1:10,
                        var2 = c(0.5, 0.6, 2, 0.8, 2.1, 1, 7, 8, 0.1, 0.3),
                        idx = c(3, 3, 7, 7, 7, 7, 7, 8, 0, 0),
                        time.diff = c(1000, 889, 3000, 2000, 500, 400, 0, 0, NA, NA))
#printed result:
    time var1 var2 idx time.diff
 1: 1000    1  0.5   3      1000
 2: 1111    2  0.6   3       889
 3: 2000    3  2.0   7      3000
 4: 3000    4  0.8   7      2000
 5: 4500    5  2.1   7       500
 6: 4600    6  1.0   7       400
 7: 5000    7  7.0   7         0
 8: 7000    8  8.0   8         0
 9: 8500    9  0.1   0        NA
10: 9000   10  0.3   0        NA

I can achieve this result with a combination of a for- and a while-loop. However, the execution is painfully slow. I wonder if there is a better solution that a) creates col idx with data.table syntax and b) uses the value from idx as offset parameter (n) in shift() ? Many thanks in advance!



Solution 1:[1]

Converting my comment to an answer so the quesiton will be answered

my_dt[, time.diff := .SD[my_dt, time - i.time, on = .(var2 >= var1), mult = "first"]]

Basically,

  • my_dt[my_dt, on = .(var2 >= var1)] is a self non-equi join.

  • mult = "first" is for picking only the first match, otherwise it will join each var1 to all <= var2 values

  • Because it's a self-join, both tables have same columns, hence the i. prefix stands for the column in the table in the i-th place (as in DT[i, j, by])

  • Re the .SD part, it is just because the assignment operator (:=) messes up the order, meaning, the following line works (without assigment), but get messed up with it. Probably has something to do with the internal order of operation within data.table[i, j, by]

    my_dt[my_dt, time - i.time, on = .(var2 >= var1), mult = "first"]
    # [1] 1000  889 3000 2000  500  400    0    0   NA   NA
    

Solution 2:[2]

my_dt[,idx := max.col(outer(var1, var2, '<='), 'first')][,
       idx := idx * NA^(var2[idx]<var1)][,
       time_diff := time[idx] - time][]

    time var1 var2 idx time_diff
 1: 1000    1  0.5   3      1000
 2: 1111    2  0.6   3       889
 3: 2000    3  2.0   7      3000
 4: 3000    4  0.8   7      2000
 5: 4500    5  2.1   7       500
 6: 4600    6  1.0   7       400
 7: 5000    7  7.0   7         0
 8: 7000    8  8.0   8         0
 9: 8500    9  0.1  NA        NA
10: 9000   10  0.3  NA        NA

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 David Arenburg
Solution 2 onyambu