'How to implement a rolling join "without replacement" (a row in either source table should map to 0 or 1 rows in the result)

Consider these tables of purchases and sales

purchases <- data.table(
  purchase_id = c(10,20,30,40,50,60),
  store = c("a", "a", "a", "b", "b", "b"),
  date = c(1,1,2,3,3,3)
)

sales <- data.table(
  sale_id = c(1,2,3,4,5,6),
  store = c("a", "a", "a", "b", "b", "b"),
  date = c(1,1,1,3,3,4)
)

> purchases
    purchase_id store date
1:           10     a    1
2:           20     a    1
3:           30     a    2
4:           40     b    3
5:           50     b    3
6:           60     b    3
> sales
   sale_id store date
1:       1     a    1
2:       2     a    1
3:       3     a    1
4:       4     b    3
5:       5     b    3
6:       6     b    4

I would like to map each purchase to the sale that occurred at the same time or later (and at the same store). The catch is one purchase should be mapped to exactly one or none sales, and vice-versa.

There are multiple solutions that satisfy my requirements, but a simple one follows the following algorithm:

For each purchase:
  Subset sales where sale store matches purchase store and sale date >= purchase date
  Select the first sale in the subset and map it to this purchase
  REMOVE THIS SALE FROM THE sales TABLE!

which would produce a mapping like

    purchase_id sale_id
1:           10       1
2:           20       2
3:           30      NA
4:           40       4
5:           50       5
6:           60       6

Is there an elegant way to do this with data.table?


Dirty Solution

Here's a dirty, but working solution I developed.

rolling_join_without_replacement <- function(x, i, on, roll, allow.cartesian = FALSE){
  # Dirty implementation of a rolling join matching algo without replacement
  # Each row in i maps to exactly one row in the result
  # Each row in x maps to exactly zero or one rows in the result
  
  # Copy x and i
  x2 <- copy(x)
  i2 <- copy(i)
  
  # Create row id fields for each table
  x2[, x_row := .I]
  i2[, i_row := .I]
  
  allmatches <- list()
  while(TRUE){
    
    # Execute the rolling join
    matches <- x2[i2, on = on, roll = roll, allow.cartesian = allow.cartesian, nomatch = 0L]
    
    # If no matches, break
    if(nrow(matches) == 0) break
    
    # Get the first match per i, then get the first match per x
    matches <- matches[matches[, .I[1L], by = i_row]$V1]
    matches <- matches[matches[, .I[1L], by = x_row]$V1]
    
    # Save these matches
    allmatches <- c(allmatches, list(matches))
    
    # Exclude these x and i from future matches
    x2 <- x2[!matches, on = "x_row"]
    i2 <- i2[!matches, on = "i_row"]
  }
  
  # Combine matches
  allmatches <- rbindlist(allmatches, use.names = TRUE)
  
  # Include unmatched i rows
  unmatched <- i2[!allmatches, on = "i_row"]
  allmatches <- rbind(allmatches, unmatched, use.names = TRUE, fill = TRUE)
  
  return(allmatches[])
}

Usage

rolling_join_without_replacement(
  x = sales, 
  i = purchases, 
  on = c("store", "date"), 
  roll = -Inf, 
  allow.cartesian = TRUE
)

    purchase_id sale_id
1:           10       1
2:           20       2
3:           30      NA
4:           40       4
5:           50       5
6:           60       6


Solution 1:[1]

According to the OP, the goal is

to map each purchase to the sale that occurred at the same time or later (and at the same store). The catch is one purchase should be mapped to exactly one or none sales, and vice-versa.

If I understand correctly, the OP is looking to align the vector of purchase ids with the vector of sale ids after removing those sales events which occurred before the purchase (for each store).

Here is an approach which uses a non-equi join and rowid() to pick aligned rows:

library(data.table)
sales[purchases, on = c("store", "date>=date"), 
  .(store, purchase_id, sale_id = sale_id[x.date >= i.date])][
    rowid(store, purchase_id) == rowid(store, sale_id)]

Result for a modified use case (in order to cover more edge cases, e.g., more stores):

   store purchase_id sale_id
1:     a          10       1
2:     a          20       2
3:     a          30      NA
4:     b          40       5
5:     b          50       6
6:     b          60       7
7:     d          70      NA

Please, note that store is included for safety and completeness as purchase_id and sale_id may not be unique across all stores.

Also note that the result depends strongly on the order of rows in purchases and sales.

Data

Sample data modified to cover more edge cases:

purchases <- data.table(
  purchase_id = c(10,20,30,40,50,60,70),
  store = c("a", "a", "a", "b", "b", "b", "d"),
  date = c(1,1,2,3,3,3,3)
)

sales <- data.table(
  sale_id = c(1,2,3,4,5,6,7,8),
  store = c("a", "a", "a", "b", "b", "b", "b", "c"),
  date = c(1,1,1,2,3,3,4,5)
)

purchases
   purchase_id store date
1:          10     a    1
2:          20     a    1
3:          30     a    2
4:          40     b    3
5:          50     b    3
6:          60     b    3
7:          70     d    3

includes an additional purchase in store d.

sales
   sale_id store date
1:       1     a    1
2:       2     a    1
3:       3     a    1
4:       4     b    2
5:       5     b    3
6:       6     b    3
7:       7     b    4
8:       8     c    5

includes 2 additional sales (rows 4 and 8) and an additional store c.

Explanation

The first expression

sales[purchases, on = c("store", "date>=date"), 
  .(store, purchase_id, sale_id = sale_id[x.date >= i.date])]

returns all possible combinations of purchase_id with valid sale_id, i.e., where only those sale_ids are included where sales date x.date is on or after purchase date i.date (for each store):

    store purchase_id sale_id
 1:     a          10       1
 2:     a          10       2
 3:     a          10       3
 4:     a          20       1
 5:     a          20       2
 6:     a          20       3
 7:     a          30      NA
 8:     b          40       5
 9:     b          40       6
10:     b          40       7
11:     b          50       5
12:     b          50       6
13:     b          50       7
14:     b          60       5
15:     b          60       6
16:     b          60       7
17:     d          70      NA

The second expression

[rowid(store, purchase_id) == rowid(store, sale_id)]

creates id numbers for each unique value of purchase_id and for each unique value of sale_id likewise and subsets by matching id numbers.

Solution 2:[2]

You could try this:

x = sales[purchases, on=.(store,date>=mindate)][order(store,date, sale_id, purchase_id)]

res=x[0]

f <- function(df,res) df[sale_id %in% res$sale_id==F][sale_id == min(sale_id)]

for(p in unique(x$purchase_id)) res = rbind(res,f(x[purchase_id==p],res))

res = rbind(res,x[is.na(sale_id)])

Output:

   sale_id  store  date purchase_id
     <num> <char> <num>       <num>
1:       1      a     1          10
2:       2      a     1          20
3:       4      b     3          40
4:       5      b     3          50
5:       6      b     3          60
6:      NA      a     2          30

Solution 3:[3]

I will leave my other answer as it is, but I realized that you could use similar logic to avoid the join entirely. However, for that to work, you do need to sort the tables beforehand. If that's acceptable, it could potentially save a lot of memory if your tables are big.

setkey(purchases, store, date, purchase_id)
setkey(sales, store, date, sale_id)

purchases[, sale_id := {
  # j is the index for sales, make sure we start at the same store as purchases
  j <- 1L
  while (j <= nrow(sales) && sales$store[j] != store[1L]) {
    j <- j + 1L
  }
  if (j > nrow(sales)) {
    return(NA_real_)
  }
  
  sale_id <- purchase_id
  i <- 1L
  
  while (i <= length(purchase_id) && j <= nrow(sales)) {
    # if stores no longer match, add NA until they do
    while (i <= length(purchase_id) && sales$store[j] != store[i]) {
      sale_id[i] <- NA_real_
      i <- i + 1L
    }
    
    if (i > length(purchase_id)) {
      break
    }
    
    # move sales' cursor until we find a date that's at or after current purchase
    while (j <= nrow(sales) && sales$store[j] == store[i] && sales$date[j] < date[i]) {
      j <- j + 1L
    }
    
    # if j is still valid and stores still match, we found a valid sale date
    # otherwise no valid sale was found, so enter NA
    if (j <= nrow(sales) && sales$store[j] == store[i]) {
      sale_id[i] <- sales$sale_id[j]
      i <- i + 1L
      j <- j + 1L
    }
    else {
      sale_id[i] <- NA_real_
      i <- i + 1L
    }
  }
  
  sale_id
}]

Essentially, you just keep cursors for each table and leverage the fact that the columns are sorted. You increase both cursors when a matching sale is found, but only shift one of the cursors when that's not the case, depending on whether store or date are no longer valid for a match.

Solution 4:[4]

One option would be to do a left non-equi join and let data.table find all matches for each purchase, using more memory, and filtering afterwards by sorting the result by sale_id and leveraging that:

sales[purchases, on = .(store, date >= date), allow.cartesian = TRUE][
    , by = store, .SDcols = c("purchase_id", "sale_id"), {
        ordered_sd <- .SD[order(sale_id, purchase_id)]
        purchase_id <- ordered_sd$purchase_id
        sale_id <- ordered_sd$sale_id

        purchase_buffer <- sale_buffer <- purchases[.BY$store, purchase_id, on = "store"]
        buffer_index <- 1L
        #cat(address(purchase_buffer), address(sale_buffer), "first", sep = "\n")

        current_pid <- NULL
        max_sid <- 0L

        do_updates <- function(i, sid) {
            current_pid <<- purchase_id[i]

            if (!is.na(sid)) {
                max_sid <<- sid
            }

            purchase_buffer[buffer_index] <<- current_pid
            sale_buffer[buffer_index] <<- sid
            buffer_index <<- buffer_index + 1L
            #cat(address(purchase_buffer), address(sale_buffer), "update", sep = "\n")
        }

        for (i in seq_along(purchase_id)) {
            if (is.null(current_pid)) {
                do_updates(i, sale_id[i])
            }

            pid <- purchase_id[i]

            if (pid > current_pid) {
                if (purchase_buffer[buffer_index - 1L] != current_pid) {
                    do_updates(i - 1L, NA_integer_)
                }

                sid <- sale_id[i]

                if (!is.na(sid) && sid > max_sid) {
                    do_updates(i, sid)
                }
            }
        }

        if (purchase_buffer[buffer_index - 1L] != pid) {
            do_updates(i, NA_integer_)
        }

        list(
            purchase_id = purchase_buffer,
            sale_id = sale_buffer
        )
    }]

A breakdown of what this does:

  1. The left non-equi join.
  2. Sorting primarily by sale_id within each store group in case the ids can be repeated across all stores.
  3. Create a buffer. Since you want one line per purchase id, we can start with the purchase_id column from the original table: purchases[.BY$store, purchase_id, on = "store"]. See secondary indices.
  4. Since we now have an order in sale_id, we can iterate through the values and extract a single id for each purchase by keeping track of the max id seen in the loop (max_sid). If we don't find any sale id for a purchase, we enter an NA.

Some notes:

  • Your code doesn't produce integer ids, but I assumed they were integers, that's why I used NA_integer_. If your ids are not integers (see storage.mode), use NA_real_ to avoid corecions.
  • You can see some commented calls to cat that I used to check if copies were being made for the buffer. It seems that it is correctly copied only once and then modified in place.
  • The additional if branch after the loop is for the last purchase in the case it doesn't have any matching sales, since in that case pid > current_pid && purchase_buffer[buffer_index - 1L] != current_pid will never be true.

Now, if the additional memory utilization is not acceptable for your actual data set, you could materialize smaller joins by joining with .SD inside a frame that has also specified by:

sales[, by = store, .SD[purchases[.BY$store, on = "store"], .(purchase_id, sale_id), on = .(date >= date), allow.cartesian = TRUE][
    order(sale_id, purchase_id)][
        , {
            ...
        }] # the whole 'j' expression for 'sales[...]' ends here
]

The code between {} is almost the same as before but without the creation of ordered_sd and the 2 lines below that.

In this case, in each call, .SD will contain a subset of sales for a given store, which is also why we manually create a subset of purchases with purchases[.BY$store, on = "store"] so we only join with the same store's rows.

However, even if you use the second approach, each row from purchases could potentially match a lot of rows from sales, so you could consider applying a sort of window so that the matches don't consider things that are very far away in time. For that, you could add something like this to sales:

sales[, match_limit_inclusive := date - 2]

and change the join condition to:

on = .(date >= date, match_limit_inclusive <= date)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 langtang
Solution 3 Alexis
Solution 4