'data.table non-join on one column
I'm trying to do a join in data.table where within a set of columns that match another table I want all values in a third column from that table that don't exist in the initial table. Here's an example:
library(foreach)
library(data.table)
tmp <- data.table(mgr=c('1','1','1','1',
'2','2',
'3','3'),
year=rep(2022,8),
stocks=c('A','B','C','D',
'A','B',
'C','D'))
The input table looks like this:
mgr year stocks
1: 1 2022 A
2: 1 2022 B
3: 1 2022 C
4: 1 2022 D
5: 2 2022 A
6: 2 2022 B
7: 3 2022 C
8: 3 2022 D
I then get all unique stock-year combinations. Then for each mgr-year I want the stocks that are not in tmp for that particular mgr-year pair:
allstocks <- unique(tmp[,.(year,stocks)])
myloop <- unique(tmp[,.(mgr,year)])
out <- foreach(myidx=1:nrow(myloop)) %do% {
tmp1 <- tmp[myloop[myidx],on=.NATURAL]
tmp1 <- allstocks[!tmp1,on=.(year,stocks)]
tmp1 <- merge(myloop[myidx],tmp1,by=c('year'))
tmp1
}
finalout <- rbindlist(out)
finalout
This gives:
year mgr stocks
1: 2022 2 C
2: 2022 2 D
3: 2022 3 A
4: 2022 3 B
Since mgr 1 has all stocks there is no row for that manager. I'm also okay with there being a row with NA in stocks for that particular mgr-year.
Effectively what I want is for each stock-year to obtain a list of the stocks that each mgr does not hold. This example works, but the actual data is ~100mm rows so I'm wondering is there is a way to do this with pure data.table in an efficient manner.
Thanks.
Solution 1:[1]
In data.table there is a ! operator for exclusions.
setkey(tmp, mgr, year, stocks)
all.stocks <- tmp[, .(stocks=unique(tmp$stocks)), by=.(mgr, year)]
setkey(all.stocks, mgr, year, stocks)
missing <- all.stocks[!tmp]
This should be pretty fast.
Solution 2:[2]
For the sake of completeness, here is a "one-liner" which uses
- cross join
CJ()to create all unique possible combinations ofmgr,year, andstocks, - the exclusive join operator
!, - the
on =clause to indicate the columns to join on, - data.table chaining:
tmp[, CJ(mgr, year, stocks, unique = TRUE)][!tmp, on = .(mgr, year, stocks)]
Key: <mgr, year, stocks> mgr year stocks <char> <num> <char> 1: 2 2022 C 2: 2 2022 D 3: 3 2022 A 4: 3 2022 B
Note that it is not necessary to call setkey() explicitely as the default parameter sorted = TRUE of CJ() already sets keys and on uses these.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | jlhoward |
| Solution 2 |
