'Strange behavior of data.table with 'by' argument?
I just want a function to sum over rows in a data.table overwriting the old values using the by argument. Normally I would expect to get in all rows grouped together with by the same results. I have created 2 examples. The only difference of the first to the second one is the deletion of the leading 3 digits in the take column of the data.table. The first example works as expected, the second shows some unexpected behavior. I would be glad to get any hint of what I'm doing wrong.
R version: 4.0.4
data.table version: 1.14.2
library(data.table)
# my expected function
superpose <- function(DT){
DT <- copy(DT)
DT[, value := sum(value), by = take]
}
v1a = c( 55: 59, 33: 37, 54: 56, 32: 34, 58: 60, 36: 38)
v1b = c(25555:25559, 20533:20537, 25554:25556, 20532:20534, 25558:25560, 20536:20538)
all.equal(as.integer(factor(v1a)), as.integer(factor(v1b)))
# [1] TRUE
v2 = 1:22
data1 <- data.table(take = v1a, value = v2) # 1st data - expected behavior
data2 <- data.table(take = v1b, value = v2) # 2nd data - unexpected behavior
res1 <- superpose(data1)
res2 <- superpose(data2)
cbind(res1, res2)
which(res1[, value] != res2[, value])
# [1] 8 11 15 16 19 20 21 22
Solution 1:[1]
There was already an open issue on github relating to this bug in data.table 1.14.3. This has now been fixed in the latest development version, which can be installed using:
update.dev.pkg()
This is a cautionary tale on why only the brave of heart should use development code - and expect issues to arise if you do.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
