'Performance of Grouped Operations with data.table
I will start by describing the task that I am performing. I have to repeatedly calculate a grouped sum, typically between 5 and 10 times. The values in the column that I am performing the grouped sum on change with each iteration, but the columns that I am grouping by do not. Below is an example table, in which w, x, and y together constitute the grouping and z is the column whose values will be summed.
DT <- data.table(w = sample(1:10, size = 1000000, replace = TRUE),
x = sample(1:100, size = 1000000, replace = TRUE),
y = sample(1:1000, size = 1000000, replace = TRUE),
z = runif(n = 1000000, min = 0, max = 1))
setkey(DT, w, x, y)
My initial solution was, I think, the most obvious:
DT[, Group_Total := sum(z), keyby = list(w, x, y)]
microbenchmark(DT[, Group_Total := sum(z), keyby = list(w, x, y)])
Unit: milliseconds
min lq mean median uq max neval
447.8674 467.3366 481.1989 479.3057 490.5499 691 100
I found that this solution was not nearly as fast as I hoped, especially since I had to perform this calculation several times. However, as far as I could tell, there was not a method within data.table that would provide better performance than this. Recently, I began experimenting with Rccp and I wrote the below function to perform a grouped sum:
cppFunction('NumericVector Group_Total(NumericVector Grouping, NumericVector x) {
int n = Grouping.size();
NumericVector out(n);
int curr_start_index = 0;
double grp_total = x[0];
for(int i = 1; i < (n-1); ++i) {
if(Grouping[i] == 1) {
for(int j = curr_start_index; j < i; ++j) {
out[j] = grp_total;
}
curr_start_index = i;
grp_total = x[i];
} else {
grp_total = grp_total + x[i];
}
}
if(Grouping[(n-1)] == 1) {
for(int j = curr_start_index; j < (n-1); ++j) {
out[j] = grp_total;
}
out[(n-1)] = x[(n-1)];
} else {
grp_total = grp_total + x[(n-1)];
for(int j = curr_start_index; j < n; ++j) {
out[j] = grp_total;
}
}
return out;
}')
To use this function, I calculate a within group index column once the data is ordered:
DT[, Within_Group_Index := 1][, Within_Group_Index := cumsum(Within_Group_Index), keyby = list(w, x, y)]
Since I only need to calculate this column once, the time added is not significant. Once this column is created, the grouped sum can be calculated via :
DT[, Group_Total := Group_Total(Within_Group_Index, z)]
microbenchmark(DT[, Group_Total := Group_Total(Within_Group_Index, z)])
Unit: milliseconds
min lq mean median uq max neval
5.3286 6.9879 9.078267 7.4814 8.0171 161.6406 100
This is faster by a significant margin, and now we come to my question: why? My understanding is that grouped operations are very efficient in data.table. Is there something that I can do to speed up the data.table grouped operation, or is this an example of a fundamental speed vs. flexibility tradeoff, i.e. the flexibility of data.table's grouped operations comes at the cost of speed when considering specific use cases?
Solution 1:[1]
I think your suspicion is correct. As to why, we could make an educated guess.
EDIT: the information below ignores what data.table does with GForce optimizations,
the functions GForce supports probably avoid copies of data similar to your C++ code but,
as Frank mentioned,
:= didn't try to detect GForce-optimizable expressions before 1.14.3,
and addrs below is certainly not optimized by GForce.
The C++ code you wrote doesn't modify anything from its input data,
it only has to allocate out.
As you correctly said,
data.table aims to be more flexible,
and must support any valid R code the users provide.
That makes me think that,
for grouped operations,
it would have to allocate new R vectors for each subset of z according to the group indices,
effectively copying some of the input multiple times.
I wanted to try to validate my assumption, so I wrote some C++ code to look at memory addresses:
set.seed(31L)
DT <- data.table(w = sample(1:3, size = 20, replace = TRUE),
x = sample(1:3, size = 20, replace = TRUE),
y = sample(1:3, size = 20, replace = TRUE),
z = runif(n = 20, min = 0, max = 1))
setkey(DT, w, x, y)
cppFunction(plugins = "cpp11", includes = "#include <sstream>", '
StringVector addrs(NumericVector z, IntegerVector indices, IntegerVector group_ids) {
StringVector ans(indices.size());
for (auto i = 0; i < indices.size(); ++i) {
std::ostringstream oss;
oss << "Group" << group_ids[i] << " " << &z[indices[i]];
ans[i] = oss.str();
}
return ans;
}
')
idx <- DT[, .(indices = .I[1L] - 1L, group_ids = .GRP), keyby = list(w, x, y)]
DT[, list(addrs(z, idx$indices, idx$group_ids))]
# V1
# 1: Group1 0x55b7733361f8
# 2: Group2 0x55b773336200
# 3: Group3 0x55b773336210
# 4: Group4 0x55b773336220
# 5: Group5 0x55b773336228
# 6: Group6 0x55b773336230
# 7: Group7 0x55b773336238
# 8: Group8 0x55b773336248
# 9: Group9 0x55b773336250
# 10: Group10 0x55b773336258
# 11: Group11 0x55b773336260
# 12: Group12 0x55b773336270
# 13: Group13 0x55b773336280
# 14: Group14 0x55b773336288
# 15: Group15 0x55b773336290
As expected here,
if we look at z as a whole,
no copy takes place and the addresses of the different elements are close to each other,
for example 0x55b773336200 = 0x55b7733361f8 + 0x8.
You can execute the last line from the previous code multiple times and it will always show the same addresses.
What I partially didn't expect is this:
DT[, list(addrs(z, 0L, .GRP)), keyby = list(w, x, y)]
# w x y V1
# 1: 1 1 2 Group1 0x55b771b03440
# 2: 1 1 3 Group2 0x55b771b03440
# 3: 1 2 1 Group3 0x55b771b03440
# 4: 1 2 2 Group4 0x55b771b03440
# 5: 1 3 2 Group5 0x55b771b03440
# 6: 1 3 3 Group6 0x55b771b03440
# 7: 2 1 1 Group7 0x55b771b03440
# 8: 2 2 1 Group8 0x55b771b03440
# 9: 2 2 2 Group9 0x55b771b03440
# 10: 2 2 3 Group10 0x55b771b03440
# 11: 2 3 1 Group11 0x55b771b03440
# 12: 3 2 1 Group12 0x55b771b03440
# 13: 3 2 2 Group13 0x55b771b03440
# 14: 3 2 3 Group14 0x55b771b03440
# 15: 3 3 3 Group15 0x55b771b03440
On the one hand,
the address in memory did change,
so something was copied,
and if you run this multiple times you will see different addresses each time.
However, it seems like data.table somehow reuses a buffer with the same address,
maybe allocating one array with the length of the biggest group-subset and copying the different group values to it?
I wonder how they manage that.
Or maybe my code is wrong ¯\_(?)_/¯
EDIT: I added the following as the first line of addrs
(double escape because of R parsing)
Rcout << "input length = " << z.size() << "\\n";
and if I run the last DT[...] code from above,
it does print different lengths even though the address is the same.
Solution 2:[2]
I found that the initial code was faster on dev branch version 1.14.3. From the NEWS:
- := is now optimized by group
I believe this applies wherever GForce works. See ?GForce. (Thanks for pointing this out @Alexis)
I saw a similar timing to yours, but after upgrading see that the original code is pretty fast:
microbenchmark(
direct = DT[, Group_Total := sum(z), keyby = list(w, x, y)],
indirect = DT[, idx := rowid(w,x,y)][, gt := Group_Total(idx, z)]
)
Unit: milliseconds
expr min lq mean median uq max neval
direct 42.92247 44.61097 48.77803 45.48076 46.93751 72.29237 100
indirect 30.49422 31.95343 34.69138 33.18616 35.32934 58.37454 100
Notes:
rowidconstructs the within-group index.DT[, Group_Total := sum(z), keyby = list(w, x, y), verbose=TRUE]will display info on whether GForce optimization is being used.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 |
