'Fastest Way to count all-TRUE rows of a LogicalMatrix R / C++ / Rcpp
I need to count the number of rows in a LogicalMatrix that are all TRUE.
Because I need to be able to do this 1 - 2500 million times on a relatively regular basis speed actually really matters:
My current best:
The most efficient / fastest single-process way I've figured how to do this is in the how many Rcpp function (hm2).
My limited profiling abilities show me that the vast majority of the time is spent in doing the if(r_tll == xcolls){.... I can't seem to think of a different algorithm that would be faster ( I have tried the break out of the loop as soon as a FALSE is found and it is much slower).
details that can be assumed:
I can assume that:
- The matrix will always have fewer than 10 million rows.
- All the output matrices from upstream will have the same number of cols (for a given session/process/thread).
- There will never be more than 2326 cols per matrix.
minimal example:
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
head(m)
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE TRUE FALSE
#> [2,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE
#> [3,] FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
#> [4,] TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
#> [5,] TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE TRUE TRUE
#> [6,] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
// [[Rcpp::export]]
int hm(const LogicalMatrix& x){
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}
I don't understand why, but on my machine if I bake in the number of cols it is faster (if someone could explain why this is it would be great too):
// [[Rcpp::export]]
int hm2(const LogicalMatrix& x){
const int xrows = x.nrow();
// const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < 10; col++) {
r_ttl += x(row,col);
}
if(r_ttl == 10){
n_all_true += 1;
}
}
return n_all_true;
}
timing:
microbenchmark(hm(m), hm2(m), times = 1000)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> hm(m) 597.828 599.0995 683.3482 605.397 643.8655 1659.711 1000
#> hm2(m) 236.847 237.6565 267.8787 238.748 253.5280 683.221 1000
Solution 1:[1]
Can go 30% faster still with OpenMP (which I now see is against the question which requests single thread solutions), and minimal code changes, at least on my 4 core Xeon. I have a feeling that a logical AND reduction may do better but will leave that for another day:
library(Rcpp)
library(microbenchmark)
m_rows <- 10L
m_cols <- 50000L
rebuild = FALSE
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', rebuild = rebuild)
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', rebuild = rebuild)
macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
macroExpand_omp <- function(NCOL) {
paste0('int hm_npjc_omp(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
#pragma omp parallel for reduction(+:n_all_true)
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
cppFunction(macroExpand(m_rows), rebuild = rebuild)
cppFunction(macroExpand_omp(m_rows), plugins = "openmp", rebuild = rebuild)
cppFunction('int hm_omp(const LogicalMatrix& x) {
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
#pragma omp parallel for reduction(+:n_all_true) schedule(static)
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}', plugins = "openmp", rebuild = rebuild)
# using != as inner loop control - no difference, using pre-increment in n_all_true, no diff, static vs dynamic OpenMP, attempted to direct clang and gcc to unroll loops: didn't seem to work
set.seed(21)
m <- matrix(sample(c(TRUE, FALSE), m_cols * m_rows, replace = T), ncol = m_rows)
print(microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m),
hm_omp(m), hm_npjc_omp(m),
times = 1000))
I used GCC 4.9. Clang 3.7 similar results.
Giving:
Unit: microseconds
expr min lq mean median uq max neval
hm(m) 614.074 640.9840 643.24836 641.462 642.9920 976.694 1000
hm3(m) 2705.066 2768.3080 2948.39388 2775.992 2786.8625 43424.060 1000
hm_jmu(m) 591.179 612.3590 625.84484 612.881 613.8825 6874.428 1000
hm_npjc(m) 62.958 63.8965 64.89338 64.346 65.0550 144.487 1000
hm_omp(m) 91.892 92.6050 165.21507 93.758 98.8230 10026.583 1000
hm_npjc_omp(m) 43.129 43.6820 129.15842 44.458 47.0860 17636.875 1000
The OpenMP magic is just the inclusion of the -fopenmp at compile and link time (taken care of by Rcpp, plugin="openmp"), and
#pragma omp parallel for reduction(+:n_all_true) schedule(static)
In this case, the outer loop is parallelized, and the result is a sum, so the reduction statement tells the compiler to break down the problem, and reduce the sum of each part into a single sum. schedule(static) describes how the compiler and/or runtime will allocate the loop between threads. In this case, the width of both inner and outer loops is known, so static is preferred; if, say, the inner loop size varied a lot, then dynamic might balance the work better between threads.
It is possible to explicitly tell OpenMP how many loop iterations you would like per thread, but it is often better to let the compiler decide.
On a different note, I tried hard to use compiler flags, such as -funroll-loops to replace the ugly but fast hard-coding of the inner loop width (which is not a generalized solution to the question). I tested these to no avail: see https://github.com/jackwasey/optimization-comparison
Solution 2:[2]
I was very curious as to why 'baking in' what was defined as a const
would make any difference; so I played around with this idea.
Previously:
library(Rcpp)
library(microbenchmark)
cppFunction('int hm(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(size_t row = 0; row < xrows; row++) {
int r_ttl = 0;
for(size_t col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == 10){
n_all_true++;
}
}
return n_all_true;
}')
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
cppFunction('int hm_jmu(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
Baking in the number of cols
I'm just taking Joshua's sol'n here but generating the tailored function by code-gen works well on my machine. This seems hacky to me but I thought I would post anyway:
macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
const int xrows = x.nrow();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 0;
for(int col = 0; col < ',NCOL,'; col++) {
r_ttl += x(row,col);
}
if(r_ttl == ',NCOL,'){
n_all_true++;
}
}
return n_all_true;
}')
}
cppFunction(macroExpand(10L))
Results:
set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m), times=1000)
#> Unit: microseconds
#> expr min lq mean median uq max
#> hm(m) 596.808 600.1870 722.5140 629.1750 709.3875 1680.379
#> hm3(m) 2189.164 2353.6700 2972.1463 2509.4630 2956.7675 49930.471
#> hm_jmu(m) 574.137 576.5160 678.6475 600.4775 665.2800 2240.988
#> hm_npjc(m) 81.978 83.1855 102.7646 89.2160 101.0400 380.884
#> neval
#> 1000
#> 1000
#> 1000
#> 1000
I would like to note that I don't really understand why the compiler doesn't optimize to the same solution here; if anyone has insight on this that would be awesome.
Provenance
devtools::session_info()
#> Session info --------------------------------------------------------------
#> setting value
#> version R version 3.2.2 (2015-08-14)
#> system x86_64, darwin13.4.0
#> ui RStudio (0.99.691)
#> language (EN)
#> collate en_CA.UTF-8
#> tz America/Los_Angeles
#> date 2015-09-27
#> Packages ------------------------------------------------------------------
#> package * version date source
#> clipr 0.1.1 2015-09-04 CRAN (R 3.2.0)
#> colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
#> devtools 1.9.1 2015-09-11 CRAN (R 3.2.0)
#> digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
#> evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
#> formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
#> ggplot2 1.0.1 2015-03-17 CRAN (R 3.2.0)
#> gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
#> htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
#> knitr 1.10.5 2015-05-06 CRAN (R 3.2.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.2.0)
#> MASS 7.3-43 2015-07-16 CRAN (R 3.2.2)
#> memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
#> microbenchmark * 1.4-2 2014-09-28 CRAN (R 3.2.0)
#> munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
#> plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
#> proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
#> Rcpp * 0.12.1 2015-09-10 CRAN (R 3.2.0)
#> reprex 0.0.0.9001 2015-09-26 Github (jennybc/reprex@1d6584a)
#> reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0)
#> rmarkdown 0.7 2015-06-13 CRAN (R 3.2.0)
#> rstudioapi 0.3.1 2015-04-07 CRAN (R 3.2.0)
#> scales 0.3.0 2015-08-25 CRAN (R 3.2.0)
#> stringi 0.5-5 2015-06-29 CRAN (R 3.2.0)
#> stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
Solution 3:[3]
How about exploiting the fact that TRUE is coerced to 1 for many numeric operators, and then it's all vectorized in functions that are already programmed in C. E.g.
set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
sum(rowSums(m) == ncol(m))
## [1] 47
microbenchmark::microbenchmark(sum(rowSums(m) == ncol(m)))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum(rowSums(m) == ncol(m)) 1.715399 1.840763 1.873422 1.861552 1.905841 2.02524 100
See R Inferno Chapter 3.
Edited answer with direct comparison:
(here I pasted the two C++ functions into a file called test.cpp on my Desktop with the usual Rcpp header info)
require(Rcpp)
sourceCpp("~/Desktop/test.cpp")
set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
hm3 <- function(m) {
nc <- ncol(m)
sum(rowSums(m) == nc)
}
microbenchmark::microbenchmark(hm(m), hm2(m), hm3(m), times = 1000)
## Unit: milliseconds
## expr min lq mean median uq max neval
## hm(m) 4.996005 5.036732 5.169672 5.089707 5.194580 9.961581 1000
## hm2(m) 5.031222 5.074990 5.228239 5.128106 5.242909 10.109776 1000
## hm3(m) 1.626933 1.878014 2.205195 1.922608 2.014012 226.894190 1000
I note here that the reference to R Inferno is not really appropriate since it does not apply to C++ but it's still a mantra to live by. :-)
Solution 4:[4]
Stumbled across this working on a similar problem. We can squeeze a little more performance by initializing r_ttl on the first column and eliminating the if(r_ttl == xcols) check:
# initialize on the first column
cppFunction('int hm2(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = x(row,0);
for(int col = 1; col < xcols; col++) {
r_ttl += x(row,col);
}
if(r_ttl == xcols){
n_all_true++;
}
}
return n_all_true;
}')
# use *= to eliminate the if statement
cppFunction('int hm3(const LogicalMatrix& x)
{
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = 1;
for(int col = 0; col < xcols; col++) {
r_ttl *= x(row,col);
}
n_all_true += r_ttl;
}
return n_all_true;
}')
# both modifications
cppFunction('int hm4(const LogicalMatrix& x) {
const int xrows = x.nrow();
const int xcols = x.ncol();
int n_all_true = 0;
for(int row = 0; row < xrows; row++) {
int r_ttl = x(row,0);
for(int col = 1; col < xcols; col++) {
r_ttl *= x(row,col);
}
n_all_true += r_ttl;
}
return n_all_true;
}')
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark::microbenchmark(hm_jmu = hm_jmu(m),
hm2 = hm2(m),
hm3 = hm3(m),
hm4 = hm4(m),
check = "equal",
times = 1e4)
# Unit: microseconds
# expr min lq mean median uq max neval
# hm_jmu 198.3 200.8 218.5362 208.3 212.7 5855.9 10000
# hm2 169.3 170.9 184.8722 171.4 180.0 5775.4 10000
# hm3 192.7 196.1 209.7465 196.8 206.3 1056.2 10000
# hm4 161.9 163.1 176.2370 163.5 171.9 1119.2 10000
About a 20% improvement over Joshua Ulrich's even larger improvement.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | npjc |
| Solution 3 | |
| Solution 4 | jblood94 |
