'Fastest Way to count all-TRUE rows of a LogicalMatrix R / C++ / Rcpp

I need to count the number of rows in a LogicalMatrix that are all `TRUE`.

Because I need to be able to do this 1 - 2500 million times on a relatively regular basis speed actually really matters:

My current best:

The most efficient / fastest single-process way I've figured how to do this is in the how many Rcpp function (hm2).

My limited profiling abilities show me that the vast majority of the time is spent in doing the if(r_tll == xcolls){.... I can't seem to think of a different algorithm that would be faster ( I have tried the break out of the loop as soon as a FALSE is found and it is much slower).

details that can be assumed:

I can assume that:

The matrix will always have fewer than 10 million rows.
All the output matrices from upstream will have the same number of cols (for a given session/process/thread).
There will never be more than 2326 cols per matrix.

minimal example:

m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
head(m)
#>       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
#> [1,] FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
#> [2,] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
#> [3,] FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
#> [4,]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
#> [5,]  TRUE FALSE FALSE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
#> [6,] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

  // [[Rcpp::export]]
int hm(const LogicalMatrix& x){
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}

I don't understand why, but on my machine if I bake in the number of cols it is faster (if someone could explain why this is it would be great too):

// [[Rcpp::export]]
int hm2(const LogicalMatrix& x){
  const int xrows = x.nrow();
  // const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < 10; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == 10){
      n_all_true += 1;
    }
  }
  return n_all_true;
}

timing:

microbenchmark(hm(m), hm2(m), times = 1000)
#>  Unit: microseconds
#>   expr     min       lq     mean  median       uq      max neval
#>  hm(m) 597.828 599.0995 683.3482 605.397 643.8655 1659.711  1000
#> hm2(m) 236.847 237.6565 267.8787 238.748 253.5280  683.221  1000

c++r performance matrix rcpp

Solution 1:^[1]

Can go 30% faster still with OpenMP (which I now see is against the question which requests single thread solutions), and minimal code changes, at least on my 4 core Xeon. I have a feeling that a logical AND reduction may do better but will leave that for another day:

library(Rcpp)
library(microbenchmark)

m_rows <- 10L
m_cols <- 50000L
rebuild = FALSE

cppFunction('int hm(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}', rebuild = rebuild)

hm3 <- function(m) {
  nc <- ncol(m)
  sum(rowSums(m) == nc)
}

cppFunction('int hm_jmu(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(int col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}', rebuild = rebuild)

macroExpand <- function(NCOL) {
  paste0('int hm_npjc(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

macroExpand_omp <- function(NCOL) {
  paste0('int hm_npjc_omp(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  #pragma omp parallel for reduction(+:n_all_true)
  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

cppFunction(macroExpand(m_rows), rebuild = rebuild)
cppFunction(macroExpand_omp(m_rows),  plugins = "openmp", rebuild = rebuild)

cppFunction('int hm_omp(const LogicalMatrix& x) {
const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  #pragma omp parallel for reduction(+:n_all_true) schedule(static)
  for(size_t row = 0; row < xrows; row++) {
    int r_ttl = 0;
    for(size_t col = 0; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}',  plugins = "openmp", rebuild = rebuild)

# using != as inner loop control - no difference, using pre-increment in n_all_true, no diff, static vs dynamic OpenMP, attempted to direct clang and gcc to unroll loops: didn't seem to work

set.seed(21)
m <- matrix(sample(c(TRUE, FALSE), m_cols * m_rows, replace = T), ncol = m_rows)
print(microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m),
                     hm_omp(m), hm_npjc_omp(m),
                     times = 1000))

I used GCC 4.9. Clang 3.7 similar results. Giving: Unit: microseconds expr min lq mean median uq max neval hm(m) 614.074 640.9840 643.24836 641.462 642.9920 976.694 1000 hm3(m) 2705.066 2768.3080 2948.39388 2775.992 2786.8625 43424.060 1000 hm_jmu(m) 591.179 612.3590 625.84484 612.881 613.8825 6874.428 1000 hm_npjc(m) 62.958 63.8965 64.89338 64.346 65.0550 144.487 1000 hm_omp(m) 91.892 92.6050 165.21507 93.758 98.8230 10026.583 1000 hm_npjc_omp(m) 43.129 43.6820 129.15842 44.458 47.0860 17636.875 1000

The OpenMP magic is just the inclusion of the -fopenmp at compile and link time (taken care of by Rcpp, plugin="openmp"), and #pragma omp parallel for reduction(+:n_all_true) schedule(static) In this case, the outer loop is parallelized, and the result is a sum, so the reduction statement tells the compiler to break down the problem, and reduce the sum of each part into a single sum. schedule(static) describes how the compiler and/or runtime will allocate the loop between threads. In this case, the width of both inner and outer loops is known, so static is preferred; if, say, the inner loop size varied a lot, then dynamic might balance the work better between threads.

It is possible to explicitly tell OpenMP how many loop iterations you would like per thread, but it is often better to let the compiler decide.

On a different note, I tried hard to use compiler flags, such as -funroll-loops to replace the ugly but fast hard-coding of the inner loop width (which is not a generalized solution to the question). I tested these to no avail: see https://github.com/jackwasey/optimization-comparison

Solution 2:^[2]

I was very curious as to why 'baking in' what was defined as a const would make any difference; so I played around with this idea.

Previously:

library(Rcpp)
library(microbenchmark)
cppFunction('int hm(const LogicalMatrix& x)
            {
            const int xrows = x.nrow();
            const int xcols = x.ncol();
            int n_all_true = 0;

            for(size_t row = 0; row < xrows; row++) {
            int r_ttl = 0;
            for(size_t col = 0; col < xcols; col++) {
            r_ttl += x(row,col);
            }
            if(r_ttl == 10){
            n_all_true++;
            }
            }
            return n_all_true;
            }')

hm3 <- function(m) {
  nc <- ncol(m)
  sum(rowSums(m) == nc)
}

cppFunction('int hm_jmu(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < xcols; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == xcols){
  n_all_true++;
  }
  }
  return n_all_true;
  }')

Baking in the number of cols

I'm just taking Joshua's sol'n here but generating the tailored function by code-gen works well on my machine. This seems hacky to me but I thought I would post anyway:

macroExpand <- function(NCOL) {
paste0('int hm_npjc(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
  int r_ttl = 0;
  for(int col = 0; col < ',NCOL,'; col++) {
  r_ttl += x(row,col);
  }
  if(r_ttl == ',NCOL,'){
  n_all_true++;
  }
  }
  return n_all_true;
  }')
}

cppFunction(macroExpand(10L))

Results:

set.seed(21)
m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)
microbenchmark(hm(m), hm3(m), hm_jmu(m), hm_npjc(m), times=1000)
#> Unit: microseconds
#>        expr      min        lq      mean    median        uq       max
#>       hm(m)  596.808  600.1870  722.5140  629.1750  709.3875  1680.379
#>      hm3(m) 2189.164 2353.6700 2972.1463 2509.4630 2956.7675 49930.471
#>   hm_jmu(m)  574.137  576.5160  678.6475  600.4775  665.2800  2240.988
#>  hm_npjc(m)   81.978   83.1855  102.7646   89.2160  101.0400   380.884
#>  neval
#>   1000
#>   1000
#>   1000
#>   1000

I would like to note that I don't really understand why the compiler doesn't optimize to the same solution here; if anyone has insight on this that would be awesome.

Provenance

devtools::session_info()
#> Session info --------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.2.2 (2015-08-14)
#>  system   x86_64, darwin13.4.0        
#>  ui       RStudio (0.99.691)          
#>  language (EN)                        
#>  collate  en_CA.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2015-09-27
#> Packages ------------------------------------------------------------------
#>  package        * version    date       source                         
#>  clipr            0.1.1      2015-09-04 CRAN (R 3.2.0)                 
#>  colorspace       1.2-6      2015-03-11 CRAN (R 3.2.0)                 
#>  devtools         1.9.1      2015-09-11 CRAN (R 3.2.0)                 
#>  digest           0.6.8      2014-12-31 CRAN (R 3.2.0)                 
#>  evaluate         0.8        2015-09-18 CRAN (R 3.2.0)                 
#>  formatR          1.2.1      2015-09-18 CRAN (R 3.2.0)                 
#>  ggplot2          1.0.1      2015-03-17 CRAN (R 3.2.0)                 
#>  gtable           0.1.2      2012-12-05 CRAN (R 3.2.0)                 
#>  htmltools        0.2.6      2014-09-08 CRAN (R 3.2.0)                 
#>  knitr            1.10.5     2015-05-06 CRAN (R 3.2.0)                 
#>  magrittr         1.5        2014-11-22 CRAN (R 3.2.0)                 
#>  MASS             7.3-43     2015-07-16 CRAN (R 3.2.2)                 
#>  memoise          0.2.1      2014-04-22 CRAN (R 3.2.0)                 
#>  microbenchmark * 1.4-2      2014-09-28 CRAN (R 3.2.0)                 
#>  munsell          0.4.2      2013-07-11 CRAN (R 3.2.0)                 
#>  plyr             1.8.3      2015-06-12 CRAN (R 3.2.0)                 
#>  proto            0.3-10     2012-12-22 CRAN (R 3.2.0)                 
#>  Rcpp           * 0.12.1     2015-09-10 CRAN (R 3.2.0)                 
#>  reprex           0.0.0.9001 2015-09-26 Github (jennybc/reprex@1d6584a)
#>  reshape2         1.4.1      2014-12-06 CRAN (R 3.2.0)                 
#>  rmarkdown        0.7        2015-06-13 CRAN (R 3.2.0)                 
#>  rstudioapi       0.3.1      2015-04-07 CRAN (R 3.2.0)                 
#>  scales           0.3.0      2015-08-25 CRAN (R 3.2.0)                 
#>  stringi          0.5-5      2015-06-29 CRAN (R 3.2.0)                 
#>  stringr          1.0.0      2015-04-30 CRAN (R 3.2.0)

Solution 3:^[3]

How about exploiting the fact that TRUE is coerced to 1 for many numeric operators, and then it's all vectorized in functions that are already programmed in C. E.g.

set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)
sum(rowSums(m) == ncol(m))
## [1] 47

microbenchmark::microbenchmark(sum(rowSums(m) == ncol(m)))
## Unit: milliseconds
##                       expr      min       lq     mean   median       uq     max neval
## sum(rowSums(m) == ncol(m)) 1.715399 1.840763 1.873422 1.861552 1.905841 2.02524   100

See R Inferno Chapter 3.

Edited answer with direct comparison:

(here I pasted the two C++ functions into a file called test.cpp on my Desktop with the usual Rcpp header info)

require(Rcpp)
sourceCpp("~/Desktop/test.cpp")

set.seed(100)
m <- matrix(sample(c(TRUE, FALSE), 50000*10, replace = TRUE), ncol = 10L)

hm3 <- function(m) {
    nc <- ncol(m)
    sum(rowSums(m) == nc)
}

microbenchmark::microbenchmark(hm(m), hm2(m), hm3(m), times = 1000)
## Unit: milliseconds
##   expr      min       lq     mean   median       uq        max neval
##  hm(m) 4.996005 5.036732 5.169672 5.089707 5.194580   9.961581  1000
## hm2(m) 5.031222 5.074990 5.228239 5.128106 5.242909  10.109776  1000
## hm3(m) 1.626933 1.878014 2.205195 1.922608 2.014012 226.894190  1000

I note here that the reference to R Inferno is not really appropriate since it does not apply to C++ but it's still a mantra to live by. :-)

Solution 4:^[4]

Stumbled across this working on a similar problem. We can squeeze a little more performance by initializing r_ttl on the first column and eliminating the if(r_ttl == xcols) check:

# initialize on the first column
cppFunction('int hm2(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = x(row,0);
    for(int col = 1; col < xcols; col++) {
      r_ttl += x(row,col);
    }
    if(r_ttl == xcols){
      n_all_true++;
    }
  }
  return n_all_true;
}')

# use *= to eliminate the if statement
cppFunction('int hm3(const LogicalMatrix& x)
{
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = 1;
    for(int col = 0; col < xcols; col++) {
      r_ttl *= x(row,col);
    }
    n_all_true += r_ttl;
  }
  return n_all_true;
}')

# both modifications
cppFunction('int hm4(const LogicalMatrix& x) {
  const int xrows = x.nrow();
  const int xcols = x.ncol();
  int n_all_true = 0;

  for(int row = 0; row < xrows; row++) {
    int r_ttl = x(row,0);
    for(int col = 1; col < xcols; col++) {
      r_ttl *= x(row,col);
    }
    n_all_true += r_ttl;
  }
  return n_all_true;
}')

m <- matrix(sample(c(T,F),50000*10, replace = T),ncol = 10L)

microbenchmark::microbenchmark(hm_jmu = hm_jmu(m),
                               hm2 = hm2(m),
                               hm3 = hm3(m),
                               hm4 = hm4(m),
                               check = "equal",
                               times = 1e4)

# Unit: microseconds
#   expr   min    lq     mean median    uq    max neval
# hm_jmu 198.3 200.8 218.5362  208.3 212.7 5855.9 10000
#    hm2 169.3 170.9 184.8722  171.4 180.0 5775.4 10000
#    hm3 192.7 196.1 209.7465  196.8 206.3 1056.2 10000
#    hm4 161.9 163.1 176.2370  163.5 171.9 1119.2 10000

About a 20% improvement over Joshua Ulrich's even larger improvement.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1
Solution 2	npjc
Solution 3
Solution 4	jblood94