'Why does converting to `list` improve the performance of `lapply`?

I am surprised to see the first line runs much slower compared to the second one, which is suspiciously close in performance to the vectorized version. If processing a list is so much faster than processing a numeric(n) vector, why doesn't R convert its input to a list automatically?

> system.time(lapply(1:10^7, sqrt))
   user  system elapsed
  4.445   0.204   4.692
> system.time(lapply(list(1:10^7), sqrt))
   user  system elapsed
  0.048   0.015   0.062
> system.time(sqrt(1:10^7))
   user  system elapsed
   0.04    0.00    0.04

Here is the version information

$ R --version
R version 4.1.3 (2022-03-10) -- "One Push-Up"
Copyright (C) 2022 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin21.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
https://www.gnu.org/licenses/.

$ sw_vers
ProductName:    macOS
ProductVersion: 12.3.1
BuildVersion:   21E258


Solution 1:[1]

The reason is that the second expression is just a list of length 1

> length(list(1:10^7))
[1] 1

which is basically the same as applying sqrt directly. Instead, if we want to do this purely on each element of a list, it would require as.list instead of list i.e.

> length(as.list(1:10^7))
[1] 10000000

Converting to list from vector is unnecessary if the intention is to loop over each element of vector. In a vector, each element is a unit (same with matrix - only having dim attributes), but in a data.frame/tibble/data.table, each unit is a column. Thus, lapply loops over the unit i.e. column in data.frame where as the single element in a vector. When we wrap a vector with list, it is encapsulating the whole vector as a single list element

> list(1:3)
[[1]]
[1] 1 2 3

> as.list(1:3)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3

As sqrt is a vectorized function, the when we apply the sqrt by looping over the first list, it loops only once, but in second, it loops multiple times.


Thus, we get similar timings (of course the extra timing will be to convert the vector to list with as.list)

>  system.time(lapply(1:10^7, sqrt))
   user  system elapsed 
  4.364   0.220   4.748 
> system.time(lapply(as.list(1:10^7), sqrt))
   user  system elapsed 
  4.882   0.367   5.518 

A faster option would be to use vapply (if we are applying non-vectorized functions on a loop)

> system.time(vapply(1:10^7, sqrt, numeric(1)))
   user  system elapsed 
  2.464   0.172   2.633 

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1