'Efficiently sum across multiple columns in R
I have the following condensed data set:
a<-as.data.frame(c(2000:2005))
a$Col1<-c(1:6)
a$Col2<-seq(2,12,2)
colnames(a)<-c("year","Col1","Col2")
for (i in 1:2){
a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
}
I would like to sum the columns Var1 and Var2, which I use:
a$sum<-a$Var_1 + a$Var_2
In reality my data set is much larger - I would like to sum from Var_1 to Var_n (n can be upto 20). There must be a more efficient way to do this than:
a$sum<-a$Var_1 + ... + a$Var_n
Solution 1:[1]
Here's a solution using the tidyverse. You can extend it to as many columns as you like using the select() function to select the appropriate columns within a mutate().
library(tidyverse)
a<-as.data.frame(c(2000:2005))
a$Col1<-c(1:6)
a$Col2<-seq(2,12,2)
colnames(a)<-c("year","Col1","Col2")
for (i in 1:2){
a[[paste("Var_", i, sep="")]]<-i*a[[paste("Col", i, sep="")]]
}
a
#> year Col1 Col2 Var_1 Var_2
#> 1 2000 1 2 1 4
#> 2 2001 2 4 2 8
#> 3 2002 3 6 3 12
#> 4 2003 4 8 4 16
#> 5 2004 5 10 5 20
#> 6 2005 6 12 6 24
# Tidyverse solution
a %>%
mutate(Total = select(., Var_1:Var_2) %>% rowSums(na.rm = TRUE))
#> year Col1 Col2 Var_1 Var_2 Total
#> 1 2000 1 2 1 4 5
#> 2 2001 2 4 2 8 10
#> 3 2002 3 6 3 12 15
#> 4 2003 4 8 4 16 20
#> 5 2004 5 10 5 20 25
#> 6 2005 6 12 6 24 30
Created on 2019-01-01 by the reprex package (v0.2.1)
Solution 2:[2]
You can use colSums(a[,c("Var1", "Var2")]) or rowSums(a[,c("Var_1", "Var_2")]). In your case you want the latter.
Solution 3:[3]
with dplyr you can use
a %>%
rowwise() %>%
mutate(sum = sum(Col1,Col1, na.rm = T))
or more efficiently
a %>%
rowwise() %>%
mutate(sum = sum(across(starts_with("Col")), na.rm = T))
Solution 4:[4]
If you're working with a very large dataset, rowSums can be slow.
An alternative is the rowsums function from the Rfast package. This requires you to convert your data to a matrix in the process and use column indices rather than names. Here's an example based on your code:
## load Rfast
library(Rfast)
## create dataset
a <- as.data.frame(c(2000:2005))
a$Col1 <- c(1:6)
a$Col2 <- seq(2,12,2)
colnames(a) <- c("year","Col1","Col2")
for (i in 1:2){
a[[paste("Var_", i, sep="")]] <- i*a[[paste("Col", i, sep="")]]
}
## get column indices based on names
col_st <- grep("Var_1", colnames(a)) # index of "Var_1" col
col_en <- grep("Var_2", colnames(a)) # index of "Var_2" col
cols <- c(col_st:col_en) # indices of all cols from "Var_1" to "Var_2"
## sum rows 4 to 5
a$Total <- rowsums(as.matrix(a[,cols]))
Solution 5:[5]
You can use this:
library(dplyr)
a$Sum <- apply(a[,select(a, starts_with("Var_"))], 1, sum)
Solution 6:[6]
In Base R:
You could simply just use sapply:
sapply(unique(sub(".$", "", colnames(a))), function(x) rowSums(a[startsWith(colnames(a), x)]))
This is very reliable, it works for anything.
Solution 7:[7]
Benchmarking seems to show that plain Reduce('+', ...) is the fastest. Libraries just make it (at least slightly) slower, at least for mtcars, even if I expand it to be huge.
Unit: milliseconds
expr min lq mean median uq max
rowSums 8.672061 9.014344 13.708022 9.602312 10.672726 148.47183
Reduce 2.994240 3.157500 6.331503 3.223612 3.616555 99.49181
apply 524.488376 651.549401 771.095002 743.286441 857.993418 1235.53153
Rfast 5.649006 5.901787 11.110896 6.387990 9.727408 66.03151
DT_rowSums 9.209539 9.566574 20.955033 10.131163 12.967030 294.32911
DT_Reduce 3.590719 3.774761 10.595256 3.924592 4.259343 340.52855
tidy_rowSums 15.532917 15.997649 33.736883 17.316108 27.072343 343.21254
tidy_Reduce 8.627810 8.960008 12.271105 9.603124 11.089334 79.98853
Code:
library('data.table')
library('tidyverse')
library('Rfast')
DFcars = data.table::copy(mtcars)
DFcars = do.call("rbind", replicate(10000, DFcars, simplify = FALSE))
DT_cars = data.table::copy(DFcars)
DFcars2 = data.table::copy(DFcars)
setDT(DT_cars)
colnms = c("mpg", "cyl", "disp", "hp", "drat")
microbenchmark::microbenchmark(
rowSums =
{
DFcars$new_col = rowSums(DFcars[, colnms])
(as.numeric(DFcars$new_col))
},
Reduce =
{
DFcars$new_col = Reduce('+', DFcars[, colnms])
(as.numeric(DFcars$new_col))
},
apply =
{
DFcars$new_col = apply(DFcars[, 1:5], 1, sum)
(as.numeric(DFcars$new_col))
},
Rfast =
{
DFcars$new_col = rowsums(as.matrix(DFcars[, colnms]))
(as.numeric(DFcars$new_col))
},
DT_rowSums =
{
DT_cars[, new_col := rowSums(.SD), .SDcols = colnms]
(as.numeric(DT_cars$new_col))
},
DT_Reduce =
{
DT_cars[, new_col := Reduce('+', .SD), .SDcols = colnms]
(as.numeric(DT_cars$new_col))
},
tidy_rowSums =
{
DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% rowSums())
(as.numeric(DFcars2$new_col))
},
tidy_Reduce =
{
DFcars2 = DFcars2 %>% mutate(new_col = select(., colnms) %>% Reduce('+', .))
(as.numeric(DFcars2$new_col))
},
check = 'equivalent'
)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Matt Dancho |
| Solution 2 | psoares |
| Solution 3 | Jose Victor Zambrana |
| Solution 4 | Lino Ferreira |
| Solution 5 | Mohamed Rahouma |
| Solution 6 | U12-Forward |
| Solution 7 | gaspar |
