'R sum row values based on column name
I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.
The dataset looks something like this
data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
x1 = c(1, 11, 111),
x2 = c(2, 22, 222),
x1_1 = c(3, 333, 333),
x2_1 = c(4, 44, 44),
x1_2 = c(5, 55, 555),
x2_2 = c(6, 66, 666),)
I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1+3+5, value of x2 for row 2 = 22+44+66, etc.
Any help would be much appreciated.
Solution 1:[1]
Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.
library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
result[[name_stems[i]]] =
rowSums(data[
str_detect(
names(data),
pattern = paste0(name_stems[i], "_")
)
])
}
result
# # A tibble: 3 × 3
# date x1 x2
# <chr> <dbl> <dbl>
# 1 1/1/2018 9 12
# 2 2/1/2018 399 132
# 3 3/1/2018 999 932
Solution 2:[2]
Using data.table:
baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
## date x1 x2
## 1: 1/1/2018 9 12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932
Solution 3:[3]
Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.
library(tidyverse)
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = substr(name, 1,2)) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
# date x1 x2
# <chr> <dbl> <dbl>
#1 1/1/2018 9 12
#2 2/1/2018 399 132
#3 3/1/2018 999 932
Updates
In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:
s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10" "x1" "x1_1" "x1_2" "x2" "x2_1" "x2_2" "x20"
#[10] "x20_1"
s |> str_extract("x1$|(x1_).*")
#[1] NA NA "x1" "x1_1" "x1_2" NA NA NA NA NA
s |> str_extract("x2$|(x2_).*")
#[1] NA NA NA NA NA "x2" "x2_1" "x2_2" NA NA
This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.
Here is the full code:
data |>
pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
str_detect(name, "x2$|(x2_).*")~"x2")) |>
group_by(date,xgroup) |>
summarise(xsums = sum(x.values)) |>
pivot_wider(values_from = xsums, names_from = xgroup )
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | |
| Solution 2 | jlhoward |
| Solution 3 |
