'R sum row values based on column name

I have a dataset with over 10,000 columns and 10,000 rows. I am trying to add values of rows based on column names.

The dataset looks something like this

data <- tibble(date = c('1/1/2018','2/1/2018','3/1/2018'),
              x1 = c(1, 11, 111),
              x2 = c(2, 22, 222),
              x1_1 = c(3, 333, 333),
              x2_1 = c(4, 44, 44),
              x1_2 = c(5, 55, 555),
              x2_2 = c(6, 66, 666),)

I am trying to create a new table which includes the date column, an x1 column and an x2 column where the value of x1 for row 1 = 1+3+5, value of x2 for row 2 = 22+44+66, etc.

Any help would be much appreciated.

r


Solution 1:[1]

Here's a for loop approach. I use stringr but we could just as easily use base regex functions to keep it dependency-free.

library(stringr)
name_stems = unique(str_replace(names(data)[-1], "_.*", ""))
result = data[, "date", drop = FALSE]
for(i in seq_along(name_stems)) {
  result[[name_stems[i]]] = 
    rowSums(data[
      str_detect(
        names(data),
        pattern = paste0(name_stems[i], "_")
      )
    ])
}

result
# # A tibble: 3 × 3
#   date        x1    x2
#   <chr>    <dbl> <dbl>
# 1 1/1/2018     9    12
# 2 2/1/2018   399   132
# 3 3/1/2018   999   932

Solution 2:[2]

Using data.table:

baseCols <- paste0('x', 1:2)
result <- setDT(data) |> melt(measure.vars = patterns(baseCols), value.name = baseCols)
result[, lapply(.SD, sum), by=.(date), .SDcols=baseCols]
##        date  x1  x2
## 1: 1/1/2018   9  12
## 2: 2/1/2018 399 132
## 3: 3/1/2018 999 932

Solution 3:[3]

Your data is in the wide format. One way of achieving your goal is transforming the data into the long format, then grouping them based on indices (x1 and x2), compute the sums for each group for each date, and finally transform the results back to the wide formats to create columns based on the indices.

library(tidyverse)

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = substr(name, 1,2)) |> 
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

#  date        x1    x2
#  <chr>    <dbl> <dbl>
#1 1/1/2018     9    12
#2 2/1/2018   399   132
#3 3/1/2018   999   932

Updates

In order to include only columns x1 and x1_, and exclude any other column even though it starts with x1, the following regular expression pattern can be used : "x1$|(x1_).*". The similar pattern can be used to include only columns x2 and x2_. For example:

s <- c("x100_1", "x10", "x1", "x1_1", "x1_2", "x2", "x2_1", "x2_2", "x20", "x20_1")
s
#[1] "x100_1" "x10"    "x1"     "x1_1"   "x1_2"   "x2"     "x2_1"   "x2_2"   "x20"   
#[10] "x20_1" 

s |> str_extract("x1$|(x1_).*")
#[1] NA     NA     "x1"   "x1_1" "x1_2" NA     NA     NA     NA     NA

s |> str_extract("x2$|(x2_).*")
#[1] NA     NA     NA     NA     NA     "x2"   "x2_1" "x2_2" NA     NA   

This pattern can then be used to create a group that consists of x1 and x1_ columns only and another group that consists of x2 and x2_ columns only.

Here is the full code:

data |> 
    pivot_longer(cols = starts_with("x"), values_to = "x.values") |>
    mutate(xgroup = case_when(str_detect(name, "x1$|(x1_).*")~"x1",
                              str_detect(name, "x2$|(x2_).*")~"x2")) |>
    group_by(date,xgroup) |>
    summarise(xsums = sum(x.values)) |> 
    pivot_wider(values_from = xsums, names_from = xgroup )

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1
Solution 2 jlhoward
Solution 3