'Mean columns with almost the same name
I have a data frame containing only one row with named columns. The data frame looks somewhat like this:
poms_tat1 poms_tat2 poms_tat3 tens1 tens2 tens3 ...
1 0.3708821 0.4915922 0.3958195 -0.1139606 -0.1462545 -0.4411494 ...
I need to calculate the mean of all the columns with similar names. The result should look somewhat like this:
poms_tat tens ...
1 0.4194551 -0.2337881667 ...
My first approach was to use a for loop and a nested while loop to find the indices of the relevant columns and then mean those, but unfortunately I couldn't make it work.
I also found this stackoverflow post which seemed promising but the agrep function seems to match columns in my data frame that should not be matched. I wasn't able to fix that using the max.distance parameter. For example it matches "threat1-3" with "reat1-3". I know those variable names are terrible, but unfortunately that's what I have to work with. What makes this even more complicated is that the number of columns in each category isn't always 3.
I hope I was able to articulate my problem well enough. Thank you.
Edit: Here is a reproducible piece of data:
structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328,
poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638,
tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1,
isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949,
isi4 = 0.143940321201716, sleepqual = 0.378902118888194,
se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254,
challenge1 = 0.417855447018672, challenge2 = 0.393610946830482,
challenge3 = 0.417855447018672, threat1 = -0.13014390184863,
threat2 = -0.34027852368936, threat3 = -0.269679944985297,
reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091,
reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761,
selfman2 = 0.40609787800914, selfman3 = 0.418121005003545,
selfman4 = 0.467099366496914, selfman5 = 0.205356548067582,
selfman6 = 0.464385939554693, selfman7 = 0.379071252751718,
eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")
Solution 1:[1]
You could either do this by tidyr::pivot_longer
, dplyr::mutate
, stringr::str_remove
, dplyr::group_by
, and dplyr::summarise
.
This would be done like this:
ex_data <- structure(list(poms_tat1 = 0.370882118644872, poms_tat2 = 0.491592168116328,
poms_tat3 = 0.395819547420188, tens1 = -0.113960576459638,
tens2 = -0.146254484825426, tens3 = -0.44114940169153, bat_ratio1 = 1,
isi1 = 0.0944068640061701, isi2 = 0.597785124823513, isi3 = 0.676617801589949,
isi4 = 0.143940321201716, sleepqual = 0.378902118888194,
se1 = 0.393610946830482, se2 = 0.0991899501072693, se3 = 0.501745206004254,
challenge1 = 0.417855447018672, challenge2 = 0.393610946830482,
challenge3 = 0.417855447018672, threat1 = -0.13014390184863,
threat2 = -0.34027852368936, threat3 = -0.269679944985297,
reat1 = 0.565825152115738, reat2 = 0.571605347479646, reat3 = 0.497468338163091,
reat4 = 0.484881137876427, reat5 = 0.494727444918154, selfman1 = 0.389249472080761,
selfman2 = 0.40609787800914, selfman3 = 0.418121005003545,
selfman4 = 0.467099366496914, selfman5 = 0.205356548067582,
selfman6 = 0.464385939554693, selfman7 = 0.379071252751718,
eli1 = 0.250872603002127, eli2 = 0, eli3 = 0.265908011739155), row.names = 1L, class = "data.frame")
ex_data %>%
tidyr::pivot_longer(everything()) %>%
dplyr::mutate(
name = stringr::str_remove(name, '[0-9]$')
) %>%
dplyr::group_by(name) %>%
dplyr::summarise(
mean = mean(value)
)
# A tibble: 11 x 2
name mean
<chr> <dbl>
1 bat_ratio 1
2 challenge 0.410
3 eli 0.172
4 isi 0.378
5 poms_tat 0.419
6 reat 0.523
7 se 0.332
8 selfman 0.390
9 sleepqual 0.379
10 tens -0.234
11 threat -0.247
Alternatively, you could use split.default
, together with stringr::str_remove
, purrr::map
, unlist
, purrr::map_dbl
, and tibble::enframe
as follows:
ex_data %>%
split.default(stringr::str_remove(names(.), '[0-9]$')) %>%
purrr::map(unlist) %>%
purrr::map_dbl(mean) %>%
tibble::enframe()
# A tibble: 11 x 2
name value
<chr> <dbl>
1 bat_ratio 1
2 challenge 0.410
3 eli 0.172
4 isi 0.378
5 poms_tat 0.419
6 reat 0.523
7 se 0.332
8 selfman 0.390
9 sleepqual 0.379
10 tens -0.234
11 threat -0.247
Solution 2:[2]
We could use split.default
to split based on the substring of column namesinto a list
and then loop over the list
with sapply
, get the rowMeans
in base R
sapply(split.default(df1, sub("\\d+$", "", names(df1))), rowMeans, na.rm = TRUE)
Solution 3:[3]
Here's one with tidyr
. I only just saw the acceptance of Baraliuh's answer, so my answer here is more for the sake of closure.
library(tidyr)
my_summary <- as.data.frame(sapply(
X = pivot_longer(
data = df,
# Desired columns (all) to summarize.
cols = everything(),
# Take each group of columns, which share a common name before different numeric
# suffixes, and pivot them into multiple rows under a common column by that name.
names_to = c(
".value",
# Discard anything after the prefix.
NA
),
# Identify the (optional) numeric suffix.
names_sep = "\\d*$"
),
# Take the mean of each column; ignore missing values.
FUN = mean, na.rm = TRUE,
# Keep as a list, to convert into a data.frame.
simplify = FALSE
))
In contrast to some alternatives, I do believe my consolidated use of pivoting makes the process cleaner, and the output does achieve the precise format you desire.
poms_tat tens bat_ratio isi sleepqual se challenge threat reat selfman eli
1 0.4194313 -0.2337882 1 0.3781875 0.3789021 0.3315154 0.4097739 -0.2467008 0.5229015 0.3899116 0.1722602
With dplyr
involved too, the summarization (into a tibble
) is even cleaner:
library(tidyr)
library(dplyr)
my_summary <- df %>%
pivot_longer(
cols = everything(),
names_to = c(".value", NA),
names_sep = "\\d*$"
) %>%
summarize(across(everything(), mean, na.rm = TRUE))
Solution 4:[4]
You can also use the following solution:
library(dplyr)
library(tidyr)
df %>%
pivot_longer(everything()) %>%
group_by(grp = sub("\\d+$", "", name)) %>%
summarise(Avg = mean(value, na.rm = TRUE))
grp Avg
<chr> <dbl>
1 bat_ratio 1
2 challenge 0.410
3 eli 0.172
4 isi 0.378
5 poms_tat 0.419
6 reat 0.523
7 se 0.332
8 selfman 0.390
9 sleepqual 0.379
10 tens -0.234
11 threat -0.247
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | Baraliuh |
Solution 2 | akrun |
Solution 3 | |
Solution 4 |