'Getting summary by group and overall using tidyverse

I am trying to find a way to get summary stats such as means by group and overall in one step using dplyr

#Data set-up
sex <- sample(c("M", "F"), size=100, replace=TRUE)
age <- rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
dsn <- data.frame(sex, age)


library("tidyverse")

#Using dplyr to get means by group and overall
mean_by_sex <- dsn %>% 
  group_by(sex) %>% 
  summarise(mean_age = mean(age))

mean_all <- dsn %>% 
  summarise(mean_age = mean(age)) %>% 
  add_column(sex = "All")

#combining the results by groups and overall
final_result <- rbind(mean_by_sex, mean_all)
final_result  
#> # A tibble: 3 x 2
#>   sex   mean_age
#>   <fct>    <dbl>
#> 1 F         24.0
#> 2 M         20.0
#> 3 All       21.9
#This is the table I want but I wonder if is the only way to do this

Is there a way this in shorter step using group_by_at or group_by_all or a similar functions using tidyverse and dplyr Any help would be greatly appreciated



Solution 1:[1]

A little switching around can do it, too.

final_result <- dsn %>% 
  add_row(sex = "All", age = mean(age)) %>% 
  group_by(sex) %>% 
  summarise(mean_age = mean(age))

Solution 2:[2]

These answers are great if you have one variable to summarize by. What about two? I want to summarize across one but leave the other as is. The above solutions do not work in this case because the data frame still needs to be grouped.

#Data set up 
set.seed(3243242)
dsn <- tibble(
  obese = sample(c(TRUE, FALSE), size=100, replace = TRUE),
  sex = sample(c("M", "F"), size=100, replace=TRUE),
                  age = rnorm(n=100, mean=20 + 4*(sex=="F"), sd=0.1)
                    )
library("tidyverse")

I restated the original problem using 2 group_by variables.

#Extend to 2 group_by variables?
df1 <- dsn %>%
  group_by(sex, obese) %>% 
  summarise(mean_age = mean(age)) %>%
  ungroup() 

#Also across sex
df2 <- dsn %>%
  group_by(obese) %>% 
  summarise(mean_age = mean(age)) %>%
  ungroup() 

#Final_result:
bind_rows(df1, df2)

Way to do this in one step? You can add mean with add_row() but not with a grouped df. Another option is to create a function that does all the things on the group dataset. If there are other things you want to do, say sort or create new variables, you can do it in the function. Then, you can apply the function to each grouped dataset. After combining via dplyr::bind_rows(), you can change the missing group variable to all via tidyr::replace_na().

  #'@param df_group A grouped tibble
find_summary <- function(df_group){
  df_group %>% 
summarize(mean_age = mean(age))  #add other dplyr verbs here as needed like arrange or mutate
}

bind_rows(
    find_summary(group_by(dsn, sex, obese)),
    find_summary(group_by(dsn, obese))
    ) %>%
     replace_na(list(sex = "all"))
sex   obese mean_age
  <chr> <lgl>    <dbl>
1 F     FALSE     24.0
2 F     TRUE      24.0
3 M     FALSE     20.0
4 M     TRUE      20.0
5 all   FALSE     21.7
6 all   TRUE      22.3

You can extend the idea if you want a summary of all variables, by one variable, and by two variables.

bind_rows(
    find_summary(group_by(dsn, sex, obese)),
    find_summary(group_by(dsn, obese)),
    find_summary(dsn)
    ) %>%
     replace_na(list(sex = "all", obese = "all"))
  sex   obese mean_age
  <chr> <chr>    <dbl>
1 F     FALSE     24.0
2 F     TRUE      24.0
3 M     FALSE     20.0
4 M     TRUE      20.0
5 all   FALSE     21.7
6 all   TRUE      22.3
7 all   all       22.0

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 vanao veneri
Solution 2