'In R, how to write function that runs t-test on list of dataframes

Using iris dataset as an example, I want to write a user defined function that

  1. run pairwise t-test on all 4 columns exempting Species columns for each data split

  2. export the results as 3 worksheets of a csv file

See below for my attempt:

library(tidyr)
library(reshape) # for melting /stacking the data
library(multcomp) # for pairwise test 
library(xlsx) # export excel file with worksheet

options(scipen = 100)

# dataset
iris

data_stats <- function(data){
    # melt the dataframe
     df <- melt(data, id.vars=c('Species'),var='group')
    
# split the dataframe into three list of dataframe
    dfsplit<-split(df,df$column)
    
    #  pairwise t-test
    results <- pairwise.t.test(dfsplit$value, dfsplit$group,p.adjust.method = "BH")
    
    # export each result as a worksheet of an excel file
    write.xlsx(results, file="Results.xlsx", sheetName="versicolor_stats", row.names=FALSE)
    write.xlsx(results, file="Results.xlsx", sheetName="virginica_stats", append=TRUE, row.names=FALSE)
    write.xlsx(results, file="Results.xlsx", sheetName="setosa_stats", append=TRUE, row.names=FALSE)
    
}
# testing the code on iris data
data_stats(iris)



Please comment and share your code. Thanks



Solution 1:[1]

Here is an option with tidyverse - reshape to 'long' format with pivot_longer, then use group_modify to do the pairwise.t.test , tidy the output and unnest the list output

library(dplyr)
library(tidyr)
library(broom)
ttest_out <- iris %>% 
  pivot_longer(cols = -Species) %>% 
  group_by(Species) %>%
  group_modify(~ .x %>%
     summarise(out = list(pairwise.t.test(value, name) %>% 
       tidy))) %>% 
   ungroup %>% 
   unnest(out)

-output

ttest_out
# A tibble: 18 × 4
   Species    group1       group2         p.value
   <fct>      <chr>        <chr>            <dbl>
 1 setosa     Petal.Width  Petal.Length 1.77e- 54
 2 setosa     Sepal.Length Petal.Length 2.77e-132
 3 setosa     Sepal.Length Petal.Width  1.95e-156
 4 setosa     Sepal.Width  Petal.Length 1.61e- 86
 5 setosa     Sepal.Width  Petal.Width  1.13e-123
 6 setosa     Sepal.Width  Sepal.Length 4.88e- 71
 7 versicolor Petal.Width  Petal.Length 5.35e- 90
 8 versicolor Sepal.Length Petal.Length 3.78e- 52
 9 versicolor Sepal.Length Petal.Width  5.02e-125
10 versicolor Sepal.Width  Petal.Length 1.36e- 45
11 versicolor Sepal.Width  Petal.Width  3.46e- 44
12 versicolor Sepal.Width  Sepal.Length 1.25e- 95
13 virginica  Petal.Width  Petal.Length 1.39e- 90
14 virginica  Sepal.Length Petal.Length 6.67e- 22
15 virginica  Sepal.Length Petal.Width  3.47e-110
16 virginica  Sepal.Width  Petal.Length 2.35e- 68
17 virginica  Sepal.Width  Petal.Width  1.87e- 19
18 virginica  Sepal.Width  Sepal.Length 2.47e- 92

Solution 2:[2]

Update: The statistical part of this question (applying pairwise.t.test to the iris dataset) has been answered previously on SO. And here is another solution.

The accepted solution runs a series of pairwise t-tests and produces a column of p-values (exactly as the question asks) but it's a not very meaningful set of t-tests. You might suspect that from the fact that we see p-values like 2.77e-132 and that group1 and group2 are continuous variables, not the levels of a factor.

The hypotheses that these tests evaluate is whether, for each species separately, sepal is the same as petal and length is the same as width. The pairwise t-test procedure is designed to compare a single continuous variable (say sepal length) across all the levels a factor (say species).

To begin with, let's apply pairwise.t.test to the column Sepal.Length, so that we can check later on that we get the right p-values.

library("broom")
library("tidyverse")

pairwise.t.test(iris$Sepal.Width, iris$Species)
#> 
#>  Pairwise comparisons using t tests with pooled SD 
#> 
#> data:  iris$Sepal.Width and iris$Species 
#> 
#>            setosa  versicolor
#> versicolor < 2e-16 -         
#> virginica  9.1e-10 0.0031    
#> 
#> P value adjustment method: holm

If you've ever seen the iris dataset, you know these p-values "make sense": Virginica & Versicolor are more similar to each other than to Setosa.

So now let's apply the tests in a tidy way to the four numeric columns.

t_pvals <- iris %>%
  pivot_longer(
    -Species,
    names_to = "Variable",
    values_to = "x"
  ) %>%
  # The trick to performing the right tests is to group the tibble by Variable,
  # not by Species because Species is the grouping variable for the t-tests.
  group_by(
    Variable
  ) %>%
  group_modify(
    ~ tidy(pairwise.t.test(.x$x, .x$Species))
  ) %>%
  ungroup()
t_pvals
#> # A tibble: 12 × 4
#>    Variable     group1     group2      p.value
#>    <chr>        <chr>      <chr>         <dbl>
#>  1 Petal.Length versicolor setosa     1.05e-68
#>  2 Petal.Length virginica  setosa     1.23e-90
#>  3 Petal.Length virginica  versicolor 1.81e-31
#>  4 Petal.Width  versicolor setosa     2.51e-57
#>  5 Petal.Width  virginica  setosa     2.39e-85
#>  6 Petal.Width  virginica  versicolor 8.82e-37
#>  7 Sepal.Length versicolor setosa     1.75e-15
#>  8 Sepal.Length virginica  setosa     6.64e-32
#>  9 Sepal.Length virginica  versicolor 2.77e- 9
#> 10 Sepal.Width  versicolor setosa     5.50e-17
#> 11 Sepal.Width  virginica  setosa     9.08e-10
#> 12 Sepal.Width  virginica  versicolor 3.15e- 3

The p-values for the Sepal.Width comparisons are at the bottom. We got the p-values we expected!

Next we format the p-values so that they are easier on the eyes.

t_pvals <- t_pvals %>%
  mutate(
    across(
      p.value, rstatix::p_format,
      accuracy = 0.05
    )
  )
t_pvals
#> # A tibble: 12 × 4
#>    Variable     group1     group2     p.value
#>    <chr>        <chr>      <chr>      <chr>  
#>  1 Petal.Length versicolor setosa     <0.05  
#>  2 Petal.Length virginica  setosa     <0.05  
#>  3 Petal.Length virginica  versicolor <0.05  
#>  4 Petal.Width  versicolor setosa     <0.05  
#>  5 Petal.Width  virginica  setosa     <0.05  
#>  6 Petal.Width  virginica  versicolor <0.05  
#>  7 Sepal.Length versicolor setosa     <0.05  
#>  8 Sepal.Length virginica  setosa     <0.05  
#>  9 Sepal.Length virginica  versicolor <0.05  
#> 10 Sepal.Width  versicolor setosa     <0.05  
#> 11 Sepal.Width  virginica  setosa     <0.05  
#> 12 Sepal.Width  virginica  versicolor <0.05

And finally we save the results to a file.

t_pvals %>%
  write_csv("pairwse-t-tests-on-iris-data.csv")

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 akrun
Solution 2