'In R, how to write function that runs t-test on list of dataframes
Using iris dataset as an example, I want to write a user defined function that
run
pairwise t-teston all 4 columns exemptingSpeciescolumns for each data splitexport the results as 3 worksheets of a csv file
See below for my attempt:
library(tidyr)
library(reshape) # for melting /stacking the data
library(multcomp) # for pairwise test
library(xlsx) # export excel file with worksheet
options(scipen = 100)
# dataset
iris
data_stats <- function(data){
# melt the dataframe
df <- melt(data, id.vars=c('Species'),var='group')
# split the dataframe into three list of dataframe
dfsplit<-split(df,df$column)
# pairwise t-test
results <- pairwise.t.test(dfsplit$value, dfsplit$group,p.adjust.method = "BH")
# export each result as a worksheet of an excel file
write.xlsx(results, file="Results.xlsx", sheetName="versicolor_stats", row.names=FALSE)
write.xlsx(results, file="Results.xlsx", sheetName="virginica_stats", append=TRUE, row.names=FALSE)
write.xlsx(results, file="Results.xlsx", sheetName="setosa_stats", append=TRUE, row.names=FALSE)
}
# testing the code on iris data
data_stats(iris)
Please comment and share your code. Thanks
Solution 1:[1]
Here is an option with tidyverse - reshape to 'long' format with pivot_longer, then use group_modify to do the pairwise.t.test , tidy the output and unnest the list output
library(dplyr)
library(tidyr)
library(broom)
ttest_out <- iris %>%
pivot_longer(cols = -Species) %>%
group_by(Species) %>%
group_modify(~ .x %>%
summarise(out = list(pairwise.t.test(value, name) %>%
tidy))) %>%
ungroup %>%
unnest(out)
-output
ttest_out
# A tibble: 18 × 4
Species group1 group2 p.value
<fct> <chr> <chr> <dbl>
1 setosa Petal.Width Petal.Length 1.77e- 54
2 setosa Sepal.Length Petal.Length 2.77e-132
3 setosa Sepal.Length Petal.Width 1.95e-156
4 setosa Sepal.Width Petal.Length 1.61e- 86
5 setosa Sepal.Width Petal.Width 1.13e-123
6 setosa Sepal.Width Sepal.Length 4.88e- 71
7 versicolor Petal.Width Petal.Length 5.35e- 90
8 versicolor Sepal.Length Petal.Length 3.78e- 52
9 versicolor Sepal.Length Petal.Width 5.02e-125
10 versicolor Sepal.Width Petal.Length 1.36e- 45
11 versicolor Sepal.Width Petal.Width 3.46e- 44
12 versicolor Sepal.Width Sepal.Length 1.25e- 95
13 virginica Petal.Width Petal.Length 1.39e- 90
14 virginica Sepal.Length Petal.Length 6.67e- 22
15 virginica Sepal.Length Petal.Width 3.47e-110
16 virginica Sepal.Width Petal.Length 2.35e- 68
17 virginica Sepal.Width Petal.Width 1.87e- 19
18 virginica Sepal.Width Sepal.Length 2.47e- 92
Solution 2:[2]
Update: The statistical part of this question (applying pairwise.t.test to the iris dataset) has been answered previously on SO. And here is another solution.
The accepted solution runs a series of pairwise t-tests and produces a column of p-values (exactly as the question asks) but it's a not very meaningful set of t-tests. You might suspect that from the fact that we see p-values like 2.77e-132 and that group1 and group2 are continuous variables, not the levels of a factor.
The hypotheses that these tests evaluate is whether, for each species separately, sepal is the same as petal and length is the same as width. The pairwise t-test procedure is designed to compare a single continuous variable (say sepal length) across all the levels a factor (say species).
To begin with, let's apply pairwise.t.test to the column Sepal.Length, so that we can check later on that we get the right p-values.
library("broom")
library("tidyverse")
pairwise.t.test(iris$Sepal.Width, iris$Species)
#>
#> Pairwise comparisons using t tests with pooled SD
#>
#> data: iris$Sepal.Width and iris$Species
#>
#> setosa versicolor
#> versicolor < 2e-16 -
#> virginica 9.1e-10 0.0031
#>
#> P value adjustment method: holm
If you've ever seen the iris dataset, you know these p-values "make sense": Virginica & Versicolor are more similar to each other than to Setosa.
So now let's apply the tests in a tidy way to the four numeric columns.
t_pvals <- iris %>%
pivot_longer(
-Species,
names_to = "Variable",
values_to = "x"
) %>%
# The trick to performing the right tests is to group the tibble by Variable,
# not by Species because Species is the grouping variable for the t-tests.
group_by(
Variable
) %>%
group_modify(
~ tidy(pairwise.t.test(.x$x, .x$Species))
) %>%
ungroup()
t_pvals
#> # A tibble: 12 × 4
#> Variable group1 group2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 Petal.Length versicolor setosa 1.05e-68
#> 2 Petal.Length virginica setosa 1.23e-90
#> 3 Petal.Length virginica versicolor 1.81e-31
#> 4 Petal.Width versicolor setosa 2.51e-57
#> 5 Petal.Width virginica setosa 2.39e-85
#> 6 Petal.Width virginica versicolor 8.82e-37
#> 7 Sepal.Length versicolor setosa 1.75e-15
#> 8 Sepal.Length virginica setosa 6.64e-32
#> 9 Sepal.Length virginica versicolor 2.77e- 9
#> 10 Sepal.Width versicolor setosa 5.50e-17
#> 11 Sepal.Width virginica setosa 9.08e-10
#> 12 Sepal.Width virginica versicolor 3.15e- 3
The p-values for the Sepal.Width comparisons are at the bottom. We got the p-values we expected!
Next we format the p-values so that they are easier on the eyes.
t_pvals <- t_pvals %>%
mutate(
across(
p.value, rstatix::p_format,
accuracy = 0.05
)
)
t_pvals
#> # A tibble: 12 × 4
#> Variable group1 group2 p.value
#> <chr> <chr> <chr> <chr>
#> 1 Petal.Length versicolor setosa <0.05
#> 2 Petal.Length virginica setosa <0.05
#> 3 Petal.Length virginica versicolor <0.05
#> 4 Petal.Width versicolor setosa <0.05
#> 5 Petal.Width virginica setosa <0.05
#> 6 Petal.Width virginica versicolor <0.05
#> 7 Sepal.Length versicolor setosa <0.05
#> 8 Sepal.Length virginica setosa <0.05
#> 9 Sepal.Length virginica versicolor <0.05
#> 10 Sepal.Width versicolor setosa <0.05
#> 11 Sepal.Width virginica setosa <0.05
#> 12 Sepal.Width virginica versicolor <0.05
And finally we save the results to a file.
t_pvals %>%
write_csv("pairwse-t-tests-on-iris-data.csv")
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | akrun |
| Solution 2 |
