R -how to apply Pairwise_t_test by subgroups across multiple columns - r

I'm attempting to determine if there's a significant difference between any of 40 measured variables in a dichotomous classification within 4 different subgroups.
The data are such that a Y/N factor column contains 'class', a 'subgroup' factor column has "A,B,C,D" and then 40 columns with numbers.
So far I can do the t_test for each variable using purrr::map.
ttest_list<- purrr::map(names(Project_Data)[3:40], ~pairwise_t_test(reformulate('class', response = .x), data = Project_Data))
I get a list with 40 tibbles like below:
[[1]]
# A tibble: 1 x 9
.y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 valine_NMR Y N 220 382 0.00155 ** 0.00155 **
Going one at a time I can use the group_by and get:
pwc_valine <- Project_Data %>%
group_by(subgroup) %>%
pairwise_t_test(valine_NMR ~ class, p.adjust.method = "bonferroni")
pwc_valine
subgroup .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 A valine_NMR Y N 17 28 0.00619 ** 0.00619 **
2 B valine_NMR Y N 105 111 0.346 ns 0.346 ns
3 C valine_NMR Y N 86 126 0.000124 *** 0.000124 ***
4 D valine_NMR Y N 12 117 0.772 ns 0.772 ns
How do I apply the pairwise_t_test across all the columns while keeping subgroups?

Related

How to transform the muliple comparison results(*) into letter marking

I am conducting kruskal-wallis rank tests with the package rstatix and agricolae. I have doubt on how to transform the results of different packages.
Here are my sample codes:
data("iris")
#### agricolae
library(agricolae)
print( kruskal(iris[, 1], factor(iris[, 5]), group=TRUE, p.adj="bonferroni") )
##
> $groups
iris[, 1] groups
virginica 114.21 a
versicolor 82.65 b
setosa 29.64 c
library(rstatix)
iris %>% dunn_test(Sepal.Length ~ Species)
> # A tibble: 3 x 9
.y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 Sepal.~ setosa versi~ 50 50 6.11 1.02e- 9 2.04e- 9 ****
2 Sepal.~ setosa virgi~ 50 50 9.74 2.00e-22 6.00e-22 ****
3 Sepal.~ versic~ virgi~ 50 50 3.64 2.77e- 4 2.77e- 4 ***
I want to transform the '**' to letters 'abc', and are there any better packages or ways to do the multivariate nonparametric test, pls show the codes, thanks!

Adjust p.value for each term from list of ANOVA results/tidy DF's

I have a data with 2 independent variables and a thousands of dependent variables. I've performed multiple two-way ANOVA tests a now I have a list containing result for each dependent variable. Let's say that the list looks like this (example data):
> l
$a
# A tibble: 2 x 6
term df sumsq meansq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 63.2 31.6 119. 1.67e-31
2 Residuals 147 39.0 0.265 NA NA
$b
# A tibble: 2 x 6
term df sumsq meansq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 437. 219. 1180. 2.86e-91
2 Residuals 147 27.2 0.185 NA NA
Now I would like to use the p.adjust method for each term. So what I want to do is to retrieve p.value for Species, ..., Residuals from all dataframes in this list, then use the p.adjust on a vector of p.values from specific term and add each adjusted p.value to respective dataframe (to new column in respective term). Is there any way to do this in a simple (tidyverse?) way? Key here is to use the p.adjust method.
I've managed to find an answer to this. Although the method is not a "tidyverse" way. Let data look like this:
> a = aov(Sepal.Length ~ Species, data = iris)
> b = aov(Petal.Length ~ Species, data = iris)
> l = list(a = broom::tidy(a), b = broom::tidy(b))
> n_terms = nrow(l[[1]])
> n_terms
[1] 2
> for(i in seq_along(l)){
+ l[[i]]$q.value = 0
+ }
> l
$a
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 63.2 31.6 119. 1.67e-31 0
2 Residuals 147 39.0 0.265 NA NA 0
$b
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 437. 219. 1180. 2.86e-91 0
2 Residuals 147 27.2 0.185 NA NA 0
The we can create a for loop for each term. In loop we retrieve each p.value for given term using purrr::map_dbl. After that we adjust the vector of pvalues using p.adjust method using desired method. Next step is to loop over every old q.value for a given term and set it to previously calculated value.
> for(term in 1:n_terms){
+ p.vals = purrr::map_dbl(l, ~.x[term, ]$p.value)
+ adjusted = as.vector(p.adjust(p.vals, method = "BY"))
+ for(i in seq_along(adjusted)){
+ l[[i]]$q.value[term] = adjusted[i]
+ }
+ }
> l
$a
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 63.2 31.6 119. 1.67e-31 2.50e-31
2 Residuals 147 39.0 0.265 NA NA NA
$b
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 437. 219. 1180. 2.86e-91 8.57e-91
2 Residuals 147 27.2 0.185 NA NA NA

Pass column names as function arguments in formula

I want to create a re-usable function for a repeating t-test such that the column names can be passed into a formula. However, I cannot find a way to make it work. So the following code is the idea:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
column = sym(column)
category = sym(category)
stat.test <- table %>%
group_by(subset) %>%
t_test(column ~ category)
return(stat.test)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
The current error that I get is the following:
Error: Can't extract columns that don't exist.
x Column `category` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
The question is whether somebody knows how to solve this?
Just make a formula instead of wrapping them in sym:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
formula <- paste0(column, '~', category) %>%
as.formula()
table %>%
group_by(subset) %>%
t_test(formula)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 x 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 0.484 94.3 0.63
2 Set2 value A B 50 50 -2.15 97.1 0.034
As we are passing string values, we may just use reformulate to create the expression in formula
do.function <- function(table, column, category) {
stat.test <- table %>%
group_by(subset) %>%
t_test(reformulate(category, response = column ))
return(stat.test)
}
-testing
> do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 × 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 1.66 97.5 0.0993
2 Set2 value A B 50 50 0.448 92.0 0.655
Formula actually is already used in rstatix::t_test, and we net to get the variables by their names.
do.function <- function(table, column, category) {
stat.test <- table %>%
mutate(column=get(column),
category=get(category)) %>%
rstatix::t_test(column ~ category)
return(stat.test)
}
do.function(table=tmp, column="value", category="categorical_value")
# # A tibble: 1 × 8
# .y. group1 group2 n1 n2 statistic df p
# * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 column A B 100 100 0.996 197. 0.32

T-test with column number instead of column name

I am trying to perform a series of T-tests using RStatix's t_test(), where the dependent variable is the same in every test and the grouping variable changes. I am doing these tests inside a loop, so I would like to select the grouping variable with the column number instead of the column name. I have tried to do this with colnames(dataframe)[[columnnumber]], but I get the following error: "Can't extract columns that don't exist". How can I select the grouping variable with the column number instead of the column name?
Below is a minimal reproductible example with a ficticious dataframe; the test works correctly when the grouping variable's name (gender) is indicated, but not when the column number is indicated instead.
library(tidyverse)
library(rstatix)
dat<-data.frame(gender=rep(c("Male", "Female"), 1000),
age=rep(c("Young","Young", "Old", "Old"),500),
tot= round(runif(2000, min=0, max=1),0))
dat %>% t_test(tot ~ gender,detailed=T) ##Works
dat %>% t_test(tot ~ colnames(dat)[[1]],detailed=T) ##Doesn't work
colnames(dat)[1] is a string. t_test requires formula object, you need to convert string to formula and pass it in t_test. This can be done using reformulate or as.formula.
library(rstatix)
dat %>% t_test(reformulate(colnames(dat)[1], 'tot'),detailed=T)
# A tibble: 1 x 15
# estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic
#* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl>
#1 0.011 0.505 0.494 tot Female Male 1000 1000 0.492
# … with 6 more variables: p <dbl>, df <dbl>, conf.low <dbl>,
# conf.high <dbl>, method <chr>, alternative <chr>
If we want to use tidyverse way of construction, then do this with in an expr
library(rstatix)
dat %>%
t_test(formula = eval(rlang::expr(tot ~ !! rlang::sym(names(.)[1]))),
detailed = TRUE)
# A tibble: 1 x 15
# estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p df conf.low conf.high method alternative
#* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#1 -0.02 0.497 0.517 tot Female Male 1000 1000 -0.894 0.371 1998. -0.0639 0.0239 T-test two.sided
NOTE: values are different as the data was constructed without any set.seed (wrt rnorm)

R ggpubr use global environment objects?

I want to use the ggpubr package referring to data frame column names that are listed in character strings in my global environment, but ggpubr doesn't seem to take variables, only hardcoded column names. Is there a way I can make any changes so it can do this?
vars = c('var1', 'var2')
controls = c('a', 'w')
df = data.frame(subject = 1:100,
value = rnorm(100, 100, 10),
var1 = rep(c('a', 'b'), 50),
var2 = rep(c('w', 'x', 'y', 'z'), 25))
library(ggpubr)
compare_means(value ~ vars, df, ref.group = 'a')
But I want to be able to replace 'vars' with var[1], var[2], etc and same for the ref.group = controls[1], controls[2]. Can I get ggpubr to refer to global environment objects instead of taking the input directly as column names?
We can use reformulate
library(ggpubr)
fml <- reformulate(vars[1], 'value')
compare_means(fml , df, ref.group = controls[1])
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
and for multiple elements using corresponding values, use Map from base R
Map(function(x, y) compare_means(reformulate(x, 'value'), df,
ref.group = y), vars, controls)
Or with map2 from purrr
library(purrr)
map2(vars, controls, ~ compare_means(reformulate(.x, 'value'), df,
ref.group = .y))
#[[1]]
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
#[[2]]
# A tibble: 3 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value w x 0.126 0.38 0.13 ns Wilcoxon
#2 value w y 0.985 1 0.98 ns Wilcoxon
#3 value w z 0.969 1 0.97 ns Wilcoxon

Resources