T-test with column number instead of column name - r

I am trying to perform a series of T-tests using RStatix's t_test(), where the dependent variable is the same in every test and the grouping variable changes. I am doing these tests inside a loop, so I would like to select the grouping variable with the column number instead of the column name. I have tried to do this with colnames(dataframe)[[columnnumber]], but I get the following error: "Can't extract columns that don't exist". How can I select the grouping variable with the column number instead of the column name?
Below is a minimal reproductible example with a ficticious dataframe; the test works correctly when the grouping variable's name (gender) is indicated, but not when the column number is indicated instead.
library(tidyverse)
library(rstatix)
dat<-data.frame(gender=rep(c("Male", "Female"), 1000),
age=rep(c("Young","Young", "Old", "Old"),500),
tot= round(runif(2000, min=0, max=1),0))
dat %>% t_test(tot ~ gender,detailed=T) ##Works
dat %>% t_test(tot ~ colnames(dat)[[1]],detailed=T) ##Doesn't work

colnames(dat)[1] is a string. t_test requires formula object, you need to convert string to formula and pass it in t_test. This can be done using reformulate or as.formula.
library(rstatix)
dat %>% t_test(reformulate(colnames(dat)[1], 'tot'),detailed=T)
# A tibble: 1 x 15
# estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic
#* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl>
#1 0.011 0.505 0.494 tot Female Male 1000 1000 0.492
# … with 6 more variables: p <dbl>, df <dbl>, conf.low <dbl>,
# conf.high <dbl>, method <chr>, alternative <chr>

If we want to use tidyverse way of construction, then do this with in an expr
library(rstatix)
dat %>%
t_test(formula = eval(rlang::expr(tot ~ !! rlang::sym(names(.)[1]))),
detailed = TRUE)
# A tibble: 1 x 15
# estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p df conf.low conf.high method alternative
#* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#1 -0.02 0.497 0.517 tot Female Male 1000 1000 -0.894 0.371 1998. -0.0639 0.0239 T-test two.sided
NOTE: values are different as the data was constructed without any set.seed (wrt rnorm)

Related

R -how to apply Pairwise_t_test by subgroups across multiple columns

I'm attempting to determine if there's a significant difference between any of 40 measured variables in a dichotomous classification within 4 different subgroups.
The data are such that a Y/N factor column contains 'class', a 'subgroup' factor column has "A,B,C,D" and then 40 columns with numbers.
So far I can do the t_test for each variable using purrr::map.
ttest_list<- purrr::map(names(Project_Data)[3:40], ~pairwise_t_test(reformulate('class', response = .x), data = Project_Data))
I get a list with 40 tibbles like below:
[[1]]
# A tibble: 1 x 9
.y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 valine_NMR Y N 220 382 0.00155 ** 0.00155 **
Going one at a time I can use the group_by and get:
pwc_valine <- Project_Data %>%
group_by(subgroup) %>%
pairwise_t_test(valine_NMR ~ class, p.adjust.method = "bonferroni")
pwc_valine
subgroup .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 A valine_NMR Y N 17 28 0.00619 ** 0.00619 **
2 B valine_NMR Y N 105 111 0.346 ns 0.346 ns
3 C valine_NMR Y N 86 126 0.000124 *** 0.000124 ***
4 D valine_NMR Y N 12 117 0.772 ns 0.772 ns
How do I apply the pairwise_t_test across all the columns while keeping subgroups?

Adjust p.value for each term from list of ANOVA results/tidy DF's

I have a data with 2 independent variables and a thousands of dependent variables. I've performed multiple two-way ANOVA tests a now I have a list containing result for each dependent variable. Let's say that the list looks like this (example data):
> l
$a
# A tibble: 2 x 6
term df sumsq meansq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 63.2 31.6 119. 1.67e-31
2 Residuals 147 39.0 0.265 NA NA
$b
# A tibble: 2 x 6
term df sumsq meansq statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 437. 219. 1180. 2.86e-91
2 Residuals 147 27.2 0.185 NA NA
Now I would like to use the p.adjust method for each term. So what I want to do is to retrieve p.value for Species, ..., Residuals from all dataframes in this list, then use the p.adjust on a vector of p.values from specific term and add each adjusted p.value to respective dataframe (to new column in respective term). Is there any way to do this in a simple (tidyverse?) way? Key here is to use the p.adjust method.
I've managed to find an answer to this. Although the method is not a "tidyverse" way. Let data look like this:
> a = aov(Sepal.Length ~ Species, data = iris)
> b = aov(Petal.Length ~ Species, data = iris)
> l = list(a = broom::tidy(a), b = broom::tidy(b))
> n_terms = nrow(l[[1]])
> n_terms
[1] 2
> for(i in seq_along(l)){
+ l[[i]]$q.value = 0
+ }
> l
$a
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 63.2 31.6 119. 1.67e-31 0
2 Residuals 147 39.0 0.265 NA NA 0
$b
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 437. 219. 1180. 2.86e-91 0
2 Residuals 147 27.2 0.185 NA NA 0
The we can create a for loop for each term. In loop we retrieve each p.value for given term using purrr::map_dbl. After that we adjust the vector of pvalues using p.adjust method using desired method. Next step is to loop over every old q.value for a given term and set it to previously calculated value.
> for(term in 1:n_terms){
+ p.vals = purrr::map_dbl(l, ~.x[term, ]$p.value)
+ adjusted = as.vector(p.adjust(p.vals, method = "BY"))
+ for(i in seq_along(adjusted)){
+ l[[i]]$q.value[term] = adjusted[i]
+ }
+ }
> l
$a
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 63.2 31.6 119. 1.67e-31 2.50e-31
2 Residuals 147 39.0 0.265 NA NA NA
$b
# A tibble: 2 x 7
term df sumsq meansq statistic p.value q.value
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Species 2 437. 219. 1180. 2.86e-91 8.57e-91
2 Residuals 147 27.2 0.185 NA NA NA

Is it possible to extract the full output for the statistical tests performed in gtsummary?

Background/aim
I am using gtsummary to present tables that also include statistical tests performed by the package. I want to supplement the table with additional information about the statistical test in my manuscript text. For example, gtsummary has tested mean differences between groups with anova and I want to report the F value and degrees of value of the test in the text. Is there any way to extract this information from the gtsummary object or do I have to run the statistical tests again separately and extract it from there? I.e., run a aov() and copy paste from output?
Example situation with anova as the statistical test
library(tidyverse)
library(gtsummary)
theme_gtsummary_mean_sd()
gtTable <- mtcars %>%
select(cyl, mpg) %>%
tbl_summary(by = cyl) %>%
add_p()
oneWay <- aov(mpg ~ cyl, data = mtcars)
summary(oneWay)
#> Df Sum Sq Mean Sq F value Pr(>F)
#> cyl 1 817.7 817.7 79.56 6.11e-10 ***
#> Residuals 30 308.3 10.3
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Table
This should be possible, but unfortunately it is not current possible with the aov() results. I will make it possible in the next release, and you can follow this GitHub Issue to track the progress of the release. https://github.com/ddsjoberg/gtsummary/issues/956
Here is an example using a t-test, where it currently is possible.
library(gtsummary)
packageVersion("gtsummary")
#> [1] '1.4.2'
tbl <-
trial %>%
select(age, trt) %>%
tbl_summary(by = trt,
missing = "no") %>%
add_p(all_continuous() ~ "t.test")
# report any statistics in `tbl$table_body` with `inline_text()`
tbl$table_body
#> # A tibble: 1 x 18
#> variable test_name var_type var_label row_type label stat_1 stat_2 test_result
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <list>
#> 1 age t.test continuous Age label Age 46 (3~ 48 (3~ <named lis~
#> # ... with 9 more variables: estimate <dbl>, statistic <dbl>, parameter <dbl>,
#> # conf.low <dbl>, conf.high <dbl>, p.value <dbl>, estimate1 <dbl>,
#> # estimate2 <dbl>, alternative <chr>
# the columns about the t-test come from `t.test(...) %>% broom::tidy()`
t.test(age ~ trt, data = trial) %>% broom::tidy()
#> # A tibble: 1 x 10
#> estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.438 47.0 47.4 -0.209 0.834 184. -4.57 3.69
#> # ... with 2 more variables: method <chr>, alternative <chr>
# t-statistic
inline_text(tbl, variable = "age", column = "statistic") %>% style_sigfig()
#> t
#> "-0.21"
# degrees of freedom
inline_text(tbl, variable = "age", column = "parameter") %>% style_number()
#> df
#> "184"
Created on 2021-08-10 by the reprex package (v2.0.1)

renaming column names with dplyr using tidyselect functions

I am trying to rename a few columns using dplyr::rename and tidyselect helpers to do so using some patterns.
How can I get this to work?
library(tidyverse)
# tidy output from broom (using development version)
(df <- broom::tidy(stats::oneway.test(formula = wt ~ cyl, data = mtcars)))
#> # A tibble: 1 x 5
#> num.df den.df statistic p.value method
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming equ~
# renaming
df %>%
dplyr::rename(
.data = .,
parameter1 = dplyr::matches("^num"),
parameter2 = dplyr::matches("^denom")
)
#> Error: Column positions must be scalar
Created on 2020-01-12 by the reprex package (v0.3.0.9001)
Your code works fine with me, however here are some other shorter ways that can help you and you can try;
library(tidyverse)
# tidy output from broom (using development version)
(df <- broom::tidy(stats::oneway.test(formula = wt ~ cyl, data = mtcars)))
#> # A tibble: 1 x 5
#> num.df den.df statistic p.value method
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming equ~
# renaming
df %>%
rename(parameter1 = matches("^num"),
parameter2 = matches("^denom"))
# # A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming..
df %>%
rename(parameter1 = contains("num"),
parameter2 = contains("denom"))
# # A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming..
df %>%
rename(parameter1 = starts_with("num"),
parameter2 = starts_with("denom"))
# # A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming..
We can also rename from a named vector
library(dplyr)
library(stringr)
df %>%
rename(!!!set_names(names(df)[1:2], str_c('parameter', 1:2)))
# A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
#1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming equal variances)

R ggpubr use global environment objects?

I want to use the ggpubr package referring to data frame column names that are listed in character strings in my global environment, but ggpubr doesn't seem to take variables, only hardcoded column names. Is there a way I can make any changes so it can do this?
vars = c('var1', 'var2')
controls = c('a', 'w')
df = data.frame(subject = 1:100,
value = rnorm(100, 100, 10),
var1 = rep(c('a', 'b'), 50),
var2 = rep(c('w', 'x', 'y', 'z'), 25))
library(ggpubr)
compare_means(value ~ vars, df, ref.group = 'a')
But I want to be able to replace 'vars' with var[1], var[2], etc and same for the ref.group = controls[1], controls[2]. Can I get ggpubr to refer to global environment objects instead of taking the input directly as column names?
We can use reformulate
library(ggpubr)
fml <- reformulate(vars[1], 'value')
compare_means(fml , df, ref.group = controls[1])
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
and for multiple elements using corresponding values, use Map from base R
Map(function(x, y) compare_means(reformulate(x, 'value'), df,
ref.group = y), vars, controls)
Or with map2 from purrr
library(purrr)
map2(vars, controls, ~ compare_means(reformulate(.x, 'value'), df,
ref.group = .y))
#[[1]]
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
#[[2]]
# A tibble: 3 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value w x 0.126 0.38 0.13 ns Wilcoxon
#2 value w y 0.985 1 0.98 ns Wilcoxon
#3 value w z 0.969 1 0.97 ns Wilcoxon

Resources