I want to use the ggpubr package referring to data frame column names that are listed in character strings in my global environment, but ggpubr doesn't seem to take variables, only hardcoded column names. Is there a way I can make any changes so it can do this?
vars = c('var1', 'var2')
controls = c('a', 'w')
df = data.frame(subject = 1:100,
value = rnorm(100, 100, 10),
var1 = rep(c('a', 'b'), 50),
var2 = rep(c('w', 'x', 'y', 'z'), 25))
library(ggpubr)
compare_means(value ~ vars, df, ref.group = 'a')
But I want to be able to replace 'vars' with var[1], var[2], etc and same for the ref.group = controls[1], controls[2]. Can I get ggpubr to refer to global environment objects instead of taking the input directly as column names?
We can use reformulate
library(ggpubr)
fml <- reformulate(vars[1], 'value')
compare_means(fml , df, ref.group = controls[1])
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
and for multiple elements using corresponding values, use Map from base R
Map(function(x, y) compare_means(reformulate(x, 'value'), df,
ref.group = y), vars, controls)
Or with map2 from purrr
library(purrr)
map2(vars, controls, ~ compare_means(reformulate(.x, 'value'), df,
ref.group = .y))
#[[1]]
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
#[[2]]
# A tibble: 3 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value w x 0.126 0.38 0.13 ns Wilcoxon
#2 value w y 0.985 1 0.98 ns Wilcoxon
#3 value w z 0.969 1 0.97 ns Wilcoxon
Related
I'm attempting to determine if there's a significant difference between any of 40 measured variables in a dichotomous classification within 4 different subgroups.
The data are such that a Y/N factor column contains 'class', a 'subgroup' factor column has "A,B,C,D" and then 40 columns with numbers.
So far I can do the t_test for each variable using purrr::map.
ttest_list<- purrr::map(names(Project_Data)[3:40], ~pairwise_t_test(reformulate('class', response = .x), data = Project_Data))
I get a list with 40 tibbles like below:
[[1]]
# A tibble: 1 x 9
.y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 valine_NMR Y N 220 382 0.00155 ** 0.00155 **
Going one at a time I can use the group_by and get:
pwc_valine <- Project_Data %>%
group_by(subgroup) %>%
pairwise_t_test(valine_NMR ~ class, p.adjust.method = "bonferroni")
pwc_valine
subgroup .y. group1 group2 n1 n2 p p.signif p.adj p.adj.signif
* <fct> <chr> <chr> <chr> <int> <int> <dbl> <chr> <dbl> <chr>
1 A valine_NMR Y N 17 28 0.00619 ** 0.00619 **
2 B valine_NMR Y N 105 111 0.346 ns 0.346 ns
3 C valine_NMR Y N 86 126 0.000124 *** 0.000124 ***
4 D valine_NMR Y N 12 117 0.772 ns 0.772 ns
How do I apply the pairwise_t_test across all the columns while keeping subgroups?
I want to create a re-usable function for a repeating t-test such that the column names can be passed into a formula. However, I cannot find a way to make it work. So the following code is the idea:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
column = sym(column)
category = sym(category)
stat.test <- table %>%
group_by(subset) %>%
t_test(column ~ category)
return(stat.test)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
The current error that I get is the following:
Error: Can't extract columns that don't exist.
x Column `category` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
The question is whether somebody knows how to solve this?
Just make a formula instead of wrapping them in sym:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
formula <- paste0(column, '~', category) %>%
as.formula()
table %>%
group_by(subset) %>%
t_test(formula)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 x 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 0.484 94.3 0.63
2 Set2 value A B 50 50 -2.15 97.1 0.034
As we are passing string values, we may just use reformulate to create the expression in formula
do.function <- function(table, column, category) {
stat.test <- table %>%
group_by(subset) %>%
t_test(reformulate(category, response = column ))
return(stat.test)
}
-testing
> do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 × 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 1.66 97.5 0.0993
2 Set2 value A B 50 50 0.448 92.0 0.655
Formula actually is already used in rstatix::t_test, and we net to get the variables by their names.
do.function <- function(table, column, category) {
stat.test <- table %>%
mutate(column=get(column),
category=get(category)) %>%
rstatix::t_test(column ~ category)
return(stat.test)
}
do.function(table=tmp, column="value", category="categorical_value")
# # A tibble: 1 × 8
# .y. group1 group2 n1 n2 statistic df p
# * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 column A B 100 100 0.996 197. 0.32
I am trying to loop through all the cols in my df and run a prop test on each of them.
library(gss)
To run on just one variable I can use--
infer::prop_test(gss,
college ~ sex,
order = c("female", "male"))
But now I want to run this for each variable in my df like this:
cols <- gss %>% select(-sex) %>% names(.)
for (i in cols){
# print(i)
prop_test(gss,
i~sex)
}
But this loop does not recognize the i;
Error: The response variable `i` cannot be found in this dataframe.
Any suggestions please??
We need to create the formula. Either use reformulate
library(gss)
library(infer)
out <- vector('list', length(cols))
names(out) <- cols
for(i in cols) {
out[[i]] <- prop_test(gss, reformulate("sex", response = i))
}
-output
> out
$college
# A tibble: 1 × 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 0.0000204 1 0.996 two.sided -0.0917 0.101
$partyid
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 12.9 3 0.00484
$class
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 2.54 3 0.467
$finrela
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 9.11 5 0.105
or paste
for(i in cols) {
prop_test(gss, as.formula(paste0(i, " ~ sex")))
}
data
library(dplyr)
data(gss)
cols <- gss %>%
select(where(is.factor), -sex, -income) %>%
names(.)
I am trying to perform a series of T-tests using RStatix's t_test(), where the dependent variable is the same in every test and the grouping variable changes. I am doing these tests inside a loop, so I would like to select the grouping variable with the column number instead of the column name. I have tried to do this with colnames(dataframe)[[columnnumber]], but I get the following error: "Can't extract columns that don't exist". How can I select the grouping variable with the column number instead of the column name?
Below is a minimal reproductible example with a ficticious dataframe; the test works correctly when the grouping variable's name (gender) is indicated, but not when the column number is indicated instead.
library(tidyverse)
library(rstatix)
dat<-data.frame(gender=rep(c("Male", "Female"), 1000),
age=rep(c("Young","Young", "Old", "Old"),500),
tot= round(runif(2000, min=0, max=1),0))
dat %>% t_test(tot ~ gender,detailed=T) ##Works
dat %>% t_test(tot ~ colnames(dat)[[1]],detailed=T) ##Doesn't work
colnames(dat)[1] is a string. t_test requires formula object, you need to convert string to formula and pass it in t_test. This can be done using reformulate or as.formula.
library(rstatix)
dat %>% t_test(reformulate(colnames(dat)[1], 'tot'),detailed=T)
# A tibble: 1 x 15
# estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic
#* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl>
#1 0.011 0.505 0.494 tot Female Male 1000 1000 0.492
# … with 6 more variables: p <dbl>, df <dbl>, conf.low <dbl>,
# conf.high <dbl>, method <chr>, alternative <chr>
If we want to use tidyverse way of construction, then do this with in an expr
library(rstatix)
dat %>%
t_test(formula = eval(rlang::expr(tot ~ !! rlang::sym(names(.)[1]))),
detailed = TRUE)
# A tibble: 1 x 15
# estimate estimate1 estimate2 .y. group1 group2 n1 n2 statistic p df conf.low conf.high method alternative
#* <dbl> <dbl> <dbl> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#1 -0.02 0.497 0.517 tot Female Male 1000 1000 -0.894 0.371 1998. -0.0639 0.0239 T-test two.sided
NOTE: values are different as the data was constructed without any set.seed (wrt rnorm)
I have a dataframe that I would like to rename several columns with similar name conventions (e.g., starts with "X") and/or column positions (e.g., 4:7). The new names of the columns are stored in a vector. How do I rename this columns in a dplyr chain?
# data
df <- tibble(RID = 1,Var1 = "A", Var2 = "B",old_name1 =4, old_name2 = 8, old_name3=20)
new_names <- c("new_name1","new_name2","new_name3")
#psuedo code
df %>%
rename_if(starts_with('old_name'), new_names)
An option with rename_at would be
df %>%
rename_at(vars(starts_with('old_name')), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
But, it is possible to make a function that works with rename_if by creating a logical index on the column names
df %>%
rename_if(grepl("^old_name", names(.)), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
The rename_if in general is checking at the values of the columns instead of the column names i.e.
new_names2 <- c('var1', 'var2')
df %>%
rename_if(is.character, ~ new_names2)
# A tibble: 1 x 6
# RID var1 var2 old_name1 old_name2 old_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
Update dplyr 1.0.0
There is an addition to rename() by rename_with() which takes a function as input. This function can be function(x) return (new_names), in other words you use the purrr short form ~ new_names as the rename function.
This makes imho the most elegant dplyr expression.
# shortest & most elegant expression
df %>% rename_with(~ new_names, starts_with('old_name'))
# A tibble: 1 x 6
RID Var1 Var2 new_name1 new_name2 new_name3
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 A B 4 8 20