I want to create a re-usable function for a repeating t-test such that the column names can be passed into a formula. However, I cannot find a way to make it work. So the following code is the idea:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
column = sym(column)
category = sym(category)
stat.test <- table %>%
group_by(subset) %>%
t_test(column ~ category)
return(stat.test)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
The current error that I get is the following:
Error: Can't extract columns that don't exist.
x Column `category` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
The question is whether somebody knows how to solve this?
Just make a formula instead of wrapping them in sym:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
formula <- paste0(column, '~', category) %>%
as.formula()
table %>%
group_by(subset) %>%
t_test(formula)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 x 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 0.484 94.3 0.63
2 Set2 value A B 50 50 -2.15 97.1 0.034
As we are passing string values, we may just use reformulate to create the expression in formula
do.function <- function(table, column, category) {
stat.test <- table %>%
group_by(subset) %>%
t_test(reformulate(category, response = column ))
return(stat.test)
}
-testing
> do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 × 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 1.66 97.5 0.0993
2 Set2 value A B 50 50 0.448 92.0 0.655
Formula actually is already used in rstatix::t_test, and we net to get the variables by their names.
do.function <- function(table, column, category) {
stat.test <- table %>%
mutate(column=get(column),
category=get(category)) %>%
rstatix::t_test(column ~ category)
return(stat.test)
}
do.function(table=tmp, column="value", category="categorical_value")
# # A tibble: 1 × 8
# .y. group1 group2 n1 n2 statistic df p
# * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 column A B 100 100 0.996 197. 0.32
Related
I am trying to map a function to each row in a tibble. Please see code below. My desired workflow is as follows -
Convert a list with sub lists to a tibble
Map each row the tibble to a function
My desired output should be a list with a tibble as output for each row mapped to the function. See full code below -
# Packages
library(tidyverse)
library(purrr)
# Function i want to map
sample_func <- function(tib){
a <- tib$name
b <- tib$qty
c <- tib$price
d <- tib$add
e <- b+c+d
t <- tibble(e = c(e), stock = c(a))
return(t)
}
# Define the list with multiple sublists
lst <- list(c( "CHR1", 15, 222.14, 6), c( "CHR2", 10, 119.20, 10))
# Convert each sublist to a tibble and bind the rows
tib <- bind_rows(lapply(lst, function(x) {
tibble(name = x[1], qty = x[2] %>% as.numeric(), price = x[3] %>% as.numeric(),
add = x[4] %>% as.numeric())
}))
# Apply the function to each row in the tibble using map()
result <- tib %>%
rowwise() %>%
mutate(temp = map(list(name, qty, price, add), sample_func)) %>%
unnest(temp)
My desired output should be -
[[1]]
# A tibble: 1 × 2
e name
<dbl> <chr>
1 243. CHR1
[[2]]
# A tibble: 1 × 2
e name
<dbl> <chr>
1 139. CHR2
However when the final rowwise mapping, I get the following error -
Error in `mutate()`:
! Problem while computing `temp = map(list(name, qty, price, add), sample_func)`.
ℹ The error occurred in row 1.
Caused by error in `map()`:
ℹ In index: 1.
Caused by error in `tib$name`:
! $ operator is invalid for atomic vectors
What am I doing wrong here?
An alternative approach is to change the inputs of the sample_func function to be the names of the columns instead of the tibble, then you can do this with pmap():
# Function i want to map
sample_func <- function(name, qty, price, add){
a <- name
b <- qty
c <- price
d <- add
e <- b+c+d
t <- tibble(e = c(e), stock = c(a))
return(t)
}
# Define the list with multiple sublists
lst <- list(c( "CHR1", 15, 222.14, 6), c( "CHR2", 10, 119.20, 10))
# Convert each sublist to a tibble and bind the rows
tib <- bind_rows(lapply(lst, function(x) {
tibble(name = x[1], qty = x[2] %>% as.numeric(), price = x[3] %>% as.numeric(),
add = x[4] %>% as.numeric())
}))
# Apply the function to each row in the tibble using map()
pmap(tib, sample_func)
Instead of passing a tibble to the function you may pass columns of the tibble as vector.
library(dplyr)
library(purrr)
sample_func <- function(name, qty, price, add){
res <- tibble(e = qty + price + add, stock = name)
return(res)
}
You may then use pmap -
out <- tib %>%
mutate(res = pmap(list(name, qty, price, add), sample_func))
out
# A tibble: 2 × 5
# name qty price add res
# <chr> <dbl> <dbl> <dbl> <list>
#1 CHR1 15 222. 6 <tibble [1 × 2]>
#2 CHR2 10 119. 10 <tibble [1 × 2]>
out$res
#[[1]]
# A tibble: 1 × 2
# e stock
# <dbl> <chr>
#1 243. CHR1
#[[2]]
# A tibble: 1 × 2
# e stock
# <dbl> <chr>
#1 139. CHR2
You may use unnest to get separate columns.
out %>% unnest(res)
# name qty price add e stock
# <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#1 CHR1 15 222. 6 243. CHR1
#2 CHR2 10 119. 10 139. CHR2
We could just apply the sample_func on the picked dataset and unnest
library(dplyr)
library(tidyr)
tib %>%
transmute(temp = sample_func(pick(everything()))) %>%
unnest(where(is_tibble))
-output
# A tibble: 2 × 2
e stock
<dbl> <chr>
1 243. CHR1
2 139. CHR2
If we want it as a list of tibbles
tib %>%
rowwise %>%
reframe(temp = list(sample_func(pick(everything())))) %>%
pull(temp)
-output
[[1]]
# A tibble: 1 × 2
e stock
<dbl> <chr>
1 243. CHR1
[[2]]
# A tibble: 1 × 2
e stock
<dbl> <chr>
1 139. CHR2
To get your desired output and without changing your function or tibble we can use dplyr::rowwise() and dplyr::group_map().
With rowwise we tell 'dplyr' to treat each row as a group. With group_map we apply a function to each group (in our case row) and the function takes the data.frame of each group as input .x which fits your sample_func() perfectly.
library(dplyr)
tib %>%
rowwise() %>%
group_map(~ sample_func(.x))
#> [[1]]
#> # A tibble: 1 × 2
#> e stock
#> <dbl> <chr>
#> 1 243. CHR1
#>
#> [[2]]
#> # A tibble: 1 × 2
#> e stock
#> <dbl> <chr>
#> 1 139. CHR2
Data from OP
library(tidyverse)
# Function i want to map
sample_func <- function(tib){
a <- tib$name
b <- tib$qty
c <- tib$price
d <- tib$add
e <- b+c+d
t <- tibble(e = c(e), stock = c(a))
return(t)
}
# Define the list with multiple sublists
lst <- list(c( "CHR1", 15, 222.14, 6), c( "CHR2", 10, 119.20, 10))
# Convert each sublist to a tibble and bind the rows
tib <- bind_rows(lapply(lst, function(x) {
tibble(name = x[1], qty = x[2] %>% as.numeric(), price = x[3] %>% as.numeric(),
add = x[4] %>% as.numeric())
}))
Created on 2023-02-12 with reprex v2.0.2
I have a function that generates a dataframe with 2 cols (X and Y).
I want to use map_dfc but I would like to change the suffixes "...1", "...2" and so on that appear because the col names are the same
I would like something as (X_df1, Y_df1, X_df2, Y_df2, ...). Is there a suffix parameter? I've read the documentation and couldn't find
I don't want to use map_dfr because I need the dataframe to be wide.
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
map2_dfc(values$n1, values$n2, example_function)
gives me
A tibble: 1 x 4
X...1 Y...2 X...3 Y...4
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
And I want
A tibble: 1 x 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Thanks!
If we don't want to change the function, we can rename before binding the cols - use pmap to loop over the rows the data, apply the function (example_function), loop over the list with imap, rename all the columns of the list of tibbles with the list index and then use bind_cols
library(dplyr)
library(purrr)
library(stringr)
pmap(values, example_function) %>%
imap(~ {nm1 <- str_c('_df', .y)
rename_with(.x, ~ str_c(., nm1), everything())
}) %>%
bind_cols
-output
# A tibble: 1 × 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Or you could just build the new names first and apply them after you call map2_dfc():
library(purrr)
library(tibble)
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
new_names <- lapply(seq_len(ncol(values)), function(x) paste0(c("X", "Y"), "_df", x)) %>%
unlist()
map2_dfc(values$n1, values$n2, example_function) %>%
setNames(new_names)
#> New names:
#> * X -> X...1
#> * Y -> Y...2
#> * X -> X...3
#> * Y -> Y...4
#> # A tibble: 1 x 4
#> X_df1 Y_df1 X_df2 Y_df2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 5 8 12
Created on 2022-04-08 by the reprex package (v2.0.1)
I am trying to loop through all the cols in my df and run a prop test on each of them.
library(gss)
To run on just one variable I can use--
infer::prop_test(gss,
college ~ sex,
order = c("female", "male"))
But now I want to run this for each variable in my df like this:
cols <- gss %>% select(-sex) %>% names(.)
for (i in cols){
# print(i)
prop_test(gss,
i~sex)
}
But this loop does not recognize the i;
Error: The response variable `i` cannot be found in this dataframe.
Any suggestions please??
We need to create the formula. Either use reformulate
library(gss)
library(infer)
out <- vector('list', length(cols))
names(out) <- cols
for(i in cols) {
out[[i]] <- prop_test(gss, reformulate("sex", response = i))
}
-output
> out
$college
# A tibble: 1 × 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 0.0000204 1 0.996 two.sided -0.0917 0.101
$partyid
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 12.9 3 0.00484
$class
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 2.54 3 0.467
$finrela
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 9.11 5 0.105
or paste
for(i in cols) {
prop_test(gss, as.formula(paste0(i, " ~ sex")))
}
data
library(dplyr)
data(gss)
cols <- gss %>%
select(where(is.factor), -sex, -income) %>%
names(.)
I want to use the ggpubr package referring to data frame column names that are listed in character strings in my global environment, but ggpubr doesn't seem to take variables, only hardcoded column names. Is there a way I can make any changes so it can do this?
vars = c('var1', 'var2')
controls = c('a', 'w')
df = data.frame(subject = 1:100,
value = rnorm(100, 100, 10),
var1 = rep(c('a', 'b'), 50),
var2 = rep(c('w', 'x', 'y', 'z'), 25))
library(ggpubr)
compare_means(value ~ vars, df, ref.group = 'a')
But I want to be able to replace 'vars' with var[1], var[2], etc and same for the ref.group = controls[1], controls[2]. Can I get ggpubr to refer to global environment objects instead of taking the input directly as column names?
We can use reformulate
library(ggpubr)
fml <- reformulate(vars[1], 'value')
compare_means(fml , df, ref.group = controls[1])
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
and for multiple elements using corresponding values, use Map from base R
Map(function(x, y) compare_means(reformulate(x, 'value'), df,
ref.group = y), vars, controls)
Or with map2 from purrr
library(purrr)
map2(vars, controls, ~ compare_means(reformulate(.x, 'value'), df,
ref.group = .y))
#[[1]]
# A tibble: 1 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value a b 0.537 0.54 0.54 ns Wilcoxon
#[[2]]
# A tibble: 3 x 8
# .y. group1 group2 p p.adj p.format p.signif method
# <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
#1 value w x 0.126 0.38 0.13 ns Wilcoxon
#2 value w y 0.985 1 0.98 ns Wilcoxon
#3 value w z 0.969 1 0.97 ns Wilcoxon
I have a dataframe that I would like to rename several columns with similar name conventions (e.g., starts with "X") and/or column positions (e.g., 4:7). The new names of the columns are stored in a vector. How do I rename this columns in a dplyr chain?
# data
df <- tibble(RID = 1,Var1 = "A", Var2 = "B",old_name1 =4, old_name2 = 8, old_name3=20)
new_names <- c("new_name1","new_name2","new_name3")
#psuedo code
df %>%
rename_if(starts_with('old_name'), new_names)
An option with rename_at would be
df %>%
rename_at(vars(starts_with('old_name')), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
But, it is possible to make a function that works with rename_if by creating a logical index on the column names
df %>%
rename_if(grepl("^old_name", names(.)), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
The rename_if in general is checking at the values of the columns instead of the column names i.e.
new_names2 <- c('var1', 'var2')
df %>%
rename_if(is.character, ~ new_names2)
# A tibble: 1 x 6
# RID var1 var2 old_name1 old_name2 old_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
Update dplyr 1.0.0
There is an addition to rename() by rename_with() which takes a function as input. This function can be function(x) return (new_names), in other words you use the purrr short form ~ new_names as the rename function.
This makes imho the most elegant dplyr expression.
# shortest & most elegant expression
df %>% rename_with(~ new_names, starts_with('old_name'))
# A tibble: 1 x 6
RID Var1 Var2 new_name1 new_name2 new_name3
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 A B 4 8 20