I am trying to loop through all the cols in my df and run a prop test on each of them.
library(gss)
To run on just one variable I can use--
infer::prop_test(gss,
college ~ sex,
order = c("female", "male"))
But now I want to run this for each variable in my df like this:
cols <- gss %>% select(-sex) %>% names(.)
for (i in cols){
# print(i)
prop_test(gss,
i~sex)
}
But this loop does not recognize the i;
Error: The response variable `i` cannot be found in this dataframe.
Any suggestions please??
We need to create the formula. Either use reformulate
library(gss)
library(infer)
out <- vector('list', length(cols))
names(out) <- cols
for(i in cols) {
out[[i]] <- prop_test(gss, reformulate("sex", response = i))
}
-output
> out
$college
# A tibble: 1 × 6
statistic chisq_df p_value alternative lower_ci upper_ci
<dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 0.0000204 1 0.996 two.sided -0.0917 0.101
$partyid
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 12.9 3 0.00484
$class
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 2.54 3 0.467
$finrela
# A tibble: 1 × 3
statistic chisq_df p_value
<dbl> <dbl> <dbl>
1 9.11 5 0.105
or paste
for(i in cols) {
prop_test(gss, as.formula(paste0(i, " ~ sex")))
}
data
library(dplyr)
data(gss)
cols <- gss %>%
select(where(is.factor), -sex, -income) %>%
names(.)
Related
I have a function that generates a dataframe with 2 cols (X and Y).
I want to use map_dfc but I would like to change the suffixes "...1", "...2" and so on that appear because the col names are the same
I would like something as (X_df1, Y_df1, X_df2, Y_df2, ...). Is there a suffix parameter? I've read the documentation and couldn't find
I don't want to use map_dfr because I need the dataframe to be wide.
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
map2_dfc(values$n1, values$n2, example_function)
gives me
A tibble: 1 x 4
X...1 Y...2 X...3 Y...4
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
And I want
A tibble: 1 x 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Thanks!
If we don't want to change the function, we can rename before binding the cols - use pmap to loop over the rows the data, apply the function (example_function), loop over the list with imap, rename all the columns of the list of tibbles with the list index and then use bind_cols
library(dplyr)
library(purrr)
library(stringr)
pmap(values, example_function) %>%
imap(~ {nm1 <- str_c('_df', .y)
rename_with(.x, ~ str_c(., nm1), everything())
}) %>%
bind_cols
-output
# A tibble: 1 × 4
X_df1 Y_df1 X_df2 Y_df2
<dbl> <dbl> <dbl> <dbl>
1 6 5 8 12
Or you could just build the new names first and apply them after you call map2_dfc():
library(purrr)
library(tibble)
example_function <- function(n1,n2){
tibble(X = n1+n2,
Y = n1*n2)
}
values <- tibble(n1 = c(1,2),
n2 = c(5,6))
new_names <- lapply(seq_len(ncol(values)), function(x) paste0(c("X", "Y"), "_df", x)) %>%
unlist()
map2_dfc(values$n1, values$n2, example_function) %>%
setNames(new_names)
#> New names:
#> * X -> X...1
#> * Y -> Y...2
#> * X -> X...3
#> * Y -> Y...4
#> # A tibble: 1 x 4
#> X_df1 Y_df1 X_df2 Y_df2
#> <dbl> <dbl> <dbl> <dbl>
#> 1 6 5 8 12
Created on 2022-04-08 by the reprex package (v2.0.1)
I want to create a re-usable function for a repeating t-test such that the column names can be passed into a formula. However, I cannot find a way to make it work. So the following code is the idea:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
column = sym(column)
category = sym(category)
stat.test <- table %>%
group_by(subset) %>%
t_test(column ~ category)
return(stat.test)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
The current error that I get is the following:
Error: Can't extract columns that don't exist.
x Column `category` doesn't exist.
Run `rlang::last_error()` to see where the error occurred.
The question is whether somebody knows how to solve this?
Just make a formula instead of wrapping them in sym:
library(dplyr)
library(rstatix)
do.function <- function(table, column, category) {
formula <- paste0(column, '~', category) %>%
as.formula()
table %>%
group_by(subset) %>%
t_test(formula)
}
tmp = data.frame(id=seq(1:100), value = rnorm(100), subset = rep(c("Set1", "Set2"),each=50,2),categorical_value= rep(c("A", "B"),each=25,4))
do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 x 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 0.484 94.3 0.63
2 Set2 value A B 50 50 -2.15 97.1 0.034
As we are passing string values, we may just use reformulate to create the expression in formula
do.function <- function(table, column, category) {
stat.test <- table %>%
group_by(subset) %>%
t_test(reformulate(category, response = column ))
return(stat.test)
}
-testing
> do.function(table= tmp, column = "value", category = "categorical_value")
# A tibble: 2 × 9
subset .y. group1 group2 n1 n2 statistic df p
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
1 Set1 value A B 50 50 1.66 97.5 0.0993
2 Set2 value A B 50 50 0.448 92.0 0.655
Formula actually is already used in rstatix::t_test, and we net to get the variables by their names.
do.function <- function(table, column, category) {
stat.test <- table %>%
mutate(column=get(column),
category=get(category)) %>%
rstatix::t_test(column ~ category)
return(stat.test)
}
do.function(table=tmp, column="value", category="categorical_value")
# # A tibble: 1 × 8
# .y. group1 group2 n1 n2 statistic df p
# * <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 column A B 100 100 0.996 197. 0.32
I have a dataframe (this is just a subset of the full frame):
Depth <- seq(0, 2, 0.2)
cps <- sample(48000:52000, 11)
Al <- rnorm(11)
Si <- rnorm(11)
Fe <- rnorm(11)
df <- as_tibble(cbind(Depth, cps, Al, Si, Fe))
When I use mutate_at to perform a function for only chosen variables the final df still contains the variables I chose to exclude. So,
df_norm <- df %>%
mutate_at(vars(-c(Depth, cps)), ~abs(log(./df$cps)))
performs the function on Al, Si, Fe and df_norm is still a 11x5 tibble with Depth and cps being unchanged from df. However, when I do a similar move with summarise_at:
df_mean <- df %>%
summarise_at(vars(-c(Depth, cps)), mean)
the resulting dataframe is only 1x3 instead of 1x5 i.e. it removed Depth and cps instead of just ignoring them for the averaging. Is there a different way I should be writing the vars argument to keep these?
EDIT
I would like my output to be a single observation(vector) with all 5 variables [1,5] at the median Depth value (in this case 1).
In the devel version of dplyr, we can use summarise with across, but still not sure what values we want for 'Depth', 'cps', so it is converted to a list
library(dplyr)
df %>%
summarise(across(Al:Fe, mean), across(Depth:cps, list))
# A tibble: 1 x 5
# Al Si Fe Depth cps
# <dbl> <dbl> <dbl> <list> <list>
#1 -0.438 -0.118 -0.590 <dbl [11]> <dbl [11]>
Or to get the first row
df %>%
summarise(across(Al:Fe, mean), across(Depth:cps, first))
# A tibble: 1 x 5
# Al Si Fe Depth cps
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.438 -0.118 -0.590 0 51432
Or to subset the median element of 'Depth'
df %>%
summarise(across(Al:Fe, mean), across(Depth:cps, ~ .[Depth == median(Depth)]))
# A tibble: 1 x 5
# Al Si Fe Depth cps
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 -0.438 -0.118 -0.590 1 51753
If we need the first row, then mutate and slice the first row
df %>%
mutate_at(vars(-c(Depth, cps)), mean) %>%
slice(1)
# A tibble: 1 x 5
# Depth cps Al Si Fe
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 0 51432 -0.438 -0.118 -0.590
Or if it needs to be the median row
df %>%
mutate_at(vars(-c(Depth, cps)), mean) %>%
filter(Depth == median(Depth))
# A tibble: 1 x 5
# Depth cps Al Si Fe
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 51753 -0.438 -0.118 -0.590
I am trying to rename a few columns using dplyr::rename and tidyselect helpers to do so using some patterns.
How can I get this to work?
library(tidyverse)
# tidy output from broom (using development version)
(df <- broom::tidy(stats::oneway.test(formula = wt ~ cyl, data = mtcars)))
#> # A tibble: 1 x 5
#> num.df den.df statistic p.value method
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming equ~
# renaming
df %>%
dplyr::rename(
.data = .,
parameter1 = dplyr::matches("^num"),
parameter2 = dplyr::matches("^denom")
)
#> Error: Column positions must be scalar
Created on 2020-01-12 by the reprex package (v0.3.0.9001)
Your code works fine with me, however here are some other shorter ways that can help you and you can try;
library(tidyverse)
# tidy output from broom (using development version)
(df <- broom::tidy(stats::oneway.test(formula = wt ~ cyl, data = mtcars)))
#> # A tibble: 1 x 5
#> num.df den.df statistic p.value method
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming equ~
# renaming
df %>%
rename(parameter1 = matches("^num"),
parameter2 = matches("^denom"))
# # A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming..
df %>%
rename(parameter1 = contains("num"),
parameter2 = contains("denom"))
# # A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming..
df %>%
rename(parameter1 = starts_with("num"),
parameter2 = starts_with("denom"))
# # A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
# 1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming..
We can also rename from a named vector
library(dplyr)
library(stringr)
df %>%
rename(!!!set_names(names(df)[1:2], str_c('parameter', 1:2)))
# A tibble: 1 x 5
# parameter1 parameter2 statistic p.value method
# <dbl> <dbl> <dbl> <dbl> <chr>
#1 2 19.0 20.2 0.0000196 One-way analysis of means (not assuming equal variances)
I have a dataframe that I would like to rename several columns with similar name conventions (e.g., starts with "X") and/or column positions (e.g., 4:7). The new names of the columns are stored in a vector. How do I rename this columns in a dplyr chain?
# data
df <- tibble(RID = 1,Var1 = "A", Var2 = "B",old_name1 =4, old_name2 = 8, old_name3=20)
new_names <- c("new_name1","new_name2","new_name3")
#psuedo code
df %>%
rename_if(starts_with('old_name'), new_names)
An option with rename_at would be
df %>%
rename_at(vars(starts_with('old_name')), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
But, it is possible to make a function that works with rename_if by creating a logical index on the column names
df %>%
rename_if(grepl("^old_name", names(.)), ~ new_names)
# A tibble: 1 x 6
# RID Var1 Var2 new_name1 new_name2 new_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
The rename_if in general is checking at the values of the columns instead of the column names i.e.
new_names2 <- c('var1', 'var2')
df %>%
rename_if(is.character, ~ new_names2)
# A tibble: 1 x 6
# RID var1 var2 old_name1 old_name2 old_name3
# <dbl> <chr> <chr> <dbl> <dbl> <dbl>
#1 1.00 A B 4.00 8.00 20.0
Update dplyr 1.0.0
There is an addition to rename() by rename_with() which takes a function as input. This function can be function(x) return (new_names), in other words you use the purrr short form ~ new_names as the rename function.
This makes imho the most elegant dplyr expression.
# shortest & most elegant expression
df %>% rename_with(~ new_names, starts_with('old_name'))
# A tibble: 1 x 6
RID Var1 Var2 new_name1 new_name2 new_name3
<dbl> <chr> <chr> <dbl> <dbl> <dbl>
1 1 A B 4 8 20