R package "infer" - Iterative bootstrapping / looping over column names - r

I'm bootstrapping with the infer package.
The statistic of interest is the mean, example data is given by a tibble with 3 columns and 5 rows. My real tibble has 86 rows and 40 columns. For every column I want to do a bootstrap simulation, like shown below for the column "x" in tibble "test_tibble".
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15)
# A tibble: 5 x 3
x y z
<int> <int> <int>
1 1 6 11
2 2 7 12
3 3 8 13
4 4 9 14
5 5 10 15
specify(test_tibble, response = x) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
# A tibble: 1 x 2
lower_CI upper_CI
<dbl> <dbl>
1 2.10 4
I am now looking for a way of doing the same thing for the other columns in my tibble. I have tried a for-loop like this:
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, response = var_name) %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Unfortunately, this returns the follwing error
Error: The response variable `var_name` cannot be found in this dataframe.
Is there any way of iterating over the columns x, y and z without entering them manually as arguments for "response"? That'd be quite tedious for 40 columns.

This is a tricky question with a tricky answer.
Take a look at the response argument of the specify function in documentation:
The variable name in x that will serve as the response. This is alternative to using the formula argument.
With this in mind I modified the code to automate the process, adding one more column to the original dataframe and using the formula argument to obtain the same result, using a column of ones as explanatory variable.
library(infer)
library(tidyverse)
test_tibble <- tibble(x = 1:5, y = 6:10, z = 11:15, w = seq(1, 1, length.out = 5))
for (i in 1:ncol(test_tibble)){
var_name <- names(test_tibble)[i]
specify(test_tibble, formula = eval(parse(text = paste0(var_name, "~", "w"))))[, 1] %>%
generate(reps = 100, type = "bootstrap") %>%
calculate(stat = "mean") %>%
summarise(
lower_CI = quantile(probs = 0.025, stat),
upper_CI = quantile(probs = 0.975, stat)
)
}
Hope it helps

Related

Calculate mean and standard deviation for subgroups

I want to calculate the mean and standard deviation for subgroups every column in my dataset.
The membership of the subgroups is based on the values in the column of interest and these subgroups are specific to each column of interest.
# Example data
set.seed(1)
library(data.table)
df <- data.frame(baseline = runif(100), `Week0_12` = runif(100), `Week12_24` = runif(100))
So for column Baseline, a row may be assigned to another subgroup than for column Week0_12.
I can of course create these 'subgroup columns' manually for each column and then calculate the statistics for each column by column subgroup:
df$baseline_subgroup <- ifelse(df$baseline < 0.2, "subgroup_1", "subgroup_2")
df <- as.data.table(df)
df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup]
Giving this output:
baseline_subgroup mean sd
1: subgroup_2 0.58059314 0.22670071
2: subgroup_1 0.09793105 0.05317809
Doing this for every column separately is too much repetition, especially given that I have many columns my actual data.
df$Week0_12_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week0_12), sd = sd(Week0_12 )), by = Week0_12_subgroup ]
df$Week12_24_subgroup <- ifelse(df$Week0-12 < 0.2, "subgroup_1", "subgroup_2")
df[, .(mean = mean(Week12_24), sd = sd(Week12_24)), by = Week12_24_subgroup ]
What is a more elegant approach to do this?
Here's a tidyverse method that gives an easy-to-read and easy-to-plot output:
library(tidyverse)
set.seed(1)
df <- data.frame(baseline = runif(100),
`Week0_12` = runif(100),
`Week12_24` = runif(100))
df2 <- df %>%
summarize(across(everything(), list(mean_subgroup1 = ~mean(.x[.x < 0.2]),
sd_subgroup1 = ~sd(.x[.x < 0.2]),
mean_subgroup2 = ~mean(.x[.x > 0.2]),
sd_subgroup2 = ~sd(.x[.x > 0.2])))) %>%
pivot_longer(everything(), names_pattern = '^(.*)_(.*)_(.*$)',
names_to = c('time', 'measure', 'subgroup')) %>%
pivot_wider(names_from = measure, values_from = value)
df2
#> # A tibble: 6 x 4
#> time subgroup mean sd
#> <chr> <chr> <dbl> <dbl>
#> 1 baseline subgroup1 0.0979 0.0532
#> 2 baseline subgroup2 0.581 0.227
#> 3 Week0_12 subgroup1 0.117 0.0558
#> 4 Week0_12 subgroup2 0.594 0.225
#> 5 Week12_24 subgroup1 0.121 0.0472
#> 6 Week12_24 subgroup2 0.545 0.239
ggplot(df2, aes(time, mean, group = subgroup)) +
geom_errorbar(aes(ymin = mean - sd, ymax = mean + sd, color = subgroup),
width = 0.1) +
geom_point() +
theme_minimal(base_size = 16)
Created on 2022-07-14 by the reprex package (v2.0.1)
You could use apply to apply a subgroup function across each column
i. e.
# list to house dfs of summary statistics
summaries <- list()
subgroup <- function(x){
# x is the column that we are interested in
df$current_subgroup<- ifelse(x < 0.2, "subgroup_1", "subgroup_2")
library(data.table)
df <- as.data.table(df)
summaries.append(df[, .(mean = mean(baseline), sd = sd(baseline)), by = baseline_subgroup])
}
# MARGIN = 2 applies across columns
apply(df, 2, subgroup)
You can create a custom function and apply it using .SD, i.e.
library(data.table)
f1 <- function(x){
i_mean <- mean(x);
i_sd <- sd(x);
list(Avg = i_mean, standard_dev = i_sd)
}
setDT(df)[, unlist(lapply(.SD, f1), recursive = FALSE), by = baseline_subgroup][]
baseline_subgroup baseline.Avg baseline.standard_dev Week0.12.Avg Week0.12.standard_dev Week12.24.Avg Week12.24.standard_dev
1: subgroup_2 0.5950020 0.22556590 0.5332555 0.2651810 0.5467046 0.2912027
2: subgroup_1 0.1006693 0.04957005 0.5947161 0.2645519 0.5137543 0.3213723

i want to write a custom function with tidyverse verbs/syntax that accepts the grouping parameters of my function as string

I want to write a function that has as parameters a data set, a variable to be grouped, and another parameter to be filtered. I want to write the function in such a way that I can afterwards apply map() to it and pass the variables to be grouped in to map() as a vector. Nevertheless, I don't know how my custom function rating() accepts the variables to be grouped as a string. This is what i have tried.
data = tibble(a = seq.int(1:10),
g1 = c(rep("blue", 3), rep("green", 3), rep("red", 4)),
g2 = c(rep("pink", 2), rep("hotpink", 6), rep("firebrick", 2)),
na = NA,
stat=c(23,43,53,2,43,18,54,94,43,87))
rating = function(data, by, no){
data %>%
select(a, {{by}}, stat) %>%
group_by({{by}}) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
fn(data = data, by = g2, no = 5) #this works
And this is the way i want to use my function
map(.x = c("g1", "g2"), .f = ~rating(data = data, by = .x, no = 1))
... but i get
Error: Must group by variables found in `.data`.
* Column `.x` is not found.
As we are passing character elements, it would be better to convert to symbol and evaluate (!!)
library(dplyr)
library(purrr)
rating <- function(data, by, no){
by <- rlang::ensym(by)
data %>%
select(a, !! by, stat) %>%
group_by(!!by) %>%
mutate(rank = rank(stat)) %>%
ungroup() %>%
filter(a == no)
}
-testing
> map(.x = c("g1", "g2"), .f = ~rating(data = data, by = !!.x, no = 1))
[[1]]
# A tibble: 1 × 4
a g1 stat rank
<int> <chr> <dbl> <dbl>
1 1 blue 23 1
[[2]]
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 1 pink 23 1
It also works with unquoted input
> rating(data, by = g2, no = 5)
# A tibble: 1 × 4
a g2 stat rank
<int> <chr> <dbl> <dbl>
1 5 hotpink 43 3

R Quasiquotation & tidyeval for dynamic variable references in R in own functions

I'm trying to get my head around using quasiquotation from the tidyverse in R in my own functions. I've read this one here: Passing a list of arguments to a function with quasiquotation and the whole thing here: https://tidyeval.tidyverse.org/
But I still don't get it to work.
Assume I have the following data:
dat <- data.frame(time = runif(20),
group1 = rep(1:2, times = 10),
group2 = rep(1:2, each = 10),
group3 = rep(3:4, each = 10))
What I want to do now is to write a function that does the following:
take a data set
specify the variable that contains the time (note, in another data set this might be called "hours" or "qtime" or whatever)
specify by which groups I want to do operations/statistics on
So what I want the user to do is to use a function like:
test_function(data = dat, time_var = "time", group_vars = c("group1", "group3")) Note, I might choose different grouping variables or none next time.
Let's say within the function I want to:
calculate certain statistics on the time variable, e.g. the quantiles. Note: I want to split this up by my grouping variables
Here's one of my latest tries:
test_function <- function(data, time_var = NULL, group_vars = NULL)
{
# Note I initialize the variables with NULL, since e.g. the user might not specify a grouping
and I want to check for that in my function at some point)
time_var <- enquo(time_var)
group_vars <- enquos(group_vars)
# Here I try to group by my grouping variables
temp_data <- data %>%
group_by_at(group_vars) %>%
mutate(!!sym(time_var) := !!sym(time_var) / 60)
# Here I'm calculating some stats
time_stats <- temp_data %>%
summarize_at(vars(!!time_var), list(p0.1_time = ~quantile(., probs = 0.1, na.rm = T),
p0.2_time = ~quantile(., probs = 0.2, na.rm = T),
p0.3_time = ~quantile(., probs = 0.3, na.rm = T),
p0.4_time = ~quantile(., probs = 0.4, na.rm = T),
p0.5_time = ~quantile(., probs = 0.5, na.rm = T),
p0.6_time = ~quantile(., probs = 0.6, na.rm = T),
p0.7_time = ~quantile(., probs = 0.7, na.rm = T),
p0.8_time = ~quantile(., probs = 0.8, na.rm = T),
p0.9_time = ~quantile(., probs = 0.9, na.rm = T),
p0.95_time = ~quantile(., probs = 0.95, na.rm = T)))
}
What is wrong with my code? I.e. I specifically struggle with the !!, !!!, sym, enquo, enquos things. Why does the group_by_at thing doesn't need the !! thing, whereas my summarize and mutate do need it?
Make these changes:
use sym and syms rather than enquo and enquos
use !! and !!! respectively.
createpo as a list and then use unnest_wider to expand into columns
quantile is already vectorized so we don't need map
the mutate can be incorporated right into the quantile call eliminating it
consolidate the pipelines into a single pipeline
use TRUE rather than T since the latter can be masked by a variable of that name whereas no variable may be called TRUE.
we can use plain group_by and summarize
there is no group3 in the sample data so we used group2 instead
this does not make sense without time_var so remove the default of NULL
This gives the following code
test_function <- function(data, time_var, group_vars = NULL) {
p <- c(1:9/10, 0.95)
time_var <- sym(time_var)
group_vars <- syms(group_vars)
data %>%
group_by(!!!group_vars) %>%
summarize(po = list(quantile(!!time_var / 60, p, na.rm = TRUE))) %>%
ungroup %>%
unnest_wider(po)
}
test_function(data = dat, time_var = "time", group_vars = c("group1", "group2"))
giving:
# A tibble: 4 x 12
group1 group2 `10%` `20%` `30%` `40%` `50%` `60%` `70%` `80%` `90%` `95%`
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.00237 0.00432 0.00654 0.00903 0.0115 0.0120 0.0124 0.0133 0.0147 0.0154
2 1 2 0.00244 0.00251 0.00281 0.00335 0.00388 0.00410 0.00432 0.00493 0.00591 0.00640
3 2 1 0.00371 0.00381 0.00468 0.00632 0.00796 0.0101 0.0122 0.0136 0.0143 0.0147
4 2 2 0.00385 0.00538 0.00630 0.00660 0.00691 0.00725 0.00759 0.00907 0.0117 0.0130

Conduct Multiple T-Tests in R, Condensed

I wish to conduct multiple t-tests in R, without having to go through a copy-paste of each test. Each test will whether differences exist in the "Type" (whether "Left" or "Right") when looking at the "Level_#". Currently, I might have:
t.test(Level_1 ~ Type, alternative="two.sided", conf.level=0.99)
t.test(Level_2 ~ Type, alternative="two.sided", conf.level=0.99)
Type Level_1 Level_2 Level_3
Left 17 50 98
Right 18 65 65
Left 23 7 19
Left 65 7 100
Right 9 13 17
The issue is that I have hundreds of "Level_#" and would like to know how to automate this process and output a data frame of the results. My thought is to somehow incorporate an apply function.
You can do it with using the tidyverse approach, and using the purrr and broom packages.
require(tidyverse)
require(broom)
df %>%
gather(var, level, -type) %>%
nest(-var) %>%
mutate(model = purrr::map(data, function(x) {
t.test(level ~ type, alternative="two.sided", conf.level=0.99,
data = x)}),
value = purrr::map(model, tidy),
conf.low = purrr::map(value, "conf.low"),
conf.high = purrr::map(value, "conf.high"),
pvalue = purrr::map(value, "p.value")) %>%
select(-data, -model, -value)
Output:
var conf.low conf.high pvalue
1 level1 -3.025393 4.070641 0.6941518
2 level2 -3.597754 3.356125 0.9260015
3 level3 -3.955293 3.673493 0.9210724
Sample data:
set.seed(123)
df <- data.frame(type = rep(c("left", "right"), 25),
level1 = rnorm(50, mean = 85, sd = 5),
level2 = rnorm(50, mean = 75, sd = 5),
level3 = rnorm(50, mean = 65, sd = 5))

Unnest fitted glm models

I have a tibble with nested glm models. I nest over a variable (region) and run a function region_model that fits the model.
# toy data
test_data = data.frame(region = sample(letters[1:3], 1000, replace = TRUE),
x = sample(0:1, 1000, replace = TRUE),
y = sample(1:100, 1000, replace = TRUE),
z = sample(0:1, 1000, replace = TRUE)) %>% arrange(region)
# nest
by_region = test_data %>%
group_by(region) %>%
nest()
# glm function
region_model <- function(df) {
glm(x ~ y + z, data = df, family = "binomial")
}
# run the model
by_region = by_region %>% mutate(mod_rat = data %>% map(region_model))
The resulting tibble looks like this:
> by_region
# A tibble: 3 x 3
region data mod_rat
<fctr> <list> <list>
1 a <tibble [352 x 3]> <S3: glm>
2 b <tibble [329 x 3]> <S3: glm>
3 c <tibble [319 x 3]> <S3: glm>
My purpose is to unnest the models to calculate marginal effects. I have tried it and I have got this error:
> unnest(by_region, mod_rat)
Error: Each column must either be a list of vectors or a list of data frames [mod_rat]
I wonder whether it possible to use unnest on this type of objects (<S3: glm>) and in case not, whether there is an alternative to get these estimates.
As it happens, the margins package has had some recent updates which will help you do this in a tidy fashion. In particular a margins_summary() function has been added that can be mapped onto nested model objects.
This issue on GitHub has the details.
Here is some code that works with your example
Using data from above
library(tidyverse)
library(magrittr)
library(margins)
# toy data
test_data <- data.frame(region = sample(letters[1:3], 1000, replace = TRUE),
x = sample(0:1, 1000, replace = TRUE),
y = sample(1:100, 1000, replace = TRUE),
z = sample(0:1, 1000, replace = TRUE)) %>%
arrange(region)
# nest
by_region <-
test_data %>%
group_by(region) %>%
nest()
# glm function
region_model <- function(df) {
glm(x ~ y + z, data = df, family = "binomial")
}
# run the model
by_region %<>%
mutate(mod_rat = map(data, region_model))
Using the margins_summary() function via purrr:map2() to compute marginal effects (I have included both methods for calculating the marginal effects with logistic regression as described in the package vignette)
by_region %<>%
mutate(marginals = map2(mod_rat, data, ~margins_summary(.x, data = .y)),
marginals_link = map2(mod_rat, data, ~margins_summary(.x, data = .y, type = "link")))
We can now unnest either of the created list columns with the marginal effect data
by_region %>%
unnest(marginals) -> region_marginals
region_marginals
# A tibble: 6 x 8
region factor AME SE z p
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 a y -9.38e-4 9.71e-4 -0.966 0.334
2 a z 3.59e-2 5.55e-2 0.647 0.517
3 b y 1.14e-3 9.19e-4 1.24 0.215
4 b z -2.93e-2 5.38e-2 -0.545 0.586
5 c y 4.67e-4 9.77e-4 0.478 0.633
6 c z -3.32e-2 5.49e-2 -0.604 0.546
# ... with 2 more variables: lower <dbl>,
# upper <dbl>
And plot nicely
region_marginals %>%
ggplot(aes(reorder(factor, AME), AME, ymin = lower, ymax = upper)) +
geom_hline(yintercept = 0, colour = "#AAAAAA") +
geom_pointrange() +
facet_wrap(~region) +
coord_flip()

Resources