How to create confidence intervals for multiple columns using purrr and dplyr packages in R? - dictionary

I would love to have your inputs about how to use the purrr package for multiple columns. In specific, I want to do some basic operations to create the confidence interval (lower and upper) for each of the variables mass and birth_year by skin_color, from the starwars database.
What I have done so far is:
Calculate the number of observations different to NA foreach of the columns of my interest (mass and birth_year) by skin_color.
pacman::p_load("tidyr","purrr")
data <- starwars
data_obs <-
data %>%
dplyr::select(mass,birth_year,skin_color) %>%
dplyr::group_by(skin_color)%>%
dplyr::summarise_all(funs(sum(!is.na(.))))%>%
dplyr::ungroup()
I created a database that estimates the mean and standard deviation foreach variable of interest by skin_color.
data_stats <-
data %>%
dplyr::select(mass,birth_year,skin_color)%>%
dplyr::group_by(skin_color) %>%
dplyr::summarise_all(., list(mean,sd)
, na.rm=T
)%>%
dplyr::ungroup()
I merged both databases and in that way I have the number of observations different from NA, the mean, and the sd foreach of the columns.
data_complete <-
dplyr::inner_join(data_obs,data_stats, by="skin_color")
From here, it is easy to estimate the standard error foreach variable manually by:
data_complete <-
dplyr::mutate(mass_se = mass_sd/sqrt(mass_n),
mass_ci_upper = mass_mean + qt(1 - (0.05 / 2), mass_n - 1)*mass_se,
mass_ci_lower = mass_mean - qt(1 - (0.05 / 2), mass_n - 1)*mass_se)
However, since this is a lot of work for my real dataset (with more than 50 variables), I would like to use the purrr package. I have tried by doing:
list_vectors <- list(data$mass,data$birth_year)
list_ready <- map(list_vectors,
~ data %>%
group_by(skin_color)%>%
dplyr::summarise_all(funs(sum(!is.na(.))))%>%
dplyr::summarise_all(., list(mean,sd), na.rm=T) %>%
dplyr::ungroup()%>%
dplyr::mutate(var_se=var_sd/sqrt(var_n)))
vector_1 <- list_ready[[1]]
But this doesn't work. Any help is really really appreciated! (I really want to use the purrr package).

Simplest way might be to a) put your calculation steps into a function processing a vector and returning a list of a tibble with the values you need and b) passing this into across instead (using iris as an example instead):
library(tidyverse)
mean_ci <- function(vars) {
vars <- vars[!is.na(vars)]
mn <- mean(vars)
se <- sd(vars) / sqrt(length(vars))
tibble(
mean = mn,
lower = mn - qt(1 - (0.05 / 2), length(vars) - 1) * se,
upper = mn + qt(1 - (0.05 / 2), length(vars) - 1) * se
)
}
iris |>
group_by(Species) |>
summarise(across(where(is.numeric), mean_ci)) |>
unnest(where(is_tibble), names_sep = "_")
#> # A tibble: 3 × 13
#> Species Sepal.Len…¹ Sepal…² Sepal…³ Sepal…⁴ Sepal…⁵ Sepal…⁶ Petal…⁷ Petal…⁸
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 4.91 5.11 3.43 3.32 3.54 1.46 1.41
#> 2 versicolor 5.94 5.79 6.08 2.77 2.68 2.86 4.26 4.13
#> 3 virginica 6.59 6.41 6.77 2.97 2.88 3.07 5.55 5.40
#> # … with 4 more variables: Petal.Length_upper <dbl>, Petal.Width_mean <dbl>,
#> # Petal.Width_lower <dbl>, Petal.Width_upper <dbl>, and abbreviated variable
#> # names ¹​Sepal.Length_mean, ²​Sepal.Length_lower, ³​Sepal.Length_upper,
#> # ⁴​Sepal.Width_mean, ⁵​Sepal.Width_lower, ⁶​Sepal.Width_upper,
#> # ⁷​Petal.Length_mean, ⁸​Petal.Length_lower
A more purrr-y way to do it might be to map the function to nested dataframes to create a slightly longer-form data output:
iris |>
nest(data = -Species) |>
mutate(data = map(data, ~tibble(metric = names(.x), map_df(.x, mean_ci)))) |>
unnest(data)
#> # A tibble: 12 × 5
#> Species metric mean lower upper
#> <fct> <chr> <dbl> <dbl> <dbl>
#> 1 setosa Sepal.Length 5.01 4.91 5.11
#> 2 setosa Sepal.Width 3.43 3.32 3.54
#> 3 setosa Petal.Length 1.46 1.41 1.51
#> 4 setosa Petal.Width 0.246 0.216 0.276
#> 5 versicolor Sepal.Length 5.94 5.79 6.08
#> 6 versicolor Sepal.Width 2.77 2.68 2.86
#> 7 versicolor Petal.Length 4.26 4.13 4.39
#> 8 versicolor Petal.Width 1.33 1.27 1.38
#> 9 virginica Sepal.Length 6.59 6.41 6.77
#> 10 virginica Sepal.Width 2.97 2.88 3.07
#> 11 virginica Petal.Length 5.55 5.40 5.71
#> 12 virginica Petal.Width 2.03 1.95 2.10

Related

generate confidence interval for binned data

I am trying to calculate binomial proportion confidence interval for my data. Here I just use the iris dataset.
# break the length into different size
iris$Slen <- cut(iris$Sepal.Length,
breaks = c(4,5,6,8))
#need to compute confidence interval for binomial proportion within each group and size bin
alpha <- as.numeric(0.05)
#function to calculate conf.int.
CI <-function(n,r) {
f1=qf(1-alpha/2, 2*r, 2*(n-r+1), lower.tail=FALSE)
f2=qf(alpha/2, 2*(r+1), 2*(n-r), lower.tail=FALSE)
pl=(1+(n-r+1)/(r*f1))^(-1)
pu=(1+(n-r)/((r+1)*f2))^(-1)
}
First I calculate the number of species in each of the size bin and then calculate the relative proportion using the code below:
library(dplyr)
iris_count <- iris %>%
group_by(Slen, Species) %>%
summarise(TotalParticle = n())%>%
mutate(RelAbund = TotalParticle/sum(TotalParticle))
Now I want to calculate the CI for the iris_count data. n is the total number in each bin and r is the RelAbund*n in the CI function. Ex. for Slen (4,5] setosa n is 28+3+1 and r is 0.875*32.
How can I compute this directly and not manually for each case?
iris$Slen <- cut(iris$Sepal.Length,
breaks = c(4,5,6,8))
alpha <- as.numeric(0.05)
CI <-function(n,r) {
f1=qf(1-alpha/2, 2*r, 2*(n-r+1), lower.tail=FALSE)
f2=qf(alpha/2, 2*(r+1), 2*(n-r), lower.tail=FALSE)
pl=(1+(n-r+1)/(r*f1))^(-1)
pu=(1+(n-r)/((r+1)*f2))^(-1)
c(pl, pu) # note how I changed your function here so it returns the output you're looking for
}
library(dplyr)
iris_count <- iris %>%
group_by(Slen, Species) %>%
summarise(TotalParticle = n())%>%
mutate(RelAbund = TotalParticle/sum(TotalParticle))
totals <- aggregate(iris_count$TotalParticle, list(iris_count$Slen), sum)
colnames(totals)[which(colnames(totals) == 'Group.1')] <- 'Slen'
new_df <- as.data.frame(merge(iris_count, totals, 'Slen'))
new_df$x <- as.numeric(new_df$x)
colnames(new_df)
confint <- list(NULL)
for (i in seq_len(nrow(new_df))) {
confint[[i]] <- CI(new_df[i, which(colnames(new_df) == "x")], new_df[i, which(colnames(new_df) == "RelAbund")] * new_df[i, which(colnames(new_df) == "x")])
}
new_df$lower_CI <- sapply(confint, function (x) {
x[1]
})
new_df$upper_CI <- sapply(confint, function (x) {
x[2]
})
new_df
# Slen Species TotalParticle RelAbund x lower_CI upper_CI
# 1 (4,5] setosa 28 0.8750000 32 0.7100515798 0.9648693
# 2 (4,5] versicolor 3 0.0937500 32 0.0197671802 0.2502270
# 3 (4,5] virginica 1 0.0312500 32 0.0007908686 0.1621710
# 4 (5,6] setosa 22 0.3859649 57 0.2599546810 0.5242516
# 5 (5,6] versicolor 27 0.4736842 57 0.3398483226 0.6103478
# 6 (5,6] virginica 8 0.1403509 57 0.0625948901 0.2579454
# 7 (6,8] versicolor 20 0.3278689 61 0.2130663358 0.4600191
# 8 (6,8] virginica 41 0.6721311 61 0.5399808516 0.7869337
You may be reinventing the wheel here. The built in function prop.test will give you binomial confidence intervals. Here's a full reprex:
library(tidyverse)
iris$Slen <- cut(iris$Sepal.Length,
breaks = c(4,5,6,8))
iris %>%
group_by(Slen, Species) %>%
summarise(TotalParticle = n()) %>%
mutate(RelAbund = TotalParticle/sum(TotalParticle),
lower = sapply(TotalParticle,
function(x) prop.test(x, sum(TotalParticle))$conf.int[1]),
upper = sapply(TotalParticle,
function(x) prop.test(x, sum(TotalParticle))$conf.int[2]))
#> `summarise()` has grouped output by 'Slen'. You can override using the
#> `.groups` argument.
#> # A tibble: 8 x 6
#> # Groups: Slen [3]
#> Slen Species TotalParticle RelAbund lower upper
#> <fct> <fct> <int> <dbl> <dbl> <dbl>
#> 1 (4,5] setosa 28 0.875 0.701 0.959
#> 2 (4,5] versicolor 3 0.0938 0.0245 0.262
#> 3 (4,5] virginica 1 0.0312 0.00163 0.180
#> 4 (5,6] setosa 22 0.386 0.263 0.524
#> 5 (5,6] versicolor 27 0.474 0.342 0.609
#> 6 (5,6] virginica 8 0.140 0.0668 0.263
#> 7 (6,8] versicolor 20 0.328 0.216 0.461
#> 8 (6,8] virginica 41 0.672 0.539 0.784
Created on 2022-06-20 by the reprex package (v2.0.1)

R: paired t-test on multiple columns

I am trying to run a t-test on multiple columns. Basically trying to find the change from baseline to year 1 for a number of joint angles. I only want to conduct this on the study side. Below is an image with the first few rows and columns of the data. Sample Data
I have tried using both of these functions without success:
Code 1:
res <- FAI_SLS %>%
filter(study_side == "Study")%>%
select(-id,-subject,-activity,-side,-study_side,-year) %>%
map_df(~ broom::tidy(t.test(. ~ year)), .id = 'var')
I get the following error:
Error in eval(predvars, data, env) : object 'year' not found
I tried taking out -year but I still have the same issue.
Code 2:
t(sapply(FAI_SLS%>%filter(study_side == "Study")%>%select(-id,-subject,-activity,-side,-study_side,-year), function(x)
unlist(t.test(x~FAI_SLS$year)[c("estimate","p.value","statistic","conf.int")])))
I get the following error:
Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 't': variable lengths differ (found for 'FAI_SLS$year')
Again I tried taking -year out without success.
Any suggestions on how I can fix this? Thanks
Try fitting the t-test within summarise() on all the columns you want to test (selected in across()). Here's an example with a different dataset:
library(dplyr)
library(tidyr)
data("storms")
storms %>%
filter(year %in% c(2019, 2020)) %>%
summarise(across(-c(name, year, status, category),
~broom::tidy(t.test(. ~ year)))) %>%
pivot_longer(everything(), names_to = "variable") %>%
unnest(value)
#> # A tibble: 9 × 11
#> variable estimate estimate1 estimate2 statistic p.value parameter conf.low
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 month 0.0917 8.93 8.84 1.15 2.52e- 1 892. -0.0654
#> 2 day 4.29 18.2 13.9 7.49 2.34e-13 641. 3.17
#> 3 hour -0.0596 9.13 9.19 -0.128 8.99e- 1 687. -0.978
#> 4 lat 2.14 25.9 23.7 3.75 1.94e- 4 668. 1.02
#> 5 long 6.06 -60.7 -66.8 4.27 2.25e- 5 736. 3.27
#> 6 wind 8.42 58.8 50.4 4.42 1.18e- 5 529. 4.68
#> 7 pressure -4.46 989. 993. -3.03 2.59e- 3 537. -7.35
#> 8 tropicalst… 7.39 153. 145. 0.810 4.18e- 1 701. -10.5
#> 9 hurricane_… 10.9 24.1 13.2 3.92 1.02e- 4 508. 5.45
#> # … with 3 more variables: conf.high <dbl>, method <chr>, alternative <chr>
Created on 2022-06-02 by the reprex package (v2.0.1)

broom::tidy fails on multinomial regression

I'm trying to run a multinomial logistic regression in R using tidymodels but I can't convert my results to a tidy object. Here's a sample using the iris data set.
# Multinomial -----------------------------------------------------------------
# recipe
multinom_recipe <-
recipe(Species ~ Sepal.Length + Sepal.Width + Sepal.Length + Petal.Width, data = iris) %>%
step_relevel(Species, ref_level = "setosa")
# model
multinom_model <- multinom_reg() %>%
set_engine("nnet")
# make workflow
multinom_wf <-
workflow() %>%
add_model(multinom_model) %>%
add_recipe(multinom_recipe) %>%
fit(data = iris) %>%
tidy()
multinom_wf
The last step throws the following error:
Error in eval(predvars, data, env) : object '..y' not found
I thought it was bc the output of the fit(data = iris) is a workflow object, but this code seems to work fine when I don't use workflow (which is the whole point of using tidymodels) or if I fit a linear model.
# recipe
linear_recipe <-
recipe(Sepal.Length ~ Sepal.Width + Sepal.Length + Petal.Width, data = iris)
# model
linear_model <- linear_reg() %>%
set_engine("lm")
# make workflow
linear_wf <-
workflow() %>%
add_model(linear_model) %>%
add_recipe(linear_recipe) %>%
fit(data = iris) %>%
tidy()
linear_wf
Anyone have an idea as to what I'm missing or is this a bug?
It could be a clash with the call. We could change it to
multinom_wf$fit$fit$fit$call <- quote(nnet::multinom(formula = Species ~ ., data = iris, trace = FALSE))
multinom_wf %>%
tidy
-output
# A tibble: 8 x 6
y.level term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 versicolor (Intercept) 4.17 12.0 0.348 0.728
2 versicolor Sepal.Length 1.08 42.0 0.0258 0.979
3 versicolor Sepal.Width -9.13 81.5 -0.112 0.911
4 versicolor Petal.Width 20.9 14.0 1.49 0.136
5 virginica (Intercept) -16.0 12.1 -1.33 0.185
6 virginica Sepal.Length 2.37 42.0 0.0563 0.955
7 virginica Sepal.Width -13.9 81.5 -0.171 0.864
8 virginica Petal.Width 36.8 14.1 2.61 0.00916
where
multinom_wf <-
workflow() %>%
add_model(multinom_model) %>%
add_recipe(multinom_recipe) %>%
fit(data = iris)
We have a function repair_call() in parsnip to fix up the call objects for packages that don't play nicely with "typical" norms; read more about it here.
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
multinom_model <- multinom_reg() %>%
set_engine("nnet")
nnet_fit <-
multinom_model %>%
fit(Species ~ Sepal.Length + Sepal.Width + Sepal.Length + Petal.Width, data = iris)
tidy(nnet_fit)
#> Error in model.frame.default(formula = Species ~ Sepal.Length + Sepal.Width + : 'data' must be a data.frame, environment, or list
nnet_fixed <- repair_call(nnet_fit, data = iris)
tidy(nnet_fixed)
#> # A tibble: 8 × 6
#> y.level term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 versicolor (Intercept) 4.17 260. 0.0160 0.987
#> 2 versicolor Sepal.Length 1.08 64.8 0.0167 0.987
#> 3 versicolor Sepal.Width -9.13 80.4 -0.114 0.910
#> 4 versicolor Petal.Width 20.9 98.1 0.213 0.831
#> 5 virginica (Intercept) -16.0 261. -0.0616 0.951
#> 6 virginica Sepal.Length 2.37 64.8 0.0365 0.971
#> 7 virginica Sepal.Width -13.9 80.4 -0.173 0.862
#> 8 virginica Petal.Width 36.8 98.2 0.375 0.708
Created on 2021-08-01 by the reprex package (v2.0.0)

Unnesting tibble columns: "Wide" data summaries with dplyr v1.0.0

I'd like to produce "wide" summary tables of data in this sort of format:
---- Centiles ----
Param Group Mean SD 25% 50% 75%
Height 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
Weight 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
I can do that in dplyr 0.8.x. I can do it generically, with a function that can handle arbitrary grouping variables with arbitrary numbers of levels and arbitrary statistics summarising arbitrary numbers of variables with arbitrary names. I get that level of flexibility by making my data tidy. That's not what this question is about.
First, some toy data:
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
Now, a simple summary function, and a helper:
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
So I can say things like
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>% head()
Giving
# A tibble: 6 x 5
Parameter Group Q$Value $Quantile Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.45 0.25 1.54 0.141
2 Height 1 1.49 0.5 1.54 0.141
3 Height 1 1.59 0.75 1.54 0.141
4 Height 2 1.64 0.25 1.66 0.0649
5 Height 2 1.68 0.5 1.66 0.0649
6 Height 2 1.68 0.75 1.66 0.0649
So that's the summary I need, but it's in long format. And Q is a df-col. It's a tibble:
is_tibble(summary$Q)
[1] TRUE
So pivot_wider doesn't seem to work. I can use nest_by() to get to a one-row-per-group format:
toySummary <- summary %>% nest_by(Group, Mean, SD)
toySummary
# Rowwise: Group, Mean, SD
Group Mean SD data
<int> <dbl> <dbl> <list<tbl_df[,2]>>
1 1 1.54 0.141 [3 × 2]
2 1 78.8 10.2 [3 × 2]
3 2 1.66 0.0649 [3 × 2]
4 2 82.9 9.09 [3 × 2]
5 3 1.63 0.100 [3 × 2]
6 3 71.0 10.8 [3 × 2]
But now the format of the centiles is even more complicated:
> toySummary$data[1]
<list_of<
tbl_df<
Parameter: character
Q :
tbl_df<
Value : double
Quantile: double
>
>
>[1]>
[[1]]
# A tibble: 3 x 2
Parameter Q$Value $Quantile
<chr> <dbl> <dbl>
1 Height 1.45 0.25
2 Height 1.49 0.5
3 Height 1.59 0.75
It looks like a list, so I guess some form of lapply would probably work, but is there a neater, tidy, solution that I've not spotted yet? I've discovered several new verbs that I didn't know abou whilst researching this question (chop, pack, rowwise(), nest_by and such) but none seem to give me what I want: ideally, a tibble with 6 rows (defined by unique Group and Parameter combinations) and columns for Mean, SD, Q25, Q50 and Q75.
To clarify in response to the first two proposed answers: getting the exact numbers that my toy example generates is less important than finding a generic technique for moving from the df-col(s) that summarise returns in dplyr v1.0.0 to a wide data summary of the general form that my example illustrates.
revised answer
Here is my revised answer. This time, I rewrote your quibble2 function with enframe and pivot_wider so that it returns a tibble with three rows.
This will again lead to a df-col in your summary tibble, and now we can use unpack directly, without using pivot_wider to get the expected outcome.
This should generalize on centiles etc. as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
pivot_wider(enframe(quantile(x, q)),
names_from = name,
values_from = value)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q)
#> # A tibble: 6 x 7
#> Parameter Group `25%` `50%` `75%` Mean SD
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.62 1.66 1.73 1.70 0.108
#> 2 Height 2 1.73 1.77 1.78 1.76 0.105
#> 3 Height 3 1.55 1.64 1.76 1.65 0.109
#> 4 Weight 1 75.6 80.6 84.3 80.0 9.05
#> 5 Weight 2 75.4 76.9 79.6 77.4 7.27
#> 6 Weight 3 70.7 75.2 82.0 76.3 6.94
Created on 2020-06-13 by the reprex package (v0.3.0)
Second approach
without changing quibble2, we would need to first call unpack and then pivot_wider. This should scale as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q) %>%
pivot_wider(names_from = Quantile, values_from = Value)
#> # A tibble: 6 x 7
#> Parameter Group Mean SD `0.25` `0.5` `0.75`
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.70 0.108 1.62 1.66 1.73
#> 2 Height 2 1.76 0.105 1.73 1.77 1.78
#> 3 Height 3 1.65 0.109 1.55 1.64 1.76
#> 4 Weight 1 80.0 9.05 75.6 80.6 84.3
#> 5 Weight 2 77.4 7.27 75.4 76.9 79.6
#> 6 Weight 3 76.3 6.94 70.7 75.2 82.0
Created on 2020-06-13 by the reprex package (v0.3.0)
generalized approach
I tried to figure out a more general approach by rewriting the mySummary function. Now it will convert automatically those outputs to df-cols which return a vector or a named vector. It will also wrap list automatically around expressions if necessary.
Then, I defined a function widen which will widen the df as much as possible, by preserving rows, including calling broom::tidy on supported list-columns.
The approach is not perfect, and could be extended by including unnest_wider in the widen function.
Note, that I changed the grouping in the example to be able to use t.test as another example output.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
# modified summary function
mySummary <- function(data, ...) {
fns <- rlang::enquos(...)
fns <- map(fns, function(x) {
res <- rlang::eval_tidy(x, data = data)
if ( ((is.vector(res) || is.factor(res)) && length(res) == 1) ||
("list" %in% class(res) && is.list(res)) ||
rlang::call_name(rlang::quo_get_expr(x)) == "list") {
x
}
else if ((is.vector(res) || is.factor(res)) && length(res) > 1) {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0(
"pivot_wider(enframe(",
x_expr,
"), names_from = name, values_from = value)"
)
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
} else {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0("list(", x_expr,")")
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
}
})
data %>%
group_by(Parameter) %>%
summarise(!!! fns, .groups="drop")
}
# A function to automatically widen the df as much as possible while preserving rows
widen <- function(df) {
df_cols <- names(df)[map_lgl(df, is.data.frame)]
df <- unpack(df, all_of(df_cols), names_sep = "_")
try_tidy <- function(x) {
tryCatch({
broom::tidy(x)
}, error = function(e) {
x
})
}
df <- df %>% rowwise() %>% mutate(across(where(is.list), try_tidy))
ungroup(df)
}
# if you want to specify function arguments for convenience use purrr::partial
quantile3 <- partial(quantile, x = , q = c(.25, .5, .75))
summary <- mySummary(toy,
Q = quantile3(Value),
R = range(Value),
T_test = t.test(Value),
Mean = mean(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE)
)
summary
#> # A tibble: 2 x 6
#> Parameter Q$`0%` $`25%` $`50%` $`75%` $`100%` R$`1` $`2` T_test Mean SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 <htest> 1.70 0.109
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 <htest> 77.9 7.40
widen(summary)
#> # A tibble: 2 x 11
#> Parameter `Q_0%` `Q_25%` `Q_50%` `Q_75%` `Q_100%` R_1 R_2 T_test$estimate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 1.70
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 77.9
#> # … with 9 more variables: $statistic <dbl>, $p.value <dbl>, $parameter <dbl>,
#> # $conf.low <dbl>, $conf.high <dbl>, $method <chr>, $alternative <chr>,
#> # Mean <dbl>, SD <dbl>
Created on 2020-06-14 by the reprex package (v0.3.0)
What if you change quibble2 to return a list, and then use unnest_wider?
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
list(quantile(x, q))
}
mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE)) %>%
unnest_wider(Q)
# A tibble: 6 x 7
Parameter Group `25%` `50%` `75%` Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.62 1.66 1.73 1.70 0.108
2 Height 2 1.73 1.77 1.78 1.76 0.105
3 Height 3 1.55 1.64 1.76 1.65 0.109
4 Weight 1 75.6 80.6 84.3 80.0 9.05
5 Weight 2 75.4 76.9 79.6 77.4 7.27
6 Weight 3 70.7 75.2 82.0 76.3 6.94

How to iterate over columns with vectorization together with group_by function from dplyr

As explained by Fitting several regression models with dplyr, we can use the tidy function from broom package to run the regression across groups. For instance, a demo code for iris dataset is listed below, but what if, in a simultaneous manner, we intend to loop over the multiple columns and run the regression with different dependent variables (Sepal.Length,Sepal.Width,Petal.Length) together with this group_by manipulation, how can I integrate the (s)apply function into such a situation and get the results for these regression models(3*3=9)?
library(dplyr);library(broom)
res1=iris%>%
group_by(Species)%>%
do(res=lm(Sepal.Length~Petal.Width,data=.))
tidy(res1, res)%>%
filter(term!="(Intercept)")
You can do this using lme4::lmList and broom.mixed::tidy. You may be able to adapt it to a pipe, but this should get you started. Here, lmList essentially performs the same function as group_by in the dplyr pipe, but it is easier for me to conceptualize how to pipe through several DVs using lapply. Good luck!!
library(lme4)
library(broom.mixed)
# Selecting DVs
dvs <- names(iris)[1:3]
# Making formula objects
formula_text <- paste0(dvs, "~ Petal.Width | Species")
formulas <- lapply(formula_text, formula)
# Running grouped analyses and looping through DVs
results <- lapply(formulas, function(x) {
res <- broom.mixed::tidy(lmList(x, iris))
res[res$terms != "(Intercept)",]
})
# Renaming and viewing results
names(results) <- formula_text
And, viewing the results:
results
$`Sepal.Length~ Petal.Width | Species`
# A tibble: 3 x 6
group terms estimate p.value std.error statistic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa Petal.Width 0.930 0.154 0.649 1.43
2 versicolor Petal.Width 1.43 0.0000629 0.346 4.12
3 virginica Petal.Width 0.651 0.00993 0.249 2.61
$`Sepal.Width~ Petal.Width | Species`
# A tibble: 3 x 6
group terms estimate p.value std.error statistic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa Petal.Width 0.837 0.0415 0.407 2.06
2 versicolor Petal.Width 1.05 0.00000306 0.217 4.86
3 virginica Petal.Width 0.631 0.0000855 0.156 4.04
$`Petal.Length~ Petal.Width | Species`
# A tibble: 3 x 6
group terms estimate p.value std.error statistic
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 setosa Petal.Width 0.546 2.67e- 1 0.490 1.12
2 versicolor Petal.Width 1.87 3.84e-11 0.261 7.16
3 virginica Petal.Width 0.647 7.55e- 4 0.188 3.44

Resources