I'm trying to figure out if there is a straightforward way to create a table of paired t-tests using tidyverse packages. There are already Q&As addressing this topic (e.g., here), but the existing answers all seem pretty convoluted.
Here's a reproducible example showing what I'm trying to accomplish -- a column of variable names, columns with the means for both items in the pair for each variable, and a column of p-values:
library(dplyr)
library(infer)
library(tidyr)
df <- mtcars %>%
mutate(engine = if_else(vs == 0, "V-shaped", "straight"))
v_shaped <- df %>%
filter(engine == "V-shaped") %>%
summarise(across(c(mpg, disp), mean)) %>%
pivot_longer(cols = everything()) %>%
rename(V_shaped = value)
straight <- df %>%
filter(engine == "straight") %>%
summarise(across(c(mpg, disp), mean)) %>%
pivot_longer(cols = everything()) %>%
rename(straight = value)
mpg <- df %>%
t_test(formula = mpg ~ engine, alternative = "two-sided") %>%
select(p_value) %>%
mutate(name = "mpg")
disp <- df %>%
t_test(formula = disp ~ engine, alternative = "two-sided") %>%
select(p_value) %>%
mutate(name = "disp")
p_values <- bind_rows(mpg, disp)
table <- v_shaped %>%
full_join(straight, by = "name") %>%
full_join(p_values, by = "name")
table
#> # A tibble: 2 × 4
#> name V_shaped straight p_value
#> <chr> <dbl> <dbl> <dbl>
#> 1 mpg 16.6 24.6 0.000110
#> 2 disp 307. 132. 0.00000248
Obviously, this is not a good way to address this problem even for two variables, and it certainly does not scale well. But it does illustrate the intended outcome. Is there a way to do this in one pipeline? My actual use case involves many more variables, so -- ideally -- I'd be able to feed a vector of variable names into the pipe.
Here's one way of doing it in a single pipe -
library(tidyverse)
library(infer)
df %>%
#select the columns you are interested in
select(mpg, disp, engine) %>%
#get them in long format
pivot_longer(cols = -engine) %>%
#Divide the data in a list of dataframes
split(.$name) %>%
#For each dataframe
map_df(~{
#Get the mean value for each engine
.x %>%
group_by(engine) %>%
summarise(value = mean(value)) %>%
#get the data in wide format
pivot_wider(names_from = engine) %>%
#Combine it with t.test result
bind_cols(t_test(.x, formula = value ~ engine, alternative = "two-sided"))
}, .id = "name")
# name `V-shaped` straight statistic t_df p_value alternative lower_ci upper_ci
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
#1 disp 307. 132. -5.94 27.0 0.00000248 two.sided -235. -114.
#2 mpg 16.6 24.6 4.67 22.7 0.000110 two.sided 4.42 11.5
Obviously, you can keep only the columns (using select) that is relevant to you.
This can be achieved with a call to summarise() and a flattening of the results. If we use stats::t.test() the group means are returned as part of the test and don't need to be calculated separately. broom::tidy() ensures that each result is returned in a 1-row tibble.
library(dplyr)
library(purrr)
library(broom)
mtcars %>%
summarise(across(-vs, \(x)
list(tidy(
t.test(x ~ vs, data = ., alternative = "two.sided")
)))) %>%
flatten_dfr(.id = "names") %>%
rename("V-shaped" = estimate1, straight = estimate2)
# A tibble: 10 × 11
names estimate `V-shaped` straight statistic p.value parameter conf.low conf.high method alternative
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 mpg -7.94 16.6 24.6 -4.67 0.000110 22.7 -11.5 -4.42 Welch Two Sample t-test two.sided
2 cyl 2.87 7.44 4.57 7.79 0.0000000112 29.9 2.12 3.63 Welch Two Sample t-test two.sided
3 disp 175. 307. 132. 5.94 0.00000248 27.0 114. 235. Welch Two Sample t-test two.sided
4 hp 98.4 190. 91.4 6.29 0.00000182 23.6 66.1 131. Welch Two Sample t-test two.sided
5 drat -0.467 3.39 3.86 -2.66 0.0129 27.1 -0.827 -0.107 Welch Two Sample t-test two.sided
6 wt 1.08 3.69 2.61 3.76 0.000728 30.0 0.493 1.66 Welch Two Sample t-test two.sided
7 qsec -2.64 16.7 19.3 -5.94 0.00000352 24.6 -3.56 -1.72 Welch Two Sample t-test two.sided
8 am -0.167 0.333 0.5 -0.927 0.362 27.1 -0.535 0.202 Welch Two Sample t-test two.sided
9 gear -0.302 3.56 3.86 -1.22 0.232 28.8 -0.807 0.204 Welch Two Sample t-test two.sided
10 carb 1.83 3.61 1.79 3.98 0.000413 29.6 0.888 2.76 Welch Two Sample t-test two.sided
Related
I am running a mixed effects model on my dataset ,
library(lme4)
data(cake)
each dataset is a subset of a larger datsaet
subset(cake, recipe=="A")
subset(cake, recipe=="B")
subset(cake, recipe=="C")
I am using dlply to run my mixed effects model on each subset
MxM1 <- plyr::dlply(cake,
"recipe",
function(x)
lmer(angle ~ 1+ (1|replicate)+ temperature,
data=x))
This gives me a list of summaries based on each subset of data.
I know how to display the summaries one at a time using gt_summary package
lm_cake$A %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
lm_cake$B %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
lm_cake$B %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
I am not sure how to combine the results from all 3 objects (lm_cake$A, lm_cake$B, lm_cake$C) to display them as one summary table.
Model: A Model: B Model: C
Temperature Beta SE Beta SE Beta SE
Temperature. L
Temperature. Q
Temperature. C
Temperature^4
Temperature^5
Any suggestions or help is much apricated. Thanks.
You can also merge two or more gtsummary tables using the gtsummary::tbl_merge() function. Example below!
library(gtsummary)
#> #StandWithUkraine
library(lme4)
#> Loading required package: Matrix
data(cake)
MxM1 <-
plyr::dlply(
cake,
"recipe",
function(x) {
lmer(angle ~ 1+ (1|replicate)+ temperature, data=x) %>%
tbl_regression() %>%
modify_column_hide(columns = ci) %>%
modify_column_unhide(columns = std.error)
}
)
# Merge all model summaries together with `tbl_merge()`
tbl <-
MxM1 %>%
tbl_merge(
tab_spanner = c("**A**", "**B**", "**C**")
)
Created on 2022-12-17 with reprex v2.0.2
Update:
While the answer by #Daniel D. Sjoberg is perfect and the desired one. Here is the answer to OP's question in the comments:
"How can i convert the final results from long format to wide, by each recipe?"
After filtering temperature we could use pivot_wider and some tweaking thereafter:
Note we have to use broom.mixed package for our lmer
library(lme4)
library(tidyverse)
#library(broom)
library(broom.mixed)
cake %>%
mutate(recipe = as_factor(recipe)) %>%
group_by(recipe) %>%
group_split() %>%
map_dfr(.f = function(df){
lmer(angle ~ 1 + (1|replicate) + temperature,
data=df) %>%
tidy() %>%
add_column(recipe = unique(df$recipe), .before = 1)
}) %>%
filter(str_detect(term, "temperature")) %>%
select(recipe, term, Beta=estimate, SE = std.error) %>%
pivot_wider(names_from = recipe,
values_from = c(Beta, SE)) %>%
rename_with(~ str_replace(., "(.*)_(.*)", "\\2_\\1"), -1) %>%
select(term, sort(colnames(.)))
term A_Beta A_SE B_Beta B_SE C_Beta C_SE
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 temperature.L 6.43 1.22 6.88 1.16 6.52 1.12
2 temperature.Q -0.713 1.22 -0.946 1.16 0.502 1.12
3 temperature.C -2.33 1.22 0.368 1.16 0.313 1.12
4 temperature^4 -3.35 1.22 -0.328 1.16 -0.214 1.12
5 temperature^5 -0.151 1.22 -0.815 1.16 -1.78 1.12
First answer:
You need something like this?:
library(lme4)
data(cake)
library(dplyr)
library(broom)
library(broom.mixed)
cake %>%
mutate(recipe = as_factor(recipe)) %>%
group_by(recipe) %>%
group_split() %>%
map_dfr(.f = function(df){
lmer(angle ~ 1 + (1|replicate) + temperature,
data=df) %>%
tidy() %>%
add_column(recipe = unique(df$recipe), .before = 1)
})
A tibble: 24 × 7
recipe effect group term estimate std.error statistic
<fct> <chr> <chr> <chr> <dbl> <dbl> <dbl>
1 A fixed NA (Intercept) 33.1 1.42 23.3
2 A fixed NA temperature.L 6.43 1.22 5.26
3 A fixed NA temperature.Q -0.713 1.22 -0.583
4 A fixed NA temperature.C -2.33 1.22 -1.90
5 A fixed NA temperature^4 -3.35 1.22 -2.74
6 A fixed NA temperature^5 -0.151 1.22 -0.124
7 A ran_pars replicate sd__(Intercept) 5.16 NA NA
8 A ran_pars Residual sd__Observation 4.73 NA NA
9 B fixed NA (Intercept) 31.6 1.81 17.5
10 B fixed NA temperature.L 6.88 1.16 5.93
# … with 14 more rows
# ℹ Use `print(n = ...)` to see more rows
I'm fitting many models like this example:
model<-lm(vl~sex+race+gene1+gene2)
anova(model)
model<-lm(vl~sex+race+gene3+gene4)
anova(model)
model<-lm(vl~sex+race+gene5+gene6)
anova(model)
model<-lm(vl~sex+race+gene7+gene8)
anova(model)
model<-lm(vl~sex+race+gene9+gene10)
anova(model)
I want a function or R package that can extract all the p values from those models and put them in one table. I have so many models that I cannot copy and past every p value. Can you help me please?
Here is an example with the mtcars dataset:
library(tidyquant)
library(tidyverse)
library(broom)
table <- mtcars %>%
mutate(cyl = as_factor(cyl)) %>%
group_by(cyl) %>%
group_split() %>%
map_dfr(.f = function(df) {
lm(mpg ~ am+disp+gear, data = df) %>%
glance() %>%
add_column(cyl = unique(df$cyl), .before = 1)
})
table
Output:
cyl r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 4 0.866 0.808 1.98 15.0 0.00196 3 -20.6 51.2 53.2 27.3 7 11
2 6 0.609 0.218 1.29 1.56 0.362 3 -8.72 27.4 27.2 4.96 3 7
3 8 0.272 0.139 2.38 2.05 0.175 2 -30.3 68.6 71.1 62.1 11 14
I have a data frame with 13 columns. the 13th column shows the group number each row belongs to. I want to take the standard deviation of all values in all rows of columns 3 to 12, for rows of each group.
for(i in 1: groupnumber) {
sd.vect[i] <- sd(as.vector(df[df$group==i,][,-c(1,2,13)]))}
I get error
Error in is.data.frame(x) :
'list' object cannot be coerced to type 'double'
how can I get sd of all values in each group?
You can use dplyr. You can group_by(grouping_column), then summarise() with the sd() function:
library(dplyr)
data_frame%>%>group_by(thirteenth_column)%>%summarise(across(-c(1,2), sd))
As an example with mtcars:
library(dplyr)
mtcars%>%group_by(cyl)%>%summarise(across(everything(), sd))
# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 4.51 26.9 20.9 0.365 0.570 1.68 0.302 0.467 0.539 0.522
2 6 1.45 41.6 24.3 0.476 0.356 1.71 0.535 0.535 0.690 1.81
3 8 2.56 67.8 51.0 0.372 0.759 1.20 0 0.363 0.726 1.56
That said, with some corrections, your for loop works:
You just have to make sure to return a dataframe, as your desired outcome is actually a two-dimension object. See an example with the iris dataset:
sd.df<-data.frame()
for(i in unique(iris$Species)) {
for (j in 1:ncol(iris[-5])){
sd.df[i,j] <- sd(iris[iris$Species==i,j])
}
}
sd.df
V1 V2 V3 V4
setosa 0.3524897 0.3790644 0.1736640 0.1053856
versicolor 0.5161711 0.3137983 0.4699110 0.1977527
virginica 0.6358796 0.3224966 0.5518947 0.2746501
update
I understand from your comments you may want something quite odd, which would be to group your data by the grouping variable, than get the standard deviation for an aggregate of all values in each sub-dataframe. You probably would be better off working with matrices here.
For that, you may neet to group your data by unique() values in the grouping column, then call sd() on the rest of the dataframe (all values), which can be done if you coerce the dataframe into a matrix:
library(dplyr)
library(purrr)
map_dbl(unique(mtcars$cyl), ~as.matrix(mtcars%>%
filter(cyl==.x)%>%
select(-cyl))%>%
sd())%>%
set_names(., unique(mtcars$cyl))
6 4 8
62.47655 37.54494 118.18945
With your data:
map(unique(df[[13]]), ~as.matrix(df%>%
filter(df[[13]]==.x)%>%
select(-c(1,2,13)))%>%
sd()%>%
set_names(., unique(df[[13]]))
And a much simpler answer with base subsetting and split():
map_dbl(split(mtcars[-c(1,2, 10)], mtcars[10]), ~sd(as.matrix(.x)))
3 4 5
119.47824 47.97490 98.71733
You can subset the columns from cur_data() -
library(dplyr)
result <- df %>%
group_by(group) %>%
summarise(sd_value = sd(unlist(select(cur_data(), -(1:2)))))
Example using mtcars -
mtcars %>%
group_by(gear) %>%
summarise(sd_value = sd(unlist(select(cur_data(), -(1:2)))))
# gear sd_value
# <dbl> <dbl>
#1 3 119.
#2 4 48.0
#3 5 98.7
Using mtcars as an example,
aggregate(.~cyl, data=mtcars, FUN=sd)
# cyl mpg disp hp drat wt qsec vs am gear carb
# 1 4 4.509828 26.87159 20.93453 0.3654711 0.5695637 1.682445 0.3015113 0.4670994 0.5393599 0.522233
# 2 6 1.453567 41.56246 24.26049 0.4760552 0.3563455 1.706866 0.5345225 0.5345225 0.6900656 1.812654
# 3 8 2.560048 67.77132 50.97689 0.3723618 0.7594047 1.196014 0.0000000 0.3631365 0.7262730 1.556624
I would like to find a better way to bind together the results of any number of regressions after adding an identifier for each model. The code below is my current solution but is too manual for a large number of regressions. This is part of a larger tidy workflow so a solution inside of the tidyverse is preferred but whatever works is fine. Thanks
library(tidyverse)
library(broom)
model_dat=mtcars %>%
do(lm_1 = tidy(lm(disp~ wt*vs, data = .),conf.int=T),
lm_2=tidy(lm(cyl ~ wt*vs, data = .),conf.int=T ),
lm_3=tidy(lm(mpg ~ wt*vs, data = .),conf.int=T ))
df=model_dat %>%
select(lm_1) %>%
unnest(c(lm_1)) %>%
mutate(model="one") %>%
select(model,term,estimate,p.value:conf.high) %>%
bind_rows(
model_dat %>%
select(lm_2) %>%
unnest(c(lm_2)) %>%
mutate(model="two") %>%
select(model,term,estimate,p.value:conf.high)) %>%
bind_rows(
model_dat %>%
select(lm_3) %>%
unnest(c(lm_3)) %>%
mutate(model="three") %>%
select(model,term,estimate,p.value:conf.high))
It may be easier with map2 i.e. loop across the columns and the corresponding english word for the sequence of columns, pluck the list element, create the 'model' column with second argument i.e. engish words (.y), select the columns of interest, and create a single dataset by specifying _dfr in map
library(purrr)
library(english)
library(dplyr)
library(broom)
map2_dfr(model_dat, as.character(english(seq_along(model_dat))),
~ .x %>%
pluck(1) %>%
mutate(model = .y) %>%
select(model, term, estimate, p.value:conf.high) )
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419
Or use summarise with across, unclass and then bind with bind_rows
model_dat %>%
summarise(across(everything(), ~ {
# // get the column name
nm1 <- cur_column()
# // extract the list element (.[[1]])
list(.[[1]] %>%
# // create new column by extracting the numeric part
mutate(model = english(readr::parse_number(nm1))) %>%
# // select the subset of columns, wrap in a list
select(model, term, estimate, p.value:conf.high))
}
)) %>%
# // unclass to list
unclass %>%
# // bind the list elements
bind_rows
-output
# A tibble: 12 x 6
# model term estimate p.value conf.low conf.high
# <english> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 one (Intercept) -70.0 1.55e- 1 -168. 28.2
# 2 one wt 102. 8.20e- 9 76.4 128.
# 3 one vs 31.2 6.54e- 1 -110. 172.
# 4 one wt:vs -36.7 1.10e- 1 -82.2 8.82
# 5 two (Intercept) 4.31 1.28e- 5 2.64 5.99
# 6 two wt 0.849 4.90e- 4 0.408 1.29
# 7 two vs -2.19 7.28e- 2 -4.59 0.216
# 8 two wt:vs 0.0869 8.20e- 1 -0.689 0.862
# 9 three (Intercept) 29.5 6.55e-12 24.2 34.9
#10 three wt -3.50 2.33e- 5 -4.92 -2.08
#11 three vs 11.8 4.10e- 3 4.06 19.5
#12 three wt:vs -2.91 2.36e- 2 -5.40 -0.419
I am currently struggling to run weighted regression models on multiple variables in R.
When using (non-weighted) glm, I was successful by running the following:
mtcars_1 <- mtcars %>%
nest(-gear)%>%
mutate(model_0 = map(data, ~ glm(vs ~ drat, family = "binomial", data = .)))%>%
mutate(model_0_tidy = map(model_0, tidy))%>%
select(gear, model_0_tidy)%>%
ungroup()%>%
unnest(model_0_tidy)
That is I receive the following:
# A tibble: 6 x 6
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -15.3 22.6 -0.677 0.499
2 4 drat 4.26 5.76 0.740 0.459
3 3 (Intercept) -3.91 7.39 -0.529 0.597
4 3 drat 0.801 2.32 0.345 0.730
5 5 (Intercept) 5.20 14.4 0.362 0.718
6 5 drat -1.71 3.77 -0.453 0.651
However, when I would like to weight my observations and thus use svyglm from the survey-package, nesting does not work.
This was my approach:
design_0 <- svydesign(ids=~0, data = mtcars, weights = mtaars$wt)
mtcars_2 <- mtcars%>%
nest(-gear)%>%
mutate(model_1 = map(data, ~ svyglm(vs ~ drat, family = quasibinomial(logit), design = design_0, data = .)))%>%
mutate(model_1_tidy = map(model_1, tidy))%>%
select(gear, model_1_tidy)%>%
ungroup()%>%
unnest(model_1_tidy)
# If suggested that wt serves as frequency weight
# Outcome
gear term estimate std.error statistic p.value
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 4 (Intercept) -8.12 3.88 -2.09 0.0451
2 4 drat 2.12 1.07 1.99 0.0554
3 3 (Intercept) -8.12 3.88 -2.09 0.0451
4 3 drat 2.12 1.07 1.99 0.0554
5 5 (Intercept) -8.12 3.88 -2.09 0.0451
6 5 drat 2.12 1.07 1.99 0.0554
Estimates for each type of gear (that is 3,4,5) turns out to be the same.
It appears as if nesting was essentially ignored here.
Are there any solutions for combining svyglm with nest-map-unnest? Or will I have to look for other, less comfortable ways?
Thank you!
try to do it this way
mtcars%>%
nest(-gear) %>%
mutate(design = map(data, ~ svydesign(ids=~0, data = .x, weights = ~ wt)),
model = map(.x = design,
.f = ~ svyglm(vs ~ drat,
family = quasibinomial(logit),
design = .x))) %>%
mutate(model_tidy = map(model, tidy)) %>%
select(gear, model_tidy)%>%
ungroup()%>%
unnest(model_tidy)