I have a simple question about creating custom lm functions within the tidyverse framework.
I basically want a function that runs a custom model with me with one free variable.
model <- function(x){
lmer(paste("cyl ~", x, "+ (1|disp)"), data = .)
}
And then I want to use this in dplyr's do
mtcars %>%
do(x = model("hp"))
How should I approach this problem?
You could pass data to the function :
library(dplyr)
library(lme4)
model <- function(data, x){
lmer(paste("cyl ~", x, "+", "(1|disp)"), data = data)
}
and then call it like :
mtcars %>% model('hp')
#Linear mixed model fit by REML ['lmerMod']
#Formula: cyl ~ hp + (1 | disp)
# Data: data
#REML criterion at convergence: 96.2
#Random effects:
# Groups Name Std.Dev.
# disp (Intercept) 0.927
# Residual 0.441
#Number of obs: 32, groups: disp, 27
#Fixed Effects:
#(Intercept) hp
# 3.1866 0.0196
Or
mtcars %>% summarise(mod = list(model(., 'hp')))
# mod
#1 <S4 class ‘lmerMod’ [package “lme4”] with 13 slots>
Related
I am trying to obtain the summary statistics (summary()) for the linear model (below), which uses 1000 permutations of the original dataset to create a 1000 random dataset (large matrix).
random_model <- rep(NA,1000)
for (i in c(1:1000)) {
random_data <- final_data
random_data$weighted_degree <- rowSums(node.perm_1000[i,,],na.rm=T)
random_model[i] <- coef(lm(weighted_degree ~ age + sex + age*sex, data=random_data))
}
I am not simply trying to compare the models to get an overall p-value but I want to get a t-value for each of the variables in the model that uses the random permutations as well.
Try with tidy() from broom package. It returns the expected values like this (example):
# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.53 0.479 13.6 6.47e-28
2 iris$Sepal.Width -0.223 0.155 -1.44 1.52e- 1
In your case, the previous output will be stored for each element in the list of the loop according to your definition:
library(broom)
#Data
random_model <- rep(NA,1000)
#Loop
for (i in c(1:1000)) {
random_data <- final_data
random_data$weighted_degree <- rowSums(node.perm_1000[i,,],na.rm=T)
random_model[i] <- broom::tidy(lm(weighted_degree ~ age + sex + age*sex, data=random_data))
}
You should store the results of interest (estimated coefficients and t-values) in a list.
Here is a reproducible example using 10 replications on the mtcars dataset which is sampled at 50% rate for each replication.
The results of interest are retrieved using the $coefficients attribute of the summary() output on the lm object.
# The data
data(mtcars)
# Define sample size of each replication
N <- nrow(mtcars)
sample_size <- floor(N/2)
# Number of replications (model fits) and initialization of the list to store the results
set.seed(1717)
replications <- 10
random_model <- vector( "list", length=replications )
for (i in seq_along(random_model)) {
shuffle = sample(N, sample_size)
mtcars_shuffle = mtcars[shuffle, ]
random_model[[i]] <- summary(lm(mpg ~ cyl + disp + cyl*disp, data=mtcars_shuffle))$coefficients
}
For example, the model fitted for replications 1 and 10 are:
> random_model[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 48.26285335 8.219065181 5.872061 7.573836e-05
cyl -3.33999161 1.366231326 -2.444675 3.089262e-02
disp -0.12941685 0.063269362 -2.045490 6.337414e-02
cyl:disp 0.01394436 0.007877833 1.770076 1.020931e-01
> random_model[[10]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.27312267 7.662593317 7.082866 1.277746e-05
cyl -4.40545653 1.586392001 -2.777029 1.674235e-02
disp -0.15330770 0.047932153 -3.198431 7.654790e-03
cyl:disp 0.01792561 0.006707396 2.672514 2.031615e-02
Suppose in R I have multiple GLM objects from multiple glm() function calls.
glm_01
glm_02
...
glm_nn
...and suppose that I want to do all possible pairwise comparisons using a chi-squared or F ANOVA test.
anova(glm_01, glm_02, test = "F")
anova(glm_01, glm_03, test = "F")
anova(glm_01, glm_04, test = "F")
...
I don't want to do this manually because the list of models is quite long. Instead I'd like to grab a list of relevant model objects (anything starting with "glm_") and do all pairwise comparisons automatically. However I'm unsure how to pass the model objects (rather than their names in string form) to the anova() function.
As a simple example:
data(mtcars)
# create some models
glm_01 <- glm(mpg ~ cyl , mtcars, family = gaussian())
glm_02 <- glm(mpg ~ cyl + disp , mtcars, family = gaussian())
glm_03 <- glm(mpg ~ cyl + disp + hp , mtcars, family = gaussian())
glm_04 <- glm(mpg ~ cyl + disp + hp + wt, mtcars, family = gaussian())
# get list of relevant model objects from the R environment
model_list <- ls()
model_list <- model_list[substr(model_list, 1, 4) == "glm_"]
# create a table to store the pairwise ANOVA results
n_models <- length(model_list)
anova_table <- matrix(0, nrow = n_models, ncol = n_models)
# loop through twice and do pairwise comparisons
for(row_index in 1:n_models) {
for(col_index in 1:n_models) {
anova_table[row_index, col_index] <- anova(model_list[row_index], model_list[col_index], test = "F")$'Pr(>F)'[2]
}
}
...but of course this loop at the end doesn't work because I'm not passing model objects to anova(), I'm passing the names of the objects as strings instead. How do I tell anova() to use the object that the string refers to, instead of the string itself?
Thank you.
======================
Possible solution:
data(mtcars)
glm_list <- list()
glm_list$glm_01 <- glm(mpg ~ cyl , mtcars, family = gaussian())
glm_list$glm_02 <- glm(mpg ~ cyl + disp , mtcars, family = gaussian())
glm_list$glm_03 <- glm(mpg ~ cyl + disp + hp , mtcars, family = gaussian())
glm_list$glm_04 <- glm(mpg ~ cyl + disp + hp + wt, mtcars, family = gaussian())
# create a table to store the pairwise ANOVA results
n_models <- length(glm_list)
anova_table <- matrix(0, nrow = n_models, ncol = n_models)
# loop through twice and do pairwise comparisons
row_idx <- 0
col_idx <- 0
for(row_glm in glm_list)
{
row_idx <- row_idx + 1
for(col_glm in glm_list)
{
col_idx <- col_idx + 1
anova_table[row_idx, col_idx] <- anova(row_glm, col_glm, test = "F")$'Pr(>F)'[2]
}
col_idx <- 0
}
row_idx <- 0
The easiest way to do this would be to keep all your models in a list. This makes it simple to iterate over them. For example, you can create all of your models and do a pairwise comparison between all of them like this:
data(mtcars)
f_list <- list(mpg ~ cyl,
mpg ~ cyl + disp,
mpg ~ cyl + disp + hp,
mpg ~ cyl + disp + hp + wt)
all_glms <- lapply(f_list, glm, data = mtcars, family = gaussian)
all_pairs <- as.data.frame(combn(length(all_glms), 2))
result <- lapply(all_pairs, function(i) anova(all_glms[[i[1]]], all_glms[[i[2]]]))
Which gives you:
result
#> $V1
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 29 270.74 1 37.594
#>
#> $V2
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp + hp
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 28 261.37 2 46.965
#>
#> $V3
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 30 308.33
#> 2 27 170.44 3 137.89
#>
#> $V4
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp
#> Model 2: mpg ~ cyl + disp + hp
#> Resid. Df Resid. Dev Df Deviance
#> 1 29 270.74
#> 2 28 261.37 1 9.3709
#>
#> $V5
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 29 270.74
#> 2 27 170.44 2 100.3
#>
#> $V6
#> Analysis of Deviance Table
#>
#> Model 1: mpg ~ cyl + disp + hp
#> Model 2: mpg ~ cyl + disp + hp + wt
#> Resid. Df Resid. Dev Df Deviance
#> 1 28 261.37
#> 2 27 170.44 1 90.925
Created on 2020-08-25 by the reprex package (v0.3.0)
If you want to reference arbitrary objects in an accessible environment by symbol without putting them into a list object, the standard way to return the top object on the search list whose symbol is equal to a string is get(), or the vector equivalent mget(). I.e. get("glm_01") gets you the top object on the search list that has the symbol glm_01. The most minimal modification to your approach would be to wrap your calls to model_list[row_index] and model_list[col_index] in get().
You can be more precise about where to look for objects by assigning the models in a named environment and only getting from that environment (using the envir parameter to get()).
I've been using map() to calculate and extract certain statistics from multiple lm() models.
To give a reproducible example, using the mtcars dataset, I start with an input vector of formulae to be estimated using lm() models:
library(tidyverse)
df <- mtcars
input_char <- c("mpg ~ disp",
"mpg ~ disp + hp")
input_formula <- map(input_char, formula)
I've then got a function that calculates and extracts the relevant statistics for each model. For simplicity and reproducibility, here's a simplified function that just extracts the R-squared of the model.
get_rsquared <- function(a_formula) {
model1 <- lm(a_formula, data = df)
rsquared <- summary(model1)$r.squared
c(model = a_formula, rsquared = rsquared)
}
I've then used map to iterate through the formulae and extract the R-squared from each model.
models <- map(input_formula, get_rsquared)
models
which gives the output:
[[1]]
[[1]]$model
mpg ~ disp
<environment: 0x7f98987f4000>
[[1]]$rsquared
[1] 0.7183433
[[2]]
[[2]]$model
mpg ~ disp + hp
<environment: 0x7f98987f4000>
[[2]]$rsquared
[1] 0.7482402
My question is regarding the output being a list.
Is there a simple way to make the output a dataframe?
My desired output is:
#> model rsquared
#> 1 mpg ~ disp 0.7183433
#> 2 mpg ~ disp + hp 0.7482402
Keep the formulas as character strings and use as.formula() as part of the the get_rsquared() function as it's easier to work with them as character strings than formula objects.
library(purrr)
library(dplyr)
df <- mtcars
input_char <- c("mpg ~ disp",
"mpg ~ disp + hp")
get_rsquared <- function(a_formula) {
model1 <- lm(as.formula(a_formula), data = df)
rsquared <- summary(model1)$r.squared
list(model = a_formula, rsquared = rsquared)
}
map_df(input_char, get_rsquared)
# A tibble: 2 x 2
model rsquared
<chr> <dbl>
1 mpg ~ disp 0.718
2 mpg ~ disp + hp 0.748
I know there are several ways to compare regression models. One way it to create models (from linear to multiple) and compare R2, Adjusted R2, etc:
Mod1: y=b0+b1
Mod2: y=b0+b1+b2
Mod3: y=b0+b1+b2+b3 (etc)
I´m aware that some packages could perform a stepwise regression, but I'm trying to analyze that with purrr. I could create several simple linear models (Thanks for this post here), and now I want to Know how can create regression models adding a specific IV to equation:
reproducible code
data(mtcars)
library(tidyverse)
library(purrr)
library(broom)
iv_vars <- c("cyl", "disp", "hp")
make_model <- function(nm) lm(mtcars[c("mpg", nm)])
fits <- Map(make_model, iv_vars)
glance_tidy <- function(x) c(unlist(glance(x)), unlist(tidy(x)[, -1]))
t(iv_vars %>% Map(f = make_model) %>% sapply(glance_tidy))
Output
What I want:
Mod1: mpg ~cyl
Mod2: mpg ~cly + disp
Mod3: mpg ~ cly + disp + hp
Thanks much.
I would begin by creating a list tibble storing your formulae. Then map the model over the formula, and map glance over the models.
library(tidyverse)
library(broom)
mtcars %>% as_tibble()
formula <- c(mpg ~ cyl, mpg ~ cyl + disp)
output <-
tibble(formula) %>%
mutate(model = map(formula, ~lm(formula = .x, data = mtcars)),
glance = map(model, glance))
output$glance
output %>% unnest(glance)
You could cumulatively paste over your vector of id_vars to get the combinations you want. I used the code in this answer to do this.
I use the plus sign as the separator between variables to get ready for the formula notation in lm.
cumpaste = function(x, .sep = " ") {
Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)
}
( iv_vars_cum = cumpaste(iv_vars, " + ") )
[1] "cyl" "cyl + disp" "cyl + disp + hp"
Then switch the make_model function to use a formula and a dataset. The explanatory variables, separated by the plus sign, get passed to the function after the tilde in the formula. Everything is pasted together, which lm conveniently interprets as a formula.
make_model = function(nm) {
lm(paste0("mpg ~", nm), data = mtcars)
}
Which we can see works as desired, returning a model with both explanatory variables.
make_model("cyl + disp")
Call:
lm(formula = as.formula(paste0("mpg ~", nm)), data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
You'll likely need to rethink how you want to combine the info together, as you will now how differing numbers of columns due to the increased number of coefficients.
A possible option is to add dplyr::bind_rows to your glance_tidy function and then use map_dfr from purrr for the final output.
glance_tidy = function(x) {
dplyr::bind_rows( c( unlist(glance(x)), unlist(tidy(x)[, -1]) ) )
}
iv_vars_cum %>%
Map(f = make_model) %>%
map_dfr(glance_tidy, .id = "model")
# A tibble: 3 x 28
model r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cyl 0.7261800 0.7170527 3.205902 79.56103 6.112687e-10 2 -81.65321 169.3064
2 cyl + disp 0.7595658 0.7429841 3.055466 45.80755 1.057904e-09 3 -79.57282 167.1456
3 cyl + disp + hp 0.7678877 0.7430186 3.055261 30.87710 5.053802e-09 4 -79.00921 168.0184 ...
I am running a series of models and storing them in a list:
fm0 <- list()
for(i in 1:3){
m <- formula(mpg ~ disp)
if(i > 1)
m <- update.formula(m, ~ . + gear)
if(i > 2)
m <- update.formula(m, ~ . + qsec)
fm1 <- lm(m, data = mtcars)
fm0[[i]] <- fm1
names(fm0)[i] <- paste0("m",i)
}
I want to run anova on the sequence of models like this:
anova(fm0$m1, fm0$m2, fm0$m3)
# Analysis of Variance Table
#
# Model 1: mpg ~ disp
# Model 2: mpg ~ disp + gear
# Model 3: mpg ~ disp + gear + qsec
# Res.Df RSS Df Sum of Sq F Pr(>F)
# 1 30 317.16
# 2 29 317.01 1 0.1443 0.0130 0.9099
# 3 28 309.83 1 7.1839 0.6492 0.4272
but I want something generic where I do not need to type out each named component of the list as the number of models is varying (depending on the data, which is set up in another loop, in which the loop above sits).
I tried lapply(fm0, anova), but it runs anova on each model on its own, which is not what I am after.
Here is an absolutely inelegant solution:
eval(parse(text=paste("anova(",paste("fm0[[",1:length(fm0),"]]",sep="",collapse=","),")")))