Problem extracting model covariates for model summary table - r

I'm a graduate student using a linear regression (count) model to understand drivers of fish movement into and out of tidal wetlands. I am currently trying to generate a publication-worthy model summary table in r. I've been using the sel.table function which has been working well for this purpose.
However, I've been unable to generate a column that contains the individual model formulas. Below is my code which is based off of some nice instructions for using the MuMIn package. https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples
So to recap, my question pertains to the last line of code below,
How can I insert model formulas into a model selection table.**
install.packages("MuMIn")
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
#add a column for model names
sel.table$Model <- rownames(sel.table)
#replace model name with formulas
for(i in 1:nrow(sel.table)) sel.table$Model[i]<- as.character(formula(paste(sel.table$Model[i])))[3]
#Any help on this topic would be greatly appreciated!
UPDATED CODE
My method of pulling out model names is pretty clunky but otherwise this code seems to generate what I intended (a complete model selection table). Also, I'm not sure if the model coefficients are displayed properly but I hope to follow up on this for my final answer.
data = mtcars
#write linear models
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp + disp, data = data),
model4 <- lm(mpg ~ cyl * hp + disp + wt + drat, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)
#slightly rename intercept column
names(sel.table)[1]="Intercept"
#select variables to display in model summary table
sel.table <- sel.table %>%
select(Intercept,cyl,hp,disp,wt,drat,df,logLik,AICc,delta)
#round numerical coumns
sel.table[,1:6]<- round(sel.table[,1:6],2)
sel.table[,8:10]<-round(sel.table[,8:10],2)
#add a column for model (row) names
sel.table$Model <- rownames(sel.table)
#extract model formulas
form <- data.frame(name = as.character(lapply(models, `[[`, c(10,2))))
#generate a column with model (row) numbers (beside associated model formulas)
form <- form %>%
mutate(Model=(1:4))
#merge model table and model formulas
sum_table <- merge (form,sel.table,by="Model")
#rename model equation column
names(sum_table)[2]="Formula"
print <- flextable(head(sum_table))
print <- autofit(print)
print
6/1/20 UPDATE:
Below is an image that describes two issues that I'm having with the code. I've found a workaround to the first question but I'm still investigating the second.
see details here
Models end up being misnumbered
Model formula columns are being generated for each model

I believe there is a part of the code missing in the examples you followed, that is why your code does not work.
The easiest way to generate formula-like strings is simply to deparse the right hand side of the model formulas (i.e. 3-rd element):
sapply(get.models(out.put, TRUE), function(mo) deparse(formula(mo)[[3]], width.cutoff = 500))
or, if you want A*B's expanded into A + B + A:B:
sapply(get.models(out.put, TRUE), function(mo) deparse(terms(formula(mo), simplify = TRUE)[[3]], width.cutoff = 500))
Update: the original example code improved and simplified:
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
#! selection table
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)

library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
#slightly rename intercept column
colnames(sel.table)[3] <- 'Intercept'
# #select summary columns for model selection table
# sel.table <- sel.table %>%
# select(Model,formula,Intercept,df,logLik,AICc,delta,weight)
print <- flextable(head(sel.table))
print <- autofit(print)
print

Since your question isn't reproducible, i'll try with something else and maybe that's what you're looking for:
data = mtcars
models = list(
model1 = lm(mpg ~ cyl, data = data),
model2 = lm(mpg ~ cyl + hp, data = data)
)
data.frame(name = as.character(lapply(models, `[[`, c(10,2))),
other.column = NA)
#> name other.column
#> 1 mpg ~ cyl NA
#> 2 mpg ~ cyl + hp NA
Created on 2020-05-28 by the reprex package (v0.3.0)
The formula (call) of a lm object is on position 10 of the list. You can actually count when you type model1$. You can use rownames() instead of a column, but that's not recommended.
EDIT AFTER REPRODUCIBLE EXAMPLE
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
# create an object that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
# formulas as names
sel.table$name = as.character(lapply(models, `[[`, c(10,2)))
# reordering
sel.table = sel.table[, c(6,1,2,3,4,5)]
sel.table
#> name df logLik AICc delta weight
#> 3 mpg ~ cyl 5 -78.14329 168.5943 0.000000 0.5713716
#> 1 mpg ~ cyl + hp 3 -81.65321 170.1636 1.569298 0.2607054
#> 2 mpg ~ cyl * hp 4 -80.78092 171.0433 2.449068 0.1679230
Created on 2020-05-31 by the reprex package (v0.3.0)

Related

How to avoid the adding of the column "ModelLik" when tab_df() is used to export the AIC table

As example two LMM.
lme1 <- lme(mpg ~ cyl + disp,
random = ~1|disp,
method = "ML",
data = mtcars)
lme2 <- lme(mpg ~ cyl * disp,
random = ~1|disp,
method = "ML",
data = mtcars)
AIC table to select the best model.
library(AICcmodavg)
Cand.models <- list( )
Cand.models[[1]] <- lme1
Cand.models[[2]] <- lme2
aictab(Cand.models, sort = TRUE)
Model selection based on AICc:
K AICc Delta_AICc AICcWt Cum.Wt LL
Mod2 6 164.40 0.00 0.94 0.94 -74.52
Mod1 5 169.87 5.46 0.06 1.00 -78.78
To export the AIC table I like to use the tab_df() function.
library(sjPlot)
tab_df(aictab(Cand.models, sort = TRUE))
Unnecessary the tab_df() functions add the column "ModelLik" to the table, how can I avoid this?
The reason is that the object returned by aictab has some more columns then printed with its print method. In the following, I assign the returned table to a variable tb and inspect it with str(). If you use RStudio, you can also see it in the Environment explorer.
Function tab_df just formats the data frame, so we can select, remove or even rename columns according to our needs. The following shows an example. As a small goody, I user-defined names for the models:
library("nlme")
library("AICcmodavg")
library("sjPlot")
lme1 <- lme(mpg ~ cyl + disp, random = ~1|disp, method = "ML", data = mtcars)
lme2 <- lme(mpg ~ cyl * disp, random = ~1|disp, method = "ML", data = mtcars)
# alternative way to produce the list, can optionally provide speaking names
Cand.models <- list(
'model 1' = lme1,
'model 2' = lme2
)
# assign the table to a variable
tb <- aictab(Cand.models, sort = TRUE)
## look what is in
str(tb)
which_columns <- c("Modnames", "K", "AICc", "Delta_AICc", "AICcWt", "Cum.Wt", "LL")
tab_df(aictab(Cand.models, sort = TRUE)[which_columns])

Remove Inf values from formula before lm in R

Let's say I have use mtcars dataset to set arbitrary formula:
data(mtcars)
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
I would like to use that formula inside lm function, but before that, I would like to remove potential rows that contain Inf, NaN and NA. From example if disp / hp result in any Inf values I would like to remove rows that contain it. I know I can do that by generate new variable first , remove Inf and then run lm with formula, but I would like to do that using formula terms, since it is part of shiny application and formula is input.
My try:
formulaTerms <- terms(myFormula)
formulaTerms <- gsub("I", "", labels(formulaTerms))
formulaTermsRatio <- formulaTerms[grep("/", formulaTerms)]
mtcarsDT <- setDT(mtcars)
mtcarsDT <- mtcarsDT[, formulaTermsRatio[1] := sym(formulaTermsRatio[1])]
Use drop.terms. Assuming that each term is represented by a single column in the model matrix (i.e. no factors with > 2 levels) we compute the model matrix mm and find the column numbers, wx, of the bad columns. Then use drop.terms to drop those columns from the terms object and extract the formula from the revised terms object.
mtcars[1, 3] <- Inf
# is.na is TRUE for NA or NaN; is.infinite is TRUE for Inf or -Inf
is.bad <- function(x) any(is.na(x) | is.infinite(x))
fo_terms <- terms(myFormula) # myFormula is taken from question
mm <- model.matrix(myFormula, mtcars)
wx <- which(apply(mm[, -1], 2, is.bad))
fo_terms2 <- drop.terms(fo_terms, wx, keep.response = TRUE)
fo2 <- formula(fo_terms2)
myFormula
## mpg ~ cyl + I(disp/hp) + I(wt^2) + I((qsec + vs)/gear)
fo2
## mpg ~ cyl + I(wt^2) + I((qsec + vs)/gear)
Update
If you want to remove bad rows rather than terms from the formula then:
lm(myFormula, mtcars, subset = !apply(mm, 1, is.bad))
Note that lm will automatically remove rows with NAs and NaNs (dependintg on the na.action argument) so in this case you could simplify is.bad to only check for Inf and -Inf.
Another approach would be to replace Inf and -Inf with NA.
mtcars[is.infinite(mtcars)] <- NA
and then perform lm normally.
You can remove these values from the data you're regressing on. Inf will occur where hp==0 or gear==0.
data(mtcars)
df <- mtcars
myFormula <- as.formula("mpg ~ cyl + I(disp / hp) + I(wt^2) + I((qsec + vs) / gear)")
df <- df[!(df$hp==0 | df$gear==0),]
lm(myFormula,df)
> lm(myFormula,df)
Call:
lm(formula = myFormula, data = df)
Coefficients:
(Intercept) cyl I(disp/hp) I(wt^2) I((qsec + vs)/gear)
35.5847 -1.9639 1.0707 -0.3671 -0.1699

Purrr and several multiple regressions in R

I know there are several ways to compare regression models. One way it to create models (from linear to multiple) and compare R2, Adjusted R2, etc:
Mod1: y=b0+b1
Mod2: y=b0+b1+b2
Mod3: y=b0+b1+b2+b3 (etc)
I´m aware that some packages could perform a stepwise regression, but I'm trying to analyze that with purrr. I could create several simple linear models (Thanks for this post here), and now I want to Know how can create regression models adding a specific IV to equation:
reproducible code
data(mtcars)
library(tidyverse)
library(purrr)
library(broom)
iv_vars <- c("cyl", "disp", "hp")
make_model <- function(nm) lm(mtcars[c("mpg", nm)])
fits <- Map(make_model, iv_vars)
glance_tidy <- function(x) c(unlist(glance(x)), unlist(tidy(x)[, -1]))
t(iv_vars %>% Map(f = make_model) %>% sapply(glance_tidy))
Output
What I want:
Mod1: mpg ~cyl
Mod2: mpg ~cly + disp
Mod3: mpg ~ cly + disp + hp
Thanks much.
I would begin by creating a list tibble storing your formulae. Then map the model over the formula, and map glance over the models.
library(tidyverse)
library(broom)
mtcars %>% as_tibble()
formula <- c(mpg ~ cyl, mpg ~ cyl + disp)
output <-
tibble(formula) %>%
mutate(model = map(formula, ~lm(formula = .x, data = mtcars)),
glance = map(model, glance))
output$glance
output %>% unnest(glance)
You could cumulatively paste over your vector of id_vars to get the combinations you want. I used the code in this answer to do this.
I use the plus sign as the separator between variables to get ready for the formula notation in lm.
cumpaste = function(x, .sep = " ") {
Reduce(function(x1, x2) paste(x1, x2, sep = .sep), x, accumulate = TRUE)
}
( iv_vars_cum = cumpaste(iv_vars, " + ") )
[1] "cyl" "cyl + disp" "cyl + disp + hp"
Then switch the make_model function to use a formula and a dataset. The explanatory variables, separated by the plus sign, get passed to the function after the tilde in the formula. Everything is pasted together, which lm conveniently interprets as a formula.
make_model = function(nm) {
lm(paste0("mpg ~", nm), data = mtcars)
}
Which we can see works as desired, returning a model with both explanatory variables.
make_model("cyl + disp")
Call:
lm(formula = as.formula(paste0("mpg ~", nm)), data = mtcars)
Coefficients:
(Intercept) cyl disp
34.66099 -1.58728 -0.02058
You'll likely need to rethink how you want to combine the info together, as you will now how differing numbers of columns due to the increased number of coefficients.
A possible option is to add dplyr::bind_rows to your glance_tidy function and then use map_dfr from purrr for the final output.
glance_tidy = function(x) {
dplyr::bind_rows( c( unlist(glance(x)), unlist(tidy(x)[, -1]) ) )
}
iv_vars_cum %>%
Map(f = make_model) %>%
map_dfr(glance_tidy, .id = "model")
# A tibble: 3 x 28
model r.squared adj.r.squared sigma statistic p.value df logLik AIC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 cyl 0.7261800 0.7170527 3.205902 79.56103 6.112687e-10 2 -81.65321 169.3064
2 cyl + disp 0.7595658 0.7429841 3.055466 45.80755 1.057904e-09 3 -79.57282 167.1456
3 cyl + disp + hp 0.7678877 0.7430186 3.055261 30.87710 5.053802e-09 4 -79.00921 168.0184 ...

Combining cbind and paste in linear model

I would like to know how can I come up with a lm formula syntax that would enable me to use paste together with cbind for multiple multivariate regression.
Example
In my model I have a set of variables, which corresponds to the primitive example below:
data(mtcars)
depVars <- paste("mpg", "disp")
indepVars <- paste("qsec", "wt", "drat")
Problem
I would like to create a model with my depVars and indepVars. The model, typed by hand, would look like that:
modExmple <- lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
I'm interested in generating the same formula without referring to variable names and only using depVars and indepVars vectors defined above.
Attempt 1
For example, what I had on mind would correspond to:
mod1 <- lm(formula = formula(paste(cbind(paste(depVars, collapse = ",")), " ~ ",
indepVars)), data = mtcars)
Attempt 2
I tried this as well:
mod2 <- lm(formula = formula(cbind(depVars), paste(" ~ ",
paste(indepVars,
collapse = " + "))),
data = mtcars)
Side notes
I found a number of good examples on how to use paste with formula but I would like to know how I can combine with cbind.
This is mostly a syntax a question; in my real data I've a number of variables I would like to introduce to the model and making use of the previously generated vector is more parsimonious and makes the code more presentable. In effect, I'm only interested in creating a formula object that would contain cbind with variable names corresponding to one vector and the remaining variables corresponding to another vector.
In a word, I want to arrive at the formula in modExample without having to type variable names.
Think it works.
data(mtcars)
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
lm(formula(paste('cbind(',
paste(depVars, collapse = ','),
') ~ ',
paste(indepVars, collapse = '+'))), data = mtcars)
All the solutions below use these definitions:
depVars <- c("mpg", "disp")
indepVars <- c("qsec", "wt", "drat")
1) character string formula Create a character string representing the formula and then run lm using do.call. Note that the the formula shown in the output displays correctly and is written out.
fo <- sprintf("cbind(%s) ~ %s", toString(depVars), paste(indepVars, collapse = "+"))
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = "cbind(mpg, disp) ~ qsec+wt+drat", data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
1a) This would also work:
fo <- sprintf("cbind(%s) ~.", toString(depVars))
do.call("lm", list(fo, quote(mtcars[c(depVars, indepVars)])))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars[c(depVars,
indepVars)])
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
2) reformulate #akrun and #Konrad, in comments below the question suggest using reformulate. This approach produces a "formula" object whereas the ones above produce a character string as the formula. (If this were desired for the prior solutions above it would be possible using fo <- formula(fo) .) Note that it is important that the response argument to reformulate be a call object and not a character string or else reformulate will interpret the character string as the name of a single variable.
fo <- reformulate(indepVars, parse(text = sprintf("cbind(%s)", toString(depVars)))[[1]])
do.call("lm", list(fo, quote(mtcars)))
giving:
Call:
lm(formula = cbind(mpg, disp) ~ qsec + wt + drat, data = mtcars)
Coefficients:
mpg disp
(Intercept) 11.3945 452.3407
qsec 0.9462 -20.3504
wt -4.3978 89.9782
drat 1.6561 -41.1148
3) lm.fit Another way that does not use a formula at all is:
m <- as.matrix(mtcars)
fit <- lm.fit(cbind(1, m[, indepVars]), m[, depVars])
The output is a list with these components:
> names(fit)
[1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"

Easily performing the same regression on different datasets

I'm performing the same regression on several different datasets (same dependent and independe variables). However, there are many independent variables, and I often want to test adding/removing different variables. I'd like to avoid making all these changes to different lines of code, just because they use different datasets. Can I instead just copy the formula that was used to create some object, and then create a new object using a different dataset? For example, something like:
fit1 <- lm(y ~ x1 + x2 + x3 + ..., data = dataset1)
fit2 <- lm(fit1$call, data = dataset2) # this doesn't work
fit3 <- lm(fit1$call, data = dataset3) # this doesn't work
This way, if I want to update numerous regressions, I just update the first one and then rerun them all.
Can this be done? Preferably without using a loop or paste().
Thanks!
Or use update
(fit <- lm(mpg ~ wt, data = mtcars))
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.285 -5.344
update(fit, data = mtcars[mtcars$hp < 100, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[mtcars$hp < 100, ])
#
# Coefficients:
# (Intercept) wt
# 39.295 -5.379
update(fit, data = mtcars[1:10, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[1:10, ])
#
# Coefficients:
# (Intercept) wt
# 33.774 -4.285
Collect your datasets into a list and then use lapply. E.g.:
dsets <- list(dataset1,dataset2,dataset3)
lapply(dsets, function(x) lm(y ~ x1 + x2, data=x) )
Not sure entirely that this what you want but you can do this as follows:
formula <- y ~ x1 + x2 + x3 + ...
fit1 <- lm(formula, data = dataset1)
fit2 <- lm(formula, data = dataset2)
fit3 <- lm(formula, data = dataset3)

Resources