Easily performing the same regression on different datasets - r

I'm performing the same regression on several different datasets (same dependent and independe variables). However, there are many independent variables, and I often want to test adding/removing different variables. I'd like to avoid making all these changes to different lines of code, just because they use different datasets. Can I instead just copy the formula that was used to create some object, and then create a new object using a different dataset? For example, something like:
fit1 <- lm(y ~ x1 + x2 + x3 + ..., data = dataset1)
fit2 <- lm(fit1$call, data = dataset2) # this doesn't work
fit3 <- lm(fit1$call, data = dataset3) # this doesn't work
This way, if I want to update numerous regressions, I just update the first one and then rerun them all.
Can this be done? Preferably without using a loop or paste().
Thanks!

Or use update
(fit <- lm(mpg ~ wt, data = mtcars))
# Call:
# lm(formula = mpg ~ wt, data = mtcars)
#
# Coefficients:
# (Intercept) wt
# 37.285 -5.344
update(fit, data = mtcars[mtcars$hp < 100, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[mtcars$hp < 100, ])
#
# Coefficients:
# (Intercept) wt
# 39.295 -5.379
update(fit, data = mtcars[1:10, ])
# Call:
# lm(formula = mpg ~ wt, data = mtcars[1:10, ])
#
# Coefficients:
# (Intercept) wt
# 33.774 -4.285

Collect your datasets into a list and then use lapply. E.g.:
dsets <- list(dataset1,dataset2,dataset3)
lapply(dsets, function(x) lm(y ~ x1 + x2, data=x) )

Not sure entirely that this what you want but you can do this as follows:
formula <- y ~ x1 + x2 + x3 + ...
fit1 <- lm(formula, data = dataset1)
fit2 <- lm(formula, data = dataset2)
fit3 <- lm(formula, data = dataset3)

Related

How to dynamically reference datasets in function call of linear regression

Let's say I have a function like this:
data("mtcars")
ncol(mtcars)
test <- function(string){
fit <- lm(mpg ~ cyl,
data = string)
return(fit)
}
I'd like to be able to have the "string" variable evaluated as the dataset for a linear regression like so:
test("mtcars")
However, I get an error:
Error in eval(predvars, data, env) : invalid 'envir' argument of
type 'character'
I've tried using combinations of eval and parse, but to no avail. Any ideas?
You can use get() to search by name for an object.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = get(string))
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
You can add one more line to make the output look better. Notice the change of the Call part in the output. It turns from data = get(string) to data = mtcars.
test <- function(string){
fit <- lm(mpg ~ cyl, data = get(string))
fit$call$data <- as.name(string)
return(fit)
}
test("mtcars")
# Call:
# lm(formula = mpg ~ cyl, data = mtcars)
#
# Coefficients:
# (Intercept) cyl
# 37.885 -2.876
Try this slight change to your code:
#Code
test <- function(string){
fit <- lm(mpg ~ cyl,
data = eval(parse(text=string)))
return(fit)
}
#Apply
test("mtcars")
Output:
Call:
lm(formula = mpg ~ cyl, data = eval(parse(text = string)))
Coefficients:
(Intercept) cyl
37.885 -2.876

Problem extracting model covariates for model summary table

I'm a graduate student using a linear regression (count) model to understand drivers of fish movement into and out of tidal wetlands. I am currently trying to generate a publication-worthy model summary table in r. I've been using the sel.table function which has been working well for this purpose.
However, I've been unable to generate a column that contains the individual model formulas. Below is my code which is based off of some nice instructions for using the MuMIn package. https://sites.google.com/site/rforfishandwildlifegrads/home/mumin_usage_examples
So to recap, my question pertains to the last line of code below,
How can I insert model formulas into a model selection table.**
install.packages("MuMIn")
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
#add a column for model names
sel.table$Model <- rownames(sel.table)
#replace model name with formulas
for(i in 1:nrow(sel.table)) sel.table$Model[i]<- as.character(formula(paste(sel.table$Model[i])))[3]
#Any help on this topic would be greatly appreciated!
UPDATED CODE
My method of pulling out model names is pretty clunky but otherwise this code seems to generate what I intended (a complete model selection table). Also, I'm not sure if the model coefficients are displayed properly but I hope to follow up on this for my final answer.
data = mtcars
#write linear models
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp + disp, data = data),
model4 <- lm(mpg ~ cyl * hp + disp + wt + drat, data = data)
)
#create an object “out.put” that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)
#slightly rename intercept column
names(sel.table)[1]="Intercept"
#select variables to display in model summary table
sel.table <- sel.table %>%
select(Intercept,cyl,hp,disp,wt,drat,df,logLik,AICc,delta)
#round numerical coumns
sel.table[,1:6]<- round(sel.table[,1:6],2)
sel.table[,8:10]<-round(sel.table[,8:10],2)
#add a column for model (row) names
sel.table$Model <- rownames(sel.table)
#extract model formulas
form <- data.frame(name = as.character(lapply(models, `[[`, c(10,2))))
#generate a column with model (row) numbers (beside associated model formulas)
form <- form %>%
mutate(Model=(1:4))
#merge model table and model formulas
sum_table <- merge (form,sel.table,by="Model")
#rename model equation column
names(sum_table)[2]="Formula"
print <- flextable(head(sum_table))
print <- autofit(print)
print
6/1/20 UPDATE:
Below is an image that describes two issues that I'm having with the code. I've found a workaround to the first question but I'm still investigating the second.
see details here
Models end up being misnumbered
Model formula columns are being generated for each model
I believe there is a part of the code missing in the examples you followed, that is why your code does not work.
The easiest way to generate formula-like strings is simply to deparse the right hand side of the model formulas (i.e. 3-rd element):
sapply(get.models(out.put, TRUE), function(mo) deparse(formula(mo)[[3]], width.cutoff = 500))
or, if you want A*B's expanded into A + B + A:B:
sapply(get.models(out.put, TRUE), function(mo) deparse(terms(formula(mo), simplify = TRUE)[[3]], width.cutoff = 500))
Update: the original example code improved and simplified:
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
#! selection table
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
library(MuMIn)
data <- mtcars
#! Feed the models directly to `model.sel`. No need to create a separate list of
#! models.
gm <- lm(mpg ~ cyl, data = data)
out.put <- model.sel(
model1 = gm,
model2 = update(gm, . ~. + hp),
model3 = update(gm, . ~ . * hp + disp),
model4 = update(gm, . ~ . * hp + disp + wt + drat)
)
sel.table <- out.put
sel.table$family <- NULL
sel.table <- round(sel.table, 2)
#! Use `get.models` to get the list of models in the same order as in the
sel.table <- cbind(
Model =
#! Update (2): model number according to their original order, use:
attr(out.put, "order"),
#! otherwise: seq(nrow(sel.table)),
#!
#! Update (2): add a large `width.cutoff` to `deparse` so that the result is
#! always a single string and `sapply` returns a character vector
#! rather than a list.
#! For oversize formulas, use `paste0(deparse(...), collapse = "")`
formula = sapply(get.models(out.put, TRUE),
function(mo) deparse(formula(mo)[[3]], width.cutoff = 500)),
#!
sel.table
)
#slightly rename intercept column
colnames(sel.table)[3] <- 'Intercept'
# #select summary columns for model selection table
# sel.table <- sel.table %>%
# select(Model,formula,Intercept,df,logLik,AICc,delta,weight)
print <- flextable(head(sel.table))
print <- autofit(print)
print
Since your question isn't reproducible, i'll try with something else and maybe that's what you're looking for:
data = mtcars
models = list(
model1 = lm(mpg ~ cyl, data = data),
model2 = lm(mpg ~ cyl + hp, data = data)
)
data.frame(name = as.character(lapply(models, `[[`, c(10,2))),
other.column = NA)
#> name other.column
#> 1 mpg ~ cyl NA
#> 2 mpg ~ cyl + hp NA
Created on 2020-05-28 by the reprex package (v0.3.0)
The formula (call) of a lm object is on position 10 of the list. You can actually count when you type model1$. You can use rownames() instead of a column, but that's not recommended.
EDIT AFTER REPRODUCIBLE EXAMPLE
library(MuMIn)
data = mtcars
models = list(
model1 <- lm(mpg ~ cyl, data = data),
model2 <- lm(mpg ~ cyl + hp, data = data),
model3 <- lm(mpg ~ cyl * hp, data = data)
)
# create an object that contains all of the model selection information
out.put <- model.sel(models)
#coerce the object out.put into a data frame
sel.table <-as.data.frame(out.put)[6:10]
# formulas as names
sel.table$name = as.character(lapply(models, `[[`, c(10,2)))
# reordering
sel.table = sel.table[, c(6,1,2,3,4,5)]
sel.table
#> name df logLik AICc delta weight
#> 3 mpg ~ cyl 5 -78.14329 168.5943 0.000000 0.5713716
#> 1 mpg ~ cyl + hp 3 -81.65321 170.1636 1.569298 0.2607054
#> 2 mpg ~ cyl * hp 4 -80.78092 171.0433 2.449068 0.1679230
Created on 2020-05-31 by the reprex package (v0.3.0)

Estimate equation with varying numbers of parameters in R (lm) with demeaned variables

I want to estimate an equation such as:
(where the bar denotes the mean of a variable.... Meaning, I want to automatically have interactions between Z and a demeaned version of X. So far I just demean the variables manually beforehand and estimate:
lm(Y ~ .*Z, data= sdata)
This seems to be working, but I would rather use a solution that does not require manual demeaning beforehand because I would also like to include the means of more complex terms, such as:
Edit:
As requested, a working code-sample, note that in the actual thing I have large (and varying) numbers of X- variables, so that I dont want to use a hard-coded variant:
x1 <- runif(100)
x2 <- runif(100)
Z <- runif(100)
Y <- exp(x1) + exp(x2) + exp(z)
##current way of estimating the first equation:
sdata <- data.frame(Y=Y,Z=Z,x1=x1-mean(x1),x2=x2-mean(x2))
lm(Y ~ .*Z, data= sdata)
##basically what I want is that the following terms, and their interactions with Z are also used:
# X1^2 - mean(X1^2)
# X2^2 - mean(X2^2)
# X1*X2 - mean(X1*X2)
Edit 2:
Now, what I want to achieve is basically what
lm(Y ~ .^2*Z, data= sdata)
would do. However, given prior demeaing expressions in there, such as: Z:X1:X2 would correspond to: (x1-mean(x1))*(x2-mean(x2)), while what I want to have is x1*x2-mean(x1*x2)
To show that scale works inside a formula:
lm(mpg ~ cyl + scale(disp*hp, scale=F), data=mtcars)
Call:
lm(formula = mpg ~ cyl + scale(disp * hp, scale = F), data = mtcars)
Coefficients:
(Intercept) cyl scale(disp * hp, scale = F)
3.312e+01 -2.105e+00 -4.642e-05
Now for comparison let's scale the interaction outside the formula:
mtcars$scaled_interaction <- with(mtcars, scale(disp*hp, scale=F))
lm(mpg ~ cyl + scaled_interaction, data=mtcars)
Call:
lm(formula = mpg ~ cyl + scaled_interaction, data = mtcars)
Coefficients:
(Intercept) cyl scaled_interaction
3.312e+01 -2.105e+00 -4.642e-05
At least in these examples, it seems as if scale inside formulae is working.
To provide a solution to your specific issue:
Alternative 1: Use formulae
# fit without Z
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
vars <- attr(mod$terms, "term.labels")
vars <- gsub(":", "*", vars) # needed so that scale works later
vars <- paste0("scale(", vars, ", scale=F)")
newf <- as.formula(paste0("Y ~ ", paste0(vars, collapse = "+")))
# now interact with Z
f2 <- update.formula(newf, . ~ .*Z)
# This fives the following formula:
f2
Y ~ scale(x1, scale = F) + scale(x2, scale = F) + scale(x1*x2, scale = F) +
Z + scale(x1, scale = F):Z + scale(x2, scale = F):Z + scale(x1*x2, scale = F):Z
Alternative 2: Use Model Matrices
# again fit without Z and get model matrix
mod <- lm(Y ~ (.)^2, data= sdata[, names(sdata) != "Z" ])
modmat <- apply(model.matrix(mod), 2, function(x) scale(x, scale=F))
Here, all x's and the interactions are demeaned:
> head(modmat)
(Intercept) x1 x2 x1:x2
[1,] 0 0.1042908 -0.08989091 -0.01095459
[2,] 0 0.1611867 -0.32677059 -0.05425087
[3,] 0 0.2206845 0.29820499 0.06422944
[4,] 0 0.3462069 -0.15636463 -0.05571430
[5,] 0 0.3194451 -0.38668844 -0.12510551
[6,] 0 -0.4708222 -0.32502269 0.15144812
> round(colMeans(modmat), 2)
(Intercept) x1 x2 x1:x2
0 0 0 0
You can use the model matrix as follows:
modmat <- modmat[, -1] # remove intercept
lm(sdata$Y ~ modmat*sdata$Z)
It is not beautiful, but should do the work with any number of explanatory variables. You can also add Y and Z to the matrix so that the output looks prettier if this is a concern. Note that you can also create the model matrix directly without fitting the model. I took it from the fitted model directly since it have already fitted it for the first approach.
As a sidenote, it may be that this is not implemented in a more straight forward fashion because it is difficult to imagine situations in which demeaning the interaction is more desirable compared to the interaction of demeaned variables.
Comparing both approaches:
Here the output of both approaches for comparison. As you can see, apart from the coefficient names everything is identical.
> lm(sdata$Y ~ modmat*sdata$Z)
Call:
lm(formula = sdata$Y ~ modmat * sdata$Z)
Coefficients:
(Intercept) modmatx1 modmatx2 modmatx1:x2 sdata$Z
4.33105 1.56455 1.43979 -0.09206 1.72901
modmatx1:sdata$Z modmatx2:sdata$Z modmatx1:x2:sdata$Z
0.25332 0.38155 -0.66292
> lm(f2, data=sdata)
Call:
lm(formula = f2, data = sdata)
Coefficients:
(Intercept) scale(x1, scale = F) scale(x2, scale = F)
4.33105 1.56455 1.43979
scale(x1 * x2, scale = F) Z scale(x1, scale = F):Z
-0.09206 1.72901 0.25332
scale(x2, scale = F):Z scale(x1 * x2, scale = F):Z
0.38155 -0.66292

R lm: Create regressions dynamically

I have a set of dependent variables y1, y2, ...., a set of independent variables x1,x2,..., and a set of controls d1,d2,.... These are all inside a data.table, lets call it data.
I need to do something along the lines of
out1 <- lm(y1 ~ x1, data=data)
out2 <- lm(y1 ~ x1 + d1 + d2, data=data)
....
This is of course not very nice, so I was thinking about writing a list containing all these regressions, and than just iterate through that. Something along the lines of
myRegressions <- list('out1' = y1 ~ x1, 'out2' = y1 ~ x1 + d1 + d2)
output <- NULL
for (reg in myRegressions)
{
output[reg] <- lm(myRegressions[[reg]])
}
This of course won't work: I cannot construct the list as the syntax is invalid outside of lm(). What's a good approach here instead?
You can use paste0 and as.formula to generate formulas and then simply put them into lm(), e. g.
regressors <- c("x1", "x1 + x2", "x1 + x2 + x3")
for (i in 1:length(regressors)) {
print(as.formula(paste0("y1", "~", regressors[i])))
}
This gives you the formulas (printed). Just store them in a list and iterate over that list with lapply like
lapply(stored_formulas, function(x) { lm(x, data=yourData) })
Using the built in anscombe data frame try this:
formulas = list(y1 ~ x1, y2 ~ x2)
lapply(formulas, function(fo) do.call("lm", list(fo, data = quote(anscombe))))
giving:
[[1]]
Call:
lm(formula = y1 ~ x1, data = anscombe)
Coefficients:
(Intercept) x1
3.0001 0.5001
[[2]]
Call:
lm(formula = y2 ~ x2, data = anscombe)
Coefficients:
(Intercept) x2
3.001 0.500
Note that the Call: portion of the output is accurately produced which will be useful if there are many components to the output list.
Formulas can be quoted :
myReg <- list('out1' = "mpg ~ cyl")
lm(myReg[[1]],data=mtcars)
Call:
lm(formula = myReg[[1]], data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876

How can I pass an argument as a character to a function within a function?

I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609

Resources