Use paste within lm to conserve variable names - r

I am doing several linear regressions and I am looping through the variables that I want to use in the model. I want to present the output from R and I want my variables names to appear in the summary of lm.
If I have:
var1 <- "nice_name_var1"
var2 <- "nice_name_var2"
depvar <- "nice_name_dep_var"
That I know are present in my data frame my.df
I cannot do this:
lm(paste(depvar,sep="") ~ paste(var1,sep="") + paste(var2,sep=""),
data=my.df)
I know that I could do this, and this works, but then the summary output doesn't have the names of the variables that I want:
lm(my.df[,paste(depvar,sep="")] ~ my.df[,paste(var1,sep="")] + my.df[,paste(var2,sep="")])
data=my.df)

1) paste Using the built in anscombe data frame as an example:
depvar <- "y1"
var1 <- "x1"
var2 <- "x2"
fo <- as.formula(paste(depvar, "~", var1, "+", var2))
do.call("lm", list(fo, quote(anscombe)))
giving this output which does show the variable names x1, x2 and y1:
Call:
lm(formula = y1 ~ x1 + x2, data = anscombe)
Coefficients:
(Intercept) x1 x2
3.0001 0.5001 NA
lm will accept a character string in place of a formula so as.formula can be omitted if it is ok to have quotes shown around it in the output.
2) model.frame/terms Another approach is:
mf <- model.frame(anscombe[c(depvar, var1, var2)])
do.call("lm", list(terms(mf), quote(anscombe)))
giving similar output.

Related

Create a loop for model selection in R

I am trying to test a whole bunch of different models easily and compare AIC / R-sq values to select the right one. I am having some trouble saving things how I want to between lists and data frames.
data frame I am going to model:
set.seed(1)
df <- data.frame(response=runif(50,min=50,max=100),
var1 = sample(1:20,50,replace=T),
var2 = sample(40:60,50,replace = T))
list of formulas to test:
formulas <- list( response ~ NULL,
response ~ var1,
response ~ var2,
response ~ var1 + var2,
response ~ var1 * var2)
So, what I want to do is create a loop that will model all of these formulas, extract Formula, AIC, and R-sq values into a table, and let me sort it to find the best one. The problem I'm having is I can't extract the formula name as "Response ~ var1", instead, it keeps coming out as "Response" "~" "var1" if I try to extract as a character object. Or, if I extract as a list (like below), then it comes out like this:
[[1]]
response ~ NULL
[[2]]
[1] 415.89
[[3]]
[1] 0
And I can't easily plug those list elements into a data frame. Here is what I tried:
selection <- matrix(ncol=3)
colnames(selection) <- c("formula","AIC","R2") # create a df to store results in
for ( i in 1:length(formulas)){
mod <- lm( formula = formulas[[i]], data= df)
mod_vals <- c(extract(formulas[[i]]),
round(AIC(mod),2),
round(summary(mod)$adj.r.squared,2)
)
selection[i,] <- mod_vals[]
}
Any ideas? I don't have to keep it as a for loop either, I just want a way to test a long list of models together.
Thanks.
You could use lapply to loop over each formula and extract relevant statistic from the model and bind the datasets together.
do.call(rbind, lapply(formulas, function(x) {
mod <- lm(x, data= df)
data.frame(formula = format(x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
}))
# formula AIC r_square
#1 response ~ NULL 405.98 0.00
#2 response ~ var1 407.54 -0.01
#3 response ~ var2 407.90 -0.02
#4 response ~ var1 + var2 409.50 -0.03
#5 response ~ var1 * var2 410.36 -0.03
Or with purrr
purrr::map_df(formulas, ~{
mod <- lm(.x, data= df)
data.frame(formula = format(.x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
})

Using a function parameter and passing it in to lm formula

I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873

R lm: Create regressions dynamically

I have a set of dependent variables y1, y2, ...., a set of independent variables x1,x2,..., and a set of controls d1,d2,.... These are all inside a data.table, lets call it data.
I need to do something along the lines of
out1 <- lm(y1 ~ x1, data=data)
out2 <- lm(y1 ~ x1 + d1 + d2, data=data)
....
This is of course not very nice, so I was thinking about writing a list containing all these regressions, and than just iterate through that. Something along the lines of
myRegressions <- list('out1' = y1 ~ x1, 'out2' = y1 ~ x1 + d1 + d2)
output <- NULL
for (reg in myRegressions)
{
output[reg] <- lm(myRegressions[[reg]])
}
This of course won't work: I cannot construct the list as the syntax is invalid outside of lm(). What's a good approach here instead?
You can use paste0 and as.formula to generate formulas and then simply put them into lm(), e. g.
regressors <- c("x1", "x1 + x2", "x1 + x2 + x3")
for (i in 1:length(regressors)) {
print(as.formula(paste0("y1", "~", regressors[i])))
}
This gives you the formulas (printed). Just store them in a list and iterate over that list with lapply like
lapply(stored_formulas, function(x) { lm(x, data=yourData) })
Using the built in anscombe data frame try this:
formulas = list(y1 ~ x1, y2 ~ x2)
lapply(formulas, function(fo) do.call("lm", list(fo, data = quote(anscombe))))
giving:
[[1]]
Call:
lm(formula = y1 ~ x1, data = anscombe)
Coefficients:
(Intercept) x1
3.0001 0.5001
[[2]]
Call:
lm(formula = y2 ~ x2, data = anscombe)
Coefficients:
(Intercept) x2
3.001 0.500
Note that the Call: portion of the output is accurately produced which will be useful if there are many components to the output list.
Formulas can be quoted :
myReg <- list('out1' = "mpg ~ cyl")
lm(myReg[[1]],data=mtcars)
Call:
lm(formula = myReg[[1]], data = mtcars)
Coefficients:
(Intercept) cyl
37.885 -2.876

How can I pass an argument as a character to a function within a function?

I'm trying to create a series of models based on subsets of different categories in my data. Instead of creating a bunch of individual model objects, I'm using lapply() to create a list of models based on subsets of every level of my category factor, like so:
test.data <- data.frame(y=rnorm(100), x1=rnorm(100), x2=rnorm(100), category=rep(c("A", "B"), 2))
run.individual.models <- function(x) {
lm(y ~ x1 + x2, data=test.data, subset=(category==x))
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
# [[1]]
# Call:
# lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
# x))
# Coefficients:
# (Intercept) x1 x2
# 0.10852 -0.09329 0.11365
# ....
This works fantastically, except the model call shows subset = (category == x) instead of category == "A", etc. This makes it more difficult to use both for diagnostic purposes (it's hard to remember which model in the list corresponds to which category) and for functions like predict().
Is there a way to substitute the actual character value of x into the lm() call so that the model doesn't use the raw x in the call?
Along the lines of Explicit formula used in linear regression
Use bquote to construct the call
run.individual.models <- function(x) {
lmc <- bquote(lm(y ~ x1 + x2, data=test.data, subset=(category==.(x))))
eval(lmc)
}
individual.models <- lapply(levels(test.data$category), FUN=run.individual.models)
individual.models
[[1]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"A"))
Coefficients:
(Intercept) x1 x2
-0.08434 0.05881 0.07695
[[2]]
Call:
lm(formula = y ~ x1 + x2, data = test.data, subset = (category ==
"B"))
Coefficients:
(Intercept) x1 x2
0.1251 -0.1854 -0.1609

linear model when all occurrences of independent variables are NA

I'm looking for suggestions on how to deal with NA's in linear regressions when all occurrences of an independent/explanatory variable are NA (i.e. x3 below).
I know the obvious solution would be to exclude the independent/explanatory variable in question from the model but I am looping through multiple regions and would prefer not to have a different functional forms for each region.
Below is some sample data:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
# Quick plot of data
library(ggplot2)
library(reshape2)
df.melt <-melt(df, id=c("time"))
p <- ggplot(df.melt, aes(x=time, y=value)) +
geom_line() + facet_grid(variable ~ .)
p
I have read the documentation for lm and tried various na.action settings without success:
lm(y~x1+x2+x3, data=df, singular.ok=TRUE)
lm(y~x1+x2+x3, data=df, na.action=na.omit)
lm(y~x1+x2+x3, data=df, na.action=na.exclude)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.omit)
lm(y~x1+x2+x3, data=df, singular.ok=TRUE, na.exclude=na.exclude)
Is there a way to get lm to run without error and simply return a coefficient for the explanatory reflective of the lack of explanatory power (i.e. either zero or NA) from the variable in question?
Here's one idea:
set.seed(23409)
n <- 100
time <- seq(1,n, 1)
x1 <- cumsum(runif(n))
y <- .8*x1 + rnorm(n, mean=0, sd=2)
x2 <- seq(1,n, 1)
x3 <- rep(NA, n)
df <- data.frame(y=y, time=time, x1=x1, x2=x2, x3=x3)
replaceNA<-function(x){
if(all(is.na(x))){
rep(0,length(x))
} else x
}
lm(y~x1+x2+x3, data= data.frame(lapply(df,replaceNA)))
Call:
lm(formula = y ~ x1 + x2 + x3, data = data.frame(lapply(df, replaceNA)))
Coefficients:
(Intercept) x1 x2 x3
0.05467 1.01133 -0.10613 NA
lm(y~x1+x2, data=df)
Call:
lm(formula = y ~ x1 + x2, data = df)
Coefficients:
(Intercept) x1 x2
0.05467 1.01133 -0.10613
So you replace the variables which contain only NA's with variable which contains only 0's. you get the coefficient value NA, but all the relevant parts of the model fits are same (expect qr decomposition, but if information about that is needed, it can be easily modified). Note that component summary(fit)$alias (see ?alias) might be useful.
This seems to relate your other question: Replace lm coefficients in [r]
You won't be able to include a column with all NA values. It does strange things to model.matrix
x1 <- 1:5
x2 <- rep(NA,5)
model.matrix(~x1+x2)
(Intercept) x1 x2TRUE
attr(,"assign")
[1] 0 1 2
attr(,"contrasts")
attr(,"contrasts")$x2
[1] "contr.treatment"
So your alternative is to programatically create the model formula based on the data.
Something like...
make_formula <- function(variables, data, response = 'y'){
if(missing(data)){stop('data not specified')}
using <- Filter(variables,f= function(i) !all(is.na(data[[i]])))
deparse(reformulate(using, response))
}
variables <- c('x1','x2','x3')
make_formula(variables, data =df)
[1] "y ~ x1 + x2"
I've used deparse to return a character string so that there is no environment issues from creating the formula within the function. lm can happily take a character string which is a valid formula.

Resources