Create a loop for model selection in R - r

I am trying to test a whole bunch of different models easily and compare AIC / R-sq values to select the right one. I am having some trouble saving things how I want to between lists and data frames.
data frame I am going to model:
set.seed(1)
df <- data.frame(response=runif(50,min=50,max=100),
var1 = sample(1:20,50,replace=T),
var2 = sample(40:60,50,replace = T))
list of formulas to test:
formulas <- list( response ~ NULL,
response ~ var1,
response ~ var2,
response ~ var1 + var2,
response ~ var1 * var2)
So, what I want to do is create a loop that will model all of these formulas, extract Formula, AIC, and R-sq values into a table, and let me sort it to find the best one. The problem I'm having is I can't extract the formula name as "Response ~ var1", instead, it keeps coming out as "Response" "~" "var1" if I try to extract as a character object. Or, if I extract as a list (like below), then it comes out like this:
[[1]]
response ~ NULL
[[2]]
[1] 415.89
[[3]]
[1] 0
And I can't easily plug those list elements into a data frame. Here is what I tried:
selection <- matrix(ncol=3)
colnames(selection) <- c("formula","AIC","R2") # create a df to store results in
for ( i in 1:length(formulas)){
mod <- lm( formula = formulas[[i]], data= df)
mod_vals <- c(extract(formulas[[i]]),
round(AIC(mod),2),
round(summary(mod)$adj.r.squared,2)
)
selection[i,] <- mod_vals[]
}
Any ideas? I don't have to keep it as a for loop either, I just want a way to test a long list of models together.
Thanks.

You could use lapply to loop over each formula and extract relevant statistic from the model and bind the datasets together.
do.call(rbind, lapply(formulas, function(x) {
mod <- lm(x, data= df)
data.frame(formula = format(x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
}))
# formula AIC r_square
#1 response ~ NULL 405.98 0.00
#2 response ~ var1 407.54 -0.01
#3 response ~ var2 407.90 -0.02
#4 response ~ var1 + var2 409.50 -0.03
#5 response ~ var1 * var2 410.36 -0.03
Or with purrr
purrr::map_df(formulas, ~{
mod <- lm(.x, data= df)
data.frame(formula = format(.x),
AIC = round(AIC(mod),2),
r_square = round(summary(mod)$adj.r.squared,2))
})

Related

R: subsetting within a function

Suppose I have a data frame in the environment, mydata, with three columns, A, B, C.
mydata = data.frame(A=c(1,2,3),
B=c(4,5,6),
C=c(7,8,9))
I can create a linear model with
lm(C ~ A, data=mydata)
I want a function to generalize this, to regress B or C on A, given just the name of the column, i.e.,
f = function(x){
lm(x ~ A, data=mydata)
}
f(B)
f(C)
or
g = function(x){
lm(mydata$x ~ mydata$A)
}
g(B)
g(C)
These solutions don't work. I know there is something wrong with the evaluation, and I have tried permutations of quo() and enquo() and !!, but no success.
This is a simplified example, but the idea is, when I have dozens of similar models to build, each fairly complicated, with only one variable changing, I want to do so without repeating the entire formula each time.
If we want to pass unquoted column name, and option is {{}} from tidyverse. With select, it can take both string and unquoted
library(dplyr)
printcol2 <- function(data, x) {
data %>%
select({{x}})
}
printcol2(mydata, A)
# A
#1 1
#2 2
#3 3
printcol2(mydata, 'A')
# A
#1 1
#2 2
#3 3
If the OP wanted to pass unquoted column name to be passed in lm
f1 <- function(x){
rsp <- deparse(substitute(x))
fmla <- reformulate("A", response = rsp)
out <- lm(fmla, data=mydata)
out$call <- as.symbol(paste0("lm(", deparse(fmla), ", data = mydata)"))
out
}
f1(B)
#Call:
#lm(B ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 3 1
f1(C)
#Call:
#lm(C ~ A, data = mydata)
#Coefficients:
#(Intercept) A
# 6 1
Maybe you are looking for deparse(substitute(.)). It accepts arguments quoted or not quoted.
f = function(x, data = mydata){
y <- deparse(substitute(x))
fmla <- paste(y, 'Species', sep = '~')
lm(as.formula(fmla), data = data)
}
mydata <- iris
f(Sepal.Length)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 5.006 0.930 1.582
f(Petal.Width)
#
#Call:
#lm(formula = as.formula(fmla), data = data)
#
#Coefficients:
# (Intercept) Speciesversicolor Speciesvirginica
# 0.246 1.080 1.780
I think generally, you might be looking for:
printcol <- function(x){
print(x)
}
printcol(mydata$A)
This doesn't involve any fancy evaluation, you just need to specify the variable you'd like to subset in your function call.
This gives us:
[1] 1 2 3
Note that you're only printing the vector A, and not actually subsetting column A from mydata.

Using a function parameter and passing it in to lm formula

I am trying to create a function that passes a parameter in as the dependent variable with the independent variables staying the same.
I have tried to use {{}} but see the problem as something like the below if select contains was possible.
test_func <- function(dataframe, dependent){
model <- tidy(lm({{ dependent }} ~ . - select(contains("x")), data = dataframe))
return(model)
}
test_func(datasets::anscombe, x1)
The function should pass as function(dataframe, dependent) with a single model.
Use reformulate().
f <- function(d, y) lm(reformulate(names(d)[grep("x", names(d))], response=y), data=d)
f(datasets::anscombe, "y1")
# Call:
# lm(formula = reformulate(names(d)[grep("x", names(d))], response = y),
# data = d)
#
# Coefficients:
# (Intercept) x1 x2 x3 x4
# 4.33291 0.45073 NA NA -0.09873

Is there a way to check the regression on all variables(dependent) in the data

is there a way to make all variables a target variable and check regression results against other independent variables. For example
df
Date Var1 Var2 Var3
27/9/2019 12 45 59
28/9/2019 34 43 54
29/9/2019 45 23 40
Usually if want to see the relationship between Var1 and Var2 i use below code
lm(Var1 ~ Var2, data = myData)
In case I want to see the results for all variables (Var1 , Var2 and Var3) like, in one instance, Var1 is dependent variable and rest(Var2 and Var3) are independent. Then 2 instance, Var2 is dependent variable and rest(Var1 and Var3) are independent and so on. Is there a way to do this?
You could use something like this to get the formulas you need:
vars <- names(df)[-1] # we can eliminate the dates
forms <- lapply(1:length(vars),
function(i) formula(paste(vars[i], "~", paste(vars[-i], collapse = "+")))
)
Output:
[[1]]
Var1 ~ Var2 + Var3
<environment: 0x7fdaaa63abd0>
[[2]]
Var2 ~ Var1 + Var3
<environment: 0x7fdaaa63c508>
[[3]]
Var3 ~ Var1 + Var2
<environment: 0x7fdaaec0d2a8>
Then you just need to pass each formula into lm in lapply:
mods <- lapply(forms, lm, data = df)
Output:
[[1]]
Call:
FUN(formula = X[[i]], data = ..1)
Coefficients:
(Intercept) Var2 Var3
196.403 3.514 -5.806
[[2]]
Call:
FUN(formula = X[[i]], data = ..1)
Coefficients:
(Intercept) Var1 Var3
-55.8933 0.2846 1.6522
[[3]]
Call:
FUN(formula = X[[i]], data = ..1)
Coefficients:
(Intercept) Var1 Var2
33.8301 -0.1722 0.6053
If you want to regress Var1 against all other variables you can do the following :
lm(Var1 ~. , data = myData)
If you just want to select more tab one variable than you can also use:
lm(Var1 ~ Var2 + Var3, data = myData)
The following is based on the answers to these questions: 1, 2 and 3. See the explanations therein.
The main difference is that it loops (lapply) through the columns of the input data set and constructs full models with each of those column-vectors as response and all others as predictors. Then dredges the full model fit.
library(MuMIn)
model_list <- lapply(names(df1), function(resp){
fmla <- as.formula(paste(resp, "~ ."))
print(fmla)
full <- lm(fmla, data = df1, na.action = na.fail)
dredge(full)
})
model_list
Test data creation code.
set.seed(1234)
df1 <- replicate(3, sample(10:99, 100, TRUE))
df1 <- as.data.frame(df1)
names(df1) <- paste0("Var", 1:3)

create function to name objects in R using existing character vector

I am trying to master building functions in R. Say I have a data frame or data.table,
dummy <- df(y, x, a, b, who)
Where the vector "who" is like so,
who <- c("Joseph", "Kim", "Billy")
I would like to use the character vector to perform various regression models and name the outputs and their summary statistics. So for the entry, "Billy" in the vector above, I would like something like this:
function() {
ols.reg.Billy <- lm(y ~ x + a + b, data = dummy[dummy$who == "Billy"])
dw.Billy <- dwtest(ols.reg.Billy)
output.Billy <- list(ols.reg.Billy, dw.Billy)
return(output.Billy)
}
But for 500 different entries of the who vector above.
Is there some way to do this? What's the most efficient way? I keep getting errors and I feel I am seriously missing something. Is there some way to use paste?
If this doesn't solve it, please provide a reproducible example. It makes it easier to help you.
library(lmtest)
outputs <- lapply(who, function(name) {
ols.reg <- lm(y ~ x + a + b, data = dummy[dummy$who == name])
dw <- dwtest(ols.reg)
output <- paste(c("ols.reg","dw"), name, sep = "_")
return(output)
})
1) Map Using the built in CO2 data set suppose we wish to regress uptake on conc separately for each Type. Note that this names the components by the Type.
Map(function(x) lm(uptake ~ conc, CO2, subset = Type == x), levels(CO2$Type))
giving this two component list (one component for each level of Type -- Quebec and Mississauga) -- continued after output.
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == x)
Coefficients:
(Intercept) conc
15.49754 0.01238
2) Map/do.call We may wish to not only name the components using the Type but also have x substituted with the actual Type in the Call: line of the output. In that case use do.call to invoke lm and use quote to ensure that the name of the data frame rather than its value is displayed and use bquote to perform the substitution for x.
reg <- function(x) {
do.call("lm", list(uptake ~ conc, quote(CO2), subset = bquote(Type == .(x))))
}
Map(reg, levels(CO2$Type))
giving:
$Quebec
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Quebec")
Coefficients:
(Intercept) conc
23.50304 0.02308
$Mississippi
Call:
lm(formula = uptake ~ conc, data = CO2, subset = Type == "Mississippi")
Coefficients:
(Intercept) conc
15.49754 0.01238
3) lmList The nlme package has lmList for doing this:
library(nlme)
lmList(uptake ~ conc | Type, CO2, pool = FALSE)
giving:
Call:
Model: uptake ~ conc | Type
Data: CO2
Coefficients:
(Intercept) conc
Quebec 23.50304 0.02308005
Mississippi 15.49754 0.01238113

Use paste within lm to conserve variable names

I am doing several linear regressions and I am looping through the variables that I want to use in the model. I want to present the output from R and I want my variables names to appear in the summary of lm.
If I have:
var1 <- "nice_name_var1"
var2 <- "nice_name_var2"
depvar <- "nice_name_dep_var"
That I know are present in my data frame my.df
I cannot do this:
lm(paste(depvar,sep="") ~ paste(var1,sep="") + paste(var2,sep=""),
data=my.df)
I know that I could do this, and this works, but then the summary output doesn't have the names of the variables that I want:
lm(my.df[,paste(depvar,sep="")] ~ my.df[,paste(var1,sep="")] + my.df[,paste(var2,sep="")])
data=my.df)
1) paste Using the built in anscombe data frame as an example:
depvar <- "y1"
var1 <- "x1"
var2 <- "x2"
fo <- as.formula(paste(depvar, "~", var1, "+", var2))
do.call("lm", list(fo, quote(anscombe)))
giving this output which does show the variable names x1, x2 and y1:
Call:
lm(formula = y1 ~ x1 + x2, data = anscombe)
Coefficients:
(Intercept) x1 x2
3.0001 0.5001 NA
lm will accept a character string in place of a formula so as.formula can be omitted if it is ok to have quotes shown around it in the output.
2) model.frame/terms Another approach is:
mf <- model.frame(anscombe[c(depvar, var1, var2)])
do.call("lm", list(terms(mf), quote(anscombe)))
giving similar output.

Resources