How to test all subsets of predictor variables in R - r

I would like to programmatically build glms in r, similarly to what's described here (How to build and test multiple models in R), except testing all possible subsets of predictor variables instead.
So, for a dataset like this, with outcome variable z:
data <- data.frame("z" = rnorm(20, 15, 3),
"a" = rnorm(20, 20, 3),
"b" = rnorm(20, 25, 3),
"c" = rnorm(20, 5, 1))
is there a way to automate building the models:
m1 <- glm(z ~ a, data = data)
m2 <- glm(z ~ b, data = data)
m3 <- glm(z ~ c, data = data)
m4 <- glm(z ~ a + b, data = data)
m5 <- glm(z ~ a + c, data = data)
m6 <- glm(z ~ b + c, data = data)
m7 <- glm(Z ~ a + b + c, data = data)
I know the dredge function of the MuMIn package can do this, but I got an error saying that I was including too many variables, so I'm looking for ways to do this independently of dredge. I've tried grid.expand() and combn(), map() and lapply() variants of answers I've found on StackOverflow and can't seem to piece this together. Ideally, model output, including BIC, would be stored in a sortable dataframe.
Any help would be greatly appreciated!!

Assuming you have taken note of #Maurits Evers' comment, you can achieve what you want to do by combination of lapply and combn
cols <- names(data)[-1]
lapply(seq_along(cols), function(x) combn(cols, x, function(y)
glm(reformulate(y, "z"), data = data), simplify = FALSE))

Related

Iterate over list and append in order to do a regression in R

I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).

returning summaries of linear models using lapply

Can anyone explain why the following lapply function does not work, and a possible alternative that does not involve a for-loop.
DV1 <- rnorm(20, 10, 3)
DV2 <- rnorm(20, 8, 3)
DV3 <- rnorm(20, 9, 3)
group <- rep(c("A", "B"), each = 2, length.out = 20)
df <- data.frame(group, DV1, DV2, DV3)
I would like to perform analyses on multiple outcome variables. In this example I created a list of outcome variables to pass into a lm function. Now my understanding is that lapply applies a function to a list and return a list. So why can't it give me a list of three summary(lm()) objects, each of which is a list? Is there any way to do what I am trying to do with one of the apply family of functions?
cols <- list("DV1", "DV2", "DV3")
lapply(cols, function (x) summary(lm(x ~ group, data = df)))
There is no need for a loop. lm handles multiple dependent variables elegantly and (more important) efficiently:
summary(lm(cbind(DV1, DV2, DV3) ~ group, data = DF))
summary(lm(sprintf("cbind(%s) ~ group", paste(cols, collapse = ",")), data = DF))
Try this:
forms <- paste(cols, ' ~ group')
lapply(forms, lm, data = df)
Or if you want to just print summaries rather than save output:
lapply(forms, function(x) summary(lm(x, data = df)))

Extracting residual values from lavaan list matrices in R

I am using lavaan package and my intention is to get my model residuals as dataframes for further use. I run several models that have grouping variables. Here's the basic workflow:
require(lavaan)
df <- data.frame(
y1 = sample(1:100),
y2 = sample(1:100),
x1 = sample(1:100),
x2 = sample(1:100),
x3 = sample(1:100),
grpvar = sample(c("grp1","grp2"), 100, replace = T))
semModel <- list(length = 2)
semModel[1] <- 'y1 ~ c(a,b)*x1 + c(a,b)*x2'
semModel[2] <- 'y1 ~ c(a,b)*x1
y2 ~ c(a,b)*x2 + c(a,b)*x3'
funEstim <- function(model){
sem(model, data = df, group = "grpvar", estimator = "MLM")}
fits <- lapply(semModel, funEstim)
residuals <- lapply(fits, function(x) resid(x, "obs"))
Now the resulting residuals object bugs me. It is a list of matrices that is nested few times. How do I get each of the matrices as a separate dataframe without any hardcoding? I don't want to unlist them as that would lose some information.
You can use list2env along with unlist to make the grp1, grp2, length.grp1, and length.grp2 directly available in the global environment.
list2env(unlist(residuals, recursive=FALSE), envir=.GlobalEnv)
ls()
#[1] "df" "fits" "funEstim" "grp1" "grp2"
#[6] "length.grp1" "length.grp2" "residuals" "semModel"
But they won't be data frames. For that you could convert them to data frames before calling list2env:
df.list <- lapply(unlist(residuals, recursive=FALSE), data.frame)
list2env(df.list, envir=.GlobalEnv)

List Indexing in R over a loop

I'm new to using lists in R and am trying to run a loop over various data frames that stores multiple models for each frame. I would like the models that correspond to a given data frame within the first index of the list; e.g. [[i]][1], [[i]][2]. The following example overwrites the list:
f1 <- data.frame(x = seq(1:6), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
f2 <- data.frame(x = seq(6,11), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
data.frames <- list(f1,f2)
fit <- list()
for(i in 1:length(data.frames)){
fit[[i]] <- lm(y ~ x, data = data.frames[[i]])
fit[[i]] <- lm(y ~ x + z, data = data.frames[[i]])
}
Any idea how to set up the list or the indexing in the loop such that it generates an output that has the two models for the first frame referenced as [[1]][1] and [[1]][2] and the second frame as [[2]][1] and [[2]][2]? Thanks for any and all help.
Calculate both models in a single lapply call applied to each part of the data.frames list:
lapply(data.frames, function(i) {
list(lm(y ~ x, data = i),
lm(y ~ x + z, data=i))
})

How to construct a big regular formula for a model in R?

I am trying create model to predict "y" from data "D" that contain predictor x1 to x100 and other 200 variables . since all Xs are not stored consequently I can't call them by column.
I can't use ctree( y ~ , data = D) because other variables , Is there a way that I can refer them x1:100 ?? in the model ?
instead of writing a very long code
ctree( y = x1 + x2 + x..... x100)
Some recommendation would be appreciated.
Two more. The simplest in my mind is to subset the data:
ctree(y ~ ., data = D[, c("y", paste0("x", 1:100))]
Or a more functional approach to building dynamic formulas:
ctree(reformulate(paste0("x", 1:100), "y"), data = D)
Construct your formula as a text string, and convert it with as.formula.
vars <- names(D)[1:100] # or wherever your desired predictors are
fm <- paste("y ~", paste(vars, collapse="+"))
fm <- as.formula(fm)
ctree(fm, data=D, ...)
You can use this:
fml = as.formula(paste("y", paste0("x", 1:100, collapse=" + "), sep=" ~ "))
ctree(fmla)

Resources