returning summaries of linear models using lapply - r

Can anyone explain why the following lapply function does not work, and a possible alternative that does not involve a for-loop.
DV1 <- rnorm(20, 10, 3)
DV2 <- rnorm(20, 8, 3)
DV3 <- rnorm(20, 9, 3)
group <- rep(c("A", "B"), each = 2, length.out = 20)
df <- data.frame(group, DV1, DV2, DV3)
I would like to perform analyses on multiple outcome variables. In this example I created a list of outcome variables to pass into a lm function. Now my understanding is that lapply applies a function to a list and return a list. So why can't it give me a list of three summary(lm()) objects, each of which is a list? Is there any way to do what I am trying to do with one of the apply family of functions?
cols <- list("DV1", "DV2", "DV3")
lapply(cols, function (x) summary(lm(x ~ group, data = df)))

There is no need for a loop. lm handles multiple dependent variables elegantly and (more important) efficiently:
summary(lm(cbind(DV1, DV2, DV3) ~ group, data = DF))
summary(lm(sprintf("cbind(%s) ~ group", paste(cols, collapse = ",")), data = DF))

Try this:
forms <- paste(cols, ' ~ group')
lapply(forms, lm, data = df)
Or if you want to just print summaries rather than save output:
lapply(forms, function(x) summary(lm(x, data = df)))

Related

Performing a large number of 2-sample t-tests in R

So I am creating a function which allows me to take a data.frame and get a dataframe of p.values for each variable tested.
# data and labels
my_data <- data.frame(matrix(data = rnorm(10000), nrow = 100, ncol = 10000))
labels <- sample(0:1, 100, replace = TRUE)
# append the labels to the data, then filter
my_data$labels <- labels
sample_1 <- dplyr::filter(.data = my_data, labels == 0)
sample_2 <- dplyr::filter(.data = my_data, labels == 1)
#perform a t-test on each column
p_vals <- data.frame()
for(i in c(1:10000)) {
p_vals <- rbind(p_vals, t.test(x = sample_1[,i], y = sample_2[,i])$p.value)
}
return(p_vals)
This is functional, but I think/hope there would be a more efficient way to do this without the for loop. The data should be in rows because for later functions it will be important to keep track of which variable had which p value.
Instead of splitting the samples you can use the formula interface to t.test, and sapply over the columns of my_data to conduct the tests:
p_vals <- sapply( my_data, function(x) t.test(x ~ labels)$p.value )
This will make a vector of p-values, the order will be the same as the columns of my_data
You can also use the package genefilter:
library(genefilter)
colttests(as.matrix(my_data[,-ncol(my_data)]),factor(my_data$labels))

How to test all subsets of predictor variables in R

I would like to programmatically build glms in r, similarly to what's described here (How to build and test multiple models in R), except testing all possible subsets of predictor variables instead.
So, for a dataset like this, with outcome variable z:
data <- data.frame("z" = rnorm(20, 15, 3),
"a" = rnorm(20, 20, 3),
"b" = rnorm(20, 25, 3),
"c" = rnorm(20, 5, 1))
is there a way to automate building the models:
m1 <- glm(z ~ a, data = data)
m2 <- glm(z ~ b, data = data)
m3 <- glm(z ~ c, data = data)
m4 <- glm(z ~ a + b, data = data)
m5 <- glm(z ~ a + c, data = data)
m6 <- glm(z ~ b + c, data = data)
m7 <- glm(Z ~ a + b + c, data = data)
I know the dredge function of the MuMIn package can do this, but I got an error saying that I was including too many variables, so I'm looking for ways to do this independently of dredge. I've tried grid.expand() and combn(), map() and lapply() variants of answers I've found on StackOverflow and can't seem to piece this together. Ideally, model output, including BIC, would be stored in a sortable dataframe.
Any help would be greatly appreciated!!
Assuming you have taken note of #Maurits Evers' comment, you can achieve what you want to do by combination of lapply and combn
cols <- names(data)[-1]
lapply(seq_along(cols), function(x) combn(cols, x, function(y)
glm(reformulate(y, "z"), data = data), simplify = FALSE))

Iterate over list and append in order to do a regression in R

I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).

Extracting residual values from lavaan list matrices in R

I am using lavaan package and my intention is to get my model residuals as dataframes for further use. I run several models that have grouping variables. Here's the basic workflow:
require(lavaan)
df <- data.frame(
y1 = sample(1:100),
y2 = sample(1:100),
x1 = sample(1:100),
x2 = sample(1:100),
x3 = sample(1:100),
grpvar = sample(c("grp1","grp2"), 100, replace = T))
semModel <- list(length = 2)
semModel[1] <- 'y1 ~ c(a,b)*x1 + c(a,b)*x2'
semModel[2] <- 'y1 ~ c(a,b)*x1
y2 ~ c(a,b)*x2 + c(a,b)*x3'
funEstim <- function(model){
sem(model, data = df, group = "grpvar", estimator = "MLM")}
fits <- lapply(semModel, funEstim)
residuals <- lapply(fits, function(x) resid(x, "obs"))
Now the resulting residuals object bugs me. It is a list of matrices that is nested few times. How do I get each of the matrices as a separate dataframe without any hardcoding? I don't want to unlist them as that would lose some information.
You can use list2env along with unlist to make the grp1, grp2, length.grp1, and length.grp2 directly available in the global environment.
list2env(unlist(residuals, recursive=FALSE), envir=.GlobalEnv)
ls()
#[1] "df" "fits" "funEstim" "grp1" "grp2"
#[6] "length.grp1" "length.grp2" "residuals" "semModel"
But they won't be data frames. For that you could convert them to data frames before calling list2env:
df.list <- lapply(unlist(residuals, recursive=FALSE), data.frame)
list2env(df.list, envir=.GlobalEnv)

List Indexing in R over a loop

I'm new to using lists in R and am trying to run a loop over various data frames that stores multiple models for each frame. I would like the models that correspond to a given data frame within the first index of the list; e.g. [[i]][1], [[i]][2]. The following example overwrites the list:
f1 <- data.frame(x = seq(1:6), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
f2 <- data.frame(x = seq(6,11), y = sample(1:100, 6, replace = TRUE), z = rnorm(6))
data.frames <- list(f1,f2)
fit <- list()
for(i in 1:length(data.frames)){
fit[[i]] <- lm(y ~ x, data = data.frames[[i]])
fit[[i]] <- lm(y ~ x + z, data = data.frames[[i]])
}
Any idea how to set up the list or the indexing in the loop such that it generates an output that has the two models for the first frame referenced as [[1]][1] and [[1]][2] and the second frame as [[2]][1] and [[2]][2]? Thanks for any and all help.
Calculate both models in a single lapply call applied to each part of the data.frames list:
lapply(data.frames, function(i) {
list(lm(y ~ x, data = i),
lm(y ~ x + z, data=i))
})

Resources