make a list of lm objects, retain their class - r

Apologies for such a rudimentary question--I must be missing something obvious.
I want to build a list of lm objects, which I'm then going to use in an llply call to perform mediation analysis on this list. But this is immaterial--I just first want to make a list of length m (where m is the set of models) and each element within m will itself contain n lm objects.
So in this simple example
d1 <- data.frame(x1 = runif(100, 0, 1),
x2 = runif(100, 0, 1),
x3 = runif(100, 0, 1),
y1 = runif(100, 0, 1),
y2 = runif(100, 0, 1),
y3 = runif(100, 0, 1))
m1 <- lm(y1 ~ x1 + x2 + x3, data = d1)
m2 <- lm(x1 ~ x2 + x3, data = d1)
m3 <- lm(y2 ~ x1 + x2 + x3, data = d1)
m4 <- lm(x2 ~ x1 + x3, data = d1)
m5 <- lm(y3 ~ y1 + y2 + x3, data = d1)
m6 <- lm(x3 ~ x1 + x2, data = d1)
I want a list containing 3 elements, and the first element will contain m1 and m2, the second will contain m3 and m4, etc. My initial attempt is sort of right, but the lmm objects don't retain their class.
mlist <- list(c(m1,m2),
c(m3,m4),
c(m5,m6))
It has the right length (ie length(mlist) equals 3), but I thought I could access the lm object itself with
class(mlist[1][[1]])
but this element is apparently a list.
Am I screwing up how I build the list in the first step, or is this something more fundamental regarding lm objects?

No, you're just getting confused with c and list indexing. Try this:
mlist <- list(list(m1,m2),
list(m3,m4),
list(m5,m6))
> class(mlist[[1]][[1]])
[1] "lm"
So c will concatenate lists by flattening them. In the case of a lm object, that basically means it's flattening each lm object in a list of each of the object components, and then concatenating all those lists together. c is more intuitively used on atomic vectors.
The indexing of lists often trips people up. The thing to remember is that [ will always return a sub-list, while [[ selects an element.
In my example above, this means that mlist[1] will return a list of length one. That first element is still a list. So you'd have to do something like mlist[1][[1]][[1]] to get all the way down to the lm object that way.

Related

How to test all subsets of predictor variables in R

I would like to programmatically build glms in r, similarly to what's described here (How to build and test multiple models in R), except testing all possible subsets of predictor variables instead.
So, for a dataset like this, with outcome variable z:
data <- data.frame("z" = rnorm(20, 15, 3),
"a" = rnorm(20, 20, 3),
"b" = rnorm(20, 25, 3),
"c" = rnorm(20, 5, 1))
is there a way to automate building the models:
m1 <- glm(z ~ a, data = data)
m2 <- glm(z ~ b, data = data)
m3 <- glm(z ~ c, data = data)
m4 <- glm(z ~ a + b, data = data)
m5 <- glm(z ~ a + c, data = data)
m6 <- glm(z ~ b + c, data = data)
m7 <- glm(Z ~ a + b + c, data = data)
I know the dredge function of the MuMIn package can do this, but I got an error saying that I was including too many variables, so I'm looking for ways to do this independently of dredge. I've tried grid.expand() and combn(), map() and lapply() variants of answers I've found on StackOverflow and can't seem to piece this together. Ideally, model output, including BIC, would be stored in a sortable dataframe.
Any help would be greatly appreciated!!
Assuming you have taken note of #Maurits Evers' comment, you can achieve what you want to do by combination of lapply and combn
cols <- names(data)[-1]
lapply(seq_along(cols), function(x) combn(cols, x, function(y)
glm(reformulate(y, "z"), data = data), simplify = FALSE))

How to use a dataframe in a function in r

I need to insert the variables of a dataframe into a function in r. The function in question is "y=[1- (x1-x2) / x3]". When I write, and enter the variables manually it works, however, I need to use the random numbers from the dataframe.
#Original function
f<-function(x1, x2, x3)
+{}
f<-function(x1, x2, x3)
+{return(1-(x1-x2)/x3)}
f(0.9, 0.5, 0.5)```
#Dataframe function
f<-function(x1, x2, x3)
+{}
f<-function(x1, x2, x3)
+{return(1-(x1-x2)/x3)}
f(x1 = x1, x2 = x2, x3 = x3, DATA = DF)
The first output is ok, however, the second output appears the error message. Error in f(VMB = VMB, VMR = VMR, DATA = DATA1) : unused argument (DATA = DATA1) I know I'm not properly inserting the dataframe into the code, but I'm already circling, can anyone help me?
As the comments suggest, your problem is that the function doesn't contain a data argument. R doesn't know where x1, x2, x3 comes from and will only look at through the global environment trying to find them. If these are contained in a data frame, it doesn't know that it should take them from there, and will fail.
For example
f <- function(x,y,z)
1 + (x-y)/z
f(0.9, 0.5, 0.5)
will work, because it knows where to retrieve the values. So will
x1 <- 0.9
x2 <- 0.5
x3 <- 0.5
f(x1, x2, x3)
because it looks through these environemnts, but
df <- data.frame(x = 0.9, y = 0.5, z = 0.5)
f(x, y, z) #fails
fails, because it doesn't look for them in df. Instead you can use
f(df$x, df$y, df$z)
with(df, f(x, y, z)) #same
which lets R know where to get the variables. (Here i used x, y and z to avoid conflict names)
If this function should always take a data.frame and use columns x1, x2, x3 you could use rewrite it to incorporate this, as below.
f <- function(df){
with(df, 1 + (x1-x2)/x3)
}

How to do linear regression with this particular data set?

I have a response variable y.
Also I have a list of 5 dependent variables
x <- list(x1, x2, x3, x4, x5)
Lastly I have a Logical Vector z of length 5. E.g.
z <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
Given this I want R to automatically do linear Regression
lm(y ~ x1 + x2 + x5)
Basically the TRUE/FALSE correspond to whether to include the dependent variable or not.
I am unable to do this.
I tried doing lm(y ~x[z]) but it does not work.
You may do
lm(y ~ do.call(cbind, x[z]))
do.call(cbind, x[z]) will convert x[z] into a matrix, which is an acceptable input format for lm. One problem with this is that the names of the regressors (assuming that x is a named list) in the output are a little messy. So, instead you may do
lm(y ~ ., data = data.frame(y = y, do.call(cbind, x[z])))
that would give nice names in the output (again, assuming that x is a named list).
Try something like binding your y to a data.frame or matrix (cbind) before you do your linear regression. You can filter your dependent variables by doing something like this:
x <- list(x1 = 1:5, x2 = 1:5, x3 = 1:10, x4 = 1:5, x5 = 1:5)
z <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
b <- data.frame(x[which(z == TRUE)])

How to paste formula into model.matrix function in R?

By way of simplified example, say you have the following data:
n <- 10
df <- data.frame(x1 = rnorm(n, 3, 1), x2 = rnorm(n, 0, 1))
And you wish to create a model matrix of the following form:
model.matrix(~ df$x1 + df$x2)
or more preferably:
model.matrix(~ x1 + x2, data = df)
but instead by pasting the formula into model.matrix. I have experimented with the following but encounter errors with all of them:
form1 <- "df$x1 + df$x2"
model.matrix(~ as.formula(form1))
model.matrix(~ eval(parse(text = form1)))
model.matrix(~ paste(form1))
model.matrix(~ form1)
I've also tried the same with the more preferable structure:
form2 <- "x1 + x2, data = df"
Is there a direct solution to this problem? Or is the model.matrix function not conducive to this approach?
Do you mean something like this?
expr <- "~ x1 + x2"
model.matrix(as.formula(expr), df)
You need to give df as the data argument outside of as.formula, as the data argument defines the environment within which to evaluate the formula.
If you don't want to specify the data argument you can do
model.matrix(as.formula("~ df$x1 + df$x2"))

Programming 50000 regressions in R using parallel programming

I have the following homework problem, which I have finished but seems to take an exceptionally long time to complete:
Assume that Y , X1, · · · , X1000 are all normal random variables with mean 0 and standard deviation 1, and they are independent with each other. Generate 30 samples of Y, X1, ···, X1000. Now repeat the following 50000 times: Randomly pickup ten variables from X1, . . ., X1000, run a linear regression of Y on these ten variables and record the R2. Compute the maximum value of the 50000 R2’s.
And here is my code, which works for 8000 regressions (1000 regression on each core of my macbook pro), but can't seem to finish for 6250 regressions (50000 regressions total) on each core. Here is my code:
library(snow)
cl <- makeCluster(8, type = "SOCK")
invisible(clusterEvalQ(cl, reg_cluster <- function(rep, samples, n) {
X <- list()
R <- rep(0, rep)
for (k in 1:rep) {
Y <- rnorm(samples)
for (j in 1:n) {
X[[j]] <- rnorm(samples)
}
X_1 <- sample(X, 10, replace = FALSE)
X_1_unlist <- unlist(X_1)
X.1 <- matrix(X_1_unlist[1:30], ncol = 1)
X.2 <- matrix(X_1_unlist[31:60], ncol = 1)
X.3 <- matrix(X_1_unlist[61:90], ncol = 1)
X.4 <- matrix(X_1_unlist[91:120], ncol = 1)
X.5 <- matrix(X_1_unlist[121:150], ncol = 1)
X.6 <- matrix(X_1_unlist[151:180], ncol = 1)
X.7 <- matrix(X_1_unlist[181:210], ncol = 1)
X.8 <- matrix(X_1_unlist[211:240], ncol = 1)
X.9 <- matrix(X_1_unlist[241:270], ncol = 1)
X.10 <- matrix(X_1_unlist[271:300], ncol = 1)
X_data <- cbind(X.1, X.2, X.3, X.4, X.5, X.6, X.7, X.8, X.9, X.10)
X_data <- as.data.frame (X_data)
names(X_data) <- c("X1", "X2", "X3", "X4", "X5", "X6", "X7", "X8", "X9", "X10")
attach(X_data)
reg <- lm(Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X9 + X10)
R[k] <- summary(reg)$r.squared
}
return(max(R))
}))
results <- clusterEvalQ(cl, reg_cluster(1000, 30, 1000))
results <-clusterEvalQ(cl, reg_cluster(6250, 30, 1000))
stopCluster(cl)
max_results <- c(results[[1]], results[[2]], results[[3]], results[[4]],
results[[5]], results[[6]], results[[7]], results[[8]])
max(max_results)
Something else should be noted here. Each time I run a new regression, the Y and all the X's are generated again. No random variables carry over from one regression to the next.
So my question is, how can I make this run faster?
Also, can anyone tell me why it finished after 12 minutes for 8000 regressions, but still has not finished, after 2.5 hours, for 50000 regressions?
Edit: The following procedure has been confirmed by the professor:
1) Generate 30 random standard normal variables of each of Y, X1, ..., X1000. I would have a total of 30 random normal variables for Y, and a total of 30 x 1,000 = 30,000 random normal variables for all the X's (30 for each one)
2) Randomly select ten of the 1000 choices for X (for example X726, X325, X722, X410, X46, X635, X822, X518, X773, X187)
3) Run a linear regression Y ~ 10 X's using the lm function in R. The Y would have 30 observations, while each X would also have 30 observations. Essentially we'd try to be fitting Y = B0 + B1 * X1 + B2 * X2 + ... + B10 * X10, where each of the X's represents one of the randomly selected in part 2.
4) Record the R2 value in a vector
5) Repeat steps 1-4 50,000 times
6) Find the maximum R2 of the 50,000 recorded
Here's an alternative code that seems to solve your problem.
ns <- 30
rvals <- replicate(50000, {
y <- rnorm(ns)
xvals <- replicate(1000, rnorm(ns))
selecteds <- xvals[,sample(1:1000, 10)]
df <- data.frame(y = ys, selecteds)
summary(lm(paste("y ~", paste0("X", 1:10, collapse = "+")), data = df))$r.squared
})
I'm not very experienced with clustering, but here are a few reasons why your code might be too slow:
You have nested foor loops to create X, and I used replicate, which could be slightly faster than using a list.
You're growing an empty list, X, that's very bad. (Check The R inferno - Circle 2)
You unlisted several list elements just to make them 1-column matrix, then bind them all and finally name the columns. Though this steps seem necessary, I think doing that one by one and one at a time is probably slow. The colnames, for example, are automatically set to X1:X10.
Using attach isn't necessary and probably slows things down.
If you open/close too many clusters, that consumes a lot of processing and can makes things slower than non-parallel. Doesn't seem like the case though.
As a final note, just make sure I'm doing the same as you, since the problemwas still a bit confusing for me.

Resources