I'm looking to loop a number of independent variables through a mixed effect model. There are a couple of similar questions but nothing that quite works for me. An example using mtcars:
data(mtcars)
mtcars <- mtcars
t <- as.data.frame(seq(from = 10, to = 1000, by = 100))
names(t)[1] <- "return"
t <- as.data.frame(t(t))
#create some new variables to loop through
new <- cbind(mtcars$drat, t)
new2 <- 1-exp(-mtcars$drat/new[, 2:10])
new3 <- cbind(mtcars, new2)
xnam <- paste(colnames(mtcars)[c(3:4)], sep="")
xnam2 <- paste(colnames(reference)[c(12:20)], sep="")
#basic model (works fine)
fmla <- paste(xnam, collapse= "+")
fla <- paste("cyl ~", paste(fmla))
f <- paste0(paste(fla), " +(carb|gear)")
mtcarsmodel <- lmer(f, data= mtcars)
mtcarsmodel
So with my 'basic' model, I now want iteratively run each of the variables in xnam2 through the model as a fixed effect, but can't get it working with lapply and paste method:
f2 <- " +(carb|gear)"
newmodels <- lapply(xnam2, function(x) {
lmer(substitute(paste(fla), i + (paste(f2)), list(i = as.name(x))), data = mtcars)
})
So cyl ~ disp+hp + looping variable + (carb|gear) is what I'm after.
Hopefully that's clear with what I'm trying to accomplish. I'm getting a bit confused with the multiple pastes, but seems like the best way to approach dealing with many variables. Any suggestions?
If I've understood your question, I think you can just create the model formula with paste and use lapply to iterate through each new variable.
library(lme4)
vars = names(mtcars)[c(1,5,6,7)]
models = lapply(setNames(vars, vars), function(var) {
form = paste("cyl ~ disp + hp + ", var, "+ (carb|gear)")
lmer(form, data=mtcars)
})
A slight variant on #eipi10's solution:
library(lme4)
vars = names(mtcars)[c(1,5,6,7)]
otherVars <- c("disp","hp","(carb|gear)")
formList <- lapply(vars,function(x) {
reformulate(c(otherVars,x),response="cyl")
})
modList <- lapply(formList,lmer,data=mtcars)
Related
I know that somewhere there will exist this kind of question, but I couldn't find it. I have the variables a, b, c, d and I want to write a loop, such that I regress and append the variables and regress again with the additional variable
lm(Y ~ a, data = data), then
lm(Y ~ a + b, data = data), then
lm(Y ~ a + b + c, data = data) etc.
How would you do this?
Using paste and as.formula, example using mtcars dataset:
myFits <- lapply(2:ncol(mtcars), function(i){
x <- as.formula(paste("mpg",
paste(colnames(mtcars)[2:i], collapse = "+"),
sep = "~"))
lm(formula = x, data = mtcars)
})
Note: looks like a duplicate post, I have seen a better solution for this type of questions, cannot find at the moment.
You could do this with a lapply / reformulate approach.
formulae <- lapply(ivars, function(x) reformulate(x, response="Y"))
lapply(formulae, function(x) summary(do.call("lm", list(x, quote(dat)))))
Data
set.seed(42)
dat <- data.frame(matrix(rnorm(80), 20, 4, dimnames=list(NULL, c("Y", letters[1:3]))))
ivars <- sapply(1:3, function(x) letters[1:x]) # create an example vector ov indep. variables
vars = c('a', 'b', 'c', 'd')
# might want to use a subset of names(data) instead of
# manually typing the names
reg_list = list()
for (i in seq_along(vars)) {
my_formula = as.formula(sprintf('Y ~ %s', paste(vars[1:i], collapse = " + ")))
reg_list[[i]] = lm(my_formula, data = data)
}
You can then inspect an individual result with, e.g., summary(reg_list[[2]]) (for the 2nd one).
I am looking for an elegant (and safe!) way to evaluate an amended call in the parent frame. By "amended" I mean I modified the call in such a way that it refers to something not included in parent frame but in another frame. I guess one could also say: "send something up but only for evaluation".
It is clarified what I want by the example below which works in some circumstances, but not all. The update function (stats:::update.default) uses eval and I added the weights argument with something (res) that is not in the same environment as the evaluation takes place. So I used get("res", pos = -1L) and I hope it is a safe way to refer to the environment res lives in. For models estimated with a variable as formula, both defined methods fail:
mod <- lm(mpg ~ cyl, data = mtcars)
form <- mpg ~ cyl
mod2 <- lm(form, data = mtcars)
wls1 <- function(x) {
res <- residuals(x)^2 # example
result <- update(x, weights = 1/get("res", pos = -1L))
return(result)
}
wls2 <- function(x) {
res <- residuals(x)^2 # example
result <- update(x, weights = 1/res)
return(result)
}
wls3 <- function(x) {
data(ChickWeight)
ChickWeight$cyl <- ChickWeight$weight
ChickWeight$mpg <- ChickWeight$Time
result <- update(x, data = ChickWeight)
return(result)
}
wls1(mod) # works
wls1(mod2) # errors
wls2(mod) # works
wls2(mod2) # erros
wls3(mod) # works
wls3(mod2) # works
How can this be solved in general in a safe way?
I was looking for a function that gives the current environment (something like a fictious this.environment() function) so avoid the pos argument and use the envir of get (I know I can create my own temporary environment and have res associated to it to use something like envir = my.eny).
We can do this by creating a quoted 'language' object for the formula and then update the call of the model
form <- quote(mpg ~ cyl)
wlsN <- function(x, formula) {
x$call$formula <- formula
res <- residuals(x)^2
update(x, weights = 1/res) # it is in the same environment. No need for get
}
wlsN(mod2, form)
#Call:
#lm(formula = mpg ~ cyl, data = mtcars, weights = 1/res)
#Coefficients:
#(Intercept) cyl
# 37.705 -2.841
-checking with other formula
form1 <- quote(disp ~ cyl + vs)
form2 <- quote(mpq ~ gear + carb)
mod1 <- lm(form1, data = mtcars)
mod2 <- lm(form2, data = mtcars)
wlsN(mod1, form1) # works
wlsN(mod2, form2) # works
It's hard to work around the fact that R looks for the value of weights in either data or the environment of the formula - which in the case of the variable named form in your example, is the global environment.
An alternative that riffs on the same theme as akrun's answer:
wls3 <- function(x) {
environment(x$call$formula) <- environment()
res <- residuals(x)^2
result <- update(x, weights=1/res)
}
I can see how this could get ugly in less trivial uses of this workaround such as when the formula of x already has an environment that does not enclose (potentially wrong use of the term) the environment in the call to wls3().
Another alternative (not recommended) is to use assign, e.g.
wls4 <- function(x) {
assign('res', residuals(x)^2, envir=environment(formula(x)))
result <- update(x, weights=1/res)
}
however this has the unintended consequence of leaving the variable res in the global environment.
I'm trying to combine multiple lm outputs into a data frame, for further calculations. I have a dataset of 1000 observations and 62 variables. The project is to randomly split the dataset 63/37, train the model, repeat this 1000 times and save the coefficients, the fitted values, and the r2 for all 1000 runs. So I'm doing most of that here (using mtcars):
data("mtcars")
f <- function () {
fit <- lm(mpg ~ ., data = mtcars, subset = sample <- sample.int(n = nrow(mtcars), size = floor(.63*nrow(mtcars)), replace = F))
coef(fit)
}
output <- t(replicate(1000, f()))
I know I can get the rsq values with summary(fit)$r.squared and I can use predict() to get the fitted values. I'm just struggling with how to get them into the data frame with the saved coefficients.
The below should do
get_model <- function (input_data) {
fit <- lm(mpg ~ .,
data = mtcars,
subset = sample <- sample.int(n = nrow(mtcars),
size = floor(.63*nrow(mtcars)), replace = F)
)
return(fit)
}
get_results <- function(lm_model){
data <- data.frame()
data <- rbind(data, coef(lm_model))
data <- cbind(data, summary(lm_model)$r.squared)
colnames(data) <- c(names(mtcars), "rsquared")
return(data)
}
# running the above
input_data <- mtcars
general_df <- data.frame()
for(i in 1:1000){
my_model <- get_model(input_data)
final_data <- get_results(my_model)
general_df <- rbind(general_df, final_data)
}
You are very close:
library(tidyverse)
library(modelr)
data("mtcars")
get_data_lm <- function(data_df, testPCT = 0.37){
data_resample <- modelr::crossv_mc(data_df, n = 1, test = testPCT)
fit <- lm(mpg ~ ., data = as.data.frame(data_resample$train))
stats <- c(coef(fit),
"R2" = summary(fit)$r.squared,
"AdjR2" = summary(fit)$adj.r.squared)
pred_vals <- predict(fit, newdata = as.data.frame(data_resample$test))
c(stats, pred_vals)
}
output <- t(replicate(1000, get_data_lm(mtcars)))
The only thing you needed to do is concatenate the other statistics and predicted values you want. Alternatively, you could use a parallel sapply() variant to make your simulation considerably faster.
Another comment: I use the crossv_mc() function from the modelr:: package to create one testing and training partition. However, I could have used n = 1000 outside the function instead; this would have created a resample data frame in my working environment for me to apply() a function over. See the modelr:: GitHub page for more info.
I have a list of model summaries (let's say it is a linear model; but this could apply to any model summary).
Currently, I am extracting a certain coefficient from this list of summaries using the following:
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
How could I do this by calling the variable name instead of relying on row position?
For each of the model summaries within the list, only one variable differs, which is named as V_01, V_02, V_03 etc. This is the variable coefficient I would like to extract.
I was thinking of using the grep function, something like:
coef <- sapply(modelsummaries, function(x) x[[4]][[grep("^[V]"),1]])
...but haven't got it working. Any suggestions?
Here's a reproducible example (only the last line needs to be tweaked):
newdata <- as.data.frame(seq(from = 0.1, to = 0.9, by = 0.1))
newdata <- as.data.frame(t(newdata))
colnames(newdata) = newdata[1, ]
colnames(newdata) <- paste("V", colnames(newdata), sep = "_")
mtcars <- mtcars
mtcarsmodel <- data.frame(mtcars, newdata)
mtcarsmodel[c(12:20)] <- sample(1:100, 288, replace=TRUE)
xnam <- paste(colnames(mtcarsmodel)[c(4:5)], sep="")
xnam2 <- paste(colnames(mtcarsmodel)[c(12:20)], sep="")
fmla <- paste(xnam, collapse= "+")
fmla2 <- paste(paste(fmla), "+")
fla <- paste("mpg ~", paste(fmla2))
models <- lapply(setNames(xnam2, xnam2), function(var) {
form = paste(fla, var)
lm(form, data=mtcarsmodel)
})
modelsummaries <-lapply(models, summary)
coef <- sapply(modelsummaries, function(x) x[[4]][[4,1]])
You were quite close, you just needed to tell grep what to search on, which is the rownames of the coefficient matrix returned by coef() (which is a better way to get them than [[4]]). Also so as not to reuse that name, I suggest saving the result in something different, like coefs.
coefs <- sapply(modelsummaries, function(x) {
coef(x)[grep("^V", rownames(coef(x))),"Estimate"]
})
V_0.1 V_0.2 V_0.3 V_0.4 V_0.5 V_0.6 V_0.7 V_0.8
0.030927774 -0.053437459 0.009335911 -0.011009187 -0.010303494 -0.001705420 -0.036297492 0.021838044
V_0.9
0.005457086
Also, check out the new broom package which can make it easier to extract certain information from models in a tidy way.
After struggling with a grep solution, I committed blasphemy and used an sql solution instead:
library('sqldf')
new <- lapply(modelsummaries, function(x) setDT(data.frame(x[[4]]), keep.rownames = TRUE)[])
values <- sapply(new, function(x) sqldf("SELECT x.estimate, x.'Pr...t..' FROM x WHERE rn like '%V_%'"))
data <- as.data.frame(t(rbind(values)))
I've also come up with a (somewhat ugly) grep based solution:
coef <- sapply(modelsummaries, function(x) as.numeric(unlist(strsplit(grep("^V_", capture.output(x), value = TRUE), "\\s+"))[[2]]))
If I have 100 variables with a common name, such as year_1951, year_1952, year_1953 etc, is there a way to do a linear regression that includes all variables that start with year_ ? In Stata this is easy by using the *, but in R, I'm not sure how to go about this.
THanks.
Stata Example :
regress y year_*
Is there an equivalence in R, such as
ols.lm <- lm(y ~ year_*, data = d)
I don't think R support that kind of expansion inside formula. It do support y ~ . kind of expansion.
Here is how you can do it
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
myformulae <- as.formula(paste(depVar,paste(indepVars,collapse=' + '),sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Edit
: Solving the problem mentioned in the comment (Adding constants in the formulae)
variables <- colnames(d)
depVar <- 'y'
indepVars <- variables[grepl('^year_',variables)]
indepVarsCollapse <- paste(paste(indepVars,collapse=' + '), '-1')
myformulae <- as.formula(paste(depVar,indepVarsCollapse,sep = ' ~ '))
modelfit <-lm(myformulae,data=d)
Rather than selecting the columns in the formula, select them in the data argument:
nms <- c("y", grep("year_", names(d), value = TRUE))
lm(y ~., d[nms])
Alternately, select all the desired columns in the grep
ix <- grep("^(y$|year_)", names(d))
lm(y ~., d[ix])
or if we knew that the unwanted columns do not start with y:
ix <- grep("^y", names(d))
lm(y ~., d[ix])