I have the following regression model;
models <- lapply(1:25, function(x) lm(Y_df[,x] ~ X1))
Which runs 25 regressions on 25 columns in the Y_df dataframe.
One of the outputs can be shown as;
models[15] # Gives me the coefficients for model 15
Call:
lm(formula = Y_df[, x] ~ X1)
Coefficients:
(Intercept) X1
0.1296812 1.0585835
Which I can store in a separate df. The problem I am running into is regarding Std. Error, R2, residules etc.
I would like to store these also into a separate dataframe.
I can run individual regressions and extract the summaries as a normal R regression output would look like.
ls_1 <- summary(models[[1]])
ls_1
ls_1$sigma
However I am hoping to take the values directly from the line of code which runs the 25 regressions.
This code works
> (models[[15]]$coefficients)
(Intercept) X1
-0.3643446787 1.0789369642
However; this code does not.
> (models[[15]]$sigma)
NULL
I have tried a variety of different combinations to try and extract these results with no luck.
The following did exactly what I wanted perfectly. I had hoped there was a way to replace the word coef with Std Error or R2 etc. but this does not work.
models <- lapply(1:25, function(x) lm(Y_df[,x] ~ X1))
# extract just coefficients
coefficients <- sapply(Y_df, coef)
Ideally I would like to store the Std Error from the above model
If a model is named mod, you can get to all of the residuals in the same way as the coefficients:
mod$residuals
There are also functions that extract the coefficients and residuals:
coef(mod)
resid(mod)
The other outputs, you can extract via summary:
summary(mod)$coef[,"Std. Error"] # standard errors
summary(mod)$r.squared # r squared
summary(mod)$adj.r.squared # adjusted r squared
So you can either create a list containing each of these results for each model:
outputList <- lapply(models, function(x){
coefs <- coef(mod)
stdErr <- summary(mod)$coef[,"Std. Error"]
rsq <- summary(mod)$r.squared
rsq_adj <- summary(mod)$adj.r.squared
rsd <- resid(mod)
list(coefs = coefs,
stdErr = stdErr,
rsq = rsq,
rsq_adj = rsq_adj,
rsd = rsd)
})
You can then get to the rsq for mod1 via outputList$mod1$rsq, for example.
Or you can create separate dataframes for each:
library(tidyverse)
# coefficients
coefs <- lapply(models, coef) %>%
do.call(rbind, .) %>%
as.data.frame() %>% # convert from matrix to dataframe
rownames_to_column("model") # add original model name as a column in the dataframe
# standard errors
stdErr <- lapply(models, function(x){
summary(mod)$coef[,"Std. Error"]
}) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
rownames_to_column("model")
# r squareds
rsq <- sapply(models, function(x){
summary(mod)$r.squared
}) %>%
as.data.frame() %>%
rownames_to_column("model")
# adjusted r squareds
rsq_adj <- sapply(models, function(x){
summary(mod)$adj.r.squared
})%>%
as.data.frame() %>%
rownames_to_column("model")
# residuals
rsd <- lapply(models, resid) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
rownames_to_column("model")
Worth noting that, if you're in RStudio and you assign the summary to something (ie temp <- summary(mod)), you can type the name of the object, then "$" and a dropdown of all the other objects that can be extracted from the summary appears.
Related
I am new to R and am trying to loop a mixed model across 90 columns in a dataset.
My dataset looks like the following one but has 90 predictors instead of 7 that I need to evaluate as fixed effects in consecutive models.
I then need to store the model output (coefficients and P values) to finally construct a figure summarizing the size effects of each predictor. I know the discussion of P value estimates from lme4 mixed models.
For example:
set.seed(101)
mydata <- tibble(id = rep(1:32, times=25),
time = sample(1:800),
experiment = rep(1:4, times=200),
Y = sample(1:800),
predictor_1 = runif(800),
predictor_2 = rnorm(800),
predictor_3 = sample(1:800),
predictor_4 = sample(1:800),
predictor_5 = seq(1:800),
predictor_6 = sample(1:800),
predictor_7 = runif(800)) %>% arrange (id, time)
The model to iterate across the N predictors is:
library(lme4)
library(lmerTest) # To obtain new values
mixed.model <- lmer(Y ~ predictor_1 + time + (1|id) + (1|experiment), data = mydata)
summary(mixed.model)
My coding skills are far from being able to set a loop to repeat the model across the N predictors in my dataset and store the coefficients and P values in a dataframe.
I have been able to iterate across all the predictors fitting linear models instead of mixed models using lapply. But I have failed to apply this strategy with mixed models.
varlist <- names(mydata)[5:11]
lm_models <- lapply(varlist, function(x) {
lm(substitute(Y ~ i, list(i = as.name(x))), data = mydata)
})
One option is to update the formula of a restricted model (w/o predictor) in an lapply loop over the predictors. Then summaryze the resulting list and subset the coefficient matrix using a Vectorized function.
library(lmerTest)
mixed.model <- lmer(Y ~ time + (1|id) + (1|experiment), data = mydata)
preds <- grep('pred', names(mydata), value=TRUE)
fits <- lapply(preds, \(x) update(mixed.model, paste('. ~ . + ', x)))
extract_coef_p <- Vectorize(\(x) x |> summary() |> coef() |> {\(.) .[3, c(1, 5)]}())
res <- `rownames<-`(t(extract_coef_p(fits)), preds)
res
# Estimate Pr(>|t|)
# predictor_1 -7.177579138 0.8002737
# predictor_2 -5.010342111 0.5377551
# predictor_3 -0.013030513 0.7126500
# predictor_4 -0.041702039 0.2383835
# predictor_5 -0.001437124 0.9676346
# predictor_6 0.005259293 0.8818644
# predictor_7 31.304496255 0.2511275
I have a dataset with a rather large amount of variables. In the dataset I have a predictor and an outcome variable I want to investigate. I want to find covariates with either a significant effect on the outcome variable, or a significant interaction effect between the predictor and the covariate on the outcome variable.
It would therefore be convenient to be able to regress all the covariates in turn with the desired predictor on the dependent variable and create a table over the effects and interaction effects of the covariates with their respective p-values.
I want to do something like this:
library(dplyr)
# Generating sample data
set.seed(5)
df <- data.frame(matrix(round(abs(2*rnorm(100*100)), digits = 0), ncol=100))
# Selecting covariates
covar <- names(df)[! names(df) %in% c("X1", "X2")]
# Running the lm function over the list of covariates. I should get the covariate coefficients from each regression, but I get an error when I try run this step.
coeff <- lapply(covar, function(x){
# Retrive coefficient matrix
summary(lm(X1 ~ X2 + x + X2*x, df))$coefficients %>%
# Coerce into dataframe and filter for covariates and interaction effects
as.data.frame(.) %>%
filter(row.names(.) %in% grep(x, rownames(.), value =
TRUE))}) %>%
# Finally I want to join all data frames into one
bind_rows(.)
I could use some help with the syntax. I get the following error when I try to run the function:
Error in h(simpleError(msg, call)): error in evaluating the argument 'object' in selecting a method for function 'summary': variable lengths differ (found for 'x')
When you use x(in lapply) inside function, it might be better using paste for model formula instead of just specifying it's formula.
lapply(covar, function(x){
modd <- paste0("X1 ~ X2 +", x, "+ X2 *", x)
summary(lm(modd, df))$coefficients %>%
as.data.frame(.) %>%
filter(row.names(.) %in% grep(x, rownames(.), value =
TRUE))}) %>%
bind_rows(.)
I want to fit regression models using a single predictor variable at a time. In total I have 7 predictors and 1 response variable. I want to write a chunk of code that picks a predictor variable from data frame and fits a model. I would further want to extract regression coefficient( not the intercept) and the sign of it and store them in 2 vectors. Here's my code-
for (x in (1:7))
{
fit <- lm(distance ~ FAA_unique_with_duration_filtered[x] , data=FAA_unique_with_duration_filtered)
coeff_values<-summary(fit)$coefficients[,1]
coeff_value<-coeff_values[2]
append(coeff_value_vector,coeff_value , after = length(coeff_value_vector))
append(RCs_sign_vector ,sign(coeff_values[2]) , after = length(RCs_sign_vector))
}
Over here x in will use the first column , then the 2nd and so on. However, I am getting the following error.
Error in model.frame.default(formula = distance ~ FAA_unique_with_duration_filtered[x], :
invalid type (list) for variable 'FAA_unique_with_duration_filtered[x]'
Is there a way to do this using loops?
You don't really need loops for this.
Suppose we want to regress y1, the 5th column of the built-in anscombe dataset, separately on each of the first 4 columns.
Then:
a <- anscombe
reg <- function(i) coef(lm(y1 ~., a[c(5, i)]))[[2]] # use lm
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
# or
a <- anscombe
reg <- function(i) cov(a$y1, a[[i]]) / var(a[[i]]) # use formula for slope
coefs <- sapply(1:4, reg)
signs <- sign(coefs)
Alternately the following where reg is either of the reg definitions above.
a <- anscombe
coefs <- numeric(4)
for(i in 1:4) coefs[i] <- reg(i)
signs <- sign(coefs)
I have several variables in my dataframe (e.g.: a, b, c, d) and I'm obtaining by season linear model parameters (Intercept, Slope and rSquared) through this code (Example for variable a):
lm_results_season_a<- ddply(dataframe1, "Season", function(x) {
model <- summary(lm(y ~ a, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept, Slope, rSquared)
})
My problem is that I have too many variables, and repeat this code again for each variable takes a lot of space.
For example, I would have to write the same code for variable b
lm_results_season_b<- ddply(dataframe1, "Season", function(x) {
model <- summary(lm(y ~ b, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept, Slope, rSquared)
})
and keep repeating the same code for the rest of the variables. So I tried to create a function in which I don't have to repeat all this code again, but just to call a function that can make all the calculations and give me the dataframe I am looking for.
I tried this code in which I define the variables before, and then just add them to the function:
variable1 <- dataframe1$y
variable2 <- dataframe1$a
LM_coef <- function(data, variable1, variable2){
lm_results_season<- ddply(data, "Season", function(x) {
model <- summary(lm(variable1 ~ variable2, data = x))
Intercept<- model$coefficients[1,1]
Slope<- model$coefficients[2,1]
rSquared <- model$r.squared
data.frame(Intercept,Slope, rSquared)
})
return(lm_results_season)
}
But this is not working as I wanted. Instead of giving me the linear regression parameters by Season for the variable "a", it is just giving me the linear regression parameters just for the variable "a" as a whole, and not by season.
Any idea on what's happening in the function or how to modify this function?
Are you bound to the plyr package? Otherwise, you can use the more advanced and up-to-date purrr package, always from the tidyverse world.
Here we can create a function where we insert the dataframe data, the two variables for the linear model variable1 and variable2, and the splitting column split_var (in your case "Season").
LM_coef <- function(data, variable1, variable2, split_var){
require(purrr)
data %>%
split(.[[split_var]]) %>%
map(~summary(lm(eval(as.name(variable1)) ~ eval(as.name(variable2)), data = .x))) %>%
map_dfr(~cbind(as.data.frame(t(as.matrix(coef(.)[1:2,1]))), .$r.squared), .id = split_var) %>%
setNames(c(split_var, "Intercept", "Slope", "rSquared"))
}
Example
Using the mtcars dataset, we can do
LM_coef(mtcars, "hp", "mpg", "cyl")
in order to obtain
# cyl Intercept Slope rSquared
# 1 4 147.4315 -2.430092 0.27405583
# 2 6 164.1564 -2.120802 0.01614624
# 3 8 294.4974 -5.647887 0.08044919
which is equal to what you would obtain from your initial function lm_results_season_a.
I have a list of dataframes that contains many subsets of data (470ish). I am trying to run a regression on each of them and add the regression coefficients to a dataframe. The dataframe will contain the coefficients for all dependent variables on each subgroup. I tried iterating with a for loop but obviously that is not the right way. I think the solution has something to do with lapply?
for (i in ListOfTraining){
lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC data=ListOfTraining[[i]])
}
Thanks for any advice!
The function tidy from package broom handles this nicely.
library(dplyr) # bind_rows is more efficient than do.call(rbind, ...)
library(broom) # put statistics into data.frame
bind_rows(lapply(ListOfTraining, function(dat)
tidy(lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC, data=dat))))
Example
dataList <- split(mtcars, mtcars$cyl) # list of data.frames by number of cylinders
lapply(dataList, function(dat) tidy(lm(mpg ~ disp + hp, data=dat))) %>% # fit models
bind_rows() %>% # combine into one data.frame
mutate(model=rep(1:length(dataList), each=3)) # add a model ID column
# term estimate std.error statistic p.value model
# 1 (Intercept) 43.040057552 4.235724713 10.16120274 7.531962e-06 1
# 2 disp -0.119536016 0.036945788 -3.23544366 1.195900e-02 1
# 3 hp -0.046091563 0.047423668 -0.97191054 3.595602e-01 1
# 4 (Intercept) 20.151209478 6.938235241 2.90437104 4.392508e-02 2
# 5 disp 0.001796527 0.020195109 0.08895852 9.333909e-01 2
# 6 hp -0.006032441 0.034597750 -0.17435935 8.700522e-01 2
# 7 (Intercept) 24.044775630 4.045729006 5.94324919 9.686231e-05 3
# 8 disp -0.018627566 0.009456903 -1.96973225 7.456584e-02 3
# 9 hp -0.011315585 0.012572498 -0.90002676 3.873854e-01 3
Alternatively, you could bind the data.frames beforehand, assuming they have the same columns. Then, fit models using lmList from nlme package.
## Combine list of data.frames into one data.frame with a factor variable
lengths <- sapply(dataList, nrow) # in case data.frames have different num. rows
dat <- dataList %>% bind_rows() %>%
mutate(group=rep(1:length(dataList), times=lengths)) # group id column
library(nlme) # lmList()
models <- lmList(mpg ~ disp + hp | group, data=dat) # make models, grouped by group
models$coefficients
# (Intercept) disp hp
# 1 43.04006 -0.119536016 -0.046091563
# 2 20.15121 0.001796527 -0.006032441
# 3 24.04478 -0.018627566 -0.011315585
You can solve this using the for loop, if you prefer. Your problem is that the results aren't being saved to an object as the loop progresses. You can see the below for an example using the built-in mtcars dataframe.
(This first example is revised based on OP's request for an example of how to also extract the R squared value.)
ListOfTraining <- list(mtcars, mtcars)
results <- list()
for (i in seq_along(ListOfTraining)) {
lm_obj <- lm(disp ~ qsec, data = ListOfTraining[[i]])
tmp <- c(lm_obj$coefficients, summary(lm_obj)$r.squared)
names(tmp)[length(tmp)] <- "r.squared"
results[[i]] <- tmp
}
results <- do.call(rbind, results)
results
You can also rewrite the for loop using lapply as demoed below.
ListOfTraining <- list(mtcars, mtcars)
results <- list()
results <- lapply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results <- do.call(rbind, results)
results
Finally, you can use the plyr package's ldply function which will convert the list applied outputs into a dataframe automatically (if possible).
ListOfTraining <- list(mtcars, mtcars)
results <- plyr::ldply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results
Your current code runs the regression, but does not do anything with the results (inside of a loop they are not even autoprinted), so they are just discarded. You need to have some structure to save the results into.
The following code will create a matrix of coefficients (assuming that all the regressions run without error and the number of final coefficients is the same):
my.coef <- sapply( ListOfTraining, function(dat) {
coef(lm( JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC,
data=dat) )
})
The matrix can then be converted to a data frame (you could also use lapply and convert to a data frame, but I think the sapply option is probably a little simpler).