R adding regression coeffcients to data frame - r

I have a list of dataframes that contains many subsets of data (470ish). I am trying to run a regression on each of them and add the regression coefficients to a dataframe. The dataframe will contain the coefficients for all dependent variables on each subgroup. I tried iterating with a for loop but obviously that is not the right way. I think the solution has something to do with lapply?
for (i in ListOfTraining){
lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC data=ListOfTraining[[i]])
}
Thanks for any advice!

The function tidy from package broom handles this nicely.
library(dplyr) # bind_rows is more efficient than do.call(rbind, ...)
library(broom) # put statistics into data.frame
bind_rows(lapply(ListOfTraining, function(dat)
tidy(lm(JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC, data=dat))))
Example
dataList <- split(mtcars, mtcars$cyl) # list of data.frames by number of cylinders
lapply(dataList, function(dat) tidy(lm(mpg ~ disp + hp, data=dat))) %>% # fit models
bind_rows() %>% # combine into one data.frame
mutate(model=rep(1:length(dataList), each=3)) # add a model ID column
# term estimate std.error statistic p.value model
# 1 (Intercept) 43.040057552 4.235724713 10.16120274 7.531962e-06 1
# 2 disp -0.119536016 0.036945788 -3.23544366 1.195900e-02 1
# 3 hp -0.046091563 0.047423668 -0.97191054 3.595602e-01 1
# 4 (Intercept) 20.151209478 6.938235241 2.90437104 4.392508e-02 2
# 5 disp 0.001796527 0.020195109 0.08895852 9.333909e-01 2
# 6 hp -0.006032441 0.034597750 -0.17435935 8.700522e-01 2
# 7 (Intercept) 24.044775630 4.045729006 5.94324919 9.686231e-05 3
# 8 disp -0.018627566 0.009456903 -1.96973225 7.456584e-02 3
# 9 hp -0.011315585 0.012572498 -0.90002676 3.873854e-01 3
Alternatively, you could bind the data.frames beforehand, assuming they have the same columns. Then, fit models using lmList from nlme package.
## Combine list of data.frames into one data.frame with a factor variable
lengths <- sapply(dataList, nrow) # in case data.frames have different num. rows
dat <- dataList %>% bind_rows() %>%
mutate(group=rep(1:length(dataList), times=lengths)) # group id column
library(nlme) # lmList()
models <- lmList(mpg ~ disp + hp | group, data=dat) # make models, grouped by group
models$coefficients
# (Intercept) disp hp
# 1 43.04006 -0.119536016 -0.046091563
# 2 20.15121 0.001796527 -0.006032441
# 3 24.04478 -0.018627566 -0.011315585

You can solve this using the for loop, if you prefer. Your problem is that the results aren't being saved to an object as the loop progresses. You can see the below for an example using the built-in mtcars dataframe.
(This first example is revised based on OP's request for an example of how to also extract the R squared value.)
ListOfTraining <- list(mtcars, mtcars)
results <- list()
for (i in seq_along(ListOfTraining)) {
lm_obj <- lm(disp ~ qsec, data = ListOfTraining[[i]])
tmp <- c(lm_obj$coefficients, summary(lm_obj)$r.squared)
names(tmp)[length(tmp)] <- "r.squared"
results[[i]] <- tmp
}
results <- do.call(rbind, results)
results
You can also rewrite the for loop using lapply as demoed below.
ListOfTraining <- list(mtcars, mtcars)
results <- list()
results <- lapply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results <- do.call(rbind, results)
results
Finally, you can use the plyr package's ldply function which will convert the list applied outputs into a dataframe automatically (if possible).
ListOfTraining <- list(mtcars, mtcars)
results <- plyr::ldply(ListOfTraining, function(x) {
lm(disp ~ qsec, data = x)$coefficients
})
results

Your current code runs the regression, but does not do anything with the results (inside of a loop they are not even autoprinted), so they are just discarded. You need to have some structure to save the results into.
The following code will create a matrix of coefficients (assuming that all the regressions run without error and the number of final coefficients is the same):
my.coef <- sapply( ListOfTraining, function(dat) {
coef(lm( JOB_VOLUME ~ FEB+MAR+APR+MAY+JUN+JUL+AUG+SEP+OCT+NOV+DEC,
data=dat) )
})
The matrix can then be converted to a data frame (you could also use lapply and convert to a data frame, but I think the sapply option is probably a little simpler).

Related

R function to extract glm ORs, CIs, p for looped models [duplicate]

This question already has an answer here:
R: How can I convert a list of linear regression results to a dataframe?
(1 answer)
Closed 3 months ago.
I am looking to run the same logistic regression model for a list of variables. Only one of the predictors matter to me and the covariates will all be the same from model to model. I have been able to run the models but have trouble exporting the results in a useful format.
data(mtcars)
d<-mtcars
d$am<-as.factor(d$am) #three outcome variables
d$gear<-as.factor(d$gear)
d$cyl<-as.factor(d$cyl)
d$vs<-as.factor(d$vs) #predictor of interest
vars<-c("am","gear","cyl")
glm.func<-function(x) {
mod<-glm(get(x)~vs+wt+hp, data=d, family="binomial") #the variable of interest is vs
mod.sum<-(summary(mod))
mod.exp<-(exp(cbind(OR=coef(mod), confint(mod))))
}
output<-lapply(vars, glm.func)
output
I am able to generate the models but 1) need to add the p value and a column for FDR and 2) reduce it to just the predictor of interest, in this case "vs". I'd like my data o be numeric and to look like this:
var OR 2.50% 97.50% P
am 4.3759E-02 7.35E-06 4.70E+00 0.2855
gear 5.5834E+00 3.46E-01 1.33E+02 0.2353
cyl 4.2062E-25 0.00E+00 Inf 0.998
See my answer to a similar question on how to do this with broom::tidy() and purrr::map_dfr(). Here's a base R solution, building on your glm.func() function:
glm.func <- function(x) {
mod <- glm(get(x) ~ vs + wt + hp, data = d, family = "binomial")
mod.sum <- summary(mod)
mod.exp <- cbind(
OR = exp(coef(mod)),
exp(confint(mod)),
p = mod.sum$coefficients[, "Pr(>|z|)"]
)
cbind(var = x, as.data.frame(mod.exp)["vs1", ])
}
vars <- c("am", "gear", "cyl")
output <- lapply(vars, glm.func)
output <- do.call(rbind, output)
rownames(output) <- NULL
output
var OR 2.5 % 97.5 % p
1 am 4.375911e-02 7.354903e-06 4.704322 0.2854794
2 gear 5.583369e+00 3.458290e-01 132.815557 0.2353346
3 cyl 4.206157e-25 0.000000e+00 Inf 0.9993028

A loop to create multiple data frames from a population data frame

Suppose I have a data frame called pop, and I wish to split this data frame by a categorical variable called replicate. This replicate consists out of 110 categories, and I wish to perform analyses on each data frame then the output of each must be combined to create a new data frame. In other words suppose it is replicate i then I wish to create data frame i and perform a logistic regression on i and save beta 0 for i. All the beta 0 will be combined to create a table with all the beta 0 for replicate 1-110.
I know that's A mouth full but thanks in advance.
Since you didn't give some sample data I will use mtcars. You can use split to split a data.frame on a categorical value. Combining this with map and tidy from the purrr and broom packages you can create a dataframe with all the beta's in one go.
So what happens is 1: split data.frame, 2: run regression model 3: tidy data to get the coefficients out and create a data.frame of the data.
You will need to adjust this to your data.frame and replicate variable. Broom can handle logistic regression so everything should work out.
library(purrr)
library(broom)
my_lms <- mtcars %>%
split(.$cyl) %>%
map(~ lm(mpg ~ wt, data = .x)) %>%
map_dfr(~ tidy(.))
my_lms
term estimate std.error statistic p.value
1 (Intercept) 39.571196 4.3465820 9.103980 7.771511e-06
2 wt -5.647025 1.8501185 -3.052251 1.374278e-02
3 (Intercept) 28.408845 4.1843688 6.789278 1.054844e-03
4 wt -2.780106 1.3349173 -2.082605 9.175766e-02
5 (Intercept) 23.868029 3.0054619 7.941551 4.052705e-06
6 wt -2.192438 0.7392393 -2.965803 1.179281e-02
EDIT
my_lms <- lapply(split(mtcars, mtcars$cyl), function(x) lm(mpg ~ wt, data = x))
my_coefs <- as.data.frame(sapply(my_lms, coef))
my_coefs
4 6 8
(Intercept) 39.571196 28.408845 23.868029
wt -5.647025 -2.780106 -2.192438
#Or transpose the coefficents if you want column results.
t(my_coefs)
(Intercept) wt
4 39.57120 -5.647025
6 28.40884 -2.780106
8 23.86803 -2.192438

Subset variables by significant P value

I'm trying to subset variables by significant P-values, and I attempted with the following code, but it only selects all variables instead of selecting by condition. Could anyone help me to correct the problem?
myvars <- names(summary(backward_lm)$coefficients[,4] < 0.05)
happiness_reduced <- happiness_nomis[myvars]
Thanks!
An alternative solution to Martin's great answer (in the comments section) using the broom package. Unfortunately, you haven't posted an data, so I'm using the mtcars dataset as a demo:
library(broom)
# build model
m = lm(disp ~ ., data = mtcars)
# create a dataframe frm model's output
tm = tidy(m)
# visualise dataframe of the model
# (using non scientific notation of numbers)
options(scipen = 999)
tm
# term estimate std.error statistic p.value
# 1 (Intercept) -5.8119829 228.0609389 -0.02548434 0.97990925639
# 2 mpg 1.9398052 2.5976340 0.74675849 0.46348865035
# 3 cyl 15.3889587 12.1518291 1.26639032 0.21924091701
# 4 hp 0.6649525 0.2259928 2.94236093 0.00777972543
# 5 drat 8.8116809 19.7390767 0.44640796 0.65987184728
# 6 wt 86.7111730 16.1127236 5.38153418 0.00002448671
# 7 qsec -12.9742622 8.6227190 -1.50466021 0.14730421493
# 8 vs -12.1152075 25.2579953 -0.47965832 0.63642812949
# 9 am -7.9135864 25.6183932 -0.30890253 0.76043942893
# 10 gear 5.1265224 18.0578153 0.28389494 0.77927112074
# 11 carb -30.1067073 7.5513212 -3.98694566 0.00067029676
# get variables with p value less than 0.05
tm$term[tm$p.value < 0.05]
# [1] "hp" "wt" "carb"
The main advantage is that by obtaining the model's output as a dataframe you can use variable names, and not variable positions and row names, to manipulate the data.
I'm using options(scipen = 999) to make it easier to check that filtering works (i.e. not using the scientific notation of numbers in the dataframe).

Extracting Coefficients, Std Errors, R2 etc from multiple regressions

I have the following regression model;
models <- lapply(1:25, function(x) lm(Y_df[,x] ~ X1))
Which runs 25 regressions on 25 columns in the Y_df dataframe.
One of the outputs can be shown as;
models[15] # Gives me the coefficients for model 15
Call:
lm(formula = Y_df[, x] ~ X1)
Coefficients:
(Intercept) X1
0.1296812 1.0585835
Which I can store in a separate df. The problem I am running into is regarding Std. Error, R2, residules etc.
I would like to store these also into a separate dataframe.
I can run individual regressions and extract the summaries as a normal R regression output would look like.
ls_1 <- summary(models[[1]])
ls_1
ls_1$sigma
However I am hoping to take the values directly from the line of code which runs the 25 regressions.
This code works
> (models[[15]]$coefficients)
(Intercept) X1
-0.3643446787 1.0789369642
However; this code does not.
> (models[[15]]$sigma)
NULL
I have tried a variety of different combinations to try and extract these results with no luck.
The following did exactly what I wanted perfectly. I had hoped there was a way to replace the word coef with Std Error or R2 etc. but this does not work.
models <- lapply(1:25, function(x) lm(Y_df[,x] ~ X1))
# extract just coefficients
coefficients <- sapply(Y_df, coef)
Ideally I would like to store the Std Error from the above model
If a model is named mod, you can get to all of the residuals in the same way as the coefficients:
mod$residuals
There are also functions that extract the coefficients and residuals:
coef(mod)
resid(mod)
The other outputs, you can extract via summary:
summary(mod)$coef[,"Std. Error"] # standard errors
summary(mod)$r.squared # r squared
summary(mod)$adj.r.squared # adjusted r squared
So you can either create a list containing each of these results for each model:
outputList <- lapply(models, function(x){
coefs <- coef(mod)
stdErr <- summary(mod)$coef[,"Std. Error"]
rsq <- summary(mod)$r.squared
rsq_adj <- summary(mod)$adj.r.squared
rsd <- resid(mod)
list(coefs = coefs,
stdErr = stdErr,
rsq = rsq,
rsq_adj = rsq_adj,
rsd = rsd)
})
You can then get to the rsq for mod1 via outputList$mod1$rsq, for example.
Or you can create separate dataframes for each:
library(tidyverse)
# coefficients
coefs <- lapply(models, coef) %>%
do.call(rbind, .) %>%
as.data.frame() %>% # convert from matrix to dataframe
rownames_to_column("model") # add original model name as a column in the dataframe
# standard errors
stdErr <- lapply(models, function(x){
summary(mod)$coef[,"Std. Error"]
}) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
rownames_to_column("model")
# r squareds
rsq <- sapply(models, function(x){
summary(mod)$r.squared
}) %>%
as.data.frame() %>%
rownames_to_column("model")
# adjusted r squareds
rsq_adj <- sapply(models, function(x){
summary(mod)$adj.r.squared
})%>%
as.data.frame() %>%
rownames_to_column("model")
# residuals
rsd <- lapply(models, resid) %>%
do.call(rbind, .) %>%
as.data.frame() %>%
rownames_to_column("model")
Worth noting that, if you're in RStudio and you assign the summary to something (ie temp <- summary(mod)), you can type the name of the object, then "$" and a dropdown of all the other objects that can be extracted from the summary appears.

How to Loop/Repeat a Linear Regression in R

I have figured out how to make a table in R with 4 variables, which I am using for multiple linear regressions. The dependent variable (Lung) for each regression is taken from one column of a csv table of 22,000 columns. One of the independent variables (Blood) is taken from a corresponding column of a similar table.
Each column represents the levels of a particular gene, which is why there are so many of them. There are also two additional variables (Age and Gender of each patient). When I enter in the linear regression equation, I use lm(Lung[,1] ~ Blood[,1] + Age + Gender), which works for one gene.
I am looking for a way to input this equation and have R calculate all of the remaining columns for Lung and Blood, and hopefully output the coefficients into a table.
Any help would be appreciated!
You want to run 22,000 linear regressions and extract the coefficients? That's simple to do from a coding standpoint.
set.seed(1)
# number of columns in the Lung and Blood data.frames. 22,000 for you?
n <- 5
# dummy data
obs <- 50 # observations
Lung <- data.frame(matrix(rnorm(obs*n), ncol=n))
Blood <- data.frame(matrix(rnorm(obs*n), ncol=n))
Age <- sample(20:80, obs)
Gender <- factor(rbinom(obs, 1, .5))
# run n regressions
my_lms <- lapply(1:n, function(x) lm(Lung[,x] ~ Blood[,x] + Age + Gender))
# extract just coefficients
sapply(my_lms, coef)
# if you need more info, get full summary call. now you can get whatever, like:
summaries <- lapply(my_lms, summary)
# ...coefficents with p values:
lapply(summaries, function(x) x$coefficients[, c(1,4)])
# ...or r-squared values
sapply(summaries, function(x) c(r_sq = x$r.squared,
adj_r_sq = x$adj.r.squared))
The models are stored in a list, where model 3 (with DV Lung[, 3] and IVs Blood[,3] + Age + Gender) is in my_lms[[3]] and so on. You can use apply functions on the list to perform summaries, from which you can extract the numbers you want.
The question seems to be about how to call regression functions with formulas which are modified inside a loop.
Here is how you can do it in (using diamonds dataset):
attach(ggplot2::diamonds)
strCols = names(ggplot2::diamonds)
formula <- list(); model <- list()
for (i in 1:1) {
formula[[i]] = paste0(strCols[7], " ~ ", strCols[7+i])
model[[i]] = glm(formula[[i]])
#then you can plot or do anything else with the result ...
png(filename = sprintf("diamonds_price=glm(%s).png", strCols[7+i]))
par(mfrow = c(2, 2))
plot(model[[i]])
dev.off()
}
Sensible or not, to make the loop at least somehow work you need:
y<- c(1,5,6,2,5,10) # response
x1<- c(2,12,8,1,16,17) # predictor
x2<- c(2,14,5,1,17,17)
predictorlist<- list("x1","x2")
for (i in predictorlist){
model <- lm(paste("y ~", i[[1]]), data=df)
print(summary(model))
}
The paste function will solve the problem.
A tidyverse addition - with map()
Another way - using map2() from the purrr package:
library(purrr)
xs <- anscombe[,1:3] # Select variables of interest
ys <- anscombe[,5:7]
map2_df(ys, xs,
function(i,j){
m <- lm(i ~j + x4 , data = anscombe)
coef(m)
})
The output is a dataframe (tibble) of all coefficients:
`(Intercept)` j x4
1 4.33 0.451 -0.0987
2 6.42 0.373 -0.253
3 2.30 0.526 0.0518
If more variables are changing this can be done using the pmap() functions

Resources