extracting linear model coefficients into a vector within a loop - r

I am trying to create sample of 200 linear model coefficients using a loop in R. As an end result, I want a vector containing the coefficients.
for (i in 1:200) {
smpl_5 <- population[sample(1:1000, 5), ]
model_5 <- summary(lm(y~x, data=smpl_5))
}
I can extract the coefficients easy enough, but I am having trouble outputting them into a vector within the loop. Any Suggestions?

You can use replicate for this if you like. In your case, because the number of coefficients is identical for all models, it'll return an array as shown in the example below:
d <- data.frame(x=runif(1000))
d$y <- d$x * 0.123 + rnorm(1000, 0, 0.01)
coefs <- replicate(3, {
xy <- d[sample(nrow(d), 100), ]
coef(summary(lm(y~x, data=xy)))
})
coefs
# , , 1
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.001361961 0.002091297 0.6512516 5.164083e-01
# x 0.121142447 0.003624717 33.4212114 2.235307e-55
#
# , , 2
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.003213314 0.001967050 1.63357 1.055579e-01
# x 0.118026828 0.003332906 35.41259 1.182027e-57
#
# , , 3
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.003366678 0.001990226 1.691606 9.389883e-02
# x 0.119408470 0.003370190 35.430783 1.128070e-57
Access particular elements with normal array indexing, e.g.:
coefs[, , 1] # return the coefs for the first model
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.001361961 0.002091297 0.6512516 5.164083e-01
# x 0.121142447 0.003624717 33.4212114 2.235307e-55
So, for your problem, you could use:
replicate(200, {
smpl_5 <- population[sample(1:1000, 5), ]
coef(summary(lm(y~x, data=smpl_5)))
})

Related

Updating model factor levels within function not working properly

I'm running some GLM models in R on a some data related to feeding trials I am doing. I'm regressing my variables of interest on a two predictors: one factor with three levels and one continuous variable. I want to compare the intercepts for each level of the factor to one another to determine if they're different. To do this, I wrote a function (called interceptCompare in the reproducible code below) which relevels the factor and updates the model and then saves the results of each model. It's my quick way of doing all the pair-wise comparisons of the intercepts.
The problem is that when I run the function, it doesn't appear to properly update the model. Each item of the list returned is the same, when they should be changing so that each item has a different level of the factor as the "(Intercept)" that the other levels are being compared against. I suspect it has something to do with the environment of the function, but I'm not sure. I haven't been able to find a similar example on stackoverflow or google.
Here's what should be a reproducible example:
food <- as.factor(rep(c("a", "b", "c"), each = 20))
variable <- rbinom(60, 1, 0.7)
movement <- rgamma(60, 10, 2)
binomial.model <- glm(variable ~ food,
family = "binomial")
gamma.model <- glm(movement ~ food,
family = Gamma)
interceptCompare <- function(model, factor) {
results <- list() # empty list to store results
for (i in unique(factor)) {
factor <- relevel(factor, ref = i)
model <- update(model)
results[[i]] <- summary(model)$coefficients[1:3, ]
}
results <- lapply(results, function(x) round(x, 4))
return(results)
}
interceptCompare(binomial.model, food)
interceptCompare(gamma.model, food)
You will need to add one line, in order to change the data, and use it within the update:
interceptCompare <- function(model, factor) {
results <- list() # empty list to store results
s <- deparse(substitute(factor))#ADD THIS LINE
for (i in unique(factor)) {
factor <- relevel(factor, ref = i)
model[["model"]][[s]] <- factor #CHANGE THE DATA IN THE MODEL
model <- update(model,data=model[["model"]])# UPDATE THE MODEL
results[[i]] <- summary(model)$coefficients[1:3, ]
}
results <- lapply(results, function(x) round(x, 4))
return(results)
}
interceptCompare(binomial.model, food)
$a
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3863 0.5590 2.4799 0.0131
foodb -0.7673 0.7296 -1.0516 0.2930
foodc -0.2877 0.7610 -0.3780 0.7054
$b
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.6190 0.4688 1.3205 0.1867
fooda 0.7673 0.7296 1.0516 0.2930
foodc 0.4796 0.6975 0.6876 0.4917
$c
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0986 0.5164 2.1275 0.0334
foodb -0.4796 0.6975 -0.6876 0.4917
fooda 0.2877 0.7610 0.3780 0.7054
interceptCompare(gamma.model, food)
$a
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2246 0.0156 14.3919 0.0000
foodb -0.0170 0.0213 -0.8022 0.4257
foodc -0.0057 0.0218 -0.2608 0.7952
$b
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2076 0.0144 14.3919 0.0000
fooda 0.0170 0.0213 0.8022 0.4257
foodc 0.0114 0.0210 0.5421 0.5898
$c
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2189 0.0152 14.3919 0.0000
foodb -0.0114 0.0210 -0.5421 0.5898
fooda 0.0057 0.0218 0.2608 0.7952
You need to be much more careful when trying to swap out symbols from formulas which is what you are doing. You need to put it in terms that the R language can understand. You want to pass in the name "food" not the values stored in the "food" vector as you are now. Here's an update that seems to do what you were trying to do
interceptCompare <- function(model, factor) {
sym <- substitute(factor)
results <- list() # empty list to store results
for (i in unique(factor)) {
change <- eval(bquote(~.-.(sym)+relevel(.(sym), ref=.(i))))
new_model <- update(model, change)
results[[i]] <- summary(new_model)$coefficients[1:3, ]
}
results <- lapply(results, function(x) round(x, 4))
return(results)
}
Here we capture the name "food" with the substitute. Then we use bquote() to build a new formula that will remove the original value, and relevel the factor variable with a particular reference. Then we save this to a new object so we don't keep updating the same model. For the binomial.model, this returns
$`a`
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.4879 1.7364 0.0825
relevel(food, ref = "a")b 0.0000 0.6901 0.0000 1.0000
relevel(food, ref = "a")c -0.8473 0.6619 -1.2801 0.2005
$b
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8473 0.4879 1.7364 0.0825
relevel(food, ref = "b")a 0.0000 0.6901 0.0000 1.0000
relevel(food, ref = "b")c -0.8473 0.6619 -1.2801 0.2005
$c
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0000 0.4472 0.0000 1.0000
relevel(food, ref = "c")a 0.8473 0.6619 1.2801 0.2005
relevel(food, ref = "c")b 0.8473 0.6619 1.2801 0.2005
You can see how it changed the ref= at each iteration

extract summary from matrix lm object

I'd like to get the coefficients from the summary section of an lm object, except I inputted a matrix and I am getting null for the summary part. Here is my code:
n=12
y=rnorm(n,23,1)
x1=rnorm(n,23,1)
x2=rnorm(n,15.5,1)
lm1=lm(y~x1+x2)
n2=10
b0=4;b1=2;b2=3
sim1<-function(){
randmat=matrix(rnorm(n*n2,0,8),n,n2)
x1mat=matrix(x1,n,n2)
x2mat=matrix(x2,n,n2)
return(b0+b1*x1mat+b2*x2mat+randmat)
}
sim1=sim1()
lm1=lm(sim1~x1+x2)
c2=summary(lm1)$coefficients
> c2
NULL
what I want is this (but repeated):
lm2=lm(sim1[,1]~x1+x2)
summary(lm2)$coefficients
Does anyone know how to extract these? Thanks
-Rik
Another way is to do the following after the end of the following line of your code.
lm1=lm(sim1~x1+x2) #this runs 10 models
All the coefficients will be stored in the list summary(lm1) as Response Y1 ... to Response Y10 (i.e. 10 models as many as ncol(sim1)).
In order to get the coefficients from each model back you could do:
all_coef <- lapply( paste0('Response Y', 1:ncol(sim1)),
function(x) summary(lm1)[[x]]$coefficients)
Or as #Rik mentions in the comment it will be faster if summary(lm1) is not repeated in the lapply loop in case you have a big matrix.
the_sum <- summary(lm1)
all_coef <- lapply( paste0('Response Y', 1:ncol(sim1)),
function(x) the_sum[[x]]$coefficients)
And the output would be:
> all_coef
[[1]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 135.242552 80.136427 1.687654 0.1257496
x1 -4.777486 2.953534 -1.617549 0.1402142
x2 4.464435 3.891641 1.147186 0.2808857
[[2]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 119.1772823 111.603046 1.06786765 0.3133851
x1 -0.1376013 4.113277 -0.03345297 0.9740435
x2 -1.2946027 5.419744 -0.23886785 0.8165585
[[3]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) -51.329923 63.495202 -0.8084063 0.4397018
x1 3.721227 2.340199 1.5901325 0.1462682
x2 3.793981 3.083498 1.2304147 0.2497304
[[4]]
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.8606014 57.669842 2.16509352 0.05857967
x1 -1.2517705 2.125498 -0.58893044 0.57039201
x2 -0.1159803 2.800603 -0.04141263 0.96787111
#...and so on until 10
To get the individual coefficients for a model just do:
all_coef[[<the_number_you_want>]]

Extracting coefficients from a regression 1 model with 1 predictor

I currently have the following regression model:
> print(summary(step1))
Call:
lm(formula = model1, data = newdat1)
Residuals:
Min 1Q Median 3Q Max
-2.53654 -0.02423 -0.02423 -0.02423 1.71962
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3962 0.0532 7.446 2.76e-12 ***
i2 0.6281 0.0339 18.528 < 2e-16 ***
I would like just the following returned as a data frame:
Estimate Std. Error t value Pr(>|t|)
i2 0.6281 0.0339 18.528 < 2e-16
I currently have the following code:
> results1<-as.data.frame(summary(step1)$coefficients[-1,drop=FALSE])
Which yields:
> results1
summary(step1)$coefficients[-1, drop = FALSE]
1 6.280769e-01
2 5.320108e-02
3 3.389873e-02
4 7.446350e+00
5 1.852804e+01
6 2.764836e-12
7 2.339089e-45
Thus is not what I want; however, it does work when there's more than 1 predictor.
It would be nice if you gave a reproducible example. I think you're looking for
cc <- coef(summary(step1))[2,,drop=FALSE]
as.data.frame(cc)
Using accessors such as coef(summary(.)) rather than summary(.)$coefficients is both prettier and more robust (there is no guarantee that the internal structure of summary() will stay the same -- although admittedly it's unlikely that this basic a part of R will change any time soon, especially as many users probably have used constructions like $coefficients).
Indexing the row by name, i.e.
coef(summary(step1))["i2",,drop=FALSE]
would probably be even better.
summary(step1)$coefficients is a matrix. When you take out the first element with [-1, drop=FALSE] it is converted to a vector, which is why you get 7 numbers instead of the row you want.
> set.seed(123)
> x <- rnorm(100)
> y <- -1 + 0.2*x + rnorm(100)
> step1 <- lm(y ~ x)
> class(summary(step1)$coefficients)
[1] "matrix"
> class(summary(step1)$coefficients[-1, drop=FALSE])
[1] "numeric"
The solution is to change the subsetting with [ so that you specify you wan to keep all columns (see ?`[`):
> summary(step1)$coefficients[-1, , drop=FALSE]
Estimate Std. Error t value Pr(>|t|)
x 0.1475284 0.1068786 1.380336 0.1706238

Obtain standard errors of regression coefficients for an "mlm" object returned by `lm()`

I'd like to run 10 regressions against the same regressor, then pull all the standard errors without using a loop.
depVars <- as.matrix(data[,1:10]) # multiple dependent variables
regressor <- as.matrix([,11]) # independent variable
allModels <- lm(depVars ~ regressor) # multiple, single variable regressions
summary(allModels)[1] # Can "view" the standard error for 1st regression, but can't extract...
allModels is stored as an "mlm" object, which is really tough to work with. It'd be great if I could store a list of lm objects or a matrix with statistics of interest.
Again, the objective is to NOT use a loop. Here is a loop equivalent:
regressor <- as.matrix([,11]) # independent variable
for(i in 1:10) {
tempObject <- lm(data[,i] ~ regressor) # single regressions
table1Data[i,1] <- summary(tempObject)$coefficients[2,2] # assign std error
rm(tempObject)
}
If you put your data in long format it's very easy to get a bunch of regression results using lmList from the nlme or lme4 packages. The output is a list of regression results and the summary can give you a matrix of coefficients, just like you wanted.
library(lme4)
m <- lmList( y ~ x | group, data = dat)
summary(m)$coefficients
Those coefficients are in a simple 3 dimensional array so the standard errors are at [,2,2].
Given an "mlm" model object model, you can use the below function written by me to get standard errors of coefficients. This is very efficient: no loop, and no access to summary.mlm().
std_mlm <- function (model) {
Rinv <- with(model$qr, backsolve(qr, diag(rank)))
## unscaled standard error
std_unscaled <- sqrt(rowSums(Rinv ^ 2)[order(model$qr$pivot)])
## residual standard error
sigma <- sqrt(colSums(model$residuals ^ 2) / model$df.residual)
## return final standard error
## each column corresponds to a model
"dimnames<-"(outer(std_unscaled, sigma), list = dimnames(model$coefficients))
}
A simple, reproducible example
set.seed(0)
Y <- matrix(rnorm(50 * 5), 50) ## assume there are 5 responses
X <- rnorm(50) ## covariate
fit <- lm(Y ~ X)
We all know that it is simple to extract estimated coefficients via:
fit$coefficients ## or `coef(fit)`
# [,1] [,2] [,3] [,4] [,5]
#(Intercept) -0.21013925 0.1162145 0.04470235 0.08785647 0.02146662
#X 0.04110489 -0.1954611 -0.07979964 -0.02325163 -0.17854525
Now let's apply our std_mlm:
std_mlm(fit)
# [,1] [,2] [,3] [,4] [,5]
#(Intercept) 0.1297150 0.1400600 0.1558927 0.1456127 0.1186233
#X 0.1259283 0.1359712 0.1513418 0.1413618 0.1151603
We can of course, call summary.mlm just to check our result is correct:
coef(summary(fit))
#Response Y1 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.21013925 0.1297150 -1.6200072 0.1117830
#X 0.04110489 0.1259283 0.3264151 0.7455293
#
#Response Y2 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1162145 0.1400600 0.8297485 0.4107887
#X -0.1954611 0.1359712 -1.4375183 0.1570583
#
#Response Y3 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.04470235 0.1558927 0.2867508 0.7755373
#X -0.07979964 0.1513418 -0.5272811 0.6004272
#
#Response Y4 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.08785647 0.1456127 0.6033574 0.5491116
#X -0.02325163 0.1413618 -0.1644831 0.8700415
#
#Response Y5 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.02146662 0.1186233 0.1809646 0.8571573
#X -0.17854525 0.1151603 -1.5504057 0.1276132
Yes, all correct!
Here an option:
put your data in the long format using regressor as an id key.
do your regression against value by group of variable.
For example , using mtcars data set:
library(reshape2)
dat.m <- melt(mtcars,id.vars='mpg') ## mpg is my regressor
library(plyr)
ddply(dat.m,.(variable),function(x)coef(lm(variable~value,data=x)))
variable (Intercept) value
1 cyl 1 8.336774e-18
2 disp 1 6.529223e-19
3 hp 1 1.106781e-18
4 drat 1 -1.505237e-16
5 wt 1 8.846955e-17
6 qsec 1 6.167713e-17
7 vs 1 2.442366e-16
8 am 1 -3.381738e-16
9 gear 1 -8.141220e-17
10 carb 1 -6.455094e-17

handling outputs with different lengths using ldply

Just a quick question on how to handle outputs of different lengths using ldply from the plyr package. Here is a simple version of the code I am using and the error I am getting:
# function to collect the coefficients from the regression models:
> SecreatWeapon <- dlply(merged1,~country.x, function(df) {
+ lm(log(child_mortality) ~ log(IHME_usd_gdppc)+ hiv_prev,data=df)
+ })
>
# functions to extract the output of interest
> extract.coefs <- function(mod) c(extract.coefs = summary(mod)$coefficients[,1])
> extract.se.coefs <- function(mod) c(extract.se.coefs = summary(mod)$coefficients[,2])
>
# function to combine the extracted output
> res <- ldply(SecreatWeapon, extract.coefs)
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
Here the error is due to the fact that some models will contain NA values so that:
> SecreatWeapon[[1]]
Call:
lm(formula = log(child_mortality) ~ log(IHME_usd_gdppc) + hiv_prev,
data = df)
Coefficients:
(Intercept) log(IHME_usd_gdppc) hiv_prev
-4.6811 0.5195 NA
and therefore the following output won't have the same length; for example:
> summary(SecreatWeapon[[1]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.6811000 0.6954918 -6.730633 6.494799e-08
log(IHME_usd_gdppc) 0.5194643 0.1224292 4.242977 1.417349e-04
but for the other one I get
> summary(SecreatWeapon[[10]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.612698 1.7505236 10.632646 1.176347e-12
log(IHME_usd_gdppc) -2.256465 0.1773498 -12.723244 6.919009e-15
hiv_prev -272.558951 160.3704493 -1.699558 9.784053e-02
Any easy fixes? Thank you very much,
Antonio Pedro.
The summary.lm( . ) function accessed with $coefficients gives different output than the coef would with an lm argument for any lm-object with an NA "coefficient". Would you be satisfied with using something like this:
coef.se <- function(mod) {
extract.coefs <- function(mod) coef(mod) # lengths all the same
extract.se.coefs <- function(mod) { summary(mod)$coefficients[,2]}
return( merge( extract.coefs(mod), extract.se.coefs(mod), by='row.names', all=TRUE) )
}
With Roland's example it gives:
> coef.se(fit)
Row.names x y
1 (Intercept) -0.3606557 0.1602034
2 x1 2.2131148 0.1419714
3 x2 NA NA
You could rename the x as coef and the y as se.coef
y <- c(1,2,3)
x1 <- c(0.6,1.1,1.5)
x2 <- c(1,1,1)
fit <- lm(y~x1+x2)
summary(fit)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#function for full matrix, adjusted from getAnywhere(print.summary.lm)
full_coeffs <- function (fit) {
fit_sum <- summary(fit)
cn <- names(fit_sum$aliased)
coefs <- matrix(NA, length(fit_sum$aliased), 4,
dimnames = list(cn, colnames(fit_sum$coefficients)))
coefs[!fit_sum$aliased, ] <- fit_sum$coefficients
coefs
}
full_coeffs(fit)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#x2 NA NA NA NA

Resources