handling outputs with different lengths using ldply - r

Just a quick question on how to handle outputs of different lengths using ldply from the plyr package. Here is a simple version of the code I am using and the error I am getting:
# function to collect the coefficients from the regression models:
> SecreatWeapon <- dlply(merged1,~country.x, function(df) {
+ lm(log(child_mortality) ~ log(IHME_usd_gdppc)+ hiv_prev,data=df)
+ })
>
# functions to extract the output of interest
> extract.coefs <- function(mod) c(extract.coefs = summary(mod)$coefficients[,1])
> extract.se.coefs <- function(mod) c(extract.se.coefs = summary(mod)$coefficients[,2])
>
# function to combine the extracted output
> res <- ldply(SecreatWeapon, extract.coefs)
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
Here the error is due to the fact that some models will contain NA values so that:
> SecreatWeapon[[1]]
Call:
lm(formula = log(child_mortality) ~ log(IHME_usd_gdppc) + hiv_prev,
data = df)
Coefficients:
(Intercept) log(IHME_usd_gdppc) hiv_prev
-4.6811 0.5195 NA
and therefore the following output won't have the same length; for example:
> summary(SecreatWeapon[[1]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.6811000 0.6954918 -6.730633 6.494799e-08
log(IHME_usd_gdppc) 0.5194643 0.1224292 4.242977 1.417349e-04
but for the other one I get
> summary(SecreatWeapon[[10]])$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.612698 1.7505236 10.632646 1.176347e-12
log(IHME_usd_gdppc) -2.256465 0.1773498 -12.723244 6.919009e-15
hiv_prev -272.558951 160.3704493 -1.699558 9.784053e-02
Any easy fixes? Thank you very much,
Antonio Pedro.

The summary.lm( . ) function accessed with $coefficients gives different output than the coef would with an lm argument for any lm-object with an NA "coefficient". Would you be satisfied with using something like this:
coef.se <- function(mod) {
extract.coefs <- function(mod) coef(mod) # lengths all the same
extract.se.coefs <- function(mod) { summary(mod)$coefficients[,2]}
return( merge( extract.coefs(mod), extract.se.coefs(mod), by='row.names', all=TRUE) )
}
With Roland's example it gives:
> coef.se(fit)
Row.names x y
1 (Intercept) -0.3606557 0.1602034
2 x1 2.2131148 0.1419714
3 x2 NA NA
You could rename the x as coef and the y as se.coef

y <- c(1,2,3)
x1 <- c(0.6,1.1,1.5)
x2 <- c(1,1,1)
fit <- lm(y~x1+x2)
summary(fit)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#function for full matrix, adjusted from getAnywhere(print.summary.lm)
full_coeffs <- function (fit) {
fit_sum <- summary(fit)
cn <- names(fit_sum$aliased)
coefs <- matrix(NA, length(fit_sum$aliased), 4,
dimnames = list(cn, colnames(fit_sum$coefficients)))
coefs[!fit_sum$aliased, ] <- fit_sum$coefficients
coefs
}
full_coeffs(fit)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.3606557 0.1602034 -2.251236 0.26612016
#x1 2.2131148 0.1419714 15.588457 0.04078329
#x2 NA NA NA NA

Related

how to select variables in a linear regression based on values in a matrix with R

I am working with R.
I have a matrix called combination:
comb <- matrix( c(1,2,1,3,2,3) , nrow = 3 , ncol = 2)
n_comb<-3
I have a one column dataframe called y with the values of my y variable.
I have a 3 column dataframe called reg with 3 regressors.
I want to do a loop which regresses y on all possible combinations of reg, selecting each time two variables. Hopefully, I can store the values of the regression somewhere so that I can easily access them afterwards. For instance, I would like to store the R square of each regression, as well as the x variables employed associated with the R square value.
So far I have tried:
for (i in 1:n_comb){
*reg_simple <- select only the variables I need*
all<-cbind (y,reg_simple)
colnames(all)[1] <- "y"
regression <-lm(y~.,all)
summary (regression)
*store the R square and the regressors somewhere*
}
`
If we wanted to use the predictors based on each row of the 'comb', loop over the rows of the 'comb' matrix (either with apply/MARGIN = 1 or split by row (asplit- MARGIN = 1) and loop with sapply), create the formula using reformulate, apply the lm, and extract the r.squared values
rsquare_out <- sapply(asplit(comb, 1),
function(i) summary(lm(reformulate(names(reg)[i], response = 'y'),
data = cbind(reg, y)))$r.squared)
Using loops:
Dummy data:
n = 100
y = rnorm(n)
x = data.frame(x1=1*y+rnorm(n),
x2=2*y+rnorm(n),
x3=3*y+rnorm(n))
comb = gtools::combinations(3, 2)
Code:
regs = list()
for(i in 1:nrow(comb)){
mod = summary(lm(y ~ ., x[,comb[i,]]))
regs[[i]] = list(call=mod$terms[[3]],
coefs=mod$coefficients,
RS=mod$r.squared)}
You can include anything else you want in the list(). Output:
> regs
[[1]]
[[1]]$call
x1 + x2
[[1]]$coefs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03686327 0.04218032 0.8739449 3.843069e-01
x1 0.13359822 0.04037758 3.3087228 1.316050e-03
x2 0.36019362 0.02384050 15.1084733 3.143002e-27
[[1]]$RS
[1] 0.8384476
[[2]]
[[2]]$call
x1 + x3
[[2]]$coefs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03390277 0.02660885 1.274116 2.056664e-01
x1 0.04295226 0.02654442 1.618128 1.088823e-01
x3 0.28556167 0.01064096 26.836090 1.110231e-46
[[2]]$RS
[1] 0.9356962
[[3]]
[[3]]$call
x2 + x3
[[3]]$coefs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0291651 0.02391629 1.219466 2.256244e-01
x2 0.1116096 0.02205835 5.059746 1.989407e-06
x3 0.2304448 0.01497633 15.387271 8.944792e-28
[[3]]$RS
[1] 0.9477506
Or you can use this to name the lists with the call:
regs = list()
for(i in 1:nrow(comb)){
names = colnames(x)[comb[i,]]
mod = summary(lm(y ~ ., x[,names]))
regs[[paste(names, collapse=" + ")]] = list(coefs=mod$coefficients,
RS=mod$r.squared)}
Output:
> regs
$`x1 + x2`
$`x1 + x2`$coefs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03686327 0.04218032 0.8739449 3.843069e-01
x1 0.13359822 0.04037758 3.3087228 1.316050e-03
x2 0.36019362 0.02384050 15.1084733 3.143002e-27
$`x1 + x2`$RS
[1] 0.8384476
$`x1 + x3`
$`x1 + x3`$coefs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.03390277 0.02660885 1.274116 2.056664e-01
x1 0.04295226 0.02654442 1.618128 1.088823e-01
x3 0.28556167 0.01064096 26.836090 1.110231e-46
$`x1 + x3`$RS
[1] 0.9356962
$`x2 + x3`
$`x2 + x3`$coefs
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0291651 0.02391629 1.219466 2.256244e-01
x2 0.1116096 0.02205835 5.059746 1.989407e-06
x3 0.2304448 0.01497633 15.387271 8.944792e-28
$`x2 + x3`$RS
[1] 0.9477506

Replace underlying standard errors in lm() OLS model in R

I would like to run an ols model using lm() in R and replace the standard errors in the model. In the following example, I would like to replace each standard error with "2":
set.seed(123)
x <- rnorm(100)
y <- rnorm(100)
mod <- lm(y ~x)
ses <- c(2,2)
coef(summary(mod))[,2] <- ses
sqrt(diag(vcov(mod))) <- ses
Any thoughts on how to do this? Thanks.
Those assignments are not going to succeed. coef, sqrt and vcov are not going to pass those values "upstream". You could do this:
> false.summ <- coef(summary(mod))
> false.sqrt.vcov <- sqrt(diag(vcov(mod)))
> false.summ
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.10280305 0.09755118 -1.0538371 0.2945488
x -0.05247161 0.10687862 -0.4909459 0.6245623
> false.summ[ , 2] <- ses
> false.sqrt.vcov
(Intercept) x
0.09755118 0.10687862
> false.sqrt.vcov <- ses
You could also modify a summary-object at least the coef-matrix, but there is no "vcov" element in summary despite the fact that vcov does return a value.
> summ <- summary(mod)
> summ$coefficients[ , 2] <- ses
> coef(summ)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.10280305 2 -1.0538371 0.2945488
x -0.05247161 2 -0.4909459 0.6245623
> summ$vcov
NULL
> vcov(summ)
(Intercept) x
(Intercept) 0.009516233 -0.00103271
x -0.001032710 0.01142304:
If you wanted to change the output of vcov when applied to a summary object you would need to distort the unscaled cov-matrix. This is the code that vcov uses for that object-class:
> getAnywhere(vcov.summary.lm)
A single object matching ‘vcov.summary.lm’ was found
It was found in the following places
registered S3 method for vcov from namespace stats
namespace:stats
with value
function (object, ...)
object$sigma^2 * object$cov.unscaled
<bytecode: 0x7fb63c784068>
<environment: namespace:stats>

extracting linear model coefficients into a vector within a loop

I am trying to create sample of 200 linear model coefficients using a loop in R. As an end result, I want a vector containing the coefficients.
for (i in 1:200) {
smpl_5 <- population[sample(1:1000, 5), ]
model_5 <- summary(lm(y~x, data=smpl_5))
}
I can extract the coefficients easy enough, but I am having trouble outputting them into a vector within the loop. Any Suggestions?
You can use replicate for this if you like. In your case, because the number of coefficients is identical for all models, it'll return an array as shown in the example below:
d <- data.frame(x=runif(1000))
d$y <- d$x * 0.123 + rnorm(1000, 0, 0.01)
coefs <- replicate(3, {
xy <- d[sample(nrow(d), 100), ]
coef(summary(lm(y~x, data=xy)))
})
coefs
# , , 1
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.001361961 0.002091297 0.6512516 5.164083e-01
# x 0.121142447 0.003624717 33.4212114 2.235307e-55
#
# , , 2
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.003213314 0.001967050 1.63357 1.055579e-01
# x 0.118026828 0.003332906 35.41259 1.182027e-57
#
# , , 3
#
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.003366678 0.001990226 1.691606 9.389883e-02
# x 0.119408470 0.003370190 35.430783 1.128070e-57
Access particular elements with normal array indexing, e.g.:
coefs[, , 1] # return the coefs for the first model
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.001361961 0.002091297 0.6512516 5.164083e-01
# x 0.121142447 0.003624717 33.4212114 2.235307e-55
So, for your problem, you could use:
replicate(200, {
smpl_5 <- population[sample(1:1000, 5), ]
coef(summary(lm(y~x, data=smpl_5)))
})

Custom Bootstrapped Standard Error: numeric 'envir' arg not of length one

I am writing a custom script to bootstrap standard errors in a GLM in R and receive the following error:
Error in eval(predvars, data, env) : numeric 'envir' arg not of length one
Can someone explain what I am doing wrong? My code:
#Number of simulations
sims<-numbersimsdesired
#Set up place to store data
saved.se<-matrix(NA,sims,numberofcolumnsdesired)
y<-matrix(NA,realdata.rownumber)
x1<-matrix(NA,realdata.rownumber)
x2<-matrix(NA,realdata.rownumber)
#Resample entire dataset with replacement
for (sim in 1:sims) {
fake.data<-sample(1:nrow(data5),nrow(data5),replace=TRUE)
#Define variables for GLM using fake data
y<-realdata$y[fake.data]
x1<-realdata$x1[fake.data]
x2<-realdata$x2[fake.data]
#Run GLM on fake data, extract SEs, save SE into matrix
glm.output<-glm(y ~ x1 + x2, family = "poisson", data = fake.data)
saved.se[sim,]<-summary(glm.output)$coefficients[0,2]
}
An example: if we suppose sims = 1000 and we want 10 columns (suppose instead of x1 and x2, we have x1...x10) the goal is a dataset with 1,000 rows and 10 columns containing each explanatory variable's SEs.
There isn't a reason to reinvent the wheel. Here is an example of bootstrapping the standard error of the intercept with the boot package:
set.seed(42)
counts <- c(18,17,15,20,10,20,25,13,12)
x1 <- 1:9
x2 <- sample(9)
DF <- data.frame(counts, x1, x2)
glm1 <- glm(counts ~ x1 + x2, family = poisson(), data=DF)
summary(glm1)$coef
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) 2.08416378 0.42561333 4.896848 9.738611e-07
#x1 0.04838210 0.04370521 1.107010 2.682897e-01
#x2 0.09418791 0.04446747 2.118131 3.416400e-02
library(boot)
intercept.se <- function(d, i) {
glm1.b <- glm(counts ~ x1 + x2, family = poisson(), data=d[i,])
summary(glm1.b)$coef[1,2]
}
set.seed(42)
boot.intercept.se <- boot(DF, intercept.se, R=999)
#ORDINARY NONPARAMETRIC BOOTSTRAP
#
#
#Call:
#boot(data = DF, statistic = intercept.se, R = 999)
#
#
#Bootstrap Statistics :
# original bias std. error
#t1* 0.4256133 0.103114 0.2994377
Edit:
If you prefer doing it without a package:
n <- 999
set.seed(42)
ind <- matrix(sample(nrow(DF), nrow(DF)*n, replace=TRUE), nrow=n)
boot.values <- apply(ind, 1, function(...) {
i <- c(...)
intercept.se(DF, i)
})
sd(boot.values)
#[1] 0.2994377

Obtain standard errors of regression coefficients for an "mlm" object returned by `lm()`

I'd like to run 10 regressions against the same regressor, then pull all the standard errors without using a loop.
depVars <- as.matrix(data[,1:10]) # multiple dependent variables
regressor <- as.matrix([,11]) # independent variable
allModels <- lm(depVars ~ regressor) # multiple, single variable regressions
summary(allModels)[1] # Can "view" the standard error for 1st regression, but can't extract...
allModels is stored as an "mlm" object, which is really tough to work with. It'd be great if I could store a list of lm objects or a matrix with statistics of interest.
Again, the objective is to NOT use a loop. Here is a loop equivalent:
regressor <- as.matrix([,11]) # independent variable
for(i in 1:10) {
tempObject <- lm(data[,i] ~ regressor) # single regressions
table1Data[i,1] <- summary(tempObject)$coefficients[2,2] # assign std error
rm(tempObject)
}
If you put your data in long format it's very easy to get a bunch of regression results using lmList from the nlme or lme4 packages. The output is a list of regression results and the summary can give you a matrix of coefficients, just like you wanted.
library(lme4)
m <- lmList( y ~ x | group, data = dat)
summary(m)$coefficients
Those coefficients are in a simple 3 dimensional array so the standard errors are at [,2,2].
Given an "mlm" model object model, you can use the below function written by me to get standard errors of coefficients. This is very efficient: no loop, and no access to summary.mlm().
std_mlm <- function (model) {
Rinv <- with(model$qr, backsolve(qr, diag(rank)))
## unscaled standard error
std_unscaled <- sqrt(rowSums(Rinv ^ 2)[order(model$qr$pivot)])
## residual standard error
sigma <- sqrt(colSums(model$residuals ^ 2) / model$df.residual)
## return final standard error
## each column corresponds to a model
"dimnames<-"(outer(std_unscaled, sigma), list = dimnames(model$coefficients))
}
A simple, reproducible example
set.seed(0)
Y <- matrix(rnorm(50 * 5), 50) ## assume there are 5 responses
X <- rnorm(50) ## covariate
fit <- lm(Y ~ X)
We all know that it is simple to extract estimated coefficients via:
fit$coefficients ## or `coef(fit)`
# [,1] [,2] [,3] [,4] [,5]
#(Intercept) -0.21013925 0.1162145 0.04470235 0.08785647 0.02146662
#X 0.04110489 -0.1954611 -0.07979964 -0.02325163 -0.17854525
Now let's apply our std_mlm:
std_mlm(fit)
# [,1] [,2] [,3] [,4] [,5]
#(Intercept) 0.1297150 0.1400600 0.1558927 0.1456127 0.1186233
#X 0.1259283 0.1359712 0.1513418 0.1413618 0.1151603
We can of course, call summary.mlm just to check our result is correct:
coef(summary(fit))
#Response Y1 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -0.21013925 0.1297150 -1.6200072 0.1117830
#X 0.04110489 0.1259283 0.3264151 0.7455293
#
#Response Y2 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.1162145 0.1400600 0.8297485 0.4107887
#X -0.1954611 0.1359712 -1.4375183 0.1570583
#
#Response Y3 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.04470235 0.1558927 0.2867508 0.7755373
#X -0.07979964 0.1513418 -0.5272811 0.6004272
#
#Response Y4 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.08785647 0.1456127 0.6033574 0.5491116
#X -0.02325163 0.1413618 -0.1644831 0.8700415
#
#Response Y5 :
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.02146662 0.1186233 0.1809646 0.8571573
#X -0.17854525 0.1151603 -1.5504057 0.1276132
Yes, all correct!
Here an option:
put your data in the long format using regressor as an id key.
do your regression against value by group of variable.
For example , using mtcars data set:
library(reshape2)
dat.m <- melt(mtcars,id.vars='mpg') ## mpg is my regressor
library(plyr)
ddply(dat.m,.(variable),function(x)coef(lm(variable~value,data=x)))
variable (Intercept) value
1 cyl 1 8.336774e-18
2 disp 1 6.529223e-19
3 hp 1 1.106781e-18
4 drat 1 -1.505237e-16
5 wt 1 8.846955e-17
6 qsec 1 6.167713e-17
7 vs 1 2.442366e-16
8 am 1 -3.381738e-16
9 gear 1 -8.141220e-17
10 carb 1 -6.455094e-17

Resources