bootstrap standard errors of a linear regression in R - r

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco

What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Related

retrieve formula used by predict function in exponential equation in R

I can't figure out how to reconstruct the results nor the formula from the predict function of a linear model. I get the same results also when using this data in ggplot geom_smooth(method='lm',formula,y ~ exp(x)).
Here's some sample data
x=c(1,10,100,1000,10000,100000,1000000,3000000)
y=c(1,1,10,15,20,30,40,60)
I would like to use an exponential function so (ignore for the moment that I log the x value, because exp() fails for very large values):
model = lm( y ~ exp(log10(x)))
mypred = predict(model)
plot(log(x),mypred)
I have tried
lm_coef <- coef(model)
plot(log10(x),lm_coef[1]*exp(-lm_coef[2]*x))
However this is giving me a decreasing exponential instead of the increasing. My goal is to extract the equation of the exponential function so I can reuse the coefficients in another context.. What equation is predict() using and is there a way to see it?
I did something along the lines of:
Df<-data.frame(x=c(1,10,100,1000,10000,100000,1000000,3000000),
y=c(1,1,10,15,20,30,40,60))
model<-lm(data = Df, formula = y~log(x))
predict(model)
plot(log(Df$x),predict(model))
summary(model)
The relevant output you get is:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6.0700 4.7262 -1.284 0.246386
log(x) 3.5651 0.5035 7.081 0.000398 ***
---
Your equation therefore is 3.5651*log(x)-6.0700

Is there a way to both include PCSE and Prais-Winsten correction in a fixed effects model in R (similar to the xtpcse function in Stata)?

I want to estimate a fixed effects model while using panel-corrected standard errors as well as Prais-Winsten (AR1) transformation in order to solve panel heteroscedasticity, contemporaneous spatial correlation and autocorrelation.
I have time-series cross-section data and want to perform regression analysis. I was able to estimate a fixed effects model, panel corrected standard errors and Prais-winsten estimates individually. And I was able to include panel corrected standard errors in a fixed effects model. But I want them all at once.
# Basic ols model
ols1 <- lm(y ~ x1 + x2, data = data)
summary(ols1)
# Fixed effects model
library('plm')
plm1 <- plm(y ~ x1 + x2, data = data, model = 'within')
summary(plm1)
# Panel Corrected Standard Errors
library(pcse)
lm.pcse1 <- pcse(ols1, groupN = Country, groupT = Time)
summary(lm.pcse1)
# Prais-Winsten estimates
library(prais)
prais1 <- prais_winsten(y ~ x1 + x2, data = data)
summary(prais1)
# Combination of Fixed effects and Panel Corrected Standard Errors
ols.fe <- lm(y ~ x1 + x2 + factor(Country) - 1, data = data)
pcse.fe <- pcse(ols.fe, groupN = Country, groupT = Time)
summary(pcse.fe)
In the Stata command: xtpcse it is possible to include both panel corrected standard errors and Prais-Winsten corrected estimates, with something allong the following code:
xtpcse y x x x i.cc, c(ar1)
I would like to achieve this in R as well.
I am not sure that my answer will completely address your concern, these days I've been trying to deal with the same problem that you mention.
In my case, I ran the Prais-Winsten function from the package prais where I included my model with the fixed effects. Afterwards, I correct for heteroskedasticity using the function vcovHC.prais which is analogous to vcovHC function from the package sandwich.
This basically will give you White's/sandwich heteroskedasticity-consistent covariance matrix which, if you later fit into the function coeftest from the package lmtest, it will give you the table output with the corrected standard errors. Taking your posted example, see below the code that I have used:
# Prais-Winsten estimates with Fixed Effects
library(prais)
prais.fe <- prais_winsten(y ~ x1 + x2 + factor(Country), data = data)
library(lmtest)
prais.fe.w <- coeftest(prais.fe, vcov = vcovHC.prais(prais.fe, "HC1")
h.m1 # run the object to see the output with the corrected standard errors.
Alas, I am aware that the sandwhich heteroskedasticity-consistent standard errors are not exactly the same as the Beck and Katz's PCSEs because PCSE deals with panel heteroskedasticity while sandwhich SEs addresses overall heteroskedasticity. I am not totally sure in how much these two differ in practice, but something is something.
I hope my answer was somehow helpful, this is actually my very first answer :D

Comparing nested mice models with interaction terms

R's mice contains a function, pool.compare, to compare nested models fit to imputed objects. If I try to include an interaction term:
library(mice)
imput = mice(nhanes2)
mi1 <- with(data=imput, expr=lm(bmi~age*hyp))
mi0 <- with(data=imput, expr=lm(bmi~age+hyp))
pc <- pool.compare(mi1, mi0, method="Wald")
then it returns the following error:
Error in pool(fit1) :
Different number of parameters: coef(fit): 6, vcov(fit): 5
It sounds like the variance-covariance matrix doesn't include the interaction term as its own variable. What's the best way around this?
The problem appears to be that some of your parameters are un-estimatable in some of your imputed data.sets. When I run the code, I see
( fit1<-mi1$analyses[[1]] )
# lm(formula = bmi ~ age * hyp)
#
# Coefficients:
# (Intercept) age2 age3 hyp2 age2:hyp2
# 28.425 -5.425 -3.758 1.200 3.300
# age3:hyp2
# NA
In this set, it was not possible to estimate age3*hyp2 (presumably because there were no observations in this group).
This causes the discrepancy in coef(fit1) and vcov(fit1) since the covariance cannot be estimated for that term.
What to do in this case is more of a statistical problem than a programming problem. If you are unsure of what would be appropriate for your data, I suggest you consult with the statisticians over at Cross Validated.

Exporting Linear Regression Results Including Confidence Intervals

Hey out there how can I can I export a table of the results used to make the chart I generated for this linear regression model below.
d <- data.frame(x=c(200110,86933,104429,240752,255332,75998,
204302,97321,342812,220522,110990,259706,65733),
y=c(200000,110000,165363,225362,313284,113972,
137449,113106,409020,261733,171300,344437,89000))
lm1 <- lm(y~x,data=d)
p_conf1 <- predict(lm1,interval="confidence")
nd <- data.frame(x=seq(0,80000,length=510000))
p_conf2 <- predict(lm1,interval="confidence",newdata=nd)
plot(y~x,data=d,ylim=c(-21750,600000),xlim=c(0,600000)) ## data
abline(lm1) ## fit
matlines(d$x,p_conf1[,c("lwr","upr")],col=2,lty=1,type="b",pch="+")
matlines(nd$x,p_conf2[,c("lwr","upr")],col=4,lty=1,type="b",pch="+")
Still not entirely sure what you want but this would seem to be reasonable:
dat1 <- data.frame(d,p_conf1)
dat2 <- data.frame(nd,y=NA,p_conf2)
write.csv(rbind(dat1,dat2),file="linpredout.csv")
It includes x, y (equal to the observation or NA for non-observed points), the predicted value fit, and lwr/upr bounds.
edit: fix typo.
This will return a matrix that has some of the information needed to construct the confidence intervals:
> coef(summary(lm1))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 21749.037058 2.665203e+04 0.8160369 4.317954e-01
x 1.046954 1.374353e-01 7.6177997 1.037175e-05
Any text on linear regression should have the formula for the confidence interval. You may need to calculate some ancillary quantities dependent on which formula you're using. The code for predict is visible ... just type at the console :
predict.lm
And don't forget that confidence intervals are different than prediction intervals.

Stepwise regression using p-values to drop variables with nonsignificant p-values

I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.
I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.
If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.
Show your boss the following :
set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))
y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))
Which gives :
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1525 0.3066 -0.498 0.61995
x1 1.8693 0.6045 3.092 0.00261 **
x2b 2.5149 0.4334 5.802 8.77e-08 ***
x2c 0.3089 0.4475 0.690 0.49180
x1:x2b -1.1239 0.8022 -1.401 0.16451
x1:x2c -1.0497 0.7873 -1.333 0.18566
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.
Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.
I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.
#####################################
# Automated model selection
# Author : Joris Meys
# version : 0.2
# date : 12/01/09
#####################################
#CHANGE LOG
# 0.2 : check for empty scopevar vector
#####################################
# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
out <- sapply(terms,function(i){
sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
})
return(sum(out)>0)
}
# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after
model.select <- function(model,keep,sig=0.05,verbose=F){
counter=1
# check input
if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
# calculate scope for drop1 function
terms <- attr(model$terms,"term.labels")
if(missing(keep)){ # set scopevars to all terms
scopevars <- terms
} else{ # select the scopevars if keep is used
index <- match(keep,terms)
# check if all is specified correctly
if(sum(is.na(index))>0){
novar <- keep[is.na(index)]
warning(paste(
c(novar,"cannot be found in the model",
"\nThese terms are ignored in the model selection."),
collapse=" "))
index <- as.vector(na.omit(index))
}
scopevars <- terms[-index]
}
# Backward model selection :
while(T){
# extract the test statistics from drop.
test <- drop1(model, scope=scopevars,test="F")
if(verbose){
cat("-------------STEP ",counter,"-------------\n",
"The drop statistics : \n")
print(test)
}
pval <- test[,dim(test)[2]]
names(pval) <- rownames(test)
pval <- sort(pval,decreasing=T)
if(sum(is.na(pval))>0) stop(paste("Model",
deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))
# check if all significant
if(pval[1]<sig) break # stops the loop if all remaining vars are sign.
# select var to drop
i=1
while(T){
dropvar <- names(pval)[i]
check.terms <- terms[-match(dropvar,terms)]
x <- has.interaction(dropvar,check.terms)
if(x){i=i+1;next} else {break}
} # end while(T) drop var
if(pval[i]<sig) break # stops the loop if var to remove is significant
if(verbose){
cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")
}
#update terms, scopevars and model
scopevars <- scopevars[-match(dropvar,scopevars)]
terms <- terms[-match(dropvar,terms)]
formul <- as.formula(paste(".~.-",dropvar))
model <- update(model,formul)
if(length(scopevars)==0) {
warning("All variables are thrown out of the model.\n",
"No model could be specified.")
return()
}
counter=counter+1
} # end while(T) main loop
return(model)
}
Why not try using the step() function specifying your testing method?
For example, for backward elimination, you type only a command:
step(FullModel, direction = "backward", test = "F")
and for stepwise selection, simply:
step(FullModel, direction = "both", test = "F")
This can display both the AIC values as well as the F and P values.
Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.
model1 <-lm (ozone~temp*wind*rad)
summary(model1)
Coefficients:
Estimate Std.Error t value Pr(>t)
(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
temp -1.076e+01 4.303e+00 -2.501 0.01401 *
wind -3.237e+01 1.173e+01 -2.760 0.00687 **
rad -3.117e-01 5.585e-01 -0.558 0.57799
temp:wind 2.377e-01 1.367e-01 1.739 0.08519
temp:rad 8.402e-03 7.512e-03 1.119 0.26602
wind:rad 2.054e-02 4.892e-02 0.420 0.47552
temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358
The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:
model2 <- update(model1,~. - temp:wind:rad)
summary(model2)
Depending on the results, you can continue simplifying your model:
model3 <- update(model2,~. - temp:rad)
summary(model3)
...
Alternatively you can use the automatic model simplification function step, to see
how well it does:
model_step <- step(model1)
Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.
If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.
Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.
See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.
As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.
require(rms)
model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
model2 <- fastbw(fit=model1, rule="p", sls=0.05)
model2
olsrr package could be useful.
You can define pent (p-value to enter the model) and prem (p-value to remove)
The output gives all the metrics you would need, and beyond.

Resources