Unprecise p-values in Stargazer - r

I want the same stars for significancies in regression output in stargazer as in the "normal output".
I produce data
library("stargazer"); library("lmtest"); library("sandwich")
set.seed(1234)
df <- data.frame(y=1001:1100)
df$x <- c(1:70,-100:-71) + rnorm(100, 0, 74.8)
model <- lm(log(y) ~ x, data=df)
and get some model estimates where the coefficient on x has a p-value of 0.1023
coeftest(model, vcov = vcovHC(model, type="HC3"))
I want to have these results in LaTeX. Based on the same function I calculate heteroscedasticity consistent standard estimates and let stargazer use them.
stderr_HC3_model <- sqrt(diag(vcovHC(model, type = "HC3")))
stargazer(model, se=list(stderr_HC3_model))
The stargazer output has a star at the coefficient indicating significance when alpha=10%. I want stargazer to give the same as the coeftest. (Because of the comparability with Stata where reg L_y x, vce(hc3) gives exactly the coeftest results.)
I played around with the stargazer options p.auto, t.auto which did not help. When I execute "stargazer" I cannot view the underlying code as it is possible in other cases. What to do?
Richards answer helped me. I indicate the steps I used to give out more than one regression (let's say ols_a and ols_b).
ses <- list(coeftest(ols_a, vcov = vcovHC(ols_a, type="HC3"))[,2],
coeftest(ols_b, vcov = vcovHC(ols_b, type="HC3"))[,2])
pvals <- list(coeftest(ols_a, vcov = vcovHC(ols_a, type="HC3"))[,4],
coeftest(ols_b, vcov = vcovHC(ols_b, type="HC3"))[,4])
stargazer(ols_a, ols_b, type="text", p=pvals, se=ses)

You need to provide the p values associated with your coeftest. From the man page.
p a list of numeric vectors that will replace the default p-values for
each model. Matched by element names. These will form the basis of
decisions about significance stars
The following should work.
test <- coeftest(model, vcov = vcovHC(model, type="HC3"))
ses <- test[, 2]
pvals <- test[, 4]
stargazer(model, type="text", p=pvals, se=ses)
This provides the following.
===============================================
Dependent variable:
---------------------------
log(y)
-----------------------------------------------
x -0.00005
Constant 6.956***
(0.003)
-----------------------------------------------
Observations 100
R2 0.026
Adjusted R2 0.016
Residual Std. Error 0.027 (df = 98)
F Statistic 2.620 (df = 1; 98)
===============================================
Note: *p<0.1; **p<0.05; ***p<0.01

It may be a minor issue but Richard's answer is actually not entirely correct -
his stargazer output does not report any standard errors nor potential significance stars for the variable x.
Also when reporting only a single model in stargazer manual coefficients, se, p and t values have to be provided in a list. Otherwise stargazer will report an empty list.
The (slightly) corrected example:
test <- coeftest(model, vcov = vcovHC(model, type="HC3"))
ses <- list(test[, 2])
pvals <- list(test[, 4])
stargazer(model, type="text", p=pvals, se=ses)
Output:
=======================================================================
Dependent variable:
-----------------------------------------
Daily added investors
negative
binomial
-----------------------------------------------------------------------
log(lag_raised_amount + 1) -0.466***
(0.124)
lag_target1 -0.661***
(0.134)
Constant -3.480**
(1.290)
-----------------------------------------------------------------------
Observations 6,513
Log Likelihood -8,834
theta 1.840*** (0.081)
Akaike Inf. Crit. 17,924
=======================================================================
Note: + p<0.1; * p<0.05; ** p<0.01; *** p<0.001

There are inherent dangers associated with the se argument.
When using this approach, the user should be cautious wrt the arguments t.auto and p.auto, both of which default to TRUE. I think it would be cautious to set them both to FALSE, and supply manually t and p values.
Failure to do so, and you risk getting significance stars not in sync with the displayed p-values. (I suspect that stargazer will simply reuse the se, which are now different from the default ones, and recompute the displayed stars using this input; which will naturally yield unexpected results.)
See also:
Displaying p-values instead of SEs in parenthesis

Related

Delta method and clustered standard errors

I have a question regarding how to apply the delta method when I have clustered standard errors. Consider the following dataset and (simple) regression ((Please note that this question is not necessarily about whether it makes sense to cluster around "us" or the correctness / usefulness of this regression).
#Use packages
library(multiwayvcov)
library(sandwich)
library(lmtest)
library(msm)
#load the data
data(mtcars)
# Run the regression
model1<-lm(mpg~cyl+gear+drat, data = mtcars)
#Calculate variance covariance matrix for clustered standard errors
vcov<-cluster.vcov(model1, mtcars$vs)
coeftest(model1, vcov)
# Apply delta method results in error
g<-model1$coefficients[2] / model1$coefficients[1]
deltamethod(g, mean, cov = vcov, ses=TRUE)
# Error I get is this one: "Error in deltamethod(g, mean = g, cov = vcov, ses = TRUE) :
# Covariances should be a 1 by 1 matrix"
Now I want to calculate the standard error for the coefficient (cyl) divided by (intercept) when using my matrix for clustered standard errors around "vs" (i.e. the vcov matrix). Does anyone know how to do this? I looked at this website, but for some reason I got an error when applying this (https://rdrr.io/rforge/msm/man/deltamethod.html). I appreciate any help.
Just editing the deltamethod call to output an answer - I don't know if this answer actually makes sense for what you want to do.
deltamethod(
g = formula('~x2/x1'),
mean = model1$coefficients,
cov = vcov,
ses = TRUE)

bootstrap standard errors of a linear regression in R

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Categorical Regression with Centered Levels

R's standard way of doing regression on categorical variables is to select one factor level as a reference level and constraining the effect of that level to be zero. Instead of constraining a single level effect to be zero, I'd like to constrain the sum of the coefficients to be zero.
I can hack together coefficient estimates for this manually after fitting the model the standard way:
x <- lm(data = mtcars, mpg ~ factor(cyl))
z <- c(coef(x), "factor(cyl)4" = 0)
y <- mean(z[-1])
z[-1] <- z[-1] - y
z[1] <- z[1] + y
z
## (Intercept) factor(cyl)6 factor(cyl)8 factor(cyl)4
## 20.5021645 -0.7593074 -5.4021645 6.1614719
But that leaves me without standard error estimates for the former reference level that I just added as an explicit effect, and I need to have those as well.
I did some searching and found the constrasts functions, and tried
lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
but this still only produces two effect estimates. Is there a way to change which constraint R uses for linear regression on categorical variables properly?
Think I've figured it out. Using contrasts actually is the right way to go about it, you just need to do a little work to get the results into a convenient looking form. Here's the fit:
fit <- lm(data = mtcars, mpg ~ C(factor(cyl), contr = contr.sum))
Then the matrix cs <- contr.sum(factor(cyl)) is used to get the effect estimates and the standard error.
The effect estimates just come from multiplying the contrast matrix by the effect estimates lm spits out, like so:
cs %*% coef(fit)[-1]
The standard error can be calculated using the contrast matrix and the variance-covariance matrix of the coefficients, like so:
diag(cs %*% vcov(fit)[-1,-1] %*% t(cs))

Performing Anova on Bootstrapped Estimates from Quantile Regression

So I'm using the quantreg package in R to conduct quantile regression analyses to test how the effects of my predictors vary across the distribution of my outcome.
FML <- as.formula(outcome ~ VAR + c1 + c2 + c3)
quantiles <- c(0.25, 0.5, 0.75)
q.Result <- list()
for (i in quantiles){
i.no <- which(quantiles==i)
q.Result[[i.no]] <- rq(FML, tau=i, data, method="fn", na.action=na.omit)
}
Then i call anova.rq which runs a Wald test on all the models and outputs a pvalue for each covariate telling me whether the effects of each covariate vary significantly across the distribution of my outcome.
anova.Result <- anova(q.Result[[1]], q.Result[[2]], q.Result[[3]], joint=FALSE)
Thats works just fine. However, for my particular data (and in general?), bootstrapping my estimates and their error is preferable. Which i conduct with a slight modification of the code above.
q.Result <- rqs(FML, tau=quantiles, data, method="fn", na.action=na.omit)
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb",
covariance=TRUE)
Here's where i get stuck. The quantreg currently cannot peform the anova (Wald) test on boostrapped estimates. The information files on the quantreg packages specifically states that "extensions of the methods to be used in anova.rq should be made" regarding the boostrapping method.
Looking at the details of the anova.rq method. I can see that it requires 2 components not present in the quantile model when bootstrapping.
1) Hinv (Inverse Hessian Matrix). The package information files specifically states "note that for se = "boot" there is no way to split the estimated covariance matrix into its sandwich constituent parts."
2) J which, according to the information files, is "Unscaled Outer product of gradient matrix returned if cov=TRUE and se != "iid". The Huber sandwich is cov = tau (1-tau) Hinv %*% J %*% Hinv. as for the Hinv component, there is no J component when se == "boot". (Note that to make the Huber sandwich you need to add the tau (1-tau) mayonnaise yourself.)"
Can i calculate or estimate Hinv and J from the bootstrapped estimates? If not what is the best way to proceed?
Any help on this much appreciated. This my first timing posting a question here, though I've greatly benefited from the answers to other peoples questions in the past.
For question 2: You can use R = for resampling. For example:
anova(object, ..., test = "Wald", joint = TRUE, score =
"tau", se = "nid", R = 10000, trim = NULL)
Where R is the number of resampling replications for the anowar form of the test, used to estimate the reference distribution for the test statistic.
Just a heads up, you'll probably get a better response to your questions if you only include 1 question per post.
Consulted with a colleague, and he confirmed that it was unlikely that Hinv and J could be 'reverse' computed from bootstrapped estimates. However we resolved that estimates from different taus could be compared using Wald test as follows.
From object rqs produced by
q.Summary <- summary(Q.mod, se="boot", R=10000, bsmethod="mcmb", covariance=TRUE)
you extract the bootstrapped Beta values for variable of interest in this case VAR, the first covariate in FML for each tau
boot.Bs <- sapply(q.Summary, function (x) x[["B"]][,2])
B0 <- coef(summary(lm(FML, data)))[2,1] # Extract liner estimate data linear estimate
Then compute wald statistic and get pvalue with number of quantiles for degrees of freedom
Wald <- sum(apply(boot.Bs, 2, function (x) ((mean(x)-B0)^2)/var(x)))
Pvalue <- pchisq(Wald, ncol(boot.Bs), lower=FALSE)
You also want to verify that bootstrapped Betas are normally distributed, and if you're running many taus it can be cumbersome to check all those QQ plots so just sum them by row
qqnorm(apply(boot.Bs, 1, sum))
qqline(apply(boot.Bs, 1, sum), col = 2)
This seems to be working, and if anyone can think of anything wrong with my solution, please share

Stepwise regression using p-values to drop variables with nonsignificant p-values

I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.
I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.
If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.
Show your boss the following :
set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))
y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))
Which gives :
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1525 0.3066 -0.498 0.61995
x1 1.8693 0.6045 3.092 0.00261 **
x2b 2.5149 0.4334 5.802 8.77e-08 ***
x2c 0.3089 0.4475 0.690 0.49180
x1:x2b -1.1239 0.8022 -1.401 0.16451
x1:x2c -1.0497 0.7873 -1.333 0.18566
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.
Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.
I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.
#####################################
# Automated model selection
# Author : Joris Meys
# version : 0.2
# date : 12/01/09
#####################################
#CHANGE LOG
# 0.2 : check for empty scopevar vector
#####################################
# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
out <- sapply(terms,function(i){
sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
})
return(sum(out)>0)
}
# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after
model.select <- function(model,keep,sig=0.05,verbose=F){
counter=1
# check input
if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
# calculate scope for drop1 function
terms <- attr(model$terms,"term.labels")
if(missing(keep)){ # set scopevars to all terms
scopevars <- terms
} else{ # select the scopevars if keep is used
index <- match(keep,terms)
# check if all is specified correctly
if(sum(is.na(index))>0){
novar <- keep[is.na(index)]
warning(paste(
c(novar,"cannot be found in the model",
"\nThese terms are ignored in the model selection."),
collapse=" "))
index <- as.vector(na.omit(index))
}
scopevars <- terms[-index]
}
# Backward model selection :
while(T){
# extract the test statistics from drop.
test <- drop1(model, scope=scopevars,test="F")
if(verbose){
cat("-------------STEP ",counter,"-------------\n",
"The drop statistics : \n")
print(test)
}
pval <- test[,dim(test)[2]]
names(pval) <- rownames(test)
pval <- sort(pval,decreasing=T)
if(sum(is.na(pval))>0) stop(paste("Model",
deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))
# check if all significant
if(pval[1]<sig) break # stops the loop if all remaining vars are sign.
# select var to drop
i=1
while(T){
dropvar <- names(pval)[i]
check.terms <- terms[-match(dropvar,terms)]
x <- has.interaction(dropvar,check.terms)
if(x){i=i+1;next} else {break}
} # end while(T) drop var
if(pval[i]<sig) break # stops the loop if var to remove is significant
if(verbose){
cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")
}
#update terms, scopevars and model
scopevars <- scopevars[-match(dropvar,scopevars)]
terms <- terms[-match(dropvar,terms)]
formul <- as.formula(paste(".~.-",dropvar))
model <- update(model,formul)
if(length(scopevars)==0) {
warning("All variables are thrown out of the model.\n",
"No model could be specified.")
return()
}
counter=counter+1
} # end while(T) main loop
return(model)
}
Why not try using the step() function specifying your testing method?
For example, for backward elimination, you type only a command:
step(FullModel, direction = "backward", test = "F")
and for stepwise selection, simply:
step(FullModel, direction = "both", test = "F")
This can display both the AIC values as well as the F and P values.
Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.
model1 <-lm (ozone~temp*wind*rad)
summary(model1)
Coefficients:
Estimate Std.Error t value Pr(>t)
(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
temp -1.076e+01 4.303e+00 -2.501 0.01401 *
wind -3.237e+01 1.173e+01 -2.760 0.00687 **
rad -3.117e-01 5.585e-01 -0.558 0.57799
temp:wind 2.377e-01 1.367e-01 1.739 0.08519
temp:rad 8.402e-03 7.512e-03 1.119 0.26602
wind:rad 2.054e-02 4.892e-02 0.420 0.47552
temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358
The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:
model2 <- update(model1,~. - temp:wind:rad)
summary(model2)
Depending on the results, you can continue simplifying your model:
model3 <- update(model2,~. - temp:rad)
summary(model3)
...
Alternatively you can use the automatic model simplification function step, to see
how well it does:
model_step <- step(model1)
Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.
If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.
Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.
See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.
As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.
require(rms)
model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
model2 <- fastbw(fit=model1, rule="p", sls=0.05)
model2
olsrr package could be useful.
You can define pent (p-value to enter the model) and prem (p-value to remove)
The output gives all the metrics you would need, and beyond.

Resources