Stepwise regression using p-values to drop variables with nonsignificant p-values - r

I want to perform a stepwise linear Regression using p-values as a selection criterion, e.g.: at each step dropping variables that have the highest i.e. the most insignificant p-values, stopping when all values are significant defined by some threshold alpha.
I am totally aware that I should use the AIC (e.g. command step or stepAIC) or some other criterion instead, but my boss has no grasp of statistics and insist on using p-values.
If necessary, I could program my own routine, but I am wondering if there is an already implemented version of this.

Show your boss the following :
set.seed(100)
x1 <- runif(100,0,1)
x2 <- as.factor(sample(letters[1:3],100,replace=T))
y <- x1+x1*(x2=="a")+2*(x2=="b")+rnorm(100)
summary(lm(y~x1*x2))
Which gives :
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1525 0.3066 -0.498 0.61995
x1 1.8693 0.6045 3.092 0.00261 **
x2b 2.5149 0.4334 5.802 8.77e-08 ***
x2c 0.3089 0.4475 0.690 0.49180
x1:x2b -1.1239 0.8022 -1.401 0.16451
x1:x2c -1.0497 0.7873 -1.333 0.18566
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now, based on the p-values you would exclude which one? x2 is most significant and most non-significant at the same time.
Edit : To clarify : This exaxmple is not the best, as indicated in the comments. The procedure in Stata and SPSS is AFAIK also not based on the p-values of the T-test on the coefficients, but on the F-test after removal of one of the variables.
I have a function that does exactly that. This is a selection on "the p-value", but not of the T-test on the coefficients or on the anova results. Well, feel free to use it if it looks useful to you.
#####################################
# Automated model selection
# Author : Joris Meys
# version : 0.2
# date : 12/01/09
#####################################
#CHANGE LOG
# 0.2 : check for empty scopevar vector
#####################################
# Function has.interaction checks whether x is part of a term in terms
# terms is a vector with names of terms from a model
has.interaction <- function(x,terms){
out <- sapply(terms,function(i){
sum(1-(strsplit(x,":")[[1]] %in% strsplit(i,":")[[1]]))==0
})
return(sum(out)>0)
}
# Function Model.select
# model is the lm object of the full model
# keep is a list of model terms to keep in the model at all times
# sig gives the significance for removal of a variable. Can be 0.1 too (see SPSS)
# verbose=T gives the F-tests, dropped var and resulting model after
model.select <- function(model,keep,sig=0.05,verbose=F){
counter=1
# check input
if(!is(model,"lm")) stop(paste(deparse(substitute(model)),"is not an lm object\n"))
# calculate scope for drop1 function
terms <- attr(model$terms,"term.labels")
if(missing(keep)){ # set scopevars to all terms
scopevars <- terms
} else{ # select the scopevars if keep is used
index <- match(keep,terms)
# check if all is specified correctly
if(sum(is.na(index))>0){
novar <- keep[is.na(index)]
warning(paste(
c(novar,"cannot be found in the model",
"\nThese terms are ignored in the model selection."),
collapse=" "))
index <- as.vector(na.omit(index))
}
scopevars <- terms[-index]
}
# Backward model selection :
while(T){
# extract the test statistics from drop.
test <- drop1(model, scope=scopevars,test="F")
if(verbose){
cat("-------------STEP ",counter,"-------------\n",
"The drop statistics : \n")
print(test)
}
pval <- test[,dim(test)[2]]
names(pval) <- rownames(test)
pval <- sort(pval,decreasing=T)
if(sum(is.na(pval))>0) stop(paste("Model",
deparse(substitute(model)),"is invalid. Check if all coefficients are estimated."))
# check if all significant
if(pval[1]<sig) break # stops the loop if all remaining vars are sign.
# select var to drop
i=1
while(T){
dropvar <- names(pval)[i]
check.terms <- terms[-match(dropvar,terms)]
x <- has.interaction(dropvar,check.terms)
if(x){i=i+1;next} else {break}
} # end while(T) drop var
if(pval[i]<sig) break # stops the loop if var to remove is significant
if(verbose){
cat("\n--------\nTerm dropped in step",counter,":",dropvar,"\n--------\n\n")
}
#update terms, scopevars and model
scopevars <- scopevars[-match(dropvar,scopevars)]
terms <- terms[-match(dropvar,terms)]
formul <- as.formula(paste(".~.-",dropvar))
model <- update(model,formul)
if(length(scopevars)==0) {
warning("All variables are thrown out of the model.\n",
"No model could be specified.")
return()
}
counter=counter+1
} # end while(T) main loop
return(model)
}

Why not try using the step() function specifying your testing method?
For example, for backward elimination, you type only a command:
step(FullModel, direction = "backward", test = "F")
and for stepwise selection, simply:
step(FullModel, direction = "both", test = "F")
This can display both the AIC values as well as the F and P values.

Here is an example. Start with the most complicated model: this includes interactions between all three explanatory variables.
model1 <-lm (ozone~temp*wind*rad)
summary(model1)
Coefficients:
Estimate Std.Error t value Pr(>t)
(Intercept) 5.683e+02 2.073e+02 2.741 0.00725 **
temp -1.076e+01 4.303e+00 -2.501 0.01401 *
wind -3.237e+01 1.173e+01 -2.760 0.00687 **
rad -3.117e-01 5.585e-01 -0.558 0.57799
temp:wind 2.377e-01 1.367e-01 1.739 0.08519
temp:rad 8.402e-03 7.512e-03 1.119 0.26602
wind:rad 2.054e-02 4.892e-02 0.420 0.47552
temp:wind:rad -4.324e-04 6.595e-04 -0.656 0.51358
The three-way interaction is clearly not significant. This is how you remove it, to begin the process of model simplification:
model2 <- update(model1,~. - temp:wind:rad)
summary(model2)
Depending on the results, you can continue simplifying your model:
model3 <- update(model2,~. - temp:rad)
summary(model3)
...
Alternatively you can use the automatic model simplification function step, to see
how well it does:
model_step <- step(model1)

Package rms: Regression Modeling Strategies has fastbw() that does exactly what you need. There is even a parameter to flip from AIC to p-value based elimination.

If you are just trying to get the best predictive model, then perhaps it doesn't matter too much, but for anything else, don't bother with this sort of model selection. It is wrong.
Use a shrinkage methods such as ridge regression (in lm.ridge() in package MASS for example), or the lasso, or the elasticnet (a combination of ridge and lasso constraints). Of these, only the lasso and elastic net will do some form of model selection, i.e. force the coefficients of some covariates to zero.
See the Regularization and Shrinkage section of the Machine Learning task view on CRAN.

As mentioned by Gavin Simpson the function fastbw from rms package can be used to select variables using the p-value. Bellow is an example using the example given by George Dontas. Use the option rule='p' to select p-value criteria.
require(rms)
model1 <- ols(Ozone ~ Temp * Wind * Solar.R, data=airquality)
model2 <- fastbw(fit=model1, rule="p", sls=0.05)
model2

olsrr package could be useful.
You can define pent (p-value to enter the model) and prem (p-value to remove)
The output gives all the metrics you would need, and beyond.

Related

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

bootstrap standard errors of a linear regression in R

I have a lm object and I would like to bootstrap only its standard errors. In practice I want to use only part of the sample (with replacement) at each replication and get a distribution of standard erros. Then, if possible, I would like to display the summary of the original linear regression but with the bootstrapped standard errors and the corresponding p-values (in other words same beta coefficients but different standard errors).
Edited: In summary I want to "modify" my lm object by having the same beta coefficients of the original lm object that I ran on the original data, but having the bootstrapped standard errors (and associated t-stats and p-values) obtained by computing this lm regression several times on different subsamples (with replacement).
So my lm object looks like
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.812793 0.095282 40.016 < 2e-16 ***
x -0.904729 0.284243 -3.183 0.00147 **
z 0.599258 0.009593 62.466 < 2e-16 ***
x*z 0.091511 0.029704 3.081 0.00208 **
but the associated standard errors are wrong, and I would like to estimate them by replicating this linear regression 1000 times (replications) on different subsample (with replacement).
Is there a way to do this? can anyone help me?
Thank you for your time.
Marco
What you ask can be done following the line of the code below.
Since you have not posted an example dataset nor the model to fit, I will use the built in dataset mtcars an a simple formula with two continuous predictors.
library(boot)
boot_function <- function(data, indices, formula){
d <- data[indices, ]
obj <- lm(formula, d)
coefs <- summary(obj)$coefficients
coefs[, "Std. Error"]
}
set.seed(8527)
fmla <- as.formula("mpg ~ hp * cyl")
seboot <- boot(mtcars, boot_function, R = 1000, formula = fmla)
colMeans(seboot$t)
##[1] 6.511530646 0.068694001 1.000101450 0.008804784
I believe that it is possible to use the code above for most needs with numeric response and predictors.

Comparing nested mice models with interaction terms

R's mice contains a function, pool.compare, to compare nested models fit to imputed objects. If I try to include an interaction term:
library(mice)
imput = mice(nhanes2)
mi1 <- with(data=imput, expr=lm(bmi~age*hyp))
mi0 <- with(data=imput, expr=lm(bmi~age+hyp))
pc <- pool.compare(mi1, mi0, method="Wald")
then it returns the following error:
Error in pool(fit1) :
Different number of parameters: coef(fit): 6, vcov(fit): 5
It sounds like the variance-covariance matrix doesn't include the interaction term as its own variable. What's the best way around this?
The problem appears to be that some of your parameters are un-estimatable in some of your imputed data.sets. When I run the code, I see
( fit1<-mi1$analyses[[1]] )
# lm(formula = bmi ~ age * hyp)
#
# Coefficients:
# (Intercept) age2 age3 hyp2 age2:hyp2
# 28.425 -5.425 -3.758 1.200 3.300
# age3:hyp2
# NA
In this set, it was not possible to estimate age3*hyp2 (presumably because there were no observations in this group).
This causes the discrepancy in coef(fit1) and vcov(fit1) since the covariance cannot be estimated for that term.
What to do in this case is more of a statistical problem than a programming problem. If you are unsure of what would be appropriate for your data, I suggest you consult with the statisticians over at Cross Validated.

obtaining average coefficients and adj. R^2 from multiple pooled regressions using lapply

I have performed multiple Pooled regressions with a loop function and stored the regression output in a list (myregression). What I would like to do now is to efficiently perform the coeftest function in the lmtest package over all my regressions (i.e., myregression list) to adjust standard errors and t-statistics. Finally I would like to obtain the mean of the coefficients, standard errors and t-values.
Here is what I came up so far:
library(plm)
data("Grunfeld", package="plm")
# Store each subset regression in myregression
myregression <- list()
count <- 1
# Regression on six-year subsets of Grunfeld
for(t in 1940:1950){
myregression[[count]] <- plm(inv ~ value + capital,
subset(Grunfeld, year<=t & year>=t-5),
index=c("firm","year"))
# Name each regression based on the year range included in the data subset
names(myregression)[[count]] = paste0("Year_",t)
count <- count+1
}
Here is where my problems kick in: Although i'm able to perform the coeftest function to invidiual components of the list, I'm unable to code the lapply function accordingly.
## Apply coeftest function to all plm-objects
library(lmtest)
coeftest(myregression$Year_1940, vcov=function(x) vcovSCC(x, type="HC3", maxlag=4))
coeftest(myregression$Year_1941, vcov=function(x) vcovSCC(x, type="HC3", maxlag=4))
COEFTEST<-lapply(myregression, coeftest(x, vcov=function(x) vcovSCC(x, type="HC3", maxlag=4)))
## obtaining average coefficients, se's,and t values over all regressions
lapply(COEFTEST, mean)
I hope there is only a minor mistake that I'm unable to see.
I further noticed that the plm regression output is smaller than regular lm output is there another way to obtain mean adj. R^2?
Try
COEFTEST<-lapply(myregression, function(x) coeftest(x, vcov=vcovSCC(x, type="HC3", maxlag=4)))
Thsi will give you a list of coeftest-outputs for each regression, which you can then use in whatever way you want.
As a sidenote, make sure that whatever you do with this output makes sense. Taking the mean of an coeftest-output is not obvously sensible to me. If you want to have the average over all coefficients, try something like
mean(sapply(COEFTEST, function(x) x["capital", "Estimate"]))
Here, sapply retrieves all estimates for the variable capital from the COEFTEST outputs and puts them into a vector.
To access other elements, it is helpful to look at the structure of the objects using str(). For instance, str(summary(myregression[[1]])) reveals that the r-squares are saved under the name r.squared. These you can access using e.g. summary(myregression[[1]])$r.squared for the first regression output. Automating this, you can again use a construct as above, e.g.
sapply(myregression, function(x) summary(x)$r.squared)
There is a nicer way to obtain average coefficients (technically known as Mean Group estimators) using pmg() in plm. And from the looks of it, you are trying to estimate Fama-MacBeth regressions and SEs.
require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")
fpmg <- pmg(y~x, test, index=c("year","firmid")) ##Fama-MacBeth
> ##Fama-MacBeth
> coeftest(fpmg)
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.031278 0.023356 1.3392 0.1806
x 1.035586 0.033342 31.0599 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
For further details see Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R.
See also:
Fama MacBeth standard errors in R

Transforming data to normality. What is the best function for a given case?

Is there a function or a package that allows to look for the best (or one of the best) variable transformation in order to make model's residuals as normal as possible?
For example:
frml = formula(some_tranformation(A) ~ B+I(B^2)+B:C+C)
model = aov(formula, data=data)
shapiro.test(residuals(model))
Is there a function that tells what is the function some_transformation() that optimizes the normality of the residuals?
You mean like the Box-Cox transformation?
library(car)
m0 <- lm(cycles ~ len + amp + load, Wool)
plot(m0, which=2)
# Box Cox Method, univariate
summary(p1 <- powerTransform(m0))
# bcPower Transformation to Normality
#
# Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
# Y1 -0.0592 0.0611 -0.1789 0.0606
#
# Likelihood ratio tests about transformation parameters
# LRT df pval
# LR test, lambda = (0) 0.9213384 1 0.3371238
# LR test, lambda = (1) 84.0756559 1 0.0000000
# fit linear model with transformed response:
coef(p1, round=TRUE)
summary(m1 <- lm(bcPower(cycles, p1$roundlam) ~ len + amp + load, Wool))
plot(m1, which=2)
Unfortunately this is not a solved problem in statistics. What user #statquant has suggested is pretty much the best you can do, however it is not without its own pitfalls.
One important thing to note is that tests for normality, like shapiro.test are very sensitive to changes once you get reasonable sample sizes (i.e. in the hundreds), so you should not blindly rely on them.
Myself, i've thrown the problem in the too hard basket. If the data doesn't look at least normally distributed, then I would try to find a non-parametric version of the statistics you want to run on the data.

Resources