Command for finding the best linear model in R - r

Is there a way to get R to run all possible models (with all combinations of variables in a dataset) to produce the best/most accurate linear model and then output that model?
I feel like there is a way to do this, but I am having a hard time finding the information.

There are numerous ways this could be achieved, but for a simple way of doing this I would suggest that you have a look at the glmulti package, which is described in detail in this paper:
glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models
Alternatively, very simple example of the model selection as available on the Quick-R website:
# Stepwise Regression
library(MASS)
fit <- lm(y~x1+x2+x3,data=mydata)
step <- stepAIC(fit, direction="both")
step$anova # display results
Or to simplify even more, you can do more manual model comparison:
fit1 <- lm(y ~ x1 + x2 + x3 + x4, data=mydata)
fit2 <- lm(y ~ x1 + x2, data=mydata)
anova(fit1, fit2)

This should get you started. Although you should read my comment from above. This should build you a model based on all the data in your dataset and then compare all of the models with AIC and BIC.
# create a NULL vector called model so we have something to add our layers to
model=NULL
# create a vector of the dataframe column names used to build the formula
vars = names(data)
# remove variable names you don’t want to use (at least
# the response variable (if its in the first column)
vars = vars[-1]
# the combn function will run every different combination of variables and then run the glm
for(i in 1:length(vars)){
xx = combn(vars,i)
if(is.null(dim(xx))){
fla = paste("y ~", paste(xx, collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
} else {
for(j in 1:dim(xx)[2]){
fla = paste("y ~", paste(xx[1:dim(xx)[1],j], collapse="+"))
model[[length(model)+1]]=glm(as.formula(fla),data=data)
}
}
}
# see how many models were build using the loop above
length(model)
# create a vector to extract AIC and BIC values from the model variable
AICs = NULL
BICs = NULL
for(i in 1:length(model)){
AICs[i] = AIC(model[[i]])
BICs[i] = BIC(model[[i]])
}
#see which models were chosen as best by both methods
which(AICs==min(AICs))
which(BICs==min(BICs))

I ended up running forwards, backwards, and stepwise procedures on data to select models and then comparing them based on AIC, BIC, and adj. R-sq. This method seemed most efficient. However, when I received the actual data to be used (the program I was writing was for business purposes), I was told to only model each explanatory variable against the response, so I was able to just call lm(response ~ explanatory) for each variable in question, since the analysis we ended up using it for wasn't worried about how they interacted with each other.

This is a very old question, but for those who are still encountering this discussion - the package olsrr and specifically the function ols_step_all_possible exhaustively produces an ols model for all possible subsets of variables, based on an lm object (such that by feeding it with a full model you will get all possible combinations), and returns a dataframe with R squared, adjusted R squared, aic, bic, etc. for all the models. This is very helpful in finding the best predictors but it is also very much time consuming.
see https://olsrr.rsquaredacademy.com/reference/ols_step_all_possible.html
I do not recommend just "cherry picking" the best performing model, rather I would actually look at the output and choose carefully for the most reasonable outcome. In case you would want to immediately get the best performing model (by some criteria, say number of predictors and R2) you may write a function that saves the dataframe, arranges it by number of predictors and orders it by descending R2 and spits out the top result.

The dredge() function in R also accomplishes this.

Related

R: MICE and backwards stepwise regression

I've imputed my data using the following code:
data_imp <- mice(data, m=5, maxit=50, meth='pmm', seed=500, printFlag=FALSE)
data.impute <- complete(data_imp, action = 1)
I wanted to perform backwards stepwise regression using the stepAIC function in order to find the most parsimonious model. How can I do this using all 5 of my imputed datasets, rather than just 1?
Thank you very much!
You'd have to apply it to each dataset separately; see below for some example code.
However, let me also give you two MASSIVE disclaimers here:
Backwards stepwise regression is really really not recommended for variable selection. In addition, there are better ways to do this for imputed datasets.
From the code below, you would still have to decide on HOW to pool your results into one interpretable set. One way would be to simply count how often each variable ends up in the final model. However, this procedure implicitly carries a loss of information.
A more extensive discussion of these points can be found here:
https://stats.stackexchange.com/questions/110585/stepwise-regression-modeling-using-multiply-imputed-data-sets
The author of mice also has a subchapter on variable selection in his book:
https://stefvanbuuren.name/fimd/sec-stepwise.html
I would thus consider whether there are better options out there for you.
Example code
## I am using `mtcars`
## Let's ampute it, then impute it
data_imp <- mice(ampute(mtcars, prop = 0.001)$amp)
## Next, we loop over all imputed datasets
out <- lapply(seq_len(data_imp$m), function(i) {
## We create a dataset
data.i <- complete(data_imp, i)
## We run our model
fit <- lm(mpg ~ ., data = data.i)
## We apply `stepAIC`
stepAIC(fit, trace = FALSE)
})

R: Regressions with group fixed effects and clustered standard errors with imputed dataset

I am trying to run regressions in R (multiple models - poisson, binomial and continuous) that include fixed effects of groups (e.g. schools) to adjust for general group-level differences (essentially demeaning by group) and that cluster standard errors to account for the nesting of participants in the groups. I am also running these over imputed data frames (created with mice). It seems that different disciplines use the phrase ‘fixed effects’ differently so I am having a hard time searching to troubleshoot.
I have fit random intercept models (with lme4) but they do not account for the school fixed effects (and the random effects are not of interest to my research questions). Putting the groups in as dummies slows the run down tremendously. I could also run a single level glm/lm with group dummies but I have not been able to find a strategy to cluster the standard errors with the imputed data (tried the clusterSE package). I could hand calculate the demeaning but there seems like there should be a more direct way to achieve this.

I have also looked at the lfe package but that does not seem to have glm options and the demeanlist function does not seem to be compatible with the imputed data frames.
In Stata, the command would be xtreg, fe vce (Cluster Variable), (fe = fixed effects, vce = clustered standard errors, with mi added to run over imputed dataframes). I could switch to Stata for the modeling but would definitely prefer to stay with R if possible!
Please let me know if this is better posted in cross-validated - I was on the fence but went with this one since it seemed to be more a coding question.
Thank you!
I would block bootstrap. The "block" handles the clustering and "bootstrap" handles the generated regressors.
There is probably a more elegant way to make this extensible to other estimators, but this should get you started.
# junk data
x <- rnorm(100)
y <- 1 + 2*x + rnorm(100)
dat1 <- data.frame(y, x, id=seq_along(y))
summary(lm(y ~ x, data=dat1))
# same point estimates, but lower SEs
dat2 <- dat1[rep(seq_along(y), each=10), ]
summary(lm(y ~ x, data=dat2))
# block boostrap helper function
require(boot)
myStatistic <- function(ids, i) {
myData <- do.call(rbind, lapply(i, function(i) dat2[dat2$id==ids[i], ]))
myLm <- lm(y ~ x, data=myData)
myLm$coefficients
}
# same point estimates from helper function if original data
myStatistic(unique(dat2$id), 1:100)
# block bootstrap recovers correct SEs
boot(unique(dat2$id), myStatistic, 500)

Predicting with plm function in R

I was wondering if it is possible to predict with the plm function from the plm package in R for a new dataset of predicting variables. I have create a model object using:
model <- plm(formula, data, index, model = 'pooling')
Now I'm hoping to predict a dependent variable from a new dataset which has not been used in the estimation of the model. I can do it through using the coefficients from the model object like this:
col_idx <- c(...)
df <- cbind(rep(1, nrow(df)), df[(1:ncol(df))[-col_idx]])
fitted_values <- as.matrix(df) %*% as.matrix(model_object$coefficients)
Such that I first define index columns used in the model and dropped columns due to collinearity in col_idx and subsequently construct a matrix of data which needs to be multiplied by the coefficients from the model. However, I can see errors occuring much easier with the manual dropping of columns.
A function designed to do this would make the code a lot more readable I guess. I have also found the pmodel.response() function but I can only get this to work for the dataset which has been used in predicting the actual model object.
Any help would be appreciated!
I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed Effects models with plm.
The function is posted here:
https://stackoverflow.com/a/44185441/2409896

Logistic Regression Using R

I am running logistic regressions using R right now, but I cannot seem to get many useful model fit statistics. I am looking for metrics similar to SAS:
http://www.ats.ucla.edu/stat/sas/output/sas_logit_output.htm
Does anyone know how (or what packages) I can use to extract these stats?
Thanks
Here's a Poisson regression example:
## from ?glm:
d.AD <- data.frame(counts=c(18,17,15,20,10,20,25,13,12),
outcome=gl(3,1,9),
treatment=gl(3,3))
glm.D93 <- glm(counts ~ outcome + treatment,data = d.AD, family=poisson())
Now define a function to fit an intercept-only model with the same response, family, etc., compute summary statistics, and combine them into a table (matrix). The formula .~1 in the update command below means "refit the model with the same response variable [denoted by the dot on the LHS of the tilde] but with only an intercept term [denoted by the 1 on the RHS of the tilde]"
glmsumfun <- function(model) {
glm0 <- update(model,.~1) ## refit with intercept only
## apply built-in logLik (log-likelihood), AIC,
## BIC (Bayesian/Schwarz Information Criterion) functions
## to models with and without intercept ('model' and 'glm0');
## combine the results in a two-column matrix with appropriate
## row and column names
matrix(c(logLik(glm.D93),BIC(glm.D93),AIC(glm.D93),
logLik(glm0),BIC(glm0),AIC(glm0)),ncol=2,
dimnames=list(c("logLik","SC","AIC"),c("full","intercept_only")))
}
Now apply the function:
glmsumfun(glm.D93)
The results:
full intercept_only
logLik -23.38066 -26.10681
SC 57.74744 54.41085
AIC 56.76132 54.21362
EDIT:
anova(glm.D93,test="Chisq") gives a sequential analysis of deviance table containing df, deviance (=-2 log likelihood), residual df, residual deviance, and the likelihood ratio test (chi-squared test) p-value.
drop1(glm.D93) gives a table with the AIC values (df, deviances, etc.) for each single-term deletion; drop1(glm.D93,test="Chisq") additionally gives the LRT test p value.
Certainly glm with a family="binomial" argument is the function most commonly used for logistic regression. The default handling of contrasts of factors is different. R uses treatment contrasts and SAS (I think) uses sum contrasts. You can look these technical issues up on R-help. They have been discussed many, many times over the last ten+ years.
I see Greg Snow mentioned lrm in 'rms'. It has the advantage of being supported by several other functions in the 'rms' suite of methods. I would use it , too, but learning the rms package may take some additional time. I didn't see an option that would create SAS-like output.
If you want to compare the packages on similar problems that UCLA StatComputing pages have another resource: http://www.ats.ucla.edu/stat/r/dae/default.htm , where a large number of methods are exemplified in SPSS, SAS, Stata and R.
Using the lrm function in the rms package may give you the output that you are looking for.

what is the equivalent of SAS-parameterestimates in R

Please how can this be written in R
proc glm data=DataTX;
class DAG;
by HID;
model Bwt = DAG/ss3 solution;
ods output parameterestimates =TX_BW_corrFact;
run;
The equivalent to proc glm for most purposes in R is lm, which fits linear models. It looks like you want the estimated coefficients from the model(s), which can be obtained by coef(mod) where mod is the object returned by lm.
The most complicated bit is replicating the by statement, which fits separate models for each level of the by variable (HID in this case). Try something like this. I assume you've already got your dataset imported into R.
grps <- split(DataTX, DataTX$HID)
mods <- lapply(grps, function(x) lm(Bwt ~ DAG, data=x))
sapply(mods, coef)
This splits DataTX into separate groups based on HID. For each group, it then fits the model lm(Bwt ~ DAG). The last line then extracts the fitted coefficients for each model.
This can be concatenated into a single line, but leaving it as 3 separate statements probably makes it easier to follow.
Note that the coefficients won't be the same as those from SAS, because of differences in how the two systems parametrise the model. In particular, SAS by default treats the last level of a class/factor variable as the reference, while R uses the first.
Have a look at lmList() from the lme4 or nlme package
library(lme4)
lmList(Reaction ~ Days | Subject, sleepstudy)
That is shorter than Hong's solution.
grps <- split(sleepstudy, sleepstudy$Subject)
mods <- lapply(grps, function(x) lm(Reaction ~ Days, data=x))
sapply(mods, coef)

Resources