Is exhaustive model selection in R with high interaction terms and inclusion of main effects possible with regsubsets() or other functions? - r

I would like to perform automated, exhaustive model selection on a dataset with 7 predictors (5 continuous and 2 categorical) in R. I would like all continuous predictors to have the potential for interaction (at least up to 3 way interactions) and also have non-interacting squared terms.
I have been using regsubsets() from the leaps package and have gotten good results, however many of the models contain interaction terms without including the main effects as well (e.g., g*h is an included model predictor but g is not). Since inclusion of the main effect as well will affect the model score (Cp, BIC, etc) it is important to include them in comparisons with the other models even if they are not strong predictors.
I could manually weed through the results and cross off models that include interactions without main effects but I'd prefer to have an automated way to exclude those. I'm fairly certain this isn't possible with regsubsets() or leaps(), and probably not with glmulti either. Does anyone know of another exhaustive model selection function that allows for such specification or have a suggestion for script that will sort the model output and find only models that fit my specs?
Below is simplified output from my model searches with regsubsets(). You can see that model 3 and 4 do include interaction terms without including all the related main effects. If no other functions are known for running a search with my specs then suggestions on easily sub-setting this output to exclude models without the necessary main effects included would be helpful.
Model adjR2 BIC CP n_pred X.Intercept. x1 x2 x3 x1.x2 x1.x3 x2.x3 x1.x2.x3
1 0.470344346 -41.26794246 94.82406866 1 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 0.437034361 -36.5715963 105.3785057 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 0.366989617 -27.54194252 127.5725366 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
4 0.625478214 -64.64414719 46.08686422 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE

You can use the dredge() function from the MuMIn package.
See also Subsetting in dredge (MuMIn) - must include interaction if main effects are present .

After working with dredge I found that my models have too many predictors and interactions to run dredge in a reasonable period (I calculated that with 40+ potential predictors it might take 300k hours to complete the search on my computer). But it does exclude models where interactions don't match with main effects so I imagine that might still be a good solution for many people.
For my needs I've moved back to regsubsets and have written some code to parse through the search output in order to exclude models that contain terms in interactions that are not included as main effects. This code seems to work well so I'll share it here. Warning: it was written with human expediency in mind, not computational, so it could probably be re-coded to be faster. If you've got 100,000s of models to test you might want to make it sleeker. (I've been working on searches with ~50,000 models and up to 40 factors which take my 2.4ghz i5 core a few hours to process)
reg.output.search.with.test<- function (search_object) { ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
search_comp<-data.frame(R2=summary(search_object)$rsq,
adjR2=summary(search_object)$adjr2,
BIC=summary(search_object)$bic,
CP=summary(search_object)$cp,
n_predictors=row.names(summary(search_object)$which),
summary(search_object)$which)
## Categorize different types of predictors based on whether '.' is present
predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
## Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
search_comp$reject_model<-FALSE
for(main_eff_n in 1:length(main_pred)){ ## iterate through main effects
## Find column numbers of higher level ters containing the main effect
search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred)
## Subset models that are not yet flagged for rejection, only test these
valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]
## Subset dfs with only main or higher level predictor columns
main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]
if(length(search_cols)>0){ ## If there are higher level pred, test each one
for(high_eff_n in search_cols){ ## iterate through higher level pred.
## Test if the intxn effect is present without main effect (working with whole column of models)
test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE))
valid_model_subs[test_responses,"reject_model"]<-TRUE ## Set reject to TRUE where appropriate
} ## End high_eff for
## Transfer changes in reject to primary df:
search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
} ## End if
} ## End main_eff for
## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
"regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
} ## End reg.output.search.with.test fn

Related

Negative Binomial model offset seems to be creating a 2 level factor

I am trying to fit some data to a negative binomial model and run a pairwise comparison using emmeans. The data has two different sample sizes, 15 and 20 (num_sample in the example below).
I have set up two data frames: good.data which produces the expected result of offset() using random sample sizes between 15 and 20, and bad.data using a sample size of either 15 or 20, which seems to produce a factor of either 15 or 20. The bad.data pairwise comparison produces way too many comparisons compared to the good.data, even though they should produce the same number?
set.seed(1)
library(dplyr)
library(emmeans)
library(MASS)
# make data that works
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,17,20),24*5,T),# more than 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->good.data
# make data that doesn't work
data.frame(site=c(rep("A",24),
rep("B",24),
rep("C",24),
rep("D",24),
rep("E",24)),
trt_time=rep(rep(c(10,20,30),8),5),
pre_trt=rep(rep(c(rep("N",3),rep("Y",3)),4),5),
storage_time=rep(c(rep(0,6),rep(30,6),rep(60,6),rep(90,6)),5),
num_sample=sample(c(15,20),24*5,T),# only 2 sample sizes...
bad=sample(c(1:7),24*5,T,c(0.6,0.1,0.1,0.05,0.05,0.05,0.05)))->bad.data
# fit models
good.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.good
bad.data%>%
mutate(trt_time=factor(trt_time),
pre_trt=factor(pre_trt),
storage_time=factor(storage_time))%>%
MASS::glm.nb(bad~trt_time:pre_trt:storage_time+offset(log(num_sample)),
data=.)->mod.bad
# pairwise comparison
emmeans::emmeans(mod.good,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
emmeans::emmeans(mod.bad,pairwise~trt_time:pre_trt:storage_time+offset(log(num_sample)))$contrasts%>%as.data.frame()
First , I think you should look up how to use emmeans.The intent is not to give a duplicate of the model formula, but rather to specify which factors you want the marginal means of.
However, that is not the issue here. What emmeans does first is to setup a reference grid that consists of all combinations of
the levels of each factor
the average of each numeric predictor; except if a
numeric predictor has just two different values, then
both its values are included.
It is that exception you have run against. Since num_samples has just 2 values of 15 and 20, both levels are kept separate rather than averaged. If you want them averaged, add cov.keep = 1 to the emmeans call. It has nothing to do with offsets you specify in emmeans-related functions; it has to do with the fact that num_samples is a predictor in your model.
The reason for the exception is that a lot of people specify models with indicator variables (e.g., female having values of 1 if true and 0 if false) in place of factors. We generally want those treated like factors rather than numeric predictors.
To be honest I'm not exactly sure what's going on with the expansion (276, the 'correct' number of contrasts, is choose(24,2), the 'incorrect' number of contrasts is 1128 = choose(48,2)), but I would say that you should probably be following the guidance in the "offsets" section of one of the emmeans vignettes where it says
If a model is fitted and its formula includes an offset() term, then by default, the offset is computed and included in the reference grid. ...
However, many users would like to ignore the offset for this kind of model, because then the estimates we obtain are rates per unit value of the (logged) offset. This may be accomplished by specifying an offset parameter in the call ...
The most natural choice for setting the offset is to 0 (i.e. make predictions etc. for a sample size of 1), but in this case I don't think it matters.
get_contr <- function(x) as_tibble(x$contrasts)
cfun <- function(m) {
emmeans::emmeans(m,
pairwise~trt_time:pre_trt:storage_time, offset=0) |>
get_contr()
}
nrow(cfun(mod.good)) ## 276
nrow(cfun(mod.bad)) ## 276
From a statistical point of view I question the wisdom of looking at 276 pairwise comparisons, but that's a different issue ...

Different results using Random Forest prediction in R

When I'm running random forest model over my test data I'm getting different results for the same data set + model.
Here are the results where you can see the difference over the first column:
> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)
FALSE TRUE
FALSE 14 7
TRUE 13 66
> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)
FALSE TRUE
FALSE 15 7
TRUE 12 66
Although the difference is very small, I'm trying to understand what caused that. I'm guessing that predict has "flexible" classification threshold, although I couldn't find that in the documentation; Am I right?
Thank you in advance
I will assume that you did not refit the model here, but it is simply the predict call that is producing these results. The answer is probably this, from ?predict.randomForest:
Any ties are broken at random, so if this is undesirable, avoid it by
using odd number ntree in randomForest()

Using randomForest package in R, how to get probabilities from classification model?

TL;DR :
Is there something I can flag in the original randomForest call to avoid having to re-run the predict function to get predicted categorical probabilities, instead of just the likely category?
Details:
I am using the randomForest package.
I have a model something like:
model <- randomForest(x=out.data[train.rows, feature.cols],
y=out.data[train.rows, response.col],
xtest=out.data[test.rows, feature.cols],
ytest=out.data[test.rows, response.col],
importance= TRUE)
where out.data is a data frame, with feature.cols a mixture of numeric and categorical features, while response.col is a TRUE / FALSE binary variable, that I forced into factor so that randomForest model will properly treat it as categorical.
All runs well, and the variable model is returned to me properly. However, I cannot seem to find a flag or parameter to pass to the randomForest function so that model is returned to me with the probabilities of TRUE or FALSE. Instead, I get simply predicted values. That is, if I look at model$predicted, I'll see something like:
FALSE
FALSE
TRUE
TRUE
FALSE
.
.
.
Instead, I want to see something like:
FALSE TRUE
1 0.84 0.16
2 0.66 0.34
3 0.11 0.89
4 0.17 0.83
5 0.92 0.08
. . .
. . .
. . .
I can get the above, but in order to do so, I need to do something like:
tmp <- predict(model, out.data[test.rows, feature.cols], "prob")
[test.rows captures the row numbers for those that were used during the model testing. The details are not shown here, but are simple since the test row IDs are output into model.]
Then everything works fine. The problem is that the model is big and takes a very long time to run, and even the prediction itself takes a while. Since the prediction should be entirely unnecessary (I am simply looking to calculate the ROC curve on the test data set, the data set that should have already been calculated), I was hoping to skip this step. Is there something I can flag in the original randomForest call to avoid having to re-run the predict function?
model$predicted is NOT the same thing returned by predict(). If you want the probability of the TRUE or FALSE class then you must run predict(), or pass x,y,xtest,ytest like
randomForest(x,y,xtest=x,ytest=y),
where x=out.data[, feature.cols], y=out.data[, response.col].
model$predicted returns the class based on which class had the larger value in model$votes for each record. votes, as #joran pointed out is the proportion of OOB(out of bag) ‘votes’ from the random forest, a vote only counting when the record was selected in an OOB sample. On the other hand predict() returns the true probability for each class based on votes by all the trees.
Using randomForest(x,y,xtest=x,ytest=y) functions a little differently than when passing a formula or simply randomForest(x,y), as in the example given above. randomForest(x,y,xtest=x,ytest=y) WILL return the probability for each class, this may sound a little weird, but it is found under model$test$votes, and the predicted class under model$test$predicted, which simply selects the class based on which class had the larger value in model$test$votes. Also, when using randomForest(x,y,xtest=x,ytest=y), model$predicted and model$votes have the same definition as above.
Finally, just to note, if randomForest(x,y,xtest=x,ytest=y) is used, then, in order to use predict() function the keep.forest flag should be set to TRUE.
model=randomForest(x,y,xtest=x,ytest=y,keep.forest=TRUE).
prob=predict(model,x,type="prob")
prob WILL be equivalent to model$test$votes since the test data input are both x.

How to Find False Positive Prediction Count using R Script

Assuming "test" and "train" are two data frames for testing and traininig respectively, and "model" is a classifier that was generated using training data. I can find the number of misclassified examples like this:
n = sum(test$class_label != predict(model, test))
How can I find the number of examples that is predicted as negative but it is actually positive? (i.e. false positive)
NOTE: Above example assumes that the problem is a binary classification problem whose classes are, say, "yes" (positive class) and "no". Additionally, predict is a function of caret package.
This will get you a 2x2 table showing true positives, false positives, false negatives and true negatives.
> table(Truth = test$class_label, Prediction = predict(model, test))
Prediction
Truth yes no
yes 32 3
no 8 27

Random Forest by R package party overfits on random data

I am working on Random Forest classification.
I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.
A toy example
Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().
# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")
Perform cforest, employing unbiased parameter set (maybe recommended).
cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
# FALSE TRUE
# FALSE 45 7
# TRUE 6 42
Fairly good prediction performance on meaningless data.
On the other hand, randomForest goes honestly.
rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
# FALSE TRUE
# FALSE 25 27
# TRUE 26 22
Where these differences come from?
I am afraid that I am using cforest in a wrong way.
Let me put some extra observations in cforest:
The number of variables did not much affect the result.
Variable importance values (computed by varimp(cf)) were rather low, compared to those using some realistic explanatory variables.
AUC of ROC curve was nearly 1.
I would appreciate your advices.
Additional note
Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
You cannot learn anything about the true performance of a classifier by studying its performance on the training set. Moreover, since there is no true pattern to find you can't really tell if it is worse to overfit like cforest did, or to guess randomly like randomForest did. All you can tell is that the two algorithms followed different strategies, but if you'd test them on new unseen data both would probably fail.
The only way to estimate the performance of a classifier is to test it on external data, that has not been part of the training, in a situation you do know there is a pattern to find.
Some comments:
The number of variables shouldn't matter if none contain any useful information.
Nice to see that the variable importance is lower for meaningless data than meaningful data. This could serve as a sanity check for the method, but probably not much more.
AUC (or any other performance measure) doesn't matter on the training set, since it is trivial to obtain perfect classification results.
The predict methods have different defaults for cforest and randomForest models, respectively. party:::predict.RandomForest gets you
function (object, OOB = FALSE, ...)
{
RandomForest#predict(object, OOB = OOB, ...)
}
so
table(predict(cf), data$response)
gets me
FALSE TRUE
FALSE 45 13
TRUE 7 35
whereas
table(predict(cf, OOB=TRUE), data$response)
gets me
FALSE TRUE
FALSE 31 24
TRUE 21 24
which is a respectably dismal result.

Resources