Random Forest by R package party overfits on random data - r

I am working on Random Forest classification.
I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.
A toy example
Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().
# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")
Perform cforest, employing unbiased parameter set (maybe recommended).
cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
# FALSE TRUE
# FALSE 45 7
# TRUE 6 42
Fairly good prediction performance on meaningless data.
On the other hand, randomForest goes honestly.
rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
# FALSE TRUE
# FALSE 25 27
# TRUE 26 22
Where these differences come from?
I am afraid that I am using cforest in a wrong way.
Let me put some extra observations in cforest:
The number of variables did not much affect the result.
Variable importance values (computed by varimp(cf)) were rather low, compared to those using some realistic explanatory variables.
AUC of ROC curve was nearly 1.
I would appreciate your advices.
Additional note
Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

You cannot learn anything about the true performance of a classifier by studying its performance on the training set. Moreover, since there is no true pattern to find you can't really tell if it is worse to overfit like cforest did, or to guess randomly like randomForest did. All you can tell is that the two algorithms followed different strategies, but if you'd test them on new unseen data both would probably fail.
The only way to estimate the performance of a classifier is to test it on external data, that has not been part of the training, in a situation you do know there is a pattern to find.
Some comments:
The number of variables shouldn't matter if none contain any useful information.
Nice to see that the variable importance is lower for meaningless data than meaningful data. This could serve as a sanity check for the method, but probably not much more.
AUC (or any other performance measure) doesn't matter on the training set, since it is trivial to obtain perfect classification results.

The predict methods have different defaults for cforest and randomForest models, respectively. party:::predict.RandomForest gets you
function (object, OOB = FALSE, ...)
{
RandomForest#predict(object, OOB = OOB, ...)
}
so
table(predict(cf), data$response)
gets me
FALSE TRUE
FALSE 45 13
TRUE 7 35
whereas
table(predict(cf, OOB=TRUE), data$response)
gets me
FALSE TRUE
FALSE 31 24
TRUE 21 24
which is a respectably dismal result.

Related

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

why h2o.randomForest in R make much better predictions than randomForest packages

setwd("D:/Santander")
## import train dataset
train<-read.csv("train.csv",header=T)
dim(train)
summary(train)
str(train)
prop.table(table(train2$TARGET))
stats<-function(x){
length<-length(x)
nmiss<-sum(is.na(x))
y<-x[!is.na(x)]
freq<-as.data.frame(table(y))
max_freq<-max(freq[,2])/length
min<-min(y)
median<-median(y)
max<-max(y)
mean<-mean(y)
freq<-length(unique(y))
return(c(nmiss=nmiss,min=min,median=median,mean=mean,max=max,freq=freq,max_freq=max_freq))
}
var_stats<-sapply(train,stats)
var_stats_1<-t(var_stats)
###将最大频数类别比例超过0.9999,其它类别小于1/10000的变量全删除
exclude_var<-rownames(var_stats_1)[var_stats_1[,7]>0.9999]
train2<-train[,! colnames(train) %in% c(exclude_var,"ID")]
rm(list=setdiff(ls(),"train2"))
train2<-train2[1:10000,]
write.csv(train2,"example data.csv",row.names = F)
##随机将数据分为训练集与测试集
set.seed(1)
ind<-sample(c(1,2),size=nrow(train2),replace=T,prob=c(0.8,0.2))
train2$TARGET<-factor(train2$TARGET)
train_set<-train2[ind==1,]
test_set<-train2[ind==2,]
rm(train2)
##1\用R randomForest构建预测模型 100棵树
library(randomForest)
memory.limit(4000)
random<-randomForest(TARGET~.,data=train_set,ntree=50)
print(random)
random.importance<-importance(random)
p_train<-predict(random,train_set,type="prob")
pred.auc<-prediction(p_train[,2],train_set$TARGET)
performance(pred.auc,"auc")
##train_set auc=0.8177
## predict test_set
p_test<-predict(random,newdata = test_set,type="prob")
pred.auc<-prediction(p_test[,2],test_set$TARGET)
performance(pred.auc,"auc")
##test_set auc=0.60
#________________________________________________#
##_________h2o.randomForest_______________
library(h2o)
h2o.init()
train.h2o<-as.h2o(train_set)
test.h2o<-as.h2o(test_set)
random.h2o<-h2o.randomForest(,"TARGET",training_frame = train.h2o,ntrees=50)
importance.h2o<-h2o.varimp(random.h2o)
p_train.h2o<-as.data.frame(h2o.predict(random.h2o,train.h2o))
pred.auc<-prediction(p_train.h2o$p1,train_set$TARGET)
performance(pred.auc,"auc")
##auc=0.9388, bigger than previous one
###test_set prediction
p_test.h2o<-as.data.frame(h2o.predict(random.h2o,test.h2o))
pred.auc<-prediction(p_test.h2o$p1,test_set$TARGET)
performance(pred.auc,"auc")
###auc=0.775
I tried to make predictions with Kaggle competitions: Santander customer satisfaction: https://www.kaggle.com/c/santander-customer-satisfaction
When i use randomForest package in R, i got final result in test data of AUC=0.57, but when i use h2o.randomForest, i got final result in test data of AUC=0.81.the parameters in both function are same, i only used the default parameters with ntree=100.
So why h2o.randomForest make much better predictions than randomForest package itself?
Firstly, as user1808924 noted, there are differences in the algorithms and their default hyperparameters. For example, R's randomForest splits based on the Gini criterion and H2O trees are split based on reduction in Squared Error (even for classification). H2O also uses histograms for splitting and can handle splitting on categorical variables without dummy (or one-hot) encoding (although I don't think that matters here since the Santander dataset is entirely numeric). Other information on H2O's splitting can be found here (this is in the GBM section but the splitting for both algos is the same).
If you look at the predictions from your R randomForest model you will see that they are all in increments of 0.02. R's randomForest builds really deep trees, resulting in pure leaf nodes. This means the predicted outcome or an observation is either going to be 0 or 1 in each tree, and since you've set ntrees=50 the predictions will all be in increments of 0.02. The reason you get bad AUC scores is because with AUC it is the order of the predictions that matters, and since all of your predictions are [0.00, 0.02, 0.04, ...] there are a lot of ties. The trees in H2O's random forest aren't quite as deep and therefore aren't as pure, allowing for predictions that have some more granularity to them and that can be better sorted for a better AUC score.

Different results using Random Forest prediction in R

When I'm running random forest model over my test data I'm getting different results for the same data set + model.
Here are the results where you can see the difference over the first column:
> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)
FALSE TRUE
FALSE 14 7
TRUE 13 66
> table((predict(rfModelsL[[1]],newdata = a)) ,a$earlyR)
FALSE TRUE
FALSE 15 7
TRUE 12 66
Although the difference is very small, I'm trying to understand what caused that. I'm guessing that predict has "flexible" classification threshold, although I couldn't find that in the documentation; Am I right?
Thank you in advance
I will assume that you did not refit the model here, but it is simply the predict call that is producing these results. The answer is probably this, from ?predict.randomForest:
Any ties are broken at random, so if this is undesirable, avoid it by
using odd number ntree in randomForest()

Is exhaustive model selection in R with high interaction terms and inclusion of main effects possible with regsubsets() or other functions?

I would like to perform automated, exhaustive model selection on a dataset with 7 predictors (5 continuous and 2 categorical) in R. I would like all continuous predictors to have the potential for interaction (at least up to 3 way interactions) and also have non-interacting squared terms.
I have been using regsubsets() from the leaps package and have gotten good results, however many of the models contain interaction terms without including the main effects as well (e.g., g*h is an included model predictor but g is not). Since inclusion of the main effect as well will affect the model score (Cp, BIC, etc) it is important to include them in comparisons with the other models even if they are not strong predictors.
I could manually weed through the results and cross off models that include interactions without main effects but I'd prefer to have an automated way to exclude those. I'm fairly certain this isn't possible with regsubsets() or leaps(), and probably not with glmulti either. Does anyone know of another exhaustive model selection function that allows for such specification or have a suggestion for script that will sort the model output and find only models that fit my specs?
Below is simplified output from my model searches with regsubsets(). You can see that model 3 and 4 do include interaction terms without including all the related main effects. If no other functions are known for running a search with my specs then suggestions on easily sub-setting this output to exclude models without the necessary main effects included would be helpful.
Model adjR2 BIC CP n_pred X.Intercept. x1 x2 x3 x1.x2 x1.x3 x2.x3 x1.x2.x3
1 0.470344346 -41.26794246 94.82406866 1 TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
2 0.437034361 -36.5715963 105.3785057 1 TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
3 0.366989617 -27.54194252 127.5725366 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
4 0.625478214 -64.64414719 46.08686422 2 TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
You can use the dredge() function from the MuMIn package.
See also Subsetting in dredge (MuMIn) - must include interaction if main effects are present .
After working with dredge I found that my models have too many predictors and interactions to run dredge in a reasonable period (I calculated that with 40+ potential predictors it might take 300k hours to complete the search on my computer). But it does exclude models where interactions don't match with main effects so I imagine that might still be a good solution for many people.
For my needs I've moved back to regsubsets and have written some code to parse through the search output in order to exclude models that contain terms in interactions that are not included as main effects. This code seems to work well so I'll share it here. Warning: it was written with human expediency in mind, not computational, so it could probably be re-coded to be faster. If you've got 100,000s of models to test you might want to make it sleeker. (I've been working on searches with ~50,000 models and up to 40 factors which take my 2.4ghz i5 core a few hours to process)
reg.output.search.with.test<- function (search_object) { ## input an object from a regsubsets search
## First build a df listing model components and metrics of interest
search_comp<-data.frame(R2=summary(search_object)$rsq,
adjR2=summary(search_object)$adjr2,
BIC=summary(search_object)$bic,
CP=summary(search_object)$cp,
n_predictors=row.names(summary(search_object)$which),
summary(search_object)$which)
## Categorize different types of predictors based on whether '.' is present
predictors<-colnames(search_comp)[(match("X.Intercept.",names(search_comp))+1):dim(search_comp)[2]]
main_pred<-predictors[grep(pattern = ".", x = predictors, invert=T, fixed=T)]
higher_pred<-predictors[grep(pattern = ".", x = predictors, fixed=T)]
## Define a variable that indicates whether model should be reject, set to FALSE for all models initially.
search_comp$reject_model<-FALSE
for(main_eff_n in 1:length(main_pred)){ ## iterate through main effects
## Find column numbers of higher level ters containing the main effect
search_cols<-grep(pattern=main_pred[main_eff_n],x=higher_pred)
## Subset models that are not yet flagged for rejection, only test these
valid_model_subs<-search_comp[search_comp$reject_model==FALSE,]
## Subset dfs with only main or higher level predictor columns
main_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%main_pred]
higher_pred_df<-valid_model_subs[,colnames(valid_model_subs)%in%higher_pred]
if(length(search_cols)>0){ ## If there are higher level pred, test each one
for(high_eff_n in search_cols){ ## iterate through higher level pred.
## Test if the intxn effect is present without main effect (working with whole column of models)
test_responses<-((main_pred_df[,main_eff_n]==FALSE)&(higher_pred_df[,high_eff_n]==TRUE))
valid_model_subs[test_responses,"reject_model"]<-TRUE ## Set reject to TRUE where appropriate
} ## End high_eff for
## Transfer changes in reject to primary df:
search_comp[row.names(valid_model_subs),"reject_model"]<-valid_model_subs[,"reject_model"
} ## End if
} ## End main_eff for
## Output resulting table of all models named for original search object and current time/date in folder "model_search_reg"
current_time_date<-format(Sys.time(), "%m_%d_%y at %H_%M_%S")
write.table(search_comp,file=paste("./model_search_reg/",paste(current_time_date,deparse(substitute(search_object)),
"regSS_model_search.csv",sep="_"),sep=""),row.names=FALSE, col.names=TRUE, sep=",")
} ## End reg.output.search.with.test fn

Error with gls function in nlme package in R

I keep getting an error like this:
Error in `coef<-.corARMA`(`*tmp*`, value = c(18.3113452983211, -1.56626248550284, :
Coefficient matrix not invertible
or like this:
Error in gls(archlogfl ~ co2, correlation = corARMA(p = 3)) : false convergence (8)
with the gls function in nlme.
The former example was with the model gls(archlogflfornma~nma,correlation=corARMA(p=3)) where archlogflfornma is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
and nma is
[1] 138 139 142 148 150 134 137 135
You can see the model in the latter, and archlogfl is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
[9] 2.105556 2.176747
and co2 is
[1] 597.5778 917.9308 1101.0430 679.7803 886.5347 597.0668 873.4995
[8] 816.3483 1427.0190 423.8917
I have R 2.13.1.
Roland
#GavinSimpson's comment above, that trying to estimate a model with 5 parameters from 10 observations is very hopeful, is correct. The general rule of thumb is that you should have at least 10 times as many data points as parameters, and that's for standard fixed effect/regression parameters. (Generally variance structure parameters such as AR parameters are even a bit harder/require a bit more data than regression parameters to estimate.)
That said, in a perfect world one could hope to estimate parameters even from overfitted models. Let's just explore what happens though:
archlogfl <- c(2.611840,2.618454,2.503317,
2.305531,2.180464,2.185764,2.221760,2.211320,
2.105556,2.176747)
co2 <- c(597.5778,917.9308,1101.0430,679.7803,
886.5347,597.0668,873.4995,
816.3483,1427.0190,423.8917)
Take a look at the data,
plot(archlogfl~co2,type="b")
library(nlme)
g0 <- gls(archlogfl~co2)
plot(ACF(g0),alpha=0.05)
This is an autocorrelation function of the residuals, with 95% confidence intervals (note that these are curvewise confidence intervals, so we would expect about 1/20 points to fall outside these boundaries in any case).
So there is indeed some (graphical) evidence for some autocorrelation here. We'll fit an AR(1) model, with verbose output (to understand the scale on which these parameters are estimated, you'll probably have to dig around in Pinheiro and Bates 2000: what's presented in the printout are the unconstrained values of the parameters, what's printed in the summaries are the constrained values ...
g1 <- gls(archlogfl ~co2,correlation=corARMA(p=1),
control=glsControl(msVerbose=TRUE))
Let's see what's left after we fit AR1:
plot(ACF(g1,resType="normalized"),alpha=0.05)
Now fit AR(2):
g2 <- gls(archlogfl ~co2,correlation=corARMA(p=2),
control=glsControl(msVerbose=TRUE))
plot(ACF(g2,resType="normalized"),alpha=0.05)
As you correctly state, trying to go to AR(3) fails.
gls(archlogfl ~co2,correlation=corARMA(p=3))
You can play with tolerances, starting conditions, etc., but I don't think it's going to help much.
gls(archlogfl ~co2,correlation=corARMA(p=3,value=c(0.9,-0.5,0)),
control=glsControl(tolerance=1e-4,msVerbose=TRUE),verbose=TRUE)
If I were absolutely desperate to get these values I would code my own generalized least-squares function, constructing the AR(3) correlation matrix from scratch, and try to run it with some slightly more robust optimizer, but I would really have to have a good reason to work that hard ...
Another alternative would be to use arima to fit to the residuals from a gls or lm fit without autocorrelation: arima(residuals(g0),c(3,0,0)). (You can see that if you do this with arima(residuals(g0),c(2,0,0)) the answers are close to (but not quite equal to) the results from gls with corARMA(p=2).)

Resources