Speed up cross validated random forest approach by parallel computing - r

I'm trying to speed up my random forest approach by parallel computing. My dataset contains of 20.000 rows and 10 columns. Dependent variable, which could be predicted, is a numerical and there are two factors between independent variables (one has 2 levels and second one has 504 levels).
I think the function train does coding all the factor variables into dummy variables, so decoding is not needed in this case.
Please, could you give me some useful advice, how to speed up the following code, I would appreciate any of advice. The solution below is never ending. Thanks a lot in advance.
library(doParallel); library(caret)
set.seed(975)
forTraining <- createDataPartition(DATA$NumVar,
p = 3/4)[[1]]
trainingSet <- DATA[forTraining,]
testSet <- DATA[-forTraining,]
controlObject <- trainControl(method = "repeatedcv",
repeats = 5,
number = 10)
#run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(669)
rfModel <- train(NumVar ~ .,
data = trainingSet,
method = "rf",
tuneLength = 10,
ntrees = 1000,
importance = TRUE,
trControl = controlObject)
stopCluster(cl)

My response is too verbose for a comment so hopefully I can help guide you. Here is a summary of the points in the comments above.
Our primary concern - Computation Time
One major limitation on randomforest computation - Number of Trees
The general idea is that as you increase the number trees, your randomforest model will improve (i.e. lower error). However, this increase in performance will diminish as you continue to add trees whereas computation time will continue to increase. As such, you reach a point of diminishing returns. So, how to do we determine how many trees to use?
Well, naively we could simply fit the randomForest model with the call you provide. Another option is to do cross-validation on ntree but that isn't implemented by default in caret and Max Kuhn (the author) really knows his stuff when it comes to predictive models. So, to get started you are on the correct track with your call you provided above:
randomForest(dependentVariable ~ ., data = dataSet, mtry = 522, ntree=3000, importance=TRUE, do.trace=100)
But let's make this reproducible and use the mlbench Sonar dataset.
library(mlbench)
data(Sonar)
But we currently don't care about variable importance at the moment so let's remove that. Also, your ntree is way too high to start. I would be surprised if you need it that high in the end. Starting at a lower level we have the following:
set.seed(825)
rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
> rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
ntree OOB 1 2
25: 16.83% 11.71% 22.68%
50: 18.27% 12.61% 24.74%
75: 17.31% 17.12% 17.53%
100: 15.38% 12.61% 18.56%
125: 15.38% 10.81% 20.62%
150: 16.35% 13.51% 19.59%
175: 15.87% 10.81% 21.65%
200: 14.42% 8.11% 21.65%
As you can see, the OOB is bottoming out at approximately 100 trees. However, if I am uncertain or if the OOB is still dropping significantly I could run the call again with a larger number of trees. Once you have a working number of trees, then you can tune your mtry with caret::training.
If you do end up needing to use lots of trees (i.e. 1000's) then your code is likely going to be slow. If you have access to machines that have many processors and large amounts of RAM then your parallel implementation can help but on more common machines it will be slow going.

Related

number of trees in h2o.gbm

in traditional gbm, we can use
predict.gbm(model, newsdata=..., n.tree=...)
So that I can compare result with different number of trees for the test data.
In h2o.gbm, although it has n.tree to set, it seems it doesn't have any effect on the result. It's all the same as the default model:
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=100))
R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
h2o.test.pred <- as.vector(h2o.predict(h2o.gbm.model, newdata=test.frame, n.tree=10))
> R2(h2o.test.pred, test.mat$y)
[1] -0.00714109
Does anybod have similar problem? How to solve it? h2o.gbm is much faster than gbm, so if it can get detailed result of each tree that would be great.
I don't think H2O supports what you are describing.
BUT, if what you are after is to get the performance against the number of trees used, that can be done at model building time.
library(h2o)
h2o.init()
iris <- as.h2o(iris)
parts <- h2o.splitFrame(iris,c(0.8,0.1))
train <- parts[[1]]
valid <- parts[[2]]
test <- parts[[3]]
m <- h2o.gbm(1:4, 5, train,
validation_frame = valid,
ntrees = 100, #Max desired
score_tree_interval = 1)
h2o.scoreHistory(m)
plot(m)
The score history will show the evaluation after adding each new tree. plot(m) will show a chart of this. Looks like 20 is plenty for iris!
BTW, if your real purpose was to find out the optimum number of trees to use, then switch early stopping on, and it will do that automatically for you. (Just make sure you are using both validation and test data frames.)
As of 3.20.0.6 H2O does support this. The method you are looking for is
staged_predict_proba. For classification models it produces predicted class probabilities after each iteration (tree), for every observation in your testing frame. For regression models (i.e. when response is numerical), although not really documented, it produces the actual prediction for every observation in your testing frame.
From these predictions it is also easy to compute various performance metrics (AUC, r2 etc), assuming that's what you're after.
Python API:
staged_predict_proba = model.staged_predict_proba(test)
R API:
staged_predict_proba <- h2o.staged_predict_proba(model, prostate.test)

how to resample and compare the resutls when I just want to predict the last row of the data using surv. functions in mlr package, R?

I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.
Here is my example:
library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)
Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.
age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.
In addition, I need to change the hyperparameters for a comparison. Here is my code:
surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train')
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred
You can see that I use split=1 in makeResampleDesc because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample or tuneParams to get an evaluation on test data when using different hyperparameters.
So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?
Incidentally, looks like there is function makeFixedHoldoutInstance which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance in this way and I have got such error information:
> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
Type len Def Constr Req Tunable Trafo
mincriterion discrete - - 1.281552,2,3 - TRUE -
ntree discrete - - 100,200,300 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures, :
Size of data set: 227 and resampling instance: 228 differ!
With makeFixedHoldoutInstance you get the resampling you asked for.
But you can not use the same fixed resampling indices for the tuning inside the tuning wrapper and the resampling.
This is because first resample will split the data according to the fixed holdout instance f. Then the tuning inside the tuning wrapper will also need a resampling method to calculate the performance for a given configuration. As the tuning only sees the data after the split done by resample it can not apply the same fixed resampling.
From reading your question I guess you don't want to use the tuneWrapper but you want to directly tune your learner. So you should use simply tuneParams:
tr = tuneParams(learner = surv.lrn, task = surv.task, resampling = cv2, par.set = ps, control = ctrl)
Note: This does not work on the given example because the cindex needs at least one uncensored observation and even then it does not make sense because the cindex is only meaningful for a bigger test set.

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Tune glmnet hyperparameters and evaluate performance using nested cross-validation in mlr?

I'm trying to use the R package mlr to train a glmnet model on a binary classification problem with a large dataset (about 850000 rows and about 100 features) on very modest hardware (my laptop with 4GB RAM --- I don't have access to more CPU muscle). I decided to use mlr because I need to use nested cross-validation to tune the hyperparameters of my classifier and evaluate the expected performance of the final model. To the best of my knowledge, neither caret or h2o offer nested cross-validation at present, but mlr provides provides the infrastructure to do this. However, I find the huge number of functions provided by mlr extremely overwhelming, and it's difficult to know how to slot everything together to achieve my goal. What goes where? How do they fit together? I've read through the entire documentation here: https://mlr-org.github.io/mlr-tutorial/release/html/ and I'm still confused. There are code snippets that show how to do specific things, but it's unclear (to me) how to stitch these together. What's the big picture? I looked for a complete worked example to use as a template and only found this: https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lab/classification.html which I have been using as my start point. Can anyone help fill in the gaps?
Here's what I want to do:
Tune the hyperparameters (l1 and l2 regularisation parameters) of a glmnet model using grid search or random grid search (or anything faster if it exists -- iterated F-racing? Adaptive resampling?) and stratified k-fold cross-validation inner loop, with an outer cross-validation loop to assess the expected final performance. I want to include a feature preprocessing step in the inner loop with centering, scaling, and Yeo-Johnson transformation, and fast filter-based feature selection (the latter is a necessity because I have very modest hardware and I need to slim the feature space to decrease training time). I have imbalanced classes (positive class is about 20%) so I have opted to use AUC as my optimisation objective, but this is only a surrogate for the real metric of interest, with is the false positive rate for a small number of true positive fixed points (i.e., I want to know the FPR for TPR = 0.6, 0.7, 0.8). I'd like to tune the probability thresholds to achieve those TPRs, and note that this is possible in nested CV, but it's not clear exactly what is being optimised here:
https://github.com/mlr-org/mlr/issues/856
I'd like to know where the cut should be without incurring information leakage, so I want to pick this using CV.
I'm using glmnet because I'd rather spend my CPU cycles on building a robust model than a fancy model that produces over-optimistic results. GBM or Random Forest can be done later if I find it can be done fast enough, but I don't expect the features in my data to be informative enough to bother investing much time in training anything particularly complex.
Finally, after I've obtained an estimate of what performance I can expect from the final model, I want to actually build the final model and obtain the coefficients of the glmnet model --- including which ones are zero, so I know which features have been selected by the LASSO penalty.
Hope all this makes sense!
Here's what I've got so far:
df <- as.data.frame(DT)
task <- makeClassifTask(id = "glmnet",
data = df,
target = "Flavour",
positive = "quark")
task
lrn <- makeLearner("classif.glmnet", predict.type = "prob")
lrn
# Feature preprocessing -- want to do this as part of CV:
lrn <- makePreprocWrapperCaret(lrn,
ppc.center = TRUE,
ppc.scale = TRUE,
ppc.YeoJohnson = TRUE)
lrn
# I want to use the implementation of info gain in CORElearn, not Weka:
infGain = makeFilter(
name = "InfGain",
desc = "Information gain ",
pkg = "CORElearn",
supported.tasks = c("classif", "regr"),
supported.features = c("numerics", "factors"),
fun = function(task, nselect, ...) {
CORElearn::attrEval(
getTaskFormula(task),
data = getTaskData(task), estimator = "InfGain", ...)
}
)
infGain
# Take top 20 features:
lrn <- makeFilterWrapper(lrn, fw.method = "InfGain", fw.abs = 20)
lrn
# Now things start to get foggy...
tuningLrn <- makeTuneWrapper(
lrn,
resampling = makeResampleDesc("CV", iters = 2, stratify = TRUE),
par.set = makeParamSet(
makeNumericParam("s", lower = 0.001, upper = 0.1),
makeNumericParam("alpha", lower = 0.0, upper = 1.0)
),
control = makeTuneControlGrid(resolution = 2)
)
r2 <- resample(learner = tuningLrn,
task = task,
resampling = rdesc,
measures = auc)
# Now what...?

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.
The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.
The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.
Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.
One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.
I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Resources