A question about the parallelism in h2o.grid() function - r

I try to use the h2o.grid() function from the h2o package to do some tuning using R, when I set the parameter parallelism larger then 1, it always shows the warning
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
And the model_ids in the final grid object includes many models end with _cv_1, _cv_2 etc, and the number of the models is not equal to the setting of my max_models in search_criteria list, I think they are just the models in the cv process, not the final model.
When I set parallelism larger than 1:
When I leave the parallelism default or set it to 1, the result is normal, all models end with _model_1, _model_2 etc.
When I leave the "parallelism" default or set it to 1:
Here is my code:
# set the grid
rf_h2o_grid <- list(mtries = seq(3, ncol(train_h2o), 4),
max_depth = c(5, 10, 15, 20))
# set the search_criteria
sc <- list(strategy = "RandomDiscrete",
seed = 100,
max_models = 5
)
# random grid tuning
rf_h2o_grid_tune_random <- h2o.grid(
algorithm = "randomForest",
x = x,
y = y,
training_frame = train_h2o,
nfolds = 5, # use cv to validate the parameters
fold_assignment = "Stratified",
ntrees = 100,
seed = 100,
hyper_params = rf_h2o_grid,
search_criteria = sc
# parallelism = 6 # when I set it larger than 1, the result always includes some "cv_" models
)
So how can I use the parallelism correctly in h2o.grid()? Thanks for helping!

This is an actual issue with parallelism in grid search, previously noticed but not reported correctly.
Thanks for raising this, we'll try to fix it soon: see https://h2oai.atlassian.net/browse/PUBDEV-7886 if you want to track progress.
Until proper fix, you must avoid using both CV and parallelism in your grids.
Regarding the following error:
Some models were not built due to a failure, for more details run `summary(grid_object, show_stack_traces = TRUE)
if the error is reproducible, you should be getting more details by running the grid with verbose=True.
Adding the entire error message to the ticket above might also help.

This is because you set max_models = 5, your grid will only make 5 models then stop.
There are three ways to set up early stopping criteria:
"max_models": max number of models created
"max_runtime_secs": max running time in seconds
metric-based early stopping by setting up for "stopping_rounds", "stopping_metric", and "stopping_tolerance"

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

Is there a way to know how far R has gotten on a random forest model?

I am currently learning about random forests and how to create them in R. However as I have discovered, it can be quite the time consuming activity creating these trees, and sometimes I do not know how far R has gotten or if it has crashed, and so I close R in panic. I use the randomForest package, and my code is as follows:
model <- randomForest(def ~ .,
data = mydataset,
mtry = 4,
ntree = 200,
importance = TRUE)
Is there a way to make R show me how far it has gotten at any time, or when it is finished with one tree and is continuing to the next?
In situation such as these, you are typically looking for an argument that makes the function more verbose. This is typically something like verbose = TRUE but it varies and some functions do not offer any kind of verbosity settings.
In your case, you just have to look up the help of randomForest (with ?randomForest::randomForest) to find the argument do.trace.
do.trace
If set to TRUE, give a more verbose output as randomForest is run. If set to some integer, then running output is printed for every do.trace trees.
In other words, you can enable verbosity with:
model <- randomForest(def ~ ., data = mydataset, mtry = 4,
ntree = 200, importance = TRUE, do.trace = TRUE)
or, to print some feedback every 100 trees:
model <- randomForest(def ~ ., data = mydataset, mtry = 4,
ntree = 200, importance = TRUE, do.trace = 100)
It is always a good reflex to check the manual of the function as a first step. If you use rstudio, you can use the help pane instead of using ? or ??.

R h2o server CURL error, kind of repeatable

At first I thought it was a random issue, but re-running the script it happens again.
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix, :
Unexpected CURL error: Recv failure: Connection reset by peer
I'm doing a grid search on a medium-size dataset (roughly 40000 x 30) with a Gradient Boosting Machine model. The largest tree in the grid is 1000. This usually happens after running for a couple of hours. I set max_mem_size to 30Gb.
for ( k in 1:nrow(par.grid)) {
hg = h2o.gbm(training_frame = Xtr.hf,
validation_frame = Xt.hf,
distribution="huber",
huber_alpha = HuberAlpha,
x=2:ncol(Xtr.hf),
y=1,
ntrees = par.grid[k,"ntree"],
max_depth = depth,
learn_rate = par.grid[k,"shrink"],
min_rows = par.grid[k,"min_leaf"],
sample_rate = samp_rate,
col_sample_rate = c_samp_rate,
nfolds = 5,
model_id = p(iname, "_gbm_CV")
)
cv_result[k,1] = h2o.mse(hg, train=TRUE)
cv_result[k,2] = h2o.mse(hg, valid=TRUE)
}
Try adding gc() in your innermost loop. Even better would be to explicitly use h2o.rm().
So, it would become something like:
for ( k in 1:nrow(par.grid)) {
hg = h2o.gbm(...stuff...,
model_id = p(iname, "_gbm_CV")
)
cv_result[k,1] = h2o.mse(hg, train=TRUE)
cv_result[k,2] = h2o.mse(hg, valid=TRUE)
h2o.rm(hg);rm(hg);gc()
}
Theoretically this shouldn't matter, but if R holds on to the reference, then H2O will too.
If you think you might want to investigate any models further, and you have plenty of local disk space, you could do h2o.saveModel() before your h2o.mse() calls. (You'll need to specify a filename that somehow summarizes all your parameters, of course...)
UPDATE based on comment: If you do not need to keep any models or data, then using h2o.removeAll() is another way to rapidly reclaim all the memory. (This approach is also worth considering if any data or models you do need preserved are quick and easy to re-load.)

How to find optimal number of epoch in general while using h2o.deeplearning()?

As i have just started studying about Neural networks modelling , I wanted to know that, Suppose i have a dataset for classification problem and i am using
churn_model <- h2o.deeplearning(x = setdiff(names(churn), names(churn)[10]),
y = names(churn)[10],
training_frame = churnTrain,
validation_frame = churnValidation,
distribution = "multinomial",
activation = "RectifierWithDropout",
hidden = c(200,200,200),
hidden_dropout_ratio = c(0.1, 0.1, 0.1),
epochs = 50 , stopping_rounds = 0,
l1 = 1e-5)
So how can i determine through any function or something what will be the number of epoch that i can use?
Usually, you estimate it empirically. Start training with an initially large number of iterations/epochs and stop it whenever you observe overfitting, i.e. you validation error goes up while train error diminishes. This is called early stopping.
If you do not observe overfitting, increase the number of iterations, until validation error becomes stable.

using caret package to find optimal parameters of GBM

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages
Not sure if you found what you were looking for, but I find some of these sheets less than helpful.
If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters
He are some rules of thumb for running GBM:
The interaction.depth is 1, and on most data sets that seems
adequate, but on a few I have found that testing the results against
odd multiples up the max has given better results. The max value I
have seen for this parameter is floor(sqrt(NCOL(training))).
Shrinkage: the smaller the number, the better the predictive value,
the more trees required, and the more computational cost. Testing
the values on a small subset of data with something like shrinkage =
shrinkage = seq(.0005, .05,.0005) can be helpful in defining the
ideal value.
n.minobsinnode: default is 10, and generally I don't mess with that.
I have tried c(5,10,15,20) on small sets of data, and didn't really
see an adequate return for computational cost.
n.trees: the smaller the shrinkage, the more trees you should have.
Start with n.trees = (0:50)*50 and adjust accordingly.
Example setup using the caret package:
getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
n.trees = (0:50)*50,
shrinkage = seq(.0005, .05,.0005),
n.minobsinnode = 10) # you can also put something like c(5, 10, 15, 20)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95),
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
distribution = "adaboost",
method = "gbm", bag.fraction = 0.5,
nTrain = round(nrow(training) *.75),
trControl = fitControl,
verbose = TRUE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC"))
Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow.
To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.
This link has a concrete example (page 10) -
http://www.jstatsoft.org/v28/i05/paper
Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). Then call the generic train function as usual.

Resources