h2o GBM early stopping - r

I'm trying to overfit a GBM with h2o (I know it's weird, but I need this to make a point). So I increased the max_depth of my trees and the shrinkage, and disabled the stopping criterion :
overfit <- h2o.gbm(y=response
, training_frame = tapp.hex
, ntrees = 100
, max_depth = 30
, learn_rate = 0.1
, distribution = "gaussian"
, stopping_rounds = 0
, distribution = "gaussian"
)
The overfitting works great, but I've noticed that the training error does not improve after the 64th tree. Do you know why ? If I understand the concept of boosting well enough, the training error should converge to 0 as the number of trees increase.
Information on my data :
Around 1 million observations
10 variables
Response variable is quantitative.
Have a good day !

Did you try to lower the min_split_improvement parameter? The default of 1e-5 is already microscopic but relevant when having a million lines.
I guess all trees after the 64th (in your example) will be trivial?

If the 0.1 learning rate isn't working for you, I'd recommend decreasing the learning rate so something like 0.01 or 0.001. Although you state that the training error stops decreasing after tree 64, I'd still recommend trying to add more trees, at least 1000-5000, especially if you try out a slower learning rate.

Related

Is there a way to get the training scoring history based on in-bag samples (instead of out-of-bag samples) in h2o.randomForest()?

I would like to know the in-bag training metrics of my random forest fit in h2o R version:
rf_cv = h2o.randomForest(x = x, y = y,
training_frame = cali,
ntrees = 800,
nfolds = 5,
mtries = 3,
seed = 98)
When I print the scoring history I get the following training metrics:
> rf_cv#model$scoring_history
number_of_trees training_rmse training_mae training_deviance
1 0.70767 0.45476 0.50080
...
800 0.47283 0.30862 0.22357
But those metrics are from the out-of bag samples, as shown in the performance summary:
> h2o.performance(rf_cv)
H2ORegressionMetrics: drf
** Reported on training data. **
** Metrics reported on Out-Of-Bag training samples **
MSE: 0.2235729
RMSE: 0.4728349
MAE: 0.3086151
RMSLE: 0.1403068
Mean Residual Deviance : 0.2235729
I know I could just get the overall in-bag training performance with h2o.performance(rf_cv, data = train) but I need the scoring history. I've gone through the documentation and looked for similar questions but I've found nothing so far. Any help would be appreciated.
Not an expert but I took some time and also looked a bit into it and as far as I can tell, there is no way to retrieve an in-bag score. It wouldn't really make sense to calculate it for a random forest. I also didn't find anything related to the out-of-bag or in-bag samples, so I would assume they also don't get stored. However, if you'd manage to find them somewhere you could probably use h2o.getModelTree() to produce some kind of score.
You could also go and look into the source code and see if you get any deeper insights.
I also found this Question which might help you. However, it does not use h2o but the randomForest R library, if that is an option for you. There you can retrieve the OOB samples, thus you know what the in-bag samples are and you can then score them yourself, but I have not done this myself.

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

Parallelism in XGBoost machine learning technique

Its have to do with
parallelism implementation of XGBoost
I am trying to optimize XGBoost execution by giving it parameter nthread= 16 where my system has 24 cores. But when I train my model, it doesn't seem to even cross approx 20% of CPU utilization at any point in time while model training.
Code snippet is as follows:-
param_30 <- list("objective" = "reg:linear", # linear
"subsample"= subsample_30,
"colsample_bytree" = colsample_bytree_30,
"max_depth" = max_depth_30, # maximum depth of tree
"min_child_weight" = min_child_weight_30,
"max_delta_step" = max_delta_step_30,
"eta" = eta_30, # step size shrinkage
"gamma" = gamma_30, # minimum loss reduction
"nthread" = nthreads_30, # number of threads to be used
"scale_pos_weight" = 1.0
)
model <- xgboost(data = training.matrix[,-5],
label = training.matrix[,5],
verbose = 1, nrounds=nrounds_30, params = param_30,
maximize = FALSE, early_stopping_rounds = searchGrid$early_stopping_rounds_30[x])
Please explain me (if possible) on how I can increase CPU utilization and speed up the model training for efficient execution. Code in R shall be helpful for further understanding.
Assumption:- It is about the execution in R package of XGBoost
This is a guess... but I have had this happen to me ...
You are spending to much time communicating during the parallelism and are not ever getting CPU bound. https://en.wikipedia.org/wiki/CPU-bound
Bottom line is your data isn't large enough (rows and columns ), and/or your trees aren't deep enough max_depth to warrant that many cores. Too much overhead. xgboost parallelizes split evaluations so deep trees on big data can keep the CPU humming at max.
I have trained many models where single threaded outperforms 8/16 cores. Too much time switching and not enough work.
**MORE DATA, DEEPER TREES OR LESS CORES :) **
I tried to answer this question but my post was deleted by a moderator. Please see https://stackoverflow.com/a/67188355/5452057 which I believe could help you also, it relates to missing MPI support in the xgboost R-package for Windows available from CRAN.

Tune glmnet hyperparameters and evaluate performance using nested cross-validation in mlr?

I'm trying to use the R package mlr to train a glmnet model on a binary classification problem with a large dataset (about 850000 rows and about 100 features) on very modest hardware (my laptop with 4GB RAM --- I don't have access to more CPU muscle). I decided to use mlr because I need to use nested cross-validation to tune the hyperparameters of my classifier and evaluate the expected performance of the final model. To the best of my knowledge, neither caret or h2o offer nested cross-validation at present, but mlr provides provides the infrastructure to do this. However, I find the huge number of functions provided by mlr extremely overwhelming, and it's difficult to know how to slot everything together to achieve my goal. What goes where? How do they fit together? I've read through the entire documentation here: https://mlr-org.github.io/mlr-tutorial/release/html/ and I'm still confused. There are code snippets that show how to do specific things, but it's unclear (to me) how to stitch these together. What's the big picture? I looked for a complete worked example to use as a template and only found this: https://www.bioconductor.org/help/course-materials/2015/CSAMA2015/lab/classification.html which I have been using as my start point. Can anyone help fill in the gaps?
Here's what I want to do:
Tune the hyperparameters (l1 and l2 regularisation parameters) of a glmnet model using grid search or random grid search (or anything faster if it exists -- iterated F-racing? Adaptive resampling?) and stratified k-fold cross-validation inner loop, with an outer cross-validation loop to assess the expected final performance. I want to include a feature preprocessing step in the inner loop with centering, scaling, and Yeo-Johnson transformation, and fast filter-based feature selection (the latter is a necessity because I have very modest hardware and I need to slim the feature space to decrease training time). I have imbalanced classes (positive class is about 20%) so I have opted to use AUC as my optimisation objective, but this is only a surrogate for the real metric of interest, with is the false positive rate for a small number of true positive fixed points (i.e., I want to know the FPR for TPR = 0.6, 0.7, 0.8). I'd like to tune the probability thresholds to achieve those TPRs, and note that this is possible in nested CV, but it's not clear exactly what is being optimised here:
https://github.com/mlr-org/mlr/issues/856
I'd like to know where the cut should be without incurring information leakage, so I want to pick this using CV.
I'm using glmnet because I'd rather spend my CPU cycles on building a robust model than a fancy model that produces over-optimistic results. GBM or Random Forest can be done later if I find it can be done fast enough, but I don't expect the features in my data to be informative enough to bother investing much time in training anything particularly complex.
Finally, after I've obtained an estimate of what performance I can expect from the final model, I want to actually build the final model and obtain the coefficients of the glmnet model --- including which ones are zero, so I know which features have been selected by the LASSO penalty.
Hope all this makes sense!
Here's what I've got so far:
df <- as.data.frame(DT)
task <- makeClassifTask(id = "glmnet",
data = df,
target = "Flavour",
positive = "quark")
task
lrn <- makeLearner("classif.glmnet", predict.type = "prob")
lrn
# Feature preprocessing -- want to do this as part of CV:
lrn <- makePreprocWrapperCaret(lrn,
ppc.center = TRUE,
ppc.scale = TRUE,
ppc.YeoJohnson = TRUE)
lrn
# I want to use the implementation of info gain in CORElearn, not Weka:
infGain = makeFilter(
name = "InfGain",
desc = "Information gain ",
pkg = "CORElearn",
supported.tasks = c("classif", "regr"),
supported.features = c("numerics", "factors"),
fun = function(task, nselect, ...) {
CORElearn::attrEval(
getTaskFormula(task),
data = getTaskData(task), estimator = "InfGain", ...)
}
)
infGain
# Take top 20 features:
lrn <- makeFilterWrapper(lrn, fw.method = "InfGain", fw.abs = 20)
lrn
# Now things start to get foggy...
tuningLrn <- makeTuneWrapper(
lrn,
resampling = makeResampleDesc("CV", iters = 2, stratify = TRUE),
par.set = makeParamSet(
makeNumericParam("s", lower = 0.001, upper = 0.1),
makeNumericParam("alpha", lower = 0.0, upper = 1.0)
),
control = makeTuneControlGrid(resolution = 2)
)
r2 <- resample(learner = tuningLrn,
task = task,
resampling = rdesc,
measures = auc)
# Now what...?

Speed up cross validated random forest approach by parallel computing

I'm trying to speed up my random forest approach by parallel computing. My dataset contains of 20.000 rows and 10 columns. Dependent variable, which could be predicted, is a numerical and there are two factors between independent variables (one has 2 levels and second one has 504 levels).
I think the function train does coding all the factor variables into dummy variables, so decoding is not needed in this case.
Please, could you give me some useful advice, how to speed up the following code, I would appreciate any of advice. The solution below is never ending. Thanks a lot in advance.
library(doParallel); library(caret)
set.seed(975)
forTraining <- createDataPartition(DATA$NumVar,
p = 3/4)[[1]]
trainingSet <- DATA[forTraining,]
testSet <- DATA[-forTraining,]
controlObject <- trainControl(method = "repeatedcv",
repeats = 5,
number = 10)
#run model in parallel
cl <- makeCluster(detectCores())
registerDoParallel(cl)
set.seed(669)
rfModel <- train(NumVar ~ .,
data = trainingSet,
method = "rf",
tuneLength = 10,
ntrees = 1000,
importance = TRUE,
trControl = controlObject)
stopCluster(cl)
My response is too verbose for a comment so hopefully I can help guide you. Here is a summary of the points in the comments above.
Our primary concern - Computation Time
One major limitation on randomforest computation - Number of Trees
The general idea is that as you increase the number trees, your randomforest model will improve (i.e. lower error). However, this increase in performance will diminish as you continue to add trees whereas computation time will continue to increase. As such, you reach a point of diminishing returns. So, how to do we determine how many trees to use?
Well, naively we could simply fit the randomForest model with the call you provide. Another option is to do cross-validation on ntree but that isn't implemented by default in caret and Max Kuhn (the author) really knows his stuff when it comes to predictive models. So, to get started you are on the correct track with your call you provided above:
randomForest(dependentVariable ~ ., data = dataSet, mtry = 522, ntree=3000, importance=TRUE, do.trace=100)
But let's make this reproducible and use the mlbench Sonar dataset.
library(mlbench)
data(Sonar)
But we currently don't care about variable importance at the moment so let's remove that. Also, your ntree is way too high to start. I would be surprised if you need it that high in the end. Starting at a lower level we have the following:
set.seed(825)
rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
> rf1 <- randomForest(Class~., data=Sonar, mtry=3, ntree=200, do.trace=25)
ntree OOB 1 2
25: 16.83% 11.71% 22.68%
50: 18.27% 12.61% 24.74%
75: 17.31% 17.12% 17.53%
100: 15.38% 12.61% 18.56%
125: 15.38% 10.81% 20.62%
150: 16.35% 13.51% 19.59%
175: 15.87% 10.81% 21.65%
200: 14.42% 8.11% 21.65%
As you can see, the OOB is bottoming out at approximately 100 trees. However, if I am uncertain or if the OOB is still dropping significantly I could run the call again with a larger number of trees. Once you have a working number of trees, then you can tune your mtry with caret::training.
If you do end up needing to use lots of trees (i.e. 1000's) then your code is likely going to be slow. If you have access to machines that have many processors and large amounts of RAM then your parallel implementation can help but on more common machines it will be slow going.

Resources