Release H2O Grid From Memory? - r

I am struggling to find the correct API for releasing memory for an object created by the H2O grid. This code was pre-written by someone else and I am currently maintaining it.
#train grid search
gbm_grid1 <- h2o.grid(algorithm = "gbm" #specifies gbm algorithm is used
,grid_id = paste("gbm_grid1",current_date,sep="_") #defines a grid identification
,x = predictors #defines column variables to use as predictors
,y = y #specifies the response variable
,training_frame = train1 #specifies the training frame
#gbm parameters to remain fixed
,nfolds = 5 #specify number of folds for cross-validation is 5 (this acceptable here in order to reduce training time)
,distribution = "bernoulli" #specify that we are predicting a binary dependent variable
,ntrees = 1000 #specify the number of trees to build (1000 as essentially the maximum number of trees that can be built. Early stopping parameters defined later will make it unlikely our model will reach 1000 trees)
,learn_rate = 0.1 #specify the learn rate used of for gradient descent optimization (goal is to use as small a learn rate as possible)
,learn_rate_annealing = 0.995 #specifies that the learn rate will perpetually decrease by a factor of 0.995 (this can help speed up traing for our grid search)
,max_depth = tuned_max_depth
,min_rows = tuned_min_rows
,sample_rate = 0.8 #specify the amount of row observations used when making a split decision
,col_sample_rate = 0.8 #specify the amount of column observations used when making a split decision
,stopping_metric = "logloss" #specify loss function
,stopping_tolerance = 0.001 #specify minimum change required in stopping metric for individual model to continue training
,stopping_rounds = 5 #specify maximum amount of training rounds stopping metric has to change in excess of stopping tolerance
#specifies hyperparameters to fluctuate during model building in the grid search
,hyper_params = gbm_hp2
#specifies the search criteria that includes stop training etrics to speed up model building
,search_criteria = search_criteria2
#sets a reproducible seed
,seed = 123456
)
h2o.rm(gbm_grid1)
The problem is I believe this code was written awhile ago and has been deprecated since. h2o.rm(gbm_grid1) fails and R Studio tells me that I require a hex identifier. So I assigned my object an identifier and tried h2o.rm(gbm_grid1, "identifier.hex") and it tells me I cannot release this type of object.
The issue is I run out of memory if I move onto the next steps of the script. What should I do?
This is what I get with H2O.ls()

Yes, you can remove objects with h2o.rm(). You can use the variable name or key.
h2o.rm(your_object)
h2o.rm(‘your_key’)
You can use h2o.ls() to check what objects are in memory. Also, you can add the argument cascade = TRUE to the rm method to remove sub-models.
See more here

Related

Meaning of alpha and beta parameters in function makeFeatSelControlSequential (MLR library in R)

For deterministic forward or backward search, I'm used to give thresholds for p-values linked to coefficients linked to individual features. In the documention of makeFeatSelControlSequential in R/MLR https://www.rdocumentation.org/packages/mlr/versions/2.13/topics/FeatSelControl, alpha and beta parameters are described as follow:
alpha
(numeric(1)): Parameter of the sequential feature selection. Minimal required value of improvement difference for a forward / adding step. Default is 0.01.
beta
(numeric(1)): Parameter of the sequential feature selection. Minimal required value of improvement difference for a backward / removing step. Negative values imply that you allow a slight decrease for the removal of a feature. Default is -0.001.
It is however not clear what does "improvement difference" mean here. In the example below, I gave 0 as treshold for a backward selection (beta parameter). If this parameter relates to a threshold on p-value, I would expect to get the model without feature but it is not the case as I get an AUC of 0.9886302 instead of 0.5.
# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################
library(mlbench)
data(BreastCancer)
# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable
p<-mlbench.waveform(1000)
# convert list into dataframe
dataset<-as.data.frame(p)
# drop thrid class to get 2 classes
dataset2 = subset(dataset, classes != 3)
dataset2 <- droplevels(dataset2 )
# 2. Perform cross validation with embedded feature selection using logistic regression
##########################################################################################
library(BBmisc)
library(mlr)
set.seed(123, "L'Ecuyer")
set.seed(21)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm
mL <- makeLearner("classif.logreg", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sbs", maxit = NA,beta = 0)
# Choice of sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
The parameters control what difference in performance (for whatever performance measure you choose) is acceptable to proceed with a step along a forward or backward search. mlr doesn't compute any p-values, and no p-values are used in this process.
As the parameters only control what happens in a step, they also don't directly control the final outcome. What happens under the hood is that, e.g. for forward search, mlr computes the performances of all feature sets that expand the current one with a single feature and chooses the best one as long as it provides at least the improvement specified in alpha or beta. This procedure repeats until either all features (forward search) or no features (backward search) are present or if no minimum improvement as specified by the parameters can be achieved.

how to resample and compare the resutls when I just want to predict the last row of the data using surv. functions in mlr package, R?

I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.
Here is my example:
library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)
Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.
age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.
In addition, I need to change the hyperparameters for a comparison. Here is my code:
surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train')
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred
You can see that I use split=1 in makeResampleDesc because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample or tuneParams to get an evaluation on test data when using different hyperparameters.
So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?
Incidentally, looks like there is function makeFixedHoldoutInstance which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance in this way and I have got such error information:
> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
Type len Def Constr Req Tunable Trafo
mincriterion discrete - - 1.281552,2,3 - TRUE -
ntree discrete - - 100,200,300 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures, :
Size of data set: 227 and resampling instance: 228 differ!
With makeFixedHoldoutInstance you get the resampling you asked for.
But you can not use the same fixed resampling indices for the tuning inside the tuning wrapper and the resampling.
This is because first resample will split the data according to the fixed holdout instance f. Then the tuning inside the tuning wrapper will also need a resampling method to calculate the performance for a given configuration. As the tuning only sees the data after the split done by resample it can not apply the same fixed resampling.
From reading your question I guess you don't want to use the tuneWrapper but you want to directly tune your learner. So you should use simply tuneParams:
tr = tuneParams(learner = surv.lrn, task = surv.task, resampling = cv2, par.set = ps, control = ctrl)
Note: This does not work on the given example because the cindex needs at least one uncensored observation and even then it does not make sense because the cindex is only meaningful for a bigger test set.

How to reproduce the H2o GBM class probability calculation

I've been using h2o.gbm for a classification problem, and wanted to understand a bit more about how it calculates the class probabilities. As a starting point, I tried to recalculate the class probability of a gbm with only 1 tree (by looking at the observations in the leafs), but the results are very confusing.
Let's assume my positive class variable is "buy" and negative class variable "not_buy" and I have a training set called "dt.train" and a separate test-set called "dt.test".
In a normal decision tree, the class probability for "buy" P(has_bought="buy") for a new data row (test-data) is calculated by dividing all observations in the leaf with class "buy" by the total number of observations in the leaf (based on the training data used to grow the tree).
However, the h2o.gbm seems to do something differently, even when I simulate a 'normal' decision tree (setting n.trees to 1, and alle sample.rates to 1). I think the best way to illustrate this confusion is by telling what I did in a step-wise fashion.
Step 1: Training the model
I do not care about overfitting or model performance. I want to make my life as easy as possible, so I've set the n.trees to 1, and make sure all training-data (rows and columns) are used for each tree and split, by setting all sample.rate parameters to 1. Below is the code to train the model.
base.gbm.model <- h2o.gbm(
x = predictors,
y = "has_bought",
training_frame = dt.train,
model_id = "2",
nfolds = 0,
ntrees = 1,
learn_rate = 0.001,
max_depth = 15,
sample_rate = 1,
col_sample_rate = 1,
col_sample_rate_per_tree = 1,
seed = 123456,
keep_cross_validation_predictions = TRUE,
stopping_rounds = 10,
stopping_tolerance = 0,
stopping_metric = "AUC",
score_tree_interval = 0
)
Step 2: Getting the leaf assignments of the training set
What I want to do, is use the same data that is used to train the model, and understand in which leaf they ended up in. H2o offers a function for this, which is shown below.
train.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.train)
This will return the leaf node assignment (e.g. "LLRRLL") for each row in the training data. As we only have 1 tree, this column is called "T1.C1" which I renamed to "leaf_node", which I cbind with the target variable "has_bought" of the training data. This results in the output below (from here on referred to as "train.leafs").
Step 3: Making predictions on the test set
For the test set, I want to predict two things:
The prediction of the model itself P(has_bought="buy")
The leaf node assignment according to the model.
test.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.test)
test.pred <- h2o.predict(base.gbm.model, dt.test)
After finding this, I've used cbind to combine these two predictions with the target variable of the test-set.
test.total <- h2o.cbind(dt.test[, c("has_bought")], test.pred, test.leafs)
The result of this, is the table below, from here on referred to as "test.total"
Unfortunately, I do not have enough rep point to post more than 2 links. But if you click on "table "test.total" combined with manual
probability calculation" in step 5, it's basically the same table
without the column "manual_prob_buy".
Step 4: Manually predicting probabilities
Theoretically, I should be able to predict the probabilities now myself. I did this by writing a loop, that loops over each row in "test.total". For each row, I take the leaf node assignment.
I then use that leaf-node assignment to filter the table "train.leafs", and check how many observations have a positive class (has_bought == 1) (posN) and how many observations are there in total (totalN) within the leaf associated with the test-row.
I perform the (standard) calculation posN / totalN, and store this in the test-row as a new column called "manual_prob_buy", which should be the probability of P(has_bought="buy") for that leaf. Thus, each test-row that falls in this leaf should get this probability.
This for-loop is shown below.
for(i in 1:nrow(dt.test)){
leaf <- test.total[i, leaf_node]
totalN <- nrow(train.leafs[train.leafs$leaf_node == leaf])
posN <- nrow(train.leafs[train.leafs$leaf_node == leaf & train.leafs$has_bought == "buy",])
test.total[i, manual_prob_buy := posN / totalN]
}
Step 5: Comparing the probabilities
This is where I get confused. Below is the the updated "test.total" table, in which "buy" represents the probability P(has_bought="buy") according to the model and "manual_prob_buy" represents the manually calculated probability from step 4. As for as I know, these probabilities should be identical, knowing I only used 1 tree and I've set the sample.rates to 1.
Table "test.total" combined with manual probability calculation
The Question
I just don't understand why these two probabilities are not the same. As far as I know, I've set the parameters in such a way that it should just be like a 'normal' classification tree.
So the question: does anyone know why I find differences in these probabilities?
I hope someone could point me to where I might have made wrong assumptions. I just really hope I did something stupid, as this is driving me crazy.
Thanks!
Rather than compare the results from R's h2o.predict() with your own handwritten code, I recommend you compare with an H2O MOJO, which should match.
See an example here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#quickstartmojo
You can run that simple example yourself, and then modify it according to your own model and new row of data to predict on.
Once you can do that, you can look at the code and debug/single-step it in a java environment to see exactly how the prediction gets calculated.
You can find the MOJO prediction code on github here:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/EasyPredictModelWrapper.java
The main cause of the large difference between your observed probabilities and the predictions of h2o is your learning rate. As you have learn_rate = 0.001 the gbm is adjusting the probabilities by a relatively small amount from the overall rate. If you adjust this to learn_rate = 1 you will have something much closer to a decision tree, and h2o's predicted probabilities will come much closer to the rates in each leaf node.
There is a secondary difference which will then become apparent as your probabilities will still not exactly match. This is due to the method of gradient descent (the G in GBM) on the logistic loss function, which is used rather than the number of observations in each leaf node.

h2o.GBM taking too long on a small sized dataset

I've got a rather small dataset (162,000 observations with 13 attributes)
that I'm trying to use for modelling, using h2o.GBM. The response variable is categorical with large number of levels (~ 20,000 levels)
The model doesn't run out of memory or give any errors, but it's been going for nearly 24 hours without any progress (says 0% on H2o.GBM reporting)
I finally gave in and stopped it.
I'm wondering if there's anything wrong with my hyperparameters, as data is not particularly large.
here's my code:
library(h2o)
localH2O <- h2o.init(nthreads = -1, max_mem_size = "12g")
train.h20 <- as.h2o(analdata_train)
gbm1 <- h2o.gbm(
y = response_var
, x = independ_vars
, training_frame = train.h20
, ntrees = 3
, max_depth = 5
, min_rows = 10
, stopping_tolerance = 0.001
, learn_rate = 0.1
, distribution = "multinomial"
)
The way H2O GBM multinomial classification works is, when you ask for 1 tree as a parameter, it actually builds a tree for each level in the response column underneath the hood.
So 1 tree really means 20,000 trees in your case.
2 trees would really mean 40,000, and so on...
(Note the binomial classification case takes a shortcut and builds only one tree for both classes.)
So... it will probably finish but it could take quite a long time!
It's probably not a good idea to train a classifier with 20,000 classes -- most GBM implementations won't even let you do that. Can you group/cluster the classes into a smaller number of groups so that you can train a model with a smaller number of classes? If so, then you could perform your training in a two-stage process -- the first model would have K classes (assuming you clustered your classes into K groups). Then you can train secondary models that further classify the observations into your original classes.
This type of two-stage process may make sense if your classes represent groups that naturally clusters into a hierarchy of groups -- such as zip codes or ICD-10 medical diagnostic codes, for example.
If your use-case really demands that you train a 20,000 class GBM (and there's no way around it), then you should get a bigger cluster of machines to use in your H2O cluster (it's unclear how many CPUs you are using currently). H2O GBM should be able to finish training, assuming it has enough memory and CPUs, but it may take a while.

R- Random Forest - Importance / varImPlot

I have an issue with Random Forest with the Importance / varImPlot function, I hope someone could help me with?
I tried to code versions but I am confused about the (different) results:
1.)
rffit = randomForest(price~.,data=train,mtry=x,ntree=500)
rfvalpred = predict(rffit,newdata=test)
varImpPlot(rffit)
importance(rffit)
Shows the plot and the data of “importance”, however only “IncNodePurity”. And the data is different the plot and the data, I tried with "Scale" but did not work.
2.)
rf.analyzed_data = randomForest(price~.,data=train,mtry=x,ntree=500,importance=TRUE)
yhat.rf = predict(rf.analyzed_data,newdata=test)
varImpPlot(rf.analyzed_data)
importance(rf.analyzed_data)
In that case it does not produce any plot anymore and the importance data is showing “%IncMSE” and “IncNodePurity” data but the “IncNodePurity” data is different to first code?
Questions:
1.) Any idea why data is different for “IncNodePurity”?
2.) Any idea why no “%IncMSE” is shown in the first version?
3.) Why no plot is shown in the second version?
Many thanks!!
Ed
1) IncNodePurity is derived from the loss function, and you get that measure for free just by training the model. On the downside it is a more unstable estimate as results may vary from each model run. It is also more biased as it favors variables with many levels. I guess your found the differences are due to randomness.
2) VI, %IncMSE takes a little extra time to compute and is therefore optional. Roughly all values in data set needs to be shuffled and every OOB sample needs to be predicted once for every tree times for every variable. As the package randomForest is designed, you have to compute VI during training. importance must be set to TRUE. varImpPlot cannot plot it as it has not been computed.
3) Not sure. In this code example I see both plots at least.
library(randomForest)
#data
X = data.frame(replicate(6,rnorm(1000)))
y = with(X, X1^2 + sin(X2*pi) + X3*X4)
train = data.frame(y=y,X=X)
#training
rf1=randomForest(y~.,data=train,importance=F)
rf2=randomForest(y~.,data=train, importance=T)
#plotting importnace
varImpPlot(rf1) #plot only with IncNodePurity
varImpPlot(rf2) #bi-plot also with %IncMSE

Resources