Use h2o.grid fine tune gbm model weight column issue - r

I am using h2o.grid hyperparameter search function to fine tune gbm model. h2o gbm allows add a weight column to specify the weight of each observation. However when I tried to add that in h2o.grid, it always error out saying illegal argument/missing value, even though the weight volume is populated.
Any one has similar experience? Thanks
Hyper-parameter: max_depth, 20
[2017-04-12 13:10:05] failure_details: Illegal argument(s) for GBM model: depth_grid_model_11. Details: ERRR on field: _weights_columns: Weights cannot have missing values.
ERRR on field: _weights_columns: Weights cannot have missing values.
============================
hyper_params = list( max_depth = c(4,6,8,12,16,20) ) ##faster for larger datasets
grid <- h2o.grid(
## hyper parameters
hyper_params = hyper_params,
## full Cartesian hyper-parameter search
search_criteria = list(strategy = "Cartesian"), ## default is Cartesian
## which algorithm to run
algorithm="gbm",
## identifier for the grid, to later retrieve it
grid_id="depth_grid",
## standard model parameters
x = X, #predictors,
y = Y, #response,
training_frame = datadev, #train,
validation_frame = dataval, #valid,
**weights_column = "Adj_Bias_correction",**
## more trees is better if the learning rate is small enough
## here, use "more than enough" trees - we have early stopping
ntrees = 10000,
## smaller learning rate is better
## since we have learning_rate_annealing, we can afford to start with a bigger learning rate
learn_rate = 0.05,
## learning rate annealing: learning_rate shrinks by 1% after every tree
## (use 1.00 to disable, but then lower the learning_rate)
learn_rate_annealing = 0.99,
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
col_sample_rate = 0.8,
## fix a random number generator seed for reproducibility
seed = 1234,
## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
score_tree_interval = 10
)
## by default, display the grid search results sorted by increasing logloss (since this is a classification task)
grid

Related

How do I tune a posterior probability threshold value for a binary classifier using more than one performance measure with the mlr package in R?

The following link provided me with a greater understanding of incorporating ordinary cost in my binary classification model: https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
With a standard classifier, the default threshold is usually 0.5, and the aim is to minimize the total number of misclassification errors as much as possible (obtain the maximum accuracy). However, all misclassification errors are treated equally. This is not typically the case in a real-world setting since the cost of a false negative may be much greater than that of a false negative.
Using empirical thresholding, I was able to obtain the optimal threshold value for classifying the instance into good or bad while minimizing the average cost. On the other hand, this comes at the price of reducing the accuracy and other performance measures. This is illustrated in the following figure:
In the figure above, the red line denotes the standard threshold of 0.5 which maximizes accuracy but gives a sub-optimal average credit cost. The blue line denotes the desired threshold that minimizes the cost, but now the accuracy is drastically reduced.
Generally, I would not be concerned about the reduced accuracy. Suppose however there is also an incentive to not only minimize the cost but also to maximize the precision as well. Note that the precision is the positive predictive value or ppv = TP/(TP+FP)). Then the green line might be a good trade-off that gives a relatively low cost and a relatively high ppv. Here, I plotted the green line as the average of the red and blue lines (both credit cost and ppv functions seem to have about the same gradient between these regions so calculating the optimal threshold this way probably provides a good estimate), but is there a way to calculate this threshold exactly?
My thoughts are to create a new performance measure as a function of both the costs and the ppv, and then minimize this performance measure.
Example: measure = credit.costs*(-ppv)
But I'm not sure how to code this in R. Any advice on what should be done would be greatly appreciated.
My R code is as follows:
library(mlr)
## Load dataset
data(GermanCredit, package = "caret")
credit.task = makeClassifTask(data = GermanCredit, target = "Class")
## Removing 2 columns: Purpose.Vacation,Personal.Female.Single
credit.task = removeConstantFeatures(credit.task)
## Generate cost matrix
costs = matrix(c(0, 1, 5, 0), 2)
colnames(costs) = rownames(costs) = getTaskClassLevels(credit.task)
## Make cost measure
credit.costs = makeCostMeasure(id = "credit.costs", name = "Credit costs", costs = costs, best = 0, worst = 5)
## Set training scheme with repeated 10-fold cross-validation
set.seed(100)
rin = makeResampleInstance("RepCV", folds = 10, reps = 3, task = credit.task)
## Fit a logistic regression model (nnet::multinom())
lrn = makeLearner("classif.multinom", predict.type = "prob", trace = FALSE)
r = resample(lrn, credit.task, resampling = rin, measures = list(credit.costs, mmce), show.info = FALSE)
r
# Tune the threshold using average costs based on the predicted probabilities on the 3 test data sets
cost_tune.res = tuneThreshold(pred = r$pred, measure = credit.costs)
# Tune the threshold using precision based on the predicted probabilities on the 3 test data sets
ppv_tune.res = tuneThreshold(pred = r$pred, measure = ppv)
d = generateThreshVsPerfData(r, measures = list(credit.costs, ppv, acc))
plt = plotThreshVsPerf(d)
plt + geom_vline(xintercept=cost_tune.res$th, colour = "blue") + geom_vline(xintercept=0.5, colour = "red") +
geom_vline(xintercept=1/2*(cost_tune.res$th + 0.5), colour = "green")
calculateConfusionMatrix(r$pred)
performance(r$pred, measures = list(acc, ppv, credit.costs))
Finally, I'm also a bit confused that about my ppv value. When I observe my confusion matrix, I am calculating my ppv as 442/(442+289) = 0.6046512 but the reported value is slightly different (0.6053531). Is there something wrong with my calculation?

stop xgboost based on eval_metric

I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some
# feature indices of interest and a test set db.test
predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like:
# eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1, 4, 1)) {
for (rounds in seq(1, 100, 1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth,
nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic",
verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]),
outputmargin = TRUE))
err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.

weight calculation of individual tree in XGBoost when using "binary:logistic"

Taking cue from How to access weighting of indiviual decision trees in xgboost?.
How do one calculate the weights when objective = "binary:logistic", and eta = 0.1?
My tree dump is:
booster[0]
0:[WEIGHT<3267.5] yes=1,no=2,missing=1,gain=133.327,cover=58.75
1:[CYLINDERS<5.5] yes=3,no=4,missing=3,gain=9.61229,cover=33.25
3:leaf=0.872727,cover=26.5
4:leaf=0.0967742,cover=6.75
2:[WEIGHT<3431] yes=5,no=6,missing=5,gain=4.82912,cover=25.5
5:leaf=-0.0526316,cover=3.75
6:leaf=-0.846154,cover=21.75
booster[1]
0:[DISPLACEMENT<231.5] yes=1,no=2,missing=1,gain=60.9437,cover=52.0159
1:[WEIGHT<2974.5] yes=3,no=4,missing=3,gain=6.59775,cover=31.3195
3:leaf=0.582471,cover=25.5236
4:leaf=-0,cover=5.79593
2:[MODELYEAR<78.5] yes=5,no=6,missing=5,gain=1.96045,cover=20.6964
5:leaf=-0.643141,cover=19.3965
6:leaf=-0,cover=1.2999
Actually this was practical which I have overseen earlier.
Using the above tree structure one can find the probability for each training example.
The parameter list was:
param <- list("objective" = "binary:logistic",
"eval_metric" = "logloss",
"eta" = 0.5,
"max_depth" = 2,
"colsample_bytree" = .8,
"subsample" = 0.8,
"alpha" = 1)
For the instance set in leaf booster[0], leaf: 0-3;
the probability will be exp(0.872727)/(1+exp(0.872727)).
And for booster[0], leaf: 0-3 + booster[1], leaf: 0-3;
the probability will be exp(0.872727+ 0.582471)/(1+exp(0.872727+ 0.582471)).
And so on as one goes on increasing number of iterations.
I matched these values with R's predicted probabilities they differ in 10^(-7), probably due to floating point curtailing of leaf quality scores.
This might not be the answer to the finding weights, but this can give a production level solution when R's trained boosted trees are used in different environment for prediction.
Any comment on this will be highly appreciated.

R tuneRF unstable, how to optimize?

The Short
I'm trying to use tuneRF to find the optimal mtry value for my randomForest function but I'm finding that the answer is extremely unstable and changes with run to run/different seeds. I would run a loop to see how it changes over a large number of runs but am unable to extract which mtry has the lowest OOB error.
The Long
I have a data.frame that has eight features but two of the features are inclusive meaning all the information in one is a subset of the other. As an example one feature could be a factor A ~ c("animal', "fish") and another feature a factor B ~c("dog", "cat", "salmon", "trout"). Hence all dogs and cats are animals and all salmon and trout are fish. These two variables are by far more significant than any of the other six. Hence if I run 3 forests, one that uses A, one that uses B and one that uses A & B, the last one seems to be do the best. I suspect this is because A &/or B are so significant that by including both I have double the chance of them being selected randomly as the initial feature. I further suspect that I shouldn't allow this to happen and that I should throw out A as a factor but I can not find any literature that actually says that.
Anyway getting back on track. I have two datasets tRFx and tRFx2 the first of which contains 7 features including B but not A and the second which contains 8 features with both A and B. I'm trying to see what the optimal mtry is for these two separate models, and then how they perform relative to each other. The problem is tuneRF seems, at least in this case, to be very unstable.
For the first dataset, (includes Feature B but not A)
> set.seed(1)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2 OOB error = 17.73%
Searching left ...
Searching right ...
mtry = 3 OOB error = 17.28%
0.02531646 0.01
mtry = 4 OOB error = 18.41%
-0.06493506 0.01
mtry OOBError
2.OOB 2 0.1773288
3.OOB 3 0.1728395
4.OOB 4 0.1840629
> set.seed(3)
> tuneRF(x = tRFx, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 2 OOB error = 18.07%
Searching left ...
Searching right ...
mtry = 3 OOB error = 18.18%
-0.00621118 0.01
mtry OOBError
2.OOB 2 0.1806958
3.OOB 3 0.1818182
ie for seed 1 mtry=3 but seed=3 mtry=2
And for the second dataset (includes both Features A & B)
> set.seed(1)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3 OOB error = 17.51%
Searching left ...
mtry = 2 OOB error = 16.61%
0.05128205 0.01
Searching right ...
mtry = 4 OOB error = 16.72%
-0.006756757 0.01
mtry OOBError
2.OOB 2 0.1661055
3.OOB 3 0.1750842
4.OOB 4 0.1672278
> set.seed(3)
> tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01)
mtry = 3 OOB error = 17.4%
Searching left ...
mtry = 2 OOB error = 18.74%
-0.07741935 0.01
Searching right ...
mtry = 4 OOB error = 17.51%
-0.006451613 0.01
mtry OOBError
2.OOB 2 0.1874299
3.OOB 3 0.1739618
4.OOB 4 0.1750842
ie for seed 1 mtry=2 but seed=3 mtry=3
I was going to run a loop to see which mtry is optimal over a large number of simulations but don't know how to capture the optimal mtry from each iteration.
I know that I can use
> set.seed(3)
> min(tuneRF(x = tRFx2, y = tRFy, nTreeTry = 250, stepFactor = 1.5, improve = 0.01))
mtry = 3 OOB error = 17.4%
Searching left ...
mtry = 2 OOB error = 18.74%
-0.07741935 0.01
Searching right ...
mtry = 4 OOB error = 17.51%
-0.006451613 0.01
[1] 0.1739618
but I don't want to capture the OOB error (0.1739618) but the optimal mtry (3).
Any help (or even general comments on anything related to tuneRF) greatly appreciated. For anybody else who happens to stumble upon this looking for tuneRF help I also found this post helpful.
R: unclear behaviour of tuneRF function (randomForest package)
For what it's worth it seems that the optimal mtry for the smaller feature set (with non-inclusive features) is 3 and for the larger feature set is only 2, which initially is counter intuitive but when you consider the inclusive nature of A and B it does/may make sense.
There's not a big difference in performance in this case (and others) on which mtry you choose. Only if you wan't to win kaggle contests where winner takes all and then you would probably also be blending together many other learning algorithms in one huge ensemble. In practice you get almost the same predictions.
You don't need stepwise optimization when you test so few combinations of parameters. Just try them all and repeat many times to figure out which mtry is slightly better.
All the times I have used tuneRF, I have been disappointed. Every time I ended up writing my own stepwise optimization or simply tried all combinations many times.
The mtry vs. oob-err do not have to be a smooth curve with a single minimum, though general trend should be observed. I't can be difficult to tell if a minimum is due to noise or general tendency.
I wrote an example of to do solid mtry screening. The conclusion from this screening would be there's not much difference. mtry=2 seems best and it would be slightly faster to compute. Default value had been mtry=floor(ncol(X)/3) anyways.
library(mlbench)
library(randomForest)
data(PimaIndiansDiabetes)
y = PimaIndiansDiabetes$diabetes
X = PimaIndiansDiabetes
X = X[,!names(X)%in%"diabetes"]
nvar = ncol(X)
nrep = 25
rf.list = lapply(1:nvar,function(i.mtry) {
oob.errs = replicate(nrep,{
oob.err = tail(randomForest(X,y,mtry=i.mtry,ntree=2000)$err.rate[,1],1)})
})
plot(replicate(nrep,1:nvar),do.call(rbind,rf.list),col="#12345678",
xlab="mtry",ylab="oob.err",main="tuning mtry by oob.err")
rep.mean = sapply(rf.list,mean)
rep.sd = sapply(rf.list,sd)
points(1:nvar,rep.mean,type="l",col=3)
points(1:nvar,rep.mean+rep.sd,type="l",col=2)
points(1:nvar,rep.mean-rep.sd,type="l",col=2)

Topic models: cross validation with loglikelihood or perplexity

I'm clustering documents using topic modeling. I need to come up with the optimal topic numbers. So, I decided to do ten fold cross validation with topics 10, 20, ...60.
I have divided my corpus into ten batches and set aside one batch for a holdout set. I have ran latent dirichlet allocation (LDA) using nine batches (total 180 documents) with topics 10 to 60. Now, I have to calculate perplexity or log likelihood for the holdout set.
I found this code from one of CV's discussion sessions. I really don't understand several lines of code below. I have dtm matrix using the holdout set (20 documents). But I don't know how to calculate the perplexity or log likelihood of this holdout set.
Questions:
Can anybody explain to me what seq(2, 100, by =1) mean here? Also, what AssociatedPress[21:30] mean? What function(k) is doing here?
best.model <- lapply(seq(2, 100, by=1), function(k){ LDA(AssociatedPress[21:30,], k) })
If I want to calculate perplexity or log likelihood of the holdout set called dtm, is there better code? I know there are perplexity() and logLik() functions but since I'm new I can not figure out how to implement it with my holdout matrix, called dtm.
How can I do ten fold cross validation with my corpus, containing 200 documents? Is there existing code that I can invoke? I found caret for this purpose, but again cannot figure that out either.
The accepted answer to this question is good as far as it goes, but it doesn't actually address how to estimate perplexity on a validation dataset and how to use cross-validation.
Using perplexity for simple validation
Perplexity is a measure of how well a probability model fits a new set of data. In the topicmodels R package it is simple to fit with the perplexity function, which takes as arguments a previously fit topic model and a new set of data, and returns a single number. The lower the better.
For example, splitting the AssociatedPress data into a training set (75% of the rows) and a validation set (25% of the rows):
# load up some R packages including a few we'll need later
library(topicmodels)
library(doParallel)
library(ggplot2)
library(scales)
data("AssociatedPress", package = "topicmodels")
burnin = 1000
iter = 1000
keep = 50
full_data <- AssociatedPress
n <- nrow(full_data)
#-----------validation--------
k <- 5
splitter <- sample(1:n, round(n * 0.75))
train_set <- full_data[splitter, ]
valid_set <- full_data[-splitter, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
perplexity(fitted, newdata = train_set) # about 2700
perplexity(fitted, newdata = valid_set) # about 4300
The perplexity is higher for the validation set than the training set, because the topics have been optimised based on the training set.
Using perplexity and cross-validation to determine a good number of topics
The extension of this idea to cross-validation is straightforward. Divide the data into different subsets (say 5), and each subset gets one turn as the validation set and four turns as part of the training set. However, it's really computationally intensive, particularly when trying out the larger numbers of topics.
You might be able to use caret to do this, but I suspect it doesn't handle topic modelling yet. In any case, it's the sort of thing I prefer to do myself to be sure I understand what's going on.
The code below, even with parallel processing on 7 logical CPUs, took 3.5 hours to run on my laptop:
#----------------5-fold cross-validation, different numbers of topics----------------
# set up a cluster for parallel processing
cluster <- makeCluster(detectCores(logical = TRUE) - 1) # leave one CPU spare...
registerDoParallel(cluster)
# load up the needed R package on all the parallel sessions
clusterEvalQ(cluster, {
library(topicmodels)
})
folds <- 5
splitfolds <- sample(1:folds, n, replace = TRUE)
candidate_k <- c(2, 3, 4, 5, 10, 20, 30, 40, 50, 75, 100, 200, 300) # candidates for how many topics
# export all the needed R objects to the parallel sessions
clusterExport(cluster, c("full_data", "burnin", "iter", "keep", "splitfolds", "folds", "candidate_k"))
# we parallelize by the different number of topics. A processor is allocated a value
# of k, and does the cross-validation serially. This is because it is assumed there
# are more candidate values of k than there are cross-validation folds, hence it
# will be more efficient to parallelise
system.time({
results <- foreach(j = 1:length(candidate_k), .combine = rbind) %dopar%{
k <- candidate_k[j]
results_1k <- matrix(0, nrow = folds, ncol = 2)
colnames(results_1k) <- c("k", "perplexity")
for(i in 1:folds){
train_set <- full_data[splitfolds != i , ]
valid_set <- full_data[splitfolds == i, ]
fitted <- LDA(train_set, k = k, method = "Gibbs",
control = list(burnin = burnin, iter = iter, keep = keep) )
results_1k[i,] <- c(k, perplexity(fitted, newdata = valid_set))
}
return(results_1k)
}
})
stopCluster(cluster)
results_df <- as.data.frame(results)
ggplot(results_df, aes(x = k, y = perplexity)) +
geom_point() +
geom_smooth(se = FALSE) +
ggtitle("5-fold cross-validation of topic modelling with the 'Associated Press' dataset",
"(ie five different models fit for each candidate number of topics)") +
labs(x = "Candidate number of topics", y = "Perplexity when fitting the trained model to the hold-out set")
We see in the results that 200 topics is too many and has some over-fitting, and 50 is too few. Of the numbers of topics tried, 100 is the best, with the lowest average perplexity on the five different hold-out sets.
I wrote the answer on CV that you refer to, here's a bit more detail:
seq(2, 100, by =1) simply creates a number sequence from 2 to 100 by ones, so 2, 3, 4, 5, ... 100. Those are the numbers of topics that I want to use in the models. One model with 2 topics, another with 3 topics, another with 4 topics and so on to 100 topics.
AssociatedPress[21:30] is simply a subset of the built-in data in the topicmodels package. I just used a subset in that example so that it would run faster.
Regarding the general question of optimal topic numbers, I now follow the example of Martin
Ponweiser on Model Selection by Harmonic Mean (4.3.3 in his thesis, which is here: http://epub.wu.ac.at/3558/1/main.pdf). Here's how I do it at the moment:
library(topicmodels)
#
# get some of the example data that's bundled with the package
#
data("AssociatedPress", package = "topicmodels")
harmonicMean <- function(logLikelihoods, precision=2000L) {
library("Rmpfr")
llMed <- median(logLikelihoods)
as.double(llMed - log(mean(exp(-mpfr(logLikelihoods,
prec = precision) + llMed))))
}
# The log-likelihood values are then determined by first fitting the model using for example
k = 20
burnin = 1000
iter = 1000
keep = 50
fitted <- LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) )
# where keep indicates that every keep iteration the log-likelihood is evaluated and stored. This returns all log-likelihood values including burnin, i.e., these need to be omitted before calculating the harmonic mean:
logLiks <- fitted#logLiks[-c(1:(burnin/keep))]
# assuming that burnin is a multiple of keep and
harmonicMean(logLiks)
So to do this over a sequence of topic models with different numbers of topics...
# generate numerous topic models with different numbers of topics
sequ <- seq(2, 50, 1) # in this case a sequence of numbers from 1 to 50, by ones.
fitted_many <- lapply(sequ, function(k) LDA(AssociatedPress[21:30,], k = k, method = "Gibbs",control = list(burnin = burnin, iter = iter, keep = keep) ))
# extract logliks from each topic
logLiks_many <- lapply(fitted_many, function(L) L#logLiks[-c(1:(burnin/keep))])
# compute harmonic means
hm_many <- sapply(logLiks_many, function(h) harmonicMean(h))
# inspect
plot(sequ, hm_many, type = "l")
# compute optimum number of topics
sequ[which.max(hm_many)]
## 6
Here's the output, with numbers of topics along the x-axis, indicating that 6 topics is optimum.
Cross-validation of topic models is pretty well documented in the docs that come with the package, see here for example: http://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf Give that a try and then come back with a more specific question about coding CV with topic models.

Resources