weight calculation of individual tree in XGBoost when using "binary:logistic" - r

Taking cue from How to access weighting of indiviual decision trees in xgboost?.
How do one calculate the weights when objective = "binary:logistic", and eta = 0.1?
My tree dump is:
booster[0]
0:[WEIGHT<3267.5] yes=1,no=2,missing=1,gain=133.327,cover=58.75
1:[CYLINDERS<5.5] yes=3,no=4,missing=3,gain=9.61229,cover=33.25
3:leaf=0.872727,cover=26.5
4:leaf=0.0967742,cover=6.75
2:[WEIGHT<3431] yes=5,no=6,missing=5,gain=4.82912,cover=25.5
5:leaf=-0.0526316,cover=3.75
6:leaf=-0.846154,cover=21.75
booster[1]
0:[DISPLACEMENT<231.5] yes=1,no=2,missing=1,gain=60.9437,cover=52.0159
1:[WEIGHT<2974.5] yes=3,no=4,missing=3,gain=6.59775,cover=31.3195
3:leaf=0.582471,cover=25.5236
4:leaf=-0,cover=5.79593
2:[MODELYEAR<78.5] yes=5,no=6,missing=5,gain=1.96045,cover=20.6964
5:leaf=-0.643141,cover=19.3965
6:leaf=-0,cover=1.2999

Actually this was practical which I have overseen earlier.
Using the above tree structure one can find the probability for each training example.
The parameter list was:
param <- list("objective" = "binary:logistic",
"eval_metric" = "logloss",
"eta" = 0.5,
"max_depth" = 2,
"colsample_bytree" = .8,
"subsample" = 0.8,
"alpha" = 1)
For the instance set in leaf booster[0], leaf: 0-3;
the probability will be exp(0.872727)/(1+exp(0.872727)).
And for booster[0], leaf: 0-3 + booster[1], leaf: 0-3;
the probability will be exp(0.872727+ 0.582471)/(1+exp(0.872727+ 0.582471)).
And so on as one goes on increasing number of iterations.
I matched these values with R's predicted probabilities they differ in 10^(-7), probably due to floating point curtailing of leaf quality scores.
This might not be the answer to the finding weights, but this can give a production level solution when R's trained boosted trees are used in different environment for prediction.
Any comment on this will be highly appreciated.

Related

What is the difference between budgeted and non-budgeted parameters for the Hyperband algorithm?

I struggle to understand how the Hyperband algorithm should be set up.
In the book (https://mlr3book.mlr-org.com/optimization.html#hyperband), I find e.g. the following sample code:
set.seed(123)
# extend "classif.rpart" with "subsampling" as preprocessing step
ll = po("subsample") %>>% lrn("classif.rpart")
# extend hyperparameters of "classif.rpart" with subsampling fraction as budget
search_space = ps(
classif.rpart.cp = p_dbl(lower = 0.001, upper = 0.1),
classif.rpart.minsplit = p_int(lower = 1, upper = 10),
subsample.frac = p_dbl(lower = 0.1, upper = 1, tags = "budget")
)
The way I understand the algorithm from the respective paper is that it will:
discard some of the hyperparameter configurations (sampled from the search_space) early and that this makes the search much more efficient than pure random search and
this applies over the whole of the parameter space.
I may misunderstand this as there are some of the hyperparameters "on a budget" (here: subsample.frac) and some that are not (here: classif.rpart.cp and classif.rpart.minsplit). So, what happens with the hyperparamers that have not the "budget" tag? Are they not considered? Or does "standard random search" apply for them (so they may consume as much ressources as they need)?

How do I tune a posterior probability threshold value for a binary classifier using more than one performance measure with the mlr package in R?

The following link provided me with a greater understanding of incorporating ordinary cost in my binary classification model: https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
With a standard classifier, the default threshold is usually 0.5, and the aim is to minimize the total number of misclassification errors as much as possible (obtain the maximum accuracy). However, all misclassification errors are treated equally. This is not typically the case in a real-world setting since the cost of a false negative may be much greater than that of a false negative.
Using empirical thresholding, I was able to obtain the optimal threshold value for classifying the instance into good or bad while minimizing the average cost. On the other hand, this comes at the price of reducing the accuracy and other performance measures. This is illustrated in the following figure:
In the figure above, the red line denotes the standard threshold of 0.5 which maximizes accuracy but gives a sub-optimal average credit cost. The blue line denotes the desired threshold that minimizes the cost, but now the accuracy is drastically reduced.
Generally, I would not be concerned about the reduced accuracy. Suppose however there is also an incentive to not only minimize the cost but also to maximize the precision as well. Note that the precision is the positive predictive value or ppv = TP/(TP+FP)). Then the green line might be a good trade-off that gives a relatively low cost and a relatively high ppv. Here, I plotted the green line as the average of the red and blue lines (both credit cost and ppv functions seem to have about the same gradient between these regions so calculating the optimal threshold this way probably provides a good estimate), but is there a way to calculate this threshold exactly?
My thoughts are to create a new performance measure as a function of both the costs and the ppv, and then minimize this performance measure.
Example: measure = credit.costs*(-ppv)
But I'm not sure how to code this in R. Any advice on what should be done would be greatly appreciated.
My R code is as follows:
library(mlr)
## Load dataset
data(GermanCredit, package = "caret")
credit.task = makeClassifTask(data = GermanCredit, target = "Class")
## Removing 2 columns: Purpose.Vacation,Personal.Female.Single
credit.task = removeConstantFeatures(credit.task)
## Generate cost matrix
costs = matrix(c(0, 1, 5, 0), 2)
colnames(costs) = rownames(costs) = getTaskClassLevels(credit.task)
## Make cost measure
credit.costs = makeCostMeasure(id = "credit.costs", name = "Credit costs", costs = costs, best = 0, worst = 5)
## Set training scheme with repeated 10-fold cross-validation
set.seed(100)
rin = makeResampleInstance("RepCV", folds = 10, reps = 3, task = credit.task)
## Fit a logistic regression model (nnet::multinom())
lrn = makeLearner("classif.multinom", predict.type = "prob", trace = FALSE)
r = resample(lrn, credit.task, resampling = rin, measures = list(credit.costs, mmce), show.info = FALSE)
r
# Tune the threshold using average costs based on the predicted probabilities on the 3 test data sets
cost_tune.res = tuneThreshold(pred = r$pred, measure = credit.costs)
# Tune the threshold using precision based on the predicted probabilities on the 3 test data sets
ppv_tune.res = tuneThreshold(pred = r$pred, measure = ppv)
d = generateThreshVsPerfData(r, measures = list(credit.costs, ppv, acc))
plt = plotThreshVsPerf(d)
plt + geom_vline(xintercept=cost_tune.res$th, colour = "blue") + geom_vline(xintercept=0.5, colour = "red") +
geom_vline(xintercept=1/2*(cost_tune.res$th + 0.5), colour = "green")
calculateConfusionMatrix(r$pred)
performance(r$pred, measures = list(acc, ppv, credit.costs))
Finally, I'm also a bit confused that about my ppv value. When I observe my confusion matrix, I am calculating my ppv as 442/(442+289) = 0.6046512 but the reported value is slightly different (0.6053531). Is there something wrong with my calculation?

Use h2o.grid fine tune gbm model weight column issue

I am using h2o.grid hyperparameter search function to fine tune gbm model. h2o gbm allows add a weight column to specify the weight of each observation. However when I tried to add that in h2o.grid, it always error out saying illegal argument/missing value, even though the weight volume is populated.
Any one has similar experience? Thanks
Hyper-parameter: max_depth, 20
[2017-04-12 13:10:05] failure_details: Illegal argument(s) for GBM model: depth_grid_model_11. Details: ERRR on field: _weights_columns: Weights cannot have missing values.
ERRR on field: _weights_columns: Weights cannot have missing values.
============================
hyper_params = list( max_depth = c(4,6,8,12,16,20) ) ##faster for larger datasets
grid <- h2o.grid(
## hyper parameters
hyper_params = hyper_params,
## full Cartesian hyper-parameter search
search_criteria = list(strategy = "Cartesian"), ## default is Cartesian
## which algorithm to run
algorithm="gbm",
## identifier for the grid, to later retrieve it
grid_id="depth_grid",
## standard model parameters
x = X, #predictors,
y = Y, #response,
training_frame = datadev, #train,
validation_frame = dataval, #valid,
**weights_column = "Adj_Bias_correction",**
## more trees is better if the learning rate is small enough
## here, use "more than enough" trees - we have early stopping
ntrees = 10000,
## smaller learning rate is better
## since we have learning_rate_annealing, we can afford to start with a bigger learning rate
learn_rate = 0.05,
## learning rate annealing: learning_rate shrinks by 1% after every tree
## (use 1.00 to disable, but then lower the learning_rate)
learn_rate_annealing = 0.99,
## sample 80% of rows per tree
sample_rate = 0.8,
## sample 80% of columns per split
col_sample_rate = 0.8,
## fix a random number generator seed for reproducibility
seed = 1234,
## early stopping once the validation AUC doesn't improve by at least 0.01% for 5 consecutive scoring events
stopping_rounds = 5, stopping_tolerance = 1e-4, stopping_metric = "AUC",
## score every 10 trees to make early stopping reproducible (it depends on the scoring interval)
score_tree_interval = 10
)
## by default, display the grid search results sorted by increasing logloss (since this is a classification task)
grid

stop xgboost based on eval_metric

I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some
# feature indices of interest and a test set db.test
predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like:
# eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1, 4, 1)) {
for (rounds in seq(1, 100, 1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth,
nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic",
verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]),
outputmargin = TRUE))
err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.

xgboost Random Forest with sparse matrix data and multinomial Y

I'm not sure if xgboost's many nice features can be combined in the way that I need (?), but what I'm trying to do is to run a Random Forest with sparse data predictors on a multi-class dependent variable.
I know that xgboost can do any 1 of those things:
Random Forest via tweaking xgboost parameters:
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
Sparse matrix predictors
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
Multinomial (multiclass) dependent variable models via multi:softmax or multi:softprob
xgboost(data = data, label = multinomial_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")
However, I run into an error regarding non-conforming length when I try to do all of them at once:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
Y <- train$TripType
bst <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925
The length error I'm getting is comparing the length of my single multi-class dependent vector (let's call it n) to the length of the sparse matrix index, which I believe is j * n for j predictors.
The specific use case here is the Kaggle.com Walmart competition (the data is public, but very large by default -- about 650,000 rows and several thousand candidate features). I've been running multinomial RF models on it via H2O, but it sounds like a lot of other folks have been using xgboost, so I wonder if this is possible.
If it's not possible, then I wonder if one could/should estimate each level of the dependent variable separately and try to come the results?
Here is what is happening:
When you do this:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
you are losing rows from your data
sparse.model.matrix cannot deal with NA's by default, when it see's one, it drops the row
as it happens there are exactly 4129 rows that contain NA's in the original data.
This is the difference between these two numbers:
length(Y)
[1] 647054
nrow(sparse_matrix)
[1] 642925
The reason this works on the previous examples is as follows
In the binomial case :
it is recycling the Y vector and completing the missing labels. (this is BAD)
In the random forest case:
(I think) it's because I random forest never uses the predictions from previous trees, so this error goes unseen. (this is BAD)
Takeaway:
Neither of the previous examples that work will train well
sparse.model.matrix drops NA's you are losing rows in your training data, this is a big problem and needs to be addressed
Good luck!

Resources