I have a sample code here.
data(agaricus.train, package='xgboost')
train <- agaricus.train
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
xgb.dump(bst, 'xgb.model.dump', with.stats = TRUE)
After building the model, I print it out as
booster[0]
0:[f28<-1.00136e-05] yes=1,no=2,missing=1,gain=4000.53,cover=1628.25
1:[f55<-1.00136e-05] yes=3,no=4,missing=3,gain=1158.21,cover=924.5
3:leaf=1.71218,cover=812
4:leaf=-1.70044,cover=112.5
2:[f108<-1.00136e-05] yes=5,no=6,missing=5,gain=198.174,cover=703.75
5:leaf=-1.94071,cover=690.5
6:leaf=1.85965,cover=13.25
booster[1]
0:[f59<-1.00136e-05] yes=1,no=2,missing=1,gain=832.545,cover=788.852
1:[f28<-1.00136e-05] yes=3,no=4,missing=3,gain=569.725,cover=768.39
3:leaf=0.784718,cover=458.937
4:leaf=-0.96853,cover=309.453
2:leaf=-6.23624,cover=20.4624
I have questions:
I understand that Gradient boost tree averages results from these trees with some weighted coefficients. How can I get those coefs?
Just to clarify. The value predicted by the trees are leaf = x, isn't it?
Thank you.
Combined answer for Q1 and Q2:
The coefficient for all tree leaf scores for xgboost is 1. Simply sum all the leaf scores. Let the sum be S.
Then apply logistic(2-class) function on it:
Pr(label=1) = 1/(1+exp(-S))
I have verified this and used in production systems.
Related
I am trying to run xgboost for a problem with very noisy features and interested in stopping the number of rounds based on a custom eval_metric that I have defined.
Based on domain knowledge I know that when the eval_metric (evaluated on the training data) goes above a certain value xgboost is overfitting. And I would like to just take the fitted model at that specific number of rounds and not proceed further.
What would be the best way to achieve this ?
It would be somewhat in line with the early stopping criteria but not exactly.
Alternately, if there is a possibility to get the model from an intermediate round ?
Here is an example to better explain by question. (Using the toy example that comes with xgboost help docs and using the default eval_metric)
library(xgboost)
data(agaricus.train, package='xgboost')
train <- agaricus.train
bstSparse <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1, nthread = 2, nround = 5, objective = "binary:logistic")
Here is the output
[0] train-error:0.046522
[1] train-error:0.022263
[2] train-error:0.007063
[3] train-error:0.015200
[4] train-error:0.007063
Now lets say from domain knowledge I know that once the train error goes below 0.015 (third round in this case), any further rounds only lead to over fitting. How would I stop the training process after the third round and get hold of the trained model to use it for prediction over a different dataset ?
I need to run the training process over many different datasets and I have no sense of how many rounds it might take to train to get the error below a fixed number, hence I can't set the nrounds argument to a predetermined value. Only intuition I have is that once the training error goes below a number I need to stop further training rounds.
In the absence of any code you have tried or any data you are using then try something like this:
require(xgboost)
library(Metrics) # for rmse to calculate errors
# Assume you have a training set db.train and have some
# feature indices of interest and a test set db.test
predz <- c(2, 4, 6, 8, 10, 12)
predictors <- names(db.train[, predz])
# you have some response you are interested in
outcomeName <- "myLabel"
# you may like to include for testing some other parameters like:
# eta, gamma, colsample_bytree, min_child_weight
# here we look at depths from 1 to 4 and rounds 1 to 100 but set your own values
smallestError <- 100 # set to some sensible value depending on your eval metric
for (depth in seq(1, 4, 1)) {
for (rounds in seq(1, 100, 1)) {
# train
bst <- xgboost(data = as.matrix(db.train[,predictors]),
label = db.train[,outcomeName],
max.depth = depth,
nround = rounds,
eval_metric = "logloss",
objective = "binary:logistic",
verbose=TRUE)
gc()
# predict
predictions <- as.numeric(predict(bst, as.matrix(db.test[, predictors]),
outputmargin = TRUE))
err <- rmse(as.numeric(db.test[, outcomeName]), as.numeric(predictions))
if (err < smallestError) {
smallestError = err
print(paste(depth,rounds,err))
}
}
}
You could adapt this code for your particular evaluation metric and print this out to suit your situation. Similarly you could introduce a break in the code when some specified number of rounds is reached that satisfies some condition you seek to achieve.
Taking cue from How to access weighting of indiviual decision trees in xgboost?.
How do one calculate the weights when objective = "binary:logistic", and eta = 0.1?
My tree dump is:
booster[0]
0:[WEIGHT<3267.5] yes=1,no=2,missing=1,gain=133.327,cover=58.75
1:[CYLINDERS<5.5] yes=3,no=4,missing=3,gain=9.61229,cover=33.25
3:leaf=0.872727,cover=26.5
4:leaf=0.0967742,cover=6.75
2:[WEIGHT<3431] yes=5,no=6,missing=5,gain=4.82912,cover=25.5
5:leaf=-0.0526316,cover=3.75
6:leaf=-0.846154,cover=21.75
booster[1]
0:[DISPLACEMENT<231.5] yes=1,no=2,missing=1,gain=60.9437,cover=52.0159
1:[WEIGHT<2974.5] yes=3,no=4,missing=3,gain=6.59775,cover=31.3195
3:leaf=0.582471,cover=25.5236
4:leaf=-0,cover=5.79593
2:[MODELYEAR<78.5] yes=5,no=6,missing=5,gain=1.96045,cover=20.6964
5:leaf=-0.643141,cover=19.3965
6:leaf=-0,cover=1.2999
Actually this was practical which I have overseen earlier.
Using the above tree structure one can find the probability for each training example.
The parameter list was:
param <- list("objective" = "binary:logistic",
"eval_metric" = "logloss",
"eta" = 0.5,
"max_depth" = 2,
"colsample_bytree" = .8,
"subsample" = 0.8,
"alpha" = 1)
For the instance set in leaf booster[0], leaf: 0-3;
the probability will be exp(0.872727)/(1+exp(0.872727)).
And for booster[0], leaf: 0-3 + booster[1], leaf: 0-3;
the probability will be exp(0.872727+ 0.582471)/(1+exp(0.872727+ 0.582471)).
And so on as one goes on increasing number of iterations.
I matched these values with R's predicted probabilities they differ in 10^(-7), probably due to floating point curtailing of leaf quality scores.
This might not be the answer to the finding weights, but this can give a production level solution when R's trained boosted trees are used in different environment for prediction.
Any comment on this will be highly appreciated.
I'm not sure if xgboost's many nice features can be combined in the way that I need (?), but what I'm trying to do is to run a Random Forest with sparse data predictors on a multi-class dependent variable.
I know that xgboost can do any 1 of those things:
Random Forest via tweaking xgboost parameters:
bst <- xgboost(data = train$data, label = train$label, max.depth = 4, num_parallel_tree = 1000, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "binary:logistic")
Sparse matrix predictors
bst <- xgboost(data = sparse_matrix, label = output_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "binary:logistic")
Multinomial (multiclass) dependent variable models via multi:softmax or multi:softprob
xgboost(data = data, label = multinomial_vector, max.depth = 4,
eta = 1, nthread = 2, nround = 10,objective = "multi:softmax")
However, I run into an error regarding non-conforming length when I try to do all of them at once:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
Y <- train$TripType
bst <- xgboost(data = sparse_matrix, label = Y, max.depth = 4, num_parallel_tree = 100, subsample = 0.5, colsample_bytree =0.5, nround = 1, objective = "multi:softmax")
Error in xgb.setinfo(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
length(Y)
[1] 647054
length(sparse_matrix)
[1] 66210988200
nrow(sparse_matrix)
[1] 642925
The length error I'm getting is comparing the length of my single multi-class dependent vector (let's call it n) to the length of the sparse matrix index, which I believe is j * n for j predictors.
The specific use case here is the Kaggle.com Walmart competition (the data is public, but very large by default -- about 650,000 rows and several thousand candidate features). I've been running multinomial RF models on it via H2O, but it sounds like a lot of other folks have been using xgboost, so I wonder if this is possible.
If it's not possible, then I wonder if one could/should estimate each level of the dependent variable separately and try to come the results?
Here is what is happening:
When you do this:
sparse_matrix <- sparse.model.matrix(TripType~.-1, data = train)
you are losing rows from your data
sparse.model.matrix cannot deal with NA's by default, when it see's one, it drops the row
as it happens there are exactly 4129 rows that contain NA's in the original data.
This is the difference between these two numbers:
length(Y)
[1] 647054
nrow(sparse_matrix)
[1] 642925
The reason this works on the previous examples is as follows
In the binomial case :
it is recycling the Y vector and completing the missing labels. (this is BAD)
In the random forest case:
(I think) it's because I random forest never uses the predictions from previous trees, so this error goes unseen. (this is BAD)
Takeaway:
Neither of the previous examples that work will train well
sparse.model.matrix drops NA's you are losing rows in your training data, this is a big problem and needs to be addressed
Good luck!
Could someone explain how the Quality column in the xgboost R package is calculated in the xgb.model.dt.tree function?
In the documentation it says that Quality "is the gain related to the split in this specific node".
When you run the following code, given in the xgboost documentation for this function, Quality for node 0 of tree 0 is 4000.53, yet I calculate the Gain as 2002.848
data(agaricus.train, package='xgboost')
train <- agarics.train
X = train$data
y = train$label
bst <- xgboost(data = train$data, label = train$label, max.depth = 2,
eta = 1, nthread = 2, nround = 2,objective = "binary:logistic")
xgb.model.dt.tree(agaricus.train$data#Dimnames[[2]], model = bst)
p = rep(0.5,nrow(X))
L = which(X[,'odor=none']==0)
R = which(X[,'odor=none']==1)
pL = p[L]
pR = p[R]
yL = y[L]
yR = y[R]
GL = sum(pL-yL)
GR = sum(pR-yR)
G = sum(p-y)
HL = sum(pL*(1-pL))
HR = sum(pR*(1-pR))
H = sum(p*(1-p))
gain = 0.5 * (GL^2/HL+GR^2/HR-G^2/H)
gain
I understand that Gain is given by the following formula:
Since we are using log loss, G is the sum of p-y and H is the sum of p(1-p) - gamma and lambda in this instance are both zero.
Can anyone identify where I am going wrong?
OK, I think I've worked it out. The value for reg_lambda is not 0 by default as given in the documentation, but is actually 1 (from param.h)
Also, it appears that the factor of a half is not applied when calculating the gain, so the Quality column is double what you would expect. Lastly, I also don't think gamma (also called min_split_loss) is applied to this calculation either (from update_hitmaker-inl.hpp)
Instead, gamma is used to determine whether to invoke pruning, but is not reflected in the gain calculation itself, as the documentation suggests.
If you apply these changes, you do indeed get 4000.53 as the Quality for node 0 of tree 0, as in the original question. I'll raise this as an issue to the xgboost guys, so the documentation can be changed accordingly.
I am trying to permute (column-wise only) my data matrix a 1000 times and then do hierarchical clustering in "R" so I have the final tree on my data after 1000 randomizations.
This is where I am lost. I have this loop
for(i in 1:1000)
{
permuted <- test2_matrix[,sample(ncol(test2_matrix), 12, replace=TRUE)]; (this permutes my columns)
d = dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2);
clust = hclust(d, method = "complete", members=NULL);
}
png (filename="cluster_dendrogram_bootstrap.png", width=1024, height=1024, pointsize=10)
plot(clust)
I am not sure if the final tree is a product after the 1000 randomizations or just the last tree that it calculated in the loop. Also If I want to display the bootstrap values on the tree how should I go about it?
Many thanks!!
The value of clust in your example is indeed the final tree calculated in the loop. Here's a way of making and saving 1000 permutations of your matrix
make.permuted.clust <- function(i){ # this argument is not used
permuted <- data.matrix[,sample(ncol(data.matrix), 12, replace=TRUE)]
d <- dist(permuted, method = "euclidean", diag = FALSE, upper = FALSE, p = 2)
clust <- hclust(d, method = "complete", members=NULL)
clust # return value
}
all.clust <- lapply(1:1000, make.permuted.clust) # 1000 hclust trees
The second part of your question should be answered here.
You may be interested in the RandomForest method implemented in the randomForest package, which implements both bootstrapping of the data and of the splitting variables and allows you to save trees and get a consensus tree.
library(randomForest)
The original random forest (in FORTRAN 77) developers site
The package manual