Understanding num_classes for xgboost in R - r

I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.
I've got an example using the Iris data
df <- iris
y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)
df <- df[,1:4]
y <- as.matrix(y)
df <- as.matrix(df)
param <- list("objective" = "multi:softprob",
"num_class" = 3,
"eval_metric" = "mlogloss",
"nthread" = 8,
"max_depth" = 16,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 1,
"colsample_bytree" = 1,
"min_child_weight" = 12)
model <- xgboost(param=param, data=df, label=y, nrounds=20)
This returns an error
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) :
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label
If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.
I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.

label must be in [0, num_class)
in your script add y<-y-1 before model <-...

I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.
First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.
I re-encoded the labels as integers and then num_class worked fine when set to N.
# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1 <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)
and
param <- list(booster="gbtree",
objective="multi:softprob",
eval_metric="mlogloss",
#nthread=13,
num_class=3,
eta_decay = .99,
eta = .005,
gamma = 1,
max_depth = 4,
min_child_weight = .9,#1,
subsample = .7,
colsample_bytree = .5
)
For example.

I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.

I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.
Tutorial said:
label = as.integer(iris$Species)-1
What worked for me (response is high_end):
label = as.integer(high_end)

Related

MXNET softmax output: label shape confusion

I have not got a clear idea about how labels for the softmax classifier should be shaped.
What I could understand from my experiments is that a scalar laber indicating the index of class probability output is one option, while another is a 2D label where the rows are class probabilities, or one-hot encoded variable, like c(1, 0, 0).
What puzzles me though is that:
I can use sclalar label values that go beyong indexing, like 4 in my
example below -- without warning or error. Why is that?
When my label is a negative scalar or an array with a negative value,
the model converges to uniform probablity distribution over classes.
For example, is this expected that actor_train.y = matrix(c(0, -1,v0), ncol = 1) results in equal probabilities in the softmax output?
I try to use softmax MXNET classifier to produce the policy gradient
reifnrocement learning, and my negative rewards lead to the issue
above: uniform probability. Is that expected?
require(mxnet)
actor_initializer <- mx.init.Xavier(rnd_type = "gaussian",
factor_type = "avg",
magnitude = 0.0001)
actor_nn_data <- mx.symbol.Variable('data') actor_nn_label <- mx.symbol.Variable('label')
device.cpu <- mx.cpu()
NN architecture
actor_fc3 <- mx.symbol.FullyConnected(
data = actor_nn_data
, num_hidden = 3 )
actor_output <- mx.symbol.SoftmaxOutput(
data = actor_fc3
, label = actor_nn_label
, name = 'actor' )
crossentfunc <- function(label, pred)
{
- sum(label * log(pred)) }
actor_loss <- mx.metric.custom(
feval = crossentfunc
, name = "log-loss"
)
initialize NN
actor_train.x <- matrix(rnorm(11), nrow = 1)
actor_train.y = 0 #1 #2 #3 #-3 # matrix(c(0, 0, -1), ncol = 1)
rm(actor_model)
actor_model <- mx.model.FeedForward.create(
symbol = actor_output,
X = actor_train.x,
y = actor_train.y,
ctx = device.cpu,
num.round = 100,
array.batch.size = 1,
optimizer = 'adam',
eval.metric = actor_loss,
clip_gradient = 1,
wd = 0.01,
initializer = actor_initializer,
array.layout = "rowmajor" )
predict(actor_model, actor_train.x, array.layout = "rowmajor")
It is quite strange to me, but I found a solution.
I changed optimizer from optimizer = 'adam' to optimizer = 'rmsprop', and the NN started to converge as expected in case of negative targets. I made simulations in R using a simple NN and optim function to get the same result.
Looks like adam or SGD may be buggy or whatever in case of multinomial classification... I also used to get stuck at the fact those optimizers did not converge to a perfect solution on just 1 example, while rmsprop does! Be aware!

xgboost multinomial classification error: "label and prediction size not match"

Forward that I'm fairly new to both xgboost and R.
I am using xgboost in R to perform a multinomial classification on my data dtrain. The label I am using has six levels, so my code looks like this:
param1 <- list(objective = "multi:softprob"
, num_class = 6
, booster = "gbtree"
, eta = 0.5
, max.depth = 7
, min_child_weight = 10
, max_delta_step = 5
, subsample = 0.8
, colsample_bytree = 0.8
, lambda = 3 # L2
, alpha = 5 # L1
)
set.seed(2016)
xgbcv1 <- xgb.cv(params = param1, data = dtrain, nround = 3000, nfold = 3,
metrics = list("error", "auc"), maximize = T,
print_every_n = 10, early_stopping_rounds = 10)
This throws me the following error:
Error in xgb.iter.update(fd$bst, fd$dtrain, iteration - 1, obj) :
amalgamation/../src/objective/multiclass_obj.cc:75: Check failed:
label_error >= 0 && label_error < nclass SoftmaxMultiClassObj: label must be in [0, num_class), num_class=6 but found 6 in label.
So I tried setting num_class = 7, which throws this error:
Error in xgb.iter.eval(fd$bst, fd$watchlist, iteration - 1, feval) :
amalgamation/../src/metric/elementwise_metric.cc:28: Check failed:
(preds.size()) == (info.labels.size()) label and prediction size not match, hint: use merror or mlogloss for multi-class classification
What's going on here? Does num_class need to be greater than label_error or equal to it?
The XGboost algorithm requires that class labels start from 0 and increase sequentially to the maximum number of classes.
This is a bit of an inconvenience as you need to keep track of what Class name goes with which label.
Convert your Class target variable to numeric and subtract it with 1.
df$class_numeric<-as.numeric(df$class_target)
df<-df%>%mutate(class_numeric=class_numeric-1)
if number of levels in the dependent variable is 6 then give num_class = 7. Meaning specify num_class = levels(Dependent Variable) + 1
try :
set metrics = list("mlogloss")

human readable rules from xgboost in R

I try to use xgboost in R to get rules (gbtree) from my data, so I can use the rules in an other system (not predicted data with 'predict'). The Input-Data have appr. 1500 colums and 40 Mio rows with binary, sparse data and the Label is a binary column, too.
library(xgboost)
library(Matrix)
labels <- data.frame(labels = sample.int(2, m*1, TRUE)-1L)
observations <- Matrix(as.matrix(data.frame(feat_01=sample.int(2, size=100, T) -1,
feat_02=sample.int(2, size=100, T) -1,
feat_03=sample.int(2, size=100, T) -1,
feat_04=sample.int(2, size=100, T) -1,
feat_05=sample.int(2, size=100, T) -1)), sparse=T)
dtrain <- xgb.DMatrix(data = observations, label = labels$labels)
bstResult <- xgb.train(data = dtrain,
nthread = 1,
nround = 4,
max_depth = 3,
verbose = T,
objective = "binary:logistic",
booster='gbtree')
xgb.dump(bstResult)
xgb.plot.tree(model = bstResult, n_first_tree = 2)
I visualize the data as xgb.dump or xgb.plot.tree. But I need the data in a form like:
rule1: feat_01 == 1 & feat_02==1 & feat_03== 0 --> Label = 1
rule2: feat_01== 0 & feat_03==1 & feat_04== 1 --> Label = 0
Is this possible or am I on the wrong track?
Regards
Heiko
edit: added example and tried to make the question better
On one hand, I think you can use the importance matrix to obtain the coverage and ranking of each feature. On the other hand, xgboost uses an ensemble of weak learners using bagging, the rules should be 'rare'

Using XGBoost in R for regression based model

I'm trying to use XGBoost as a replacement for gbm.
The scores I'm getting are rather odd, so I'm thinking maybe I'm doing something wrong in my code.
My data contains several factor variables, all other numeric.
Response variable is a continuous variable indicating a House-Price.
I Understand that in order to use XGBoost, I need to use One Hot Enconding for those. I'm doing so by using the following code:
Xtest <- test.data
Xtrain <- train.data
XSalePrice <- Xtrain$SalePrice
Xtrain$SalePrice <- NULL
# Combine data
Xall <- data.frame(rbind(Xtrain, Xtest))
# Get categorical features names
ohe_vars <- names(Xall)[which(sapply(Xall, is.factor))]
# Convert them
dummies <- dummyVars(~., data = Xall)
Xall_ohe <- as.data.frame(predict(dummies, newdata = Xall))
# Replace factor variables in data with OHE
Xall <- cbind(Xall[, -c(which(colnames(Xall) %in% ohe_vars))], Xall_ohe)
After that, I'm splitting the data back to the test & train set:
Xtrain <- Xall[1:nrow(train.data), ]
Xtest <- Xall[-(1:nrow(train.data)), ]
And then building a model, and printing the RMSE & Rsquared:
# Model
xgb.fit <- xgboost(data = data.matrix(Xtrain), label = XSalePrice,
booster = "gbtree", objective = "reg:linear",
colsample_bytree = 0.2, gamma = 0.0,
learning_rate = 0.05, max_depth = 6,
min_child_weight = 1.5, n_estimators = 7300,
reg_alpha = 0.9, reg_lambda = 0.5,
subsample = 0.2, seed = 42,
silent = 1, nrounds = 25)
xgb.pred <- predict(xgb.fit, data.matrix(Xtrain))
postResample(xgb.pred, XSalePrice)
Problem is I'm getting very off RMSE & Rsxquare:
RMSE Rsquared
1.877639e+05 5.308910e-01
That are VERY far from the results I get when using GBM.
I'm thinking i'm doing something wrong, my best guess it probably with the One Hot Encoding phase which I'm unfamiliar, So used a googled code with adjustments to my data.
Can someone indicate what am I doing wrong and how to 'fix' it?
UPDATE:
After reviewing #Codutie answer, my code has some errors:
Xtrain <- sparse.model.matrix(SalePrice ~. , data = train.data)
XDtrain <- xgb.DMatrix(data = Xtrain, label = "SalePrice")
xgb.DMatrix produces:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
train.data is data frame, and it has 1453 rows. Label SalePrice also contains 1453 values (No missing values)
Thanks
train <- dat[train_ind,]
train.y <- train[,ncol(train_ind)]
xgboost(data =data.matrix(train[,-1]),
label = train.y,
objective = "reg:linear",
eval_metric = "rmse",
max.depth =15,
eta = 0.1,
nround = 15,
subsample = 0.5,
colsample_bytree = 0.5,
num_class = 12,
nthread = 3
)
Two clues to control XGB for Regression,
1) eta : if eta is small, models tends to overfit
2) eval_metric : Not sure if xgb allowed user to use their own eval_metric. But this metric is not useful when the quantitative dependent variable contains outlier. Check if XGB support hubber loss function.

Evaluate stm Model

I´m working on a STM Model (topicmodelling) and i´d like to evaluate and verify the model, but i´m not sure how to do it. My code is:
Corpus.STM <- readCorpus(dtm, type = "slam")
Model choice:
BestM1. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(10,20, 30, 40, 50, 60), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM2. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(85,110), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
BestM3. <- searchK(Corpus.STM$documents, Corpus.STM$vocab, K=c(20,21,22,23,24,25,26,27,28,29,30), proportion = .4, heldout.seed = 1, prevalence=~ cvJahr+ cvDienstgrad+ cvLand, data=Jahr.Land )
str(BestM1.)
plot.searchK(BestM1.)
plot.STM(BestM2)
plot.searchK(BestM3.)
#27 seems to be a good choice
#Heldout
set.seed(1)
heldout<- make.heldout(Corpus.STM$documents, Corpus.STM$vocab, proportion = .5,seed = 1)
stm.mod1 <- stm(heldout$documents, heldout$vocab, K =27, seed = 1, init.type = "Spectral", max.em.its = 100 )
heldout.evaluation <- eval.heldout(stm.mod1, heldout$missing)
heldout.evaluation
#evaluation heldout
labelTopics(stm.mod1)
plot.STM(stm.mod1, type="labels", n=5, frexweight = 0.25)
cloud(stm.mod1, topic=5)
plot.STM(stm.mod1, type="summary", labeltype="frex", topics=c(1:5), n=8)
I´m not sure how to interpret the output of "eval.heldout". Additional I want to make sure that the model doesn´t overfit, but i´m not sure how it could work.
eval.heldout() calculates the held-out log-likelihood using document completion. The number you want is the heldout.evaluation$expected.heldout which is the average of the held-out log-likelihood values for each document. Unfortunately there is no unambiguous measure of whether or not the model is "overfit." The plot.searchK() call you have will give you a plot of the held-out log-likelihood over different values of K and certainly if that number is decreasing as K goes up one explanation is overfitting.
Sorry to not have a clearer answer but unfortunately there are no hard and fast rules here.

Resources