I'm trying to use XGBoost as a replacement for gbm.
The scores I'm getting are rather odd, so I'm thinking maybe I'm doing something wrong in my code.
My data contains several factor variables, all other numeric.
Response variable is a continuous variable indicating a House-Price.
I Understand that in order to use XGBoost, I need to use One Hot Enconding for those. I'm doing so by using the following code:
Xtest <- test.data
Xtrain <- train.data
XSalePrice <- Xtrain$SalePrice
Xtrain$SalePrice <- NULL
# Combine data
Xall <- data.frame(rbind(Xtrain, Xtest))
# Get categorical features names
ohe_vars <- names(Xall)[which(sapply(Xall, is.factor))]
# Convert them
dummies <- dummyVars(~., data = Xall)
Xall_ohe <- as.data.frame(predict(dummies, newdata = Xall))
# Replace factor variables in data with OHE
Xall <- cbind(Xall[, -c(which(colnames(Xall) %in% ohe_vars))], Xall_ohe)
After that, I'm splitting the data back to the test & train set:
Xtrain <- Xall[1:nrow(train.data), ]
Xtest <- Xall[-(1:nrow(train.data)), ]
And then building a model, and printing the RMSE & Rsquared:
# Model
xgb.fit <- xgboost(data = data.matrix(Xtrain), label = XSalePrice,
booster = "gbtree", objective = "reg:linear",
colsample_bytree = 0.2, gamma = 0.0,
learning_rate = 0.05, max_depth = 6,
min_child_weight = 1.5, n_estimators = 7300,
reg_alpha = 0.9, reg_lambda = 0.5,
subsample = 0.2, seed = 42,
silent = 1, nrounds = 25)
xgb.pred <- predict(xgb.fit, data.matrix(Xtrain))
postResample(xgb.pred, XSalePrice)
Problem is I'm getting very off RMSE & Rsxquare:
RMSE Rsquared
1.877639e+05 5.308910e-01
That are VERY far from the results I get when using GBM.
I'm thinking i'm doing something wrong, my best guess it probably with the One Hot Encoding phase which I'm unfamiliar, So used a googled code with adjustments to my data.
Can someone indicate what am I doing wrong and how to 'fix' it?
UPDATE:
After reviewing #Codutie answer, my code has some errors:
Xtrain <- sparse.model.matrix(SalePrice ~. , data = train.data)
XDtrain <- xgb.DMatrix(data = Xtrain, label = "SalePrice")
xgb.DMatrix produces:
Error in setinfo.xgb.DMatrix(dmat, names(p), p[[1]]) :
The length of labels must equal to the number of rows in the input data
train.data is data frame, and it has 1453 rows. Label SalePrice also contains 1453 values (No missing values)
Thanks
train <- dat[train_ind,]
train.y <- train[,ncol(train_ind)]
xgboost(data =data.matrix(train[,-1]),
label = train.y,
objective = "reg:linear",
eval_metric = "rmse",
max.depth =15,
eta = 0.1,
nround = 15,
subsample = 0.5,
colsample_bytree = 0.5,
num_class = 12,
nthread = 3
)
Two clues to control XGB for Regression,
1) eta : if eta is small, models tends to overfit
2) eval_metric : Not sure if xgb allowed user to use their own eval_metric. But this metric is not useful when the quantitative dependent variable contains outlier. Check if XGB support hubber loss function.
Related
UPDATE
To help someone who is looking for similar answers to this question, I was able to increase AUC by balancing the dataset. I dis this by doing the following edit to the code:
history <- model %>%
fit(train_nn,
train_target, #when using OHE becomes train_label
epoch = 100,
batch_size = 32,
validation_split = 0.10,
class_weight = list("0" =
nrow(dataset[dataset[,134] ==1)/nrow(dataset[dataset[,134] ==0, "1" =
1)
End of Update
I am currently studying biases in predictions of neural network models. Using data from the fintech company Bondora, I am attempting to create an MLP model to predict loan acceptance. The dataset contains multiple categorical and numerical variables. I created a categorical variable called "reject_loan" (serves as my target variable) which is 1 if a loan defaults within 1 year of origination and 0 otherwise. Now I am attempting to create a MLP model to predict "reject_loan".
Problem: Even though accuracy and validation accuracy both are high (around 83% and around 90% respectively)loss, val_loss, acc, val-accuracy, predictions on test data are very poor. The model usually predicts only one class for all observations OR is able to make only very few correct predictions of the other class. AUC hovers close to 50% always.
I have tried a variety of approaches in pre-processing and in model parameters. Some of the major approaches are below:
Using OHE for all categorical variables (including target), normalizing the numerical vars and then using relu activation for hidden, softmax for output and categorical cross entrophy as loss function)
No OHE, normalizing the numerical vars and then using relu activation for hidden, sigmoid for output and binary cross entrophy as loss function)
Using elu activation for hidden to ensure no leaks in relu
Using multiple hidden layers with and without regularizer (l1 and l2)
Using dropouts
Using SGD and ADAM as optimizers (i.e. either SGD or ADAM)
Decreasing learning rate (lowest used is 0.000001)
Nothing has worked to increase predictability. I should also mention that I have trained an XGBoost model on the same dataset with AUC of around 90% ROC curve and AUC of one of the runs.
Would very much appreciate if someone can help me with this issue.
My model code is as under:
#divide into train and test
set.seed (1234)
#dividing cons in 80:20 train:test sample
sample <- sample(2, nrow(dataset), replace = T, prob = c(0.80,0.20))
train <- dataset[sample==1,1:ncol(dataset)-1]
test <- dataset[sample==2,1:ncol(dataset)-1]
train_target <- dataset[sample==1, ncol(dataset)]
test_target <- dataset[sample==2, ncol(dataset)]
#One hot encoding
train_label <- to_categorical(train_target)
test_label <- to_categorical(test_target)
#Create sequential model
model <- keras_model_sequential()
model %>%
layer_dense(units = 16,
activation = 'elu',
input_shape = c(ncol(train_nn)),
kernel_regularizer = regularizer_l1_l2(l1 = 0.2, l2 = 0.2)) %>%
layer_dropout(0.2) %>%
layer_dense(units = 8,
activation = 'elu',
kernel_regularizer = regularizer_l1_l2(l1 = 0.2, l2 = 0.2)) %>%
layer_dropout(0.4) %>%
layer_dense(units = 8,
activation = 'elu') %>%
layer_dense(units = 1,
activation = 'sigmoid' #In this iteration sigmoid is used but I have also used softmax with a OHE coding of target and units = 2)
#compile
opt = optimizer_sgd(lr = 0.001,
momentum = 0,
decay = 0,
nesterov = FALSE,
clipnorm = NULL,
clipvalue = NULL
)
opt2 = optimizer_adam(lr = 0.000001,
beta_1 = 0.9,
beta_2 = 0.999,
epsilon = NULL,
decay = 0,
amsgrad = FALSE,
clipnorm = NULL,
clipvalue = NULL
)
model %>%
compile(loss = 'binary_crossentropy', #also used categorical_crossontrophy in some iterations
optimizer = opt2,
metrics = 'accuracy')
#Fit model
clbck = callback_reduce_lr_on_plateau(monitor='val_loss', factor=0.1, patience=2)
history <- model %>%
fit(train_nn,
train_target, #when using OHE becomes train_label
epoch = 100,
batch_size = 32,
validation_split = 0.10)
#Evaluate model with test data
nn_model_3 <- model %>% evaluate(test, test_target) #When using OHE this becomes test_label
#Prediction & confusion matrix - test data
prob <- model %>%
predict_proba(test)
pred <- model %>%
predict_classes(test)
nn_conf_table_3 <- table(Predicted = pred, Actual = test_target)
nn_probability_table_3 <- cbind (prob, pred, test_target)
#auc AND roc
par(pty = "s")
nn_roc_3 <- roc(test_target, pred, plot=T, percent=T, lwd = 3, print.auc=T)
This is with reference to this answer on implementation of Bayesian Optimization. I am unable to understand the following R-code that defines a function xgb.cv.bayes(). The code is as follows:
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(params = list(booster = 'gbtree', eta = 0.05,
max_depth = max.depth,
min_child_weight = min_child_weight,
subsample = subsample,
colsample_bytree = colsample_bytree,
gamma = gamma,
lambda = 1, alpha = 0,
objective = 'binary:logistic',
eval_metric = 'auc'),
data = data.matrix(df.train[,-target.var]),
label = as.matrix(df.train[, target.var]),
nround = 500, folds = cv_folds, prediction = TRUE,
showsd = TRUE, early.stop.round = 5, maximize = TRUE,
verbose = 0
)
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
}
I am unable to understand the following part of code that comes after closing parenthesis of xgb.cv():
list(Score = cv$dt[, max(test.auc.mean)],
Pred = cv$pred)
Or very briefly, I do not understand the following syntax:
xgb.cv.bayes <- function(max.depth, min_child_weight, subsample, colsample_bytree, gamma){
cv <- xgv.cv(...)list(...)
}
I will be grateful in understanding this R-syntax and where can I find more examples of this.
In R the value of the last expression in a function is automatically the return value of this function. So the function you presented has exactly two steps:
compute the result of xgv.cv(...) and store the result in a
variable cv
create a list with two entries (Score and Pred)
whose values are extracted from cv.
Since the expression that creates the list is the last expression in the function, the list is automatically the return value. So, if you would execute test <- xgb.cv.bayes(...) you could then access test$Score and test$Pred.
Does this answer your question?
I encountered the known issue of not being able to save the xgboost model and load it later to obtain predictions and it was supposedly changed in h2o 3.18 (the problem was in 3.16). I updated the package from h2o's website (downloadable zip) and now the model that had no problem gives the following error:
Error in .h2o.doSafeREST(h2oRestApiVersion = h2oRestApiVersion, urlSuffix = urlSuffix, :
Unexpected CURL error: Failed to connect to localhost port 54321: Connection refused
This is only in the case of xgboost (binary classification), as other models I use work fine. Of course h2o is initialised and a previous model estimates without problems. Does anyone have any idea what can be the issue?
EDIT: Here is a reproducible example (based on Erin's answer) that produces the error:
library(h2o)
library(caret)
h2o.init()
# Import a sample binary outcome train set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# Assigning fold column
set.seed(1)
cv_folds <- createFolds(as.data.frame(train)$response,
k = 5,
list = FALSE,
returnTrain = FALSE)
# version 1
train <- train %>%
as.data.frame() %>%
mutate(fold_assignment = cv_folds) %>%
as.h2o()
# version 2
train <- h2o.cbind(train, as.h2o(cv_folds))
names(train)[dim(train)[2]] <- c("fold_assignment")
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
xgb <- h2o.xgboost(x = x,
y = y,
seed = 1,
training_frame = train,
fold_column = "fold_assignment",
keep_cross_validation_predictions = TRUE,
eta = 0.01,
max_depth = 3,
sample_rate = 0.8,
col_sample_rate = 0.6,
ntrees = 500,
reg_lambda = 0,
reg_alpha = 1000,
distribution = 'bernoulli')
Both versions of creating the train data.frame result in the same error.
You didn't say whether you have re-trained the models using 3.18. In general, H2O only guarantees model compatibility between major version of H2O. If you have not retrained the models, that's probably the reason that XGBoost is not working properly. If you have re-trained the models with 3.18 and XGBoost is still not working, then please post a reproducible example and we will check it out further.
EDIT:
I am adding reproducible example (the only difference from your code and this code is that I am not using fold_column here). This runs fine on 3.18.0.2. Without a reproducible example that produces an error, I can't help you any further.
library(h2o)
h2o.init()
# Import a sample binary outcome train set into H2O
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
xgb <- h2o.xgboost(x = x,
y = y,
seed = 1,
training_frame = train,
keep_cross_validation_predictions = TRUE,
eta = 0.01,
max_depth = 3,
sample_rate = 0.8,
col_sample_rate = 0.6,
ntrees = 500,
reg_lambda = 0,
reg_alpha = 1000,
distribution = 'bernoulli')
Taking cue from xgboost xgb.dump tree coefficient question.
I specifically want to know if eta = 0.1 or 0.01 how will the probability calculation differ from the answer provided?
I want to do predictions using the tree dump.
My code is
#Define train label and feature frames/matrix
y <- train_data$esc_ind
train_data = as.matrix(train_data)
trainX <- as.matrix(train_data[,-1])
param <- list("objective" = "binary:logistic",
"eval_metric" = "logloss",
"eta" = 0.5,
"max_depth" = 2,
"colsample_bytree" = .8,
"subsample" = 0.8, #0.75
"alpha" = 1
)
#Train XGBoost
bst = xgboost(param=param, data = trainX, label = y, nrounds=2)
trainX1 = data.frame(trainX)
mpg.fmap = genFMap(trainX1, "xgboost.fmap")
xgb.save(bst, "xgboost.model")
xgb.dump(bst, "xgboost.model_6.txt",with.stats = TRUE, fmap = "xgboost.fmap")
The tree looks like:
booster[0]
0:[order.1<12.2496] yes=1,no=2,missing=2,gain=1359.61,cover=7215.25
1:[access.1<0.196687] yes=3,no=4,missing=4,gain=3.19685,cover=103.25
3:leaf=-0,cover=1
4:leaf=0.898305,cover=102.25
2:[team<6.46722] yes=5,no=6,missing=6,gain=753.317,cover=7112
5:leaf=0.893333,cover=55.25
6:leaf=-0.943396,cover=7056.75
booster[1]
0:[issu.1<6.4512] yes=1,no=2,missing=2,gain=794.308,cover=5836.81
1:[team<3.23361] yes=3,no=4,missing=4,gain=18.6294,cover=67.9586
3:leaf=0.609363,cover=21.4575
4:leaf=1.28181,cover=46.5012
2:[case<6.74709] yes=5,no=6,missing=6,gain=508.34,cover=5768.85
5:leaf=1.15253,cover=39.2126
6:leaf=-0.629773,cover=5729.64
Will the coefficient for all tree leaf scores for xgboost be 1 when eta is chosen less than 1?
Actually this was practical which I have overseen earlier.
Using the above tree structure one can find the probability for each training example.
The parameter list was:
param <- list("objective" = "binary:logistic",
"eval_metric" = "logloss",
"eta" = 0.5,
"max_depth" = 2,
"colsample_bytree" = .8,
"subsample" = 0.8,
"alpha" = 1)
For the instance set in leaf booster[0], leaf: 0-3; the probability will be exp(-0)/(1+exp(-0)).
And for booster[0], leaf: 0-3 + booster[1], leaf: 0-3; the probability will be exp(0+ 0.609363)/(1+exp(0 + 0.609363)).
And so on as one goes on increasing number of iterations.
I matched these values with R's predicted probabilities they differ in 10^(-7), probably due to floating point curtailing of leaf quality scores.
This answer can give a production level solution when R's trained boosted trees are used in different environment for prediction.
Any comment on this will be highly appreciated.
I'm having a lot of trouble figuring out how to correctly set the num_classes for xgboost.
I've got an example using the Iris data
df <- iris
y <- df$Species
num.class = length(levels(y))
levels(y) = 1:num.class
head(y)
df <- df[,1:4]
y <- as.matrix(y)
df <- as.matrix(df)
param <- list("objective" = "multi:softprob",
"num_class" = 3,
"eval_metric" = "mlogloss",
"nthread" = 8,
"max_depth" = 16,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 1,
"colsample_bytree" = 1,
"min_child_weight" = 12)
model <- xgboost(param=param, data=df, label=y, nrounds=20)
This returns an error
Error in xgb.iter.update(bst$handle, dtrain, i - 1, obj) :
SoftmaxMultiClassObj: label must be in [0, num_class), num_class=3 but found 3 in label
If I change the num_class to 2 I get the same error. If I increase the num_class to 4 then the model runs, but I get 600 predicted probabilities back, which makes sense for 4 classes.
I'm not sure if I'm making an error or whether I'm failing to understand how xgboost works. Any help would be appreciated.
label must be in [0, num_class)
in your script add y<-y-1 before model <-...
I ran into this rather weird problem as well. It seemed in my class to be a result of not properly encoding the labels.
First, using a string vector with N classes as the labels, I could only get the algorithm to run by setting num_class = N + 1. However, this result was useless, because I only had N actual classes and N+1 buckets of predicted probabilities.
I re-encoded the labels as integers and then num_class worked fine when set to N.
# Convert classes to integers for xgboost
class <- data.table(interest_level=c("low", "medium", "high"), class=c(0,1,2))
t1 <- merge(t1, class, by="interest_level", all.x=TRUE, sort=F)
and
param <- list(booster="gbtree",
objective="multi:softprob",
eval_metric="mlogloss",
#nthread=13,
num_class=3,
eta_decay = .99,
eta = .005,
gamma = 1,
max_depth = 4,
min_child_weight = .9,#1,
subsample = .7,
colsample_bytree = .5
)
For example.
I was seeing the same error, my issue was that I was using an eval_metric that was only meant to be used for multiclass labels when my data had binary labels. See eval_metric in the Learning Class Parameters section of the XGBoost docs for a list of all of the options.
I had this problem and it turned out that I was trying to subtract 1 from my predictor which was already in the units of 0 and 1. Probably a novice mistake, but in case anyone else is running into this with a binary response variable that is already 0 and 1 it is something to make note of.
Tutorial said:
label = as.integer(iris$Species)-1
What worked for me (response is high_end):
label = as.integer(high_end)