Can't pass xgb.DMatrix to caret - r

I am trying tune Hyperparametes of xgboost for a classification problem, using caret library, As there were a lot of factors in my data set and xgboost likes data as numerical, I created a dummy rows using Feature Hashing, but when I get to run caret train , I get an error
#Using Feature hashing to convert all the factor variables to dummies
objTrain_hashed = hashed.model.matrix(~., data=train1[,-27], hash.size=2^15, transpose=FALSE)
#created a dense matrix which is normally accepted by xgboost method in R
#Hoping I could pass it caret as well
dmodel <- xgb.DMatrix(objTrain_hashed[, ], label = train1$Walc)
xgb_grid_1 = expand.grid(
nrounds = 500,
max_depth = c(5, 10, 15),
eta = c(0.01, 0.001, 0.0001),
gamma = c(1, 2, 3),
colsample_bytree = c(0.4, 0.7, 1.0),
min_child_weight = c(0.5, 1, 1.5)
)
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 3,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
xgb_train1 <- train(Walc ~.,dmodel,method = 'xgbTree',trControl = xgb_trcontrol_1,
metric = 'accuracy',tunegrid = xgb_grid_1)
I am getting the following error
Error in as.data.frame.default(data) :
cannot coerce class ""xgb.DMatrix"" to a data.frame
Any suggestions, on how I can proceed ?

This is because you are inputting dmodel into the last part of your code. Try inputting objTrain_hashed, which is a matrix, and not an xgb.DMatrix

How about sparse.model.matrix() instead of hashed.model.matrix...
It works on my PC...
and don't transform to xgb.DMatrix()
put it in train() function just mere sparse.model.matrix() form.
like...
model_data <- sparse.model.matrix(Y~., raw_data)
and
xgb_train1 <- train(Y ~.,model_data, <bla bla> ...)
Wish it works... thank you.

Related

Multiclass Classification using expand.grid

So far I built many classification models using the "caret" package. This library allows me to find the best parameters for XGBoost by using expand.grid and trying all the possible combinations of some parameters as shown in the example below.
trControl = trainControl(
method = 'cv',
number = 3,
returnData=F,
classProbs = TRUE,
verboseIter = TRUE,
allowParallel = TRUE)
tuneGridXGB <- expand.grid(
nrounds=c(10, 50, 100, 200, 350, 500),
max_depth = c(2,4),
eta = c(0.005, 0.01, 0.05, 0.1),
gamma = c(0,2,4),
colsample_bytree = c(0.75),
subsample = c(0.50),
min_child_weight = c(0,2,4))
xgbmod_classif_bin <- train(
x=eg_Train_mat,
y= y_train_target,
method = "xgbTree",
metric = "auc",
reg_lambda=0.7,
scale_pos_weight=1.6,
nthread = 4,
trControl = trControl,
tuneGrid = tuneGridXGB,
verbose=T)
For the first time I have a multiclass classification problem (with 9 classes) to deal with, but I don't seem to be able to use anything like "multi:softprob" (as I would do with the xgboost package - see below).
param=list(objective="multi:softprob",
num_class=9,
eta=0.005,
max.depth=4,
min_child_weight=2,
gamma=6,
eval_metric ="merror",
nthread=4,
booster = "gbtree",
lambda=1.8,
subssample=0.8,
alpha=6,
colsample_bytree=0.5,
scale_pos_weight=1.6,
verbosity=3
)
bst=xgboost(params = param,
data = eg_Train_mat,
nrounds = 15)
Any idea of how to try many parameters using a grid, possibly using the caret package, for a multiclass classification problem?
Thanks

Tuning XGboost parameters Using Caret - Error: The tuning parameter grid should have columns

I am using caret for modeling using "xgboost"
1- However, I get following error :
"Error: The tuning parameter grid should have columns nrounds,
max_depth, eta, gamma, colsample_bytree, min_child_weight, subsample"
The code:
library(caret)
library(doParallel)
library(dplyr)
library(pROC)
library(xgboost)
## Create train/test indexes
## preserve class indices
set.seed(42)
my_folds <- createFolds(train_churn$churn, k = 10)
# Compare class distribution
i <- my_folds$Fold1
table(train_churn$churn[i]) / length(i)
my_control <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE,
savePredictions = TRUE,
index = my_folds
)
my_grid <- expand.grid(nrounds = 500,
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 1,
min_child_weight = 100,
subsample = 1)
set.seed(42)
model_xgb <- train(
class ~ ., data = train_churn,
metric = "ROC",
method = "xgbTree",
trControl = my_control,
tuneGrid = my_grid)
2- I also want to get a prediction made by averaging the predictions made by using the model fitted for each fold.
I know it's 'tad' bit late but, check your spelling of gamma in the grid of tuning parameters. You misspelled it as gammma (with triple m's).

Combining train + test data and running cross validation in R

I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.

R not valid variable name for caret function

I want to use train caret function to investigate xgboost results
#open file with train data
trainy <- read.csv('')
# open file with test data
test <- read.csv('')
# we dont need ID column
##### Removing IDs
trainy$ID <- NULL
test.id <- test$ID
test$ID <- NULL
##### Extracting TARGET
trainy.y <- trainy$TARGET
trainy$TARGET <- NULL
# set up the cross-validated hyper-parameter search
xgb_grid_1 = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4, 6, 8, 10),
gamma = 1
)
# pack the training control parameters
xgb_trcontrol_1 = trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
classProbs = TRUE, # set to TRUE for AUC to be computed
summaryFunction = twoClassSummary,
allowParallel = TRUE
)
# train the model for each parameter combination in the grid,
# using CV to evaluate
xgb_train_1 = train(
x = as.matrix(trainy),
y = as.factor(trainy.y),
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
I see this error
Error in train.default(x = as.matrix(trainy), y = as.factor(trainy.y), trControl = xgb_trcontrol_1, :
At least one of the class levels is not a valid R variable name;
I have looked at other cases but still cant understand what I should change? R is quite different from Python for me for now
As I can see I should do something with y classes variable, but what and how exactly ? Why didnt as.factor function work?
I solved this issue, hope it will help to all novices
I needed to transofm all data to factor type in the way like
trainy[] <- lapply(trainy, factor)

How to instruct xgboost in caret to use mlogloss for optimization

I have a multiclass problem: For example, we can take the dataset mtcars dataset and we want to predict number of cylinders cyl.
data(mtcars)
I want to use xgboost and fit it using the caret package. For that I create grid for hyperparameters using
xgb_grid_param = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4),
gamma = 0,
colsample_bytree =1,
min_child_weight =1
)
I can create training control parameters as
xgb_tr_ctrl = trainControl(
method = "cv",
number = 5,
repeats =2,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all",
allowParallel = TRUE
)
When I then try to run the train function in caret using:
model <- train(factor(cyl)~., data = mtcars, method = "xgbTree",
trControl = xgb_grid_param, tuneGrid=xgb_grid_param)
I get the error ::
Error in trControl$classProbs && any(classLevels != make.names(classLevels)) :
invalid 'x' type in 'x && y'
How do I fix this error and how do I instruct xgbTree to use mlogloss to optimize the learning.
For another method I could solve "invalid 'x' type in 'x && y'" by setting the label attribute as last column of the data frame / matrix.

Resources