{caret}xgbTree model not running when weights included, runs fine without them - r

I have a dataset off which I have no problem building an xgbTree model without weights, but once I include weights -- even if the weights are just all 1 -- the model doesn't converge. I get the
Something is wrong; all the RMSE metric values are missing: error and when I print the warnings, I get In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ... :There were missing values in resampled performance measures. as the last message.
This is a drive link to the RData file containing the info -- it was too big to print, and smaller samples didn't always reproduce the error.
It contains 3 objects: input_x, input_y, and wts -- the last one is just a vector of 1s, but it should eventually it should be able to accept numbers on the interval (0,1), ideally. The code I used is shown below. Note the comment next to the weight argument that produces the error.
nrounds<-1000
tune_grid <- expand.grid(
nrounds = seq(from = 200, to = nrounds, by = 50),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 4, 5),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
tune_control <- caret::trainControl(
method = "cv",
number = 3,
verboseIter = FALSE,
allowParallel = TRUE
)
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts, # If I remove this line, the code works fine. When included, even if just 1s, it throws an error.
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)

EDIT 13.10.2021. thanks to #waterpolo
The correct way to specify weights is via the weights argument to caret::train
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
see a more verbose answer here: Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function
Old incorrect answer below:
According to the function source weights argument is called wts.
Line:
if (!is.null(wts))
xgboost::setinfo(x, 'weight', wts)
Running
xgb_tune <- caret::train(
x = input_x,
y = input_y,
wts = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
should produce the desired result.

Just wanted to add #missuse response from another post (Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function). The correct argument is weights .
Code:
xgb_tune <- caret::train(x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
The other thing that I found was that I needed to use weights > 1 or I would receive the same error message as you. For example, if I used inverse weighting I would receive the same message as you. Hope this helps.
Thanks #missuse for the lovely response in the other thread!

Related

Caret train() mlp missing values

I'm trying to train a multi layer perceptron for non linear regression on a dataset but it keeps giving me this error :
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I tried doing it again with a R dataset to see if my data was the problem but I keep getting the error and i have no idea why.
I already tried adding or removing tuneGrid, trControl, even linOut.
It will always give me constant results for some reasons.
library(datasets)
library(MASS)
library(caret)
DP = caret::createDataPartition(Boston$medv, p=0.75, list = F)
train = Boston[DP,]
test = Boston[-DP,]
colnames(train) = colnames(Boston)
colnames(test) = colnames(Boston)
mlp = caret::train(medv ~., data = Boston, method = "mlp", trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
Yp = caret::predict.train(mlp, test[,1:13])

Error "variable lengths differ" when assigning weights parameter in caret (R)

I want to apply weighted observation in caret using the code below:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)
xgbT <- train(x = as.matrix(train[,-21]), y = make.names(as.factor(train$y)),
method = "xgbTree",
trControl = cctrl1,
metric = "MCC",
maximize = TRUE,
weights = model_weights,
preProc = c("center", "scale"),
tuneGrid = expand.grid(nrounds = c(150), #number of trees
max_depth = c(7), #max tree depth
eta = c(0.03), #learning rate
gamma = c(0.3), #min split loss
colsample_bytree = c(0.7),
min_child_weight = c(10, 1, 5), #min number of instances in the leaf
subsample = c(0.6)), #subsample ratio of the training instance
early_stop_round = c(3), #if no improvements over specified rounds
objective = c("binary:logistic"),
silent = 0)
However, it gives me this error: Error in model.frame.default(formula = .outcome ~ ., data = dat, weights = wts) :
variable lengths differ (found for '(weights)')
Though I have checked that their lengths are the same with code below:
> table(model_weights)
model_weights
0.0000277654375832963 0.000231481481481481
18008 2160
> table(train$y)
0 1
18008 2160
Any idea how to fix this?
NOTE: I can run the train function without weights parameter.
After further debugging, I found the problem is because I apply sampling in the cctrl1. Thus, the length of weights differ, since I generate it before they apply re-sampling.
So, you can fix this by simply removing sampling from your trControl. If you still want to apply re-sampling, then you have to re-sample the data before running below code:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)

How to instruct xgboost in caret to use mlogloss for optimization

I have a multiclass problem: For example, we can take the dataset mtcars dataset and we want to predict number of cylinders cyl.
data(mtcars)
I want to use xgboost and fit it using the caret package. For that I create grid for hyperparameters using
xgb_grid_param = expand.grid(
nrounds = 1000,
eta = c(0.01, 0.001, 0.0001),
max_depth = c(2, 4),
gamma = 0,
colsample_bytree =1,
min_child_weight =1
)
I can create training control parameters as
xgb_tr_ctrl = trainControl(
method = "cv",
number = 5,
repeats =2,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "all",
allowParallel = TRUE
)
When I then try to run the train function in caret using:
model <- train(factor(cyl)~., data = mtcars, method = "xgbTree",
trControl = xgb_grid_param, tuneGrid=xgb_grid_param)
I get the error ::
Error in trControl$classProbs && any(classLevels != make.names(classLevels)) :
invalid 'x' type in 'x && y'
How do I fix this error and how do I instruct xgbTree to use mlogloss to optimize the learning.
For another method I could solve "invalid 'x' type in 'x && y'" by setting the label attribute as last column of the data frame / matrix.

Why does caret's "parRF" lead to tuning and missing value errors not present with "rf"

I have a tidy dataset with no missing values and only numeric columns.
The dataset is both large and contains sensitive information, so I won't be able to provide a copy of it here, unfortunately.
I partition this data into training and testing sets with caret's createDataPartition:
idx <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing <- model_final[-idx,]
x <- training[-ncol(training)]
y <- training$y
x1 <- testing[-ncol(testing)]
y1 <- testing$y
row.names(training) <- NULL
row.names(testing) <- NULL
row.names(x) <- NULL
row.names(y) <- NULL
row.names(x1) <- NULL
row.names(y1) <- NULL
I've been fitting and refitting Random Forest models on this data via randomForest on a regular basis:
rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
corr.bias = T, do.trace = T, nPerm = 3)
I decided to see if I could get any better or faster results with train and the following model ran fine, but took about 2 hours:
rf_train <- train(y=y, x=x,
method='rf', tuneLength = 3,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE
)
I need to take an HPC approach to make this logistically feasible, so I tried
require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE, allowParallel = TRUE)
)
but regardless of if I use tuneLength or tuneGrid, this leads to strange errors about missing values and tuning parameters:
Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
missing values found in aggregated results
I say this is weird both because there were no errors with method = "rf" and because I tripled checked to ensure there are no missing values.
I even get the same errors when completely omitting tuning options. I also tried toggling the na.action option on and off and changing "cv" to "repeatedcv".
I even get the same error with this ultra-simplified version:
rf_train <- train(y=y, x=x, method='parRF')
Seems to be because of a bug in caret. See the answer to:
parRF on caret not working for more than one core
Just dealt with this same issue, loading foreach on each new cluster manually seems to work.

GBM and Caret package: invalid number of intervals

Though I am defining that target <- factor(train$target, levels = c(0, 1)), the below-given code provides this error:
Error in cut.default(y, unique(quantile(y, probs = seq(0, 1, length =
cuts))), : invalid number of intervals In addition: Warning
messages: 1: In train.default(x, y, weights = w, ...) : cannnot
compute class probabilities for regression
What does it mean and how to fix this?
gbmGrid <- expand.grid(n.trees = (1:30)*10,
interaction.depth = c(1, 5, 9),
shrinkage = 0.1)
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = FALSE,
returnResamp = "all",
classProbs = TRUE)
target <- factor(train$target, levels = c(0, 1))
gbm <- caret::train(target ~ .,
data = train,
#distribution="gaussian",
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid)
prob = predict(gbm, newdata=testing, type='prob')[,2]
First, don't do this:
target <- factor(train$target, levels = c(0, 1))
You will get an warning:
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
Second, you created an object called target. Using the formula method means that train will use the column called target in the data frame train and those are different data. Modify the column.

Resources