Error "variable lengths differ" when assigning weights parameter in caret (R) - r

I want to apply weighted observation in caret using the code below:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)
xgbT <- train(x = as.matrix(train[,-21]), y = make.names(as.factor(train$y)),
method = "xgbTree",
trControl = cctrl1,
metric = "MCC",
maximize = TRUE,
weights = model_weights,
preProc = c("center", "scale"),
tuneGrid = expand.grid(nrounds = c(150), #number of trees
max_depth = c(7), #max tree depth
eta = c(0.03), #learning rate
gamma = c(0.3), #min split loss
colsample_bytree = c(0.7),
min_child_weight = c(10, 1, 5), #min number of instances in the leaf
subsample = c(0.6)), #subsample ratio of the training instance
early_stop_round = c(3), #if no improvements over specified rounds
objective = c("binary:logistic"),
silent = 0)
However, it gives me this error: Error in model.frame.default(formula = .outcome ~ ., data = dat, weights = wts) :
variable lengths differ (found for '(weights)')
Though I have checked that their lengths are the same with code below:
> table(model_weights)
model_weights
0.0000277654375832963 0.000231481481481481
18008 2160
> table(train$y)
0 1
18008 2160
Any idea how to fix this?
NOTE: I can run the train function without weights parameter.

After further debugging, I found the problem is because I apply sampling in the cctrl1. Thus, the length of weights differ, since I generate it before they apply re-sampling.
So, you can fix this by simply removing sampling from your trControl. If you still want to apply re-sampling, then you have to re-sample the data before running below code:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)

Related

There were missing values in resampled performance measures

I need to do a classification task on this dataset. As the following code shows, I tried to implement xgboost using caret package. Since my dataset is imbalanced, I prefer to use Fscore as performance measure. Furthermore, I need to use the first 700000 instances as the train set and the remaining 150000 instances as the test set. As the commented part of my code shows, I read this post and other related posts. However, I could not solve the issue.
mytrainvalid <- read.csv("mytrainvalid.csv")
library(xgboost)
library(dplyr)
library(caret)
mytrainvalid$DEFAULT <- ifelse(mytrainvalid$DEFAULT != 0,
"one",
"zero")
mytrainvalid$DEFAULT <- as.factor(mytrainvalid$DEFAULT)
input_x <- as.matrix(select(mytrainvalid, -DEFAULT))
## Use the validation index in the trainControl
ind=as.integer(rownames(mytrainvalid))
vi=c(700001:850000)
# modelling
grid_default <- expand.grid(
nrounds = c(100,200),
max_depth = 6,
eta = 0.1,
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
## use fScore as data is imbalance: 20:1
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
##
data.ctrl <- trainControl(method = "cv",
number = 1,
allowParallel=TRUE,
returnData = FALSE,
index = list(Fold1=(1:ind)[-vi]),
sampling = "smote",
classProbs = TRUE,
summaryFunction = f1,
savePredictions = "final",
verboseIter=TRUE,
search = "random",
#savePred=T
)
xgb_model <-caret::train (input_x,
mytrainvalid$DEFAULT,
method="xgbTree",
trControl=data.ctrl,
#tuneGrid=grid_default,
verbose=FALSE,
metric = "F1",
classProbs=TRUE,
#linout=FALSE,
#threshold = 0.3,
#scale_pos_weight = sum(input_y$DEFAULT == "no")/sum(input_y$DEFAULT == "yes"),
#maximize = FALSE,
tuneLength = 2,
)
Unfortunately, the following error is produced:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :2
Error: Stopping
In addition: Warning messages:
1: model fit failed for Fold1: eta=0.09121, max_depth=8, gamma=7.227, colsample_bytree=0.6533, min_child_weight=15, subsample=0.9783, nrounds=800 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
2: model fit failed for Fold1: eta=0.15119, max_depth=8, gamma=8.877, colsample_bytree=0.4655, min_child_weight= 3, subsample=0.9515, nrounds=536 Error in createModel(x = subset_x(x, modelIndex), y = y[modelIndex], wts = wts[modelIndex], :
formal argument "classProbs" matched by multiple actual arguments
3: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.

{caret}xgbTree model not running when weights included, runs fine without them

I have a dataset off which I have no problem building an xgbTree model without weights, but once I include weights -- even if the weights are just all 1 -- the model doesn't converge. I get the
Something is wrong; all the RMSE metric values are missing: error and when I print the warnings, I get In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ... :There were missing values in resampled performance measures. as the last message.
This is a drive link to the RData file containing the info -- it was too big to print, and smaller samples didn't always reproduce the error.
It contains 3 objects: input_x, input_y, and wts -- the last one is just a vector of 1s, but it should eventually it should be able to accept numbers on the interval (0,1), ideally. The code I used is shown below. Note the comment next to the weight argument that produces the error.
nrounds<-1000
tune_grid <- expand.grid(
nrounds = seq(from = 200, to = nrounds, by = 50),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 4, 5),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
tune_control <- caret::trainControl(
method = "cv",
number = 3,
verboseIter = FALSE,
allowParallel = TRUE
)
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts, # If I remove this line, the code works fine. When included, even if just 1s, it throws an error.
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
EDIT 13.10.2021. thanks to #waterpolo
The correct way to specify weights is via the weights argument to caret::train
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
see a more verbose answer here: Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function
Old incorrect answer below:
According to the function source weights argument is called wts.
Line:
if (!is.null(wts))
xgboost::setinfo(x, 'weight', wts)
Running
xgb_tune <- caret::train(
x = input_x,
y = input_y,
wts = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
should produce the desired result.
Just wanted to add #missuse response from another post (Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function). The correct argument is weights .
Code:
xgb_tune <- caret::train(x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
The other thing that I found was that I needed to use weights > 1 or I would receive the same error message as you. For example, if I used inverse weighting I would receive the same message as you. Hope this helps.
Thanks #missuse for the lovely response in the other thread!

How to prevent "algorithm did not converge" errors in neuralnet / Caret / R?

I am trying to train a neural network using train function and neuralnet as my method paramater to predict times table.
I am scaling my training data set as well.
Even though I've tried different learningrates, stepmaxes, and thresholds for my neuralnet, each time I tried to train the network using train function one of the k-folds happened to fail every time saying
1: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.
2: predictions failed for Fold05.Rep1: layer1=8, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
I am guessing that this is because of weights being random so somehow each time I happen to get some weights that are not going to converge.
Is there anyway of preventing this? Maybe trying to re-train the particular fold which has failed using different weights?
Here is my code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.preProcessed.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 500000,
lifesign = 'minimal',
threshold = 0.01)

Some questions about rpart & gbm in R

I am trying to model the claim frequency with the rpart and gbm packages. And I have a few questions regarding these packages.
In the rpart-model, what is the purpose/function of the shrink-parameter?
In the gbm-model, do I use the weights correct? I get an output (no errors), but I just want to be sure I have understood it correct.
I the gbm-model, I know that the parameter n.minobsinnode let me say that it should at least be 10 observations in each node. But is there a way to say that each node should have at least 1 claim? I don’t want a model that predicts 0 in claim frequency for some observations.
In RandomForest, d variables are randomly picked from the n variables for each split. But in the gbm-model, all n variables are considered for each split?
In tree-based-models, is it possible to offset one variable (e.g. deductible)?
Regression tree
Model_tree <- rpart(cbind(duration, nclaims) ~ Var_1 + … + Var_n ,
data = data ,
method = "poisson",
parms = list(shrink = 1),control=rpart.control(minbucket = 10, cp = 0.00005 , maxdepth = 5))
# Gradient Boosting Model
Model_gbm <- gbm(nclaims ~ Var_1 + … + Var_n,
data = data,
weights = duration,
distribution = "poisson",
cv.folds = 0,
shrinkage = 0.01,
interaction.depth = 5,
n.trees = 5000,
n.minobsinnode = 10,
bag.fraction = 1,
train.fraction = 1)
# Predict with a gbm
predict.gbm(object = Model_gbm, n.trees = 1000, newdata = testdata, type = "response")

GBM and Caret package: invalid number of intervals

Though I am defining that target <- factor(train$target, levels = c(0, 1)), the below-given code provides this error:
Error in cut.default(y, unique(quantile(y, probs = seq(0, 1, length =
cuts))), : invalid number of intervals In addition: Warning
messages: 1: In train.default(x, y, weights = w, ...) : cannnot
compute class probabilities for regression
What does it mean and how to fix this?
gbmGrid <- expand.grid(n.trees = (1:30)*10,
interaction.depth = c(1, 5, 9),
shrinkage = 0.1)
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = FALSE,
returnResamp = "all",
classProbs = TRUE)
target <- factor(train$target, levels = c(0, 1))
gbm <- caret::train(target ~ .,
data = train,
#distribution="gaussian",
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid)
prob = predict(gbm, newdata=testing, type='prob')[,2]
First, don't do this:
target <- factor(train$target, levels = c(0, 1))
You will get an warning:
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
Second, you created an object called target. Using the formula method means that train will use the column called target in the data frame train and those are different data. Modify the column.

Resources