GBM and Caret package: invalid number of intervals - r

Though I am defining that target <- factor(train$target, levels = c(0, 1)), the below-given code provides this error:
Error in cut.default(y, unique(quantile(y, probs = seq(0, 1, length =
cuts))), : invalid number of intervals In addition: Warning
messages: 1: In train.default(x, y, weights = w, ...) : cannnot
compute class probabilities for regression
What does it mean and how to fix this?
gbmGrid <- expand.grid(n.trees = (1:30)*10,
interaction.depth = c(1, 5, 9),
shrinkage = 0.1)
fitControl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = FALSE,
returnResamp = "all",
classProbs = TRUE)
target <- factor(train$target, levels = c(0, 1))
gbm <- caret::train(target ~ .,
data = train,
#distribution="gaussian",
method = "gbm",
trControl = fitControl,
tuneGrid = gbmGrid)
prob = predict(gbm, newdata=testing, type='prob')[,2]

First, don't do this:
target <- factor(train$target, levels = c(0, 1))
You will get an warning:
At least one of the class levels are not valid R variables names; This may cause errors if class probabilities are generated because the variables names will be converted to: X0, X1
Second, you created an object called target. Using the formula method means that train will use the column called target in the data frame train and those are different data. Modify the column.

Related

Prediction on new data with GLMNET and CARET - The number of variables in newx must be X

I have a dataset with which I am doing k-folds cross-validation with.
In each fold, I have split the data into a train and test dataset.
For the training on the dataset X, I run the following code:
cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
I check the class of 'cv_glmnet', and 'train' is returned.
I then want to use this model to predict values in the test dataset, which is a matrix that has the same number of variables (columns)
# predicting on test data
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])
However, I keep running into the following error:
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") :
The number of variables in newx must be 210
I noticed in the caret.predict documentation, it states the following:
newdata an optional set of data to predict on. If NULL, then the
original training data are used but, if the train model used a recipe,
an error will occur.
I am confused as to why am I running into this error. Is it related to how I am defining newdata? My data has the right number of variables/columns (same as the train dataset), so I have no idea what is causing the error.
You get the error because your column names changes when you pass as.data.frame(X). If your matrix doesn't have column names, it creates column names and the model expects these when it tries to predict. If it has column names, then some of them could be changed :
library(caret)
library(tibble)
X = matrix(runif(50*20),ncol=20)
y = rnorm(50)
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) :
The number of variables in newx must be 20
If you have column names, it works
colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)

Trying to extract coefficients from glmnet model returns NULL or "type must be either "raw" or "prob"" error

I am running a glmnet model in caret on the built-in infert dataset, e.g.,
infert_y <- factor(infert$case) %>% plyr::revalue(c("0"="control", "1"="case"))
infert_x <- subset(infert, select=-case)
new.x <- model.matrix(~., infert_x)
# Create cross-validation folds:
myFolds <- createFolds(infert_y, k = 10)
# Create reusable trainControl object:
myControl_categorical <- trainControl(
summaryFunction = twoClassSummary,
classProbs = TRUE, # IMPORTANT!
verboseIter = TRUE,
savePredictions = TRUE,
index = myFolds
)
model_glmnet_pca <- train(
x = new.x,
y = infert_y,
metric = "ROC",
method = "glmnet",
preProcess=c("zv", "nzv","medianImpute", "center", "scale", "pca"),
trControl = myControl_categorical,
tuneGrid= expand.grid(alpha= seq(0, 1, length = 20),
lambda = seq(0.0001, 1, length = 100))
)
But when I try to get the coefficients:
bestlambda <- model_glmnet_pca$results$lambda[model_glmnet_pca$results$ROC == max(model_glmnet_pca$results$ROC)]
coef(model_glmnet_pca, s=bestlambda)
returns:
NULL
I tried:
coef.glmnet(model_glmnet_pca, s=bestlambda)
which returns:
Error in predict.train(object, s = s, type = "coefficients", exact = exact, :
type must be either "raw" or "prob"
But surely when I'm calling coef() my "type" argument is set to "coefficients"? If I try
coef.glmnet(model_glmnet_pca, s=bestlambda, type="prob")
it returns:
Error in predict.train(object, s = s, type = "coefficients", exact = exact, :
formal argument "type" matched by multiple actual arguments
I am very confused, can anyone point out what I'm doing wrong?
To get the coefficients from the best model, you can use:
coef(model_glmnet_pca$finalModel, model_glmnet_pca$finalModel$lambdaOpt)
See e.g. this link on using regularised regression models with caret.

{caret}xgbTree model not running when weights included, runs fine without them

I have a dataset off which I have no problem building an xgbTree model without weights, but once I include weights -- even if the weights are just all 1 -- the model doesn't converge. I get the
Something is wrong; all the RMSE metric values are missing: error and when I print the warnings, I get In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ... :There were missing values in resampled performance measures. as the last message.
This is a drive link to the RData file containing the info -- it was too big to print, and smaller samples didn't always reproduce the error.
It contains 3 objects: input_x, input_y, and wts -- the last one is just a vector of 1s, but it should eventually it should be able to accept numbers on the interval (0,1), ideally. The code I used is shown below. Note the comment next to the weight argument that produces the error.
nrounds<-1000
tune_grid <- expand.grid(
nrounds = seq(from = 200, to = nrounds, by = 50),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 4, 5),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
tune_control <- caret::trainControl(
method = "cv",
number = 3,
verboseIter = FALSE,
allowParallel = TRUE
)
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts, # If I remove this line, the code works fine. When included, even if just 1s, it throws an error.
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
EDIT 13.10.2021. thanks to #waterpolo
The correct way to specify weights is via the weights argument to caret::train
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
see a more verbose answer here: Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function
Old incorrect answer below:
According to the function source weights argument is called wts.
Line:
if (!is.null(wts))
xgboost::setinfo(x, 'weight', wts)
Running
xgb_tune <- caret::train(
x = input_x,
y = input_y,
wts = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
should produce the desired result.
Just wanted to add #missuse response from another post (Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function). The correct argument is weights .
Code:
xgb_tune <- caret::train(x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
The other thing that I found was that I needed to use weights > 1 or I would receive the same error message as you. For example, if I used inverse weighting I would receive the same message as you. Hope this helps.
Thanks #missuse for the lovely response in the other thread!

How to prevent "algorithm did not converge" errors in neuralnet / Caret / R?

I am trying to train a neural network using train function and neuralnet as my method paramater to predict times table.
I am scaling my training data set as well.
Even though I've tried different learningrates, stepmaxes, and thresholds for my neuralnet, each time I tried to train the network using train function one of the k-folds happened to fail every time saying
1: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.
2: predictions failed for Fold05.Rep1: layer1=8, layer2=0, layer3=0 Error in cbind(1, pred) %*% weights[[num_hidden_layers + 1]] :
requires numeric/complex matrix/vector arguments
I am guessing that this is because of weights being random so somehow each time I happen to get some weights that are not going to converge.
Is there anyway of preventing this? Maybe trying to re-train the particular fold which has failed using different weights?
Here is my code:
library(caret)
library(neuralnet)
# Create the dataset
tt = data.frame(multiplier = rep(1:10, times = 10), multiplicand = rep(1:10, each = 10))
tt = cbind(tt, data.frame(product = tt$multiplier * tt$multiplicand))
# Splitting
indexes = createDataPartition(tt$product,
times = 1,
p = 0.7,
list = FALSE)
tt.train = tt[indexes,]
tt.test = tt[-indexes,]
# Pre-process
preProc <- preProcess(tt, method = c('center', 'scale'))
tt.preProcessed <- predict(preProc, tt)
tt.preProcessed.train <- tt.preProcessed[indexes,]
tt.preProcessed.test <- tt.preProcessed[-indexes,]
# Train
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
tune.grid <- expand.grid(layer1 = 8,
layer2 = 0,
layer3 = 0)
tt.cv <- train(product ~ .,
data = tt.preProcessed.train,
method = 'neuralnet',
tuneGrid = tune.grid,
trControl = train.control,
linear.output = TRUE,
algorithm = 'backprop',
learningrate = 0.01,
stepmax = 500000,
lifesign = 'minimal',
threshold = 0.01)

Error "variable lengths differ" when assigning weights parameter in caret (R)

I want to apply weighted observation in caret using the code below:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)
xgbT <- train(x = as.matrix(train[,-21]), y = make.names(as.factor(train$y)),
method = "xgbTree",
trControl = cctrl1,
metric = "MCC",
maximize = TRUE,
weights = model_weights,
preProc = c("center", "scale"),
tuneGrid = expand.grid(nrounds = c(150), #number of trees
max_depth = c(7), #max tree depth
eta = c(0.03), #learning rate
gamma = c(0.3), #min split loss
colsample_bytree = c(0.7),
min_child_weight = c(10, 1, 5), #min number of instances in the leaf
subsample = c(0.6)), #subsample ratio of the training instance
early_stop_round = c(3), #if no improvements over specified rounds
objective = c("binary:logistic"),
silent = 0)
However, it gives me this error: Error in model.frame.default(formula = .outcome ~ ., data = dat, weights = wts) :
variable lengths differ (found for '(weights)')
Though I have checked that their lengths are the same with code below:
> table(model_weights)
model_weights
0.0000277654375832963 0.000231481481481481
18008 2160
> table(train$y)
0 1
18008 2160
Any idea how to fix this?
NOTE: I can run the train function without weights parameter.
After further debugging, I found the problem is because I apply sampling in the cctrl1. Thus, the length of weights differ, since I generate it before they apply re-sampling.
So, you can fix this by simply removing sampling from your trControl. If you still want to apply re-sampling, then you have to re-sample the data before running below code:
model_weights <- ifelse(train$y == 0,
(1/table(train$y)[1]) * 0.5,
(1/table(train$y)[2]) * 0.5)

Resources