Caret train() mlp missing values - r

I'm trying to train a multi layer perceptron for non linear regression on a dataset but it keeps giving me this error :
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I tried doing it again with a R dataset to see if my data was the problem but I keep getting the error and i have no idea why.
I already tried adding or removing tuneGrid, trControl, even linOut.
It will always give me constant results for some reasons.
library(datasets)
library(MASS)
library(caret)
DP = caret::createDataPartition(Boston$medv, p=0.75, list = F)
train = Boston[DP,]
test = Boston[-DP,]
colnames(train) = colnames(Boston)
colnames(test) = colnames(Boston)
mlp = caret::train(medv ~., data = Boston, method = "mlp", trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
Yp = caret::predict.train(mlp, test[,1:13])

Related

Prediction on new data with GLMNET and CARET - The number of variables in newx must be X

I have a dataset with which I am doing k-folds cross-validation with.
In each fold, I have split the data into a train and test dataset.
For the training on the dataset X, I run the following code:
cv_glmnet <- caret::train(x = as.data.frame(X[curtrainfoldi, ]), y = y[curtrainfoldi, ],
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
I check the class of 'cv_glmnet', and 'train' is returned.
I then want to use this model to predict values in the test dataset, which is a matrix that has the same number of variables (columns)
# predicting on test data
yhat <- predict.train(cv_glmnet, newdata = X[curtestfoldi, ])
However, I keep running into the following error:
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt, type = "response") :
The number of variables in newx must be 210
I noticed in the caret.predict documentation, it states the following:
newdata an optional set of data to predict on. If NULL, then the
original training data are used but, if the train model used a recipe,
an error will occur.
I am confused as to why am I running into this error. Is it related to how I am defining newdata? My data has the right number of variables/columns (same as the train dataset), so I have no idea what is causing the error.
You get the error because your column names changes when you pass as.data.frame(X). If your matrix doesn't have column names, it creates column names and the model expects these when it tries to predict. If it has column names, then some of them could be changed :
library(caret)
library(tibble)
X = matrix(runif(50*20),ncol=20)
y = rnorm(50)
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
Error in predict.glmnet(modelFit, newdata, s = modelFit$lambdaOpt) :
The number of variables in newx must be 20
If you have column names, it works
colnames(X) = paste0("column",1:ncol(X))
cv_glmnet <- caret::train(x = as.data.frame(X), y = y,
method = "glmnet",
preProcess = NULL,
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10)
yhat <- predict.train(cv_glmnet, newdata = X)

How to get test data ROC plot from MLeval

I am trying to return the ROC curves for a test dataset using the MLevals package.
# Load data
train <- readRDS(paste0("Data/train.rds"))
test <- readRDS(paste0("Data/test.rds"))
# Create factor class
train$class <- ifelse(train$class == 1, 'yes', 'no')
# Set up control function for training
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary(),
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 18000,
shrinkage = 0.01,
n.minobsinnode = 4)
# Build using a gradient boosted machine
set.seed(5627)
gbm <- train(class ~ .,
data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
# Predict results -
pred <- predict(gbm, newdata = test, type = "prob")[,"yes"]
roc <- evalm(data.frame(pred, test$class))
I have used the following post, ROC curve for the testing set using Caret package,
to try and plot the ROC from test data using MLeval and yet I get the following error message:
MLeval: Machine Learning Model Evaluation
Input: data frame of probabilities of observed labels
Error in names(x) <- value :
'names' attribute [3] must be the same length as the vector [2]
Can anyone please help? Thanks.
Please provide a reproducible example with sample data so we can replicate the error and test for solutions (i.e., we cannot access train.rds or test.rds).
Nevertheless, the below may fix your issue.
pred <- predict(gbm, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))

{caret}xgbTree model not running when weights included, runs fine without them

I have a dataset off which I have no problem building an xgbTree model without weights, but once I include weights -- even if the weights are just all 1 -- the model doesn't converge. I get the
Something is wrong; all the RMSE metric values are missing: error and when I print the warnings, I get In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, ... :There were missing values in resampled performance measures. as the last message.
This is a drive link to the RData file containing the info -- it was too big to print, and smaller samples didn't always reproduce the error.
It contains 3 objects: input_x, input_y, and wts -- the last one is just a vector of 1s, but it should eventually it should be able to accept numbers on the interval (0,1), ideally. The code I used is shown below. Note the comment next to the weight argument that produces the error.
nrounds<-1000
tune_grid <- expand.grid(
nrounds = seq(from = 200, to = nrounds, by = 50),
eta = c(0.025, 0.05, 0.1, 0.3),
max_depth = c(2, 3, 4, 5),
gamma = 0,
colsample_bytree = 1,
min_child_weight = 1,
subsample = 1
)
tune_control <- caret::trainControl(
method = "cv",
number = 3,
verboseIter = FALSE,
allowParallel = TRUE
)
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts, # If I remove this line, the code works fine. When included, even if just 1s, it throws an error.
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
EDIT 13.10.2021. thanks to #waterpolo
The correct way to specify weights is via the weights argument to caret::train
xgb_tune <- caret::train(
x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
see a more verbose answer here: Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function
Old incorrect answer below:
According to the function source weights argument is called wts.
Line:
if (!is.null(wts))
xgboost::setinfo(x, 'weight', wts)
Running
xgb_tune <- caret::train(
x = input_x,
y = input_y,
wts = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
should produce the desired result.
Just wanted to add #missuse response from another post (Non-tree model error when using xgbTree method with Caret and weights to target variable when applying the varImp function). The correct argument is weights .
Code:
xgb_tune <- caret::train(x = input_x,
y = input_y,
weights = wts,
trControl = tune_control,
tuneGrid = tune_grid,
method = "xgbTree",
verbose = TRUE
)
The other thing that I found was that I needed to use weights > 1 or I would receive the same error message as you. For example, if I used inverse weighting I would receive the same message as you. Hope this helps.
Thanks #missuse for the lovely response in the other thread!

Error with caret and summaryFunction mnLogLoss: columns consistent with 'lev'

I'm trying to use log loss as loss function for training with Caret, using the data from the Kobe Bryant shot selection competition of Kaggle.
This is my script:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 10)
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)
And this is the traceback of the error:
7: stop("'data' should have columns consistent with 'lev'")
6: ctrl$summaryFunction(testOutput, lev, method)
5: evalSummaryFunction(y, wts = weights, ctrl = trControl, lev = classLevels,
metric = metric, method = method)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(shot_made_flag ~ ., data = train, method = "gbm",
preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
1: train(shot_made_flag ~ ., data = train, method = "gbm", preProc = c("zv",
"center", "scale"), trControl = ctrl, metric = "logLoss",
verbose = FALSE)
When I use defaultFunction as summaryFunction and no metric specified in train, it works, but it doesn't with mnLogLoss. I'm guessing it is expecting the data in a different format than what I am passing, but I can't find where the error is.
From the help file for defaultSummary:
To use twoClassSummary and/or mnLogLoss, the classProbs argument of trainControl should be TRUE. multiClassSummary can be used without class probabilities but some statistics (e.g. overall log loss and the average of per-class area under the ROC curves) will not be in the result set.
Therefore, I think you need to change your trainControl() to the following:
ctrl <- trainControl(method = "repeatedcv", index = folds, repeats = 3, summaryFunction = mnLogLoss, classProbs = TRUE)
If you do this and run your code you will get the following error:
Error: At least one of the class levels is not a valid R variable name; This will cause errors when class probabilities are generated because the variables names will be converted to X0, X1 . Please use factor levels that can be used as valid R variable names (see ?make.names for help).
You just need to change the 0/1 levels of shot_made_flag to something that can be a valid R variable name:
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
With the above changes your code will look like this:
library(caret)
data <- read.csv("./data.csv")
data$shot_made_flag <- ifelse(data$shot_made_flag == 0, "miss", "made")
data$shot_made_flag <- factor(data$shot_made_flag)
data$team_id <- NULL
data$team_name <- NULL
train_data_kaggle <- data[!is.na(data$shot_made_flag),]
test_data_kaggle <- data[is.na(data$shot_made_flag),]
inTrain <- createDataPartition(y=train_data_kaggle$shot_made_flag,p=.8,list=FALSE)
train <- train_data_kaggle[inTrain,]
test <- train_data_kaggle[-inTrain,]
folds <- createFolds(train$shot_made_flag, k = 3)
ctrl <- trainControl(method = "repeatedcv", classProbs = TRUE, index = folds, repeats = 3, summaryFunction = mnLogLoss)
res <- train(shot_made_flag~., data = train, method = "gbm", preProc = c("zv", "center", "scale"), trControl = ctrl, metric = "logLoss", verbose = FALSE)

Why does caret's "parRF" lead to tuning and missing value errors not present with "rf"

I have a tidy dataset with no missing values and only numeric columns.
The dataset is both large and contains sensitive information, so I won't be able to provide a copy of it here, unfortunately.
I partition this data into training and testing sets with caret's createDataPartition:
idx <- createDataPartition(y = model_final$y, p = 0.6, list = FALSE )
training <- model_final[idx,]
testing <- model_final[-idx,]
x <- training[-ncol(training)]
y <- training$y
x1 <- testing[-ncol(testing)]
y1 <- testing$y
row.names(training) <- NULL
row.names(testing) <- NULL
row.names(x) <- NULL
row.names(y) <- NULL
row.names(x1) <- NULL
row.names(y1) <- NULL
I've been fitting and refitting Random Forest models on this data via randomForest on a regular basis:
rf <- randomForest(x = x, y = y, mtry = ncol(x), ntree = 1000,
corr.bias = T, do.trace = T, nPerm = 3)
I decided to see if I could get any better or faster results with train and the following model ran fine, but took about 2 hours:
rf_train <- train(y=y, x=x,
method='rf', tuneLength = 3,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE
)
I need to take an HPC approach to make this logistically feasible, so I tried
require(doParallel)
registerDoParallel(cores = 8)
rf_train <- train(y=y, x=x,
method='parRF', tuneGrid = data.frame(mtry = 3), na.action = na.omit,
trControl=trainControl(method='cv',number=10,
classProbs = TRUE, allowParallel = TRUE)
)
but regardless of if I use tuneLength or tuneGrid, this leads to strange errors about missing values and tuning parameters:
Error in train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
final tuning parameters could not be determined
In addition: Warning messages:
1: In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
2: In train.default(y = y, x = x, method = "parRF", tuneGrid = data.frame(mtry = 3), :
missing values found in aggregated results
I say this is weird both because there were no errors with method = "rf" and because I tripled checked to ensure there are no missing values.
I even get the same errors when completely omitting tuning options. I also tried toggling the na.action option on and off and changing "cv" to "repeatedcv".
I even get the same error with this ultra-simplified version:
rf_train <- train(y=y, x=x, method='parRF')
Seems to be because of a bug in caret. See the answer to:
parRF on caret not working for more than one core
Just dealt with this same issue, loading foreach on each new cluster manually seems to work.

Resources