Error when using (caret) to train an mlp model - r

I used caret to train an mlp model with this code.
library(datasets)
library(MASS)
library(caret)
DP = caret::createDataPartition(Boston$medv, p=0.75, list = F)
train = Boston[DP,]
test = Boston[-DP,]
colnames(train) = colnames(Boston)
colnames(test) = colnames(Boston)
mlp = caret::train(medv ~., data = Boston, method = "mlp", trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
Yp = caret::predict.train(mlp, test[,1:13])
I got this error message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
please guys I need to understand why I got this error?

The R squared value for some runs throws NA, you can check the output:
set.seed(111)
mlp = caret::train(medv ~., data = Boston, method = "mlp",
trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
mlp$results
size RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 9.152376 NaN 6.640184 0.9213123 NA 0.6877405
2 2 14.353732 0.000965434 12.274448 8.3227673 NA 8.6994894
3 3 12.701064 NaN 10.988850 3.2658958 NA 3.6549478
Note even for models that work, your Rsquared is too low. Two problems with the model, 1) your size might be too low, and 2) you don't scale the data, so your predictions give you only one value, and R2 will be total nonsense:
Yp = caret::predict.train(mlp, test[,1:13])
table(Yp)
Yp
20.0358009338379
125
Try something like this:
mlp = caret::train(medv ~., data = Boston, method = "mlp",
trControl = trainControl(method = "cv", number = 3),
preProcess = c("center","scale"),
tuneGrid = expand.grid(size = 3:5), linOut = T, metric = "RMSE")
mlp
Multi-Layer Perceptron
506 samples
13 predictor
Pre-processing: centered (13), scaled (13)
Resampling: Cross-Validated (3 fold)
Summary of sample sizes: 337, 338, 337
Resampling results across tuning parameters:
size RMSE Rsquared MAE
3 7.926669 0.3291762 5.619198
4 6.976707 0.4913297 5.130273
5 6.894459 0.5188481 5.040821

Related

Hyperparameters not changing results from random forest regression trees

I am trying to tune the hyperparameters of a random forest regression model and all of the accuracy measures are exactly the same, regardless of changes to hyperparameters. I've tested the same code on the "diamonds" dataset and have been able to reproduce the problem. Here is my code:
train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)
Here's what I get for results:
What the heck?
UPDATE:
I reduced the sample size to 1000 to allow for faster processing and got different results, still all identical to each other. Code:
train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)
Results:
This seems to be an issue with your cross-validation folds. When I run your code and look at the results of model it says:
Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
indicating that each fold only has a sample size of 1.
I think if you define folds like this, it will work more like you're expecting it to:
folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
The results then look like this:
Random Forest
1000 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.01582362 0.9933839 0.00985451
3 0.01601980 0.9932625 0.00994588
4 0.01567161 0.9935624 0.01018242
Tuning parameter 'splitrule' was held constant at a value
of variance
Tuning parameter 'min.node.size' was held constant
at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
= variance and min.node.size = 20.

error with mnLogloss for multinomial classifier using caret/gbm

I am trying to perform a multinomial classifier. It seems to work and I am able to generate a plot with minimized logLoss vs boosting iterations, however I am having trouble extracting the error value. This is the error when I run the mnLogLoss function.
Error in mnLogLoss(predicted, lev = predicted$label) :
'data' should have columns consistent with 'lev'
data has been partitioned into.
-training
-testing
-in both, the column "label" contains the ground truth
library(MLmetrics)
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
system.time(
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,
verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
gbmPredictions <- predict(gbmFit1, testing)
predicted <- cbind(gbmPredictions, testing)
mnLogLoss(predicted, lev = levels(predicted$label))
For mnLogLoss, it says in the vignette:
data: a data frame with columns ‘obs’ and ‘pred’ for the observed
and predicted outcomes. For metrics that rely on class
probabilities, such as ‘twoClassSummary’, columns should also
include predicted probabilities for each class. See the
‘classProbs’ argument to ‘trainControl’.
So it's not asking for the training data. The data parameter here is just an input, so i use some simulated data:
library(caret)
df = data.frame(label=factor(sample(c("a","b"),100,replace=TRUE)),
matrix(runif(500),ncol=50))
training = df[1:50,]
testing = df[1:50,]
fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3, verboseIter = FALSE,
savePredictions = TRUE, classProbs = TRUE, summaryFunction= mnLogLoss)
gbmGrid1 <- expand.grid(.interaction.depth = (1:3), .n.trees = (1:10)*20, .shrinkage = 0.01, .n.minobsinnode = 3)
gbmFit1 <- train(label~., data = training, method = "gbm", trControl=fitControl,verbose = 1, metric = "logLoss", tuneGrid = gbmGrid1)
)
And we put together obs, pred and the last two columns are probabilities of each class:
predicted <- data.frame(obs=testing$label,
pred=predict(gbmFit1, testing),
predict(gbmFit1, testing,type="prob"))
head(predicted)
obs pred a b
1 b a 0.5506054 0.4493946
2 b a 0.5107631 0.4892369
3 a b 0.4859799 0.5140201
4 b a 0.5090264 0.4909736
5 b b 0.4545746 0.5454254
6 a a 0.6211514 0.3788486
mnLogLoss(predicted, lev = levels(predicted$obs))
logLoss
0.6377392

Identical results from random forest hyperparameter tuning in R

I am trying out hyperparameter tuning of randomforest in R using the library(RandomForest). Why am I getting identical results for different values of hyperparameter i.e. maxnodes?
I am using the Titanic dataset that was obtained from Kaggle. I applied grid search only on the hyperparameter 'mtry' and the results gave me different values of Accuracy and Kappa for each 'mtry'.
However, when I tried to search for the best 'maxnode' value, all of them return the same value of Accuracy and Kappa.
This is for tuning 'mtry'
library(randomForest)
control <- trainControl(
method = "cv",
number = 10,
search = "grid"
)
tuneGrid <- expand.grid(.mtry = c(1:10))
rf_mtry <- train(train_X,
train_Y,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = control,
importance = TRUE,
nodesize = 14,
ntree = 300
)
This is for tuning 'maxnodes'
mtry_best <- rf_mtry$bestTune$mtry
store_maxnode <- list()
tuneGrid <- expand.grid(.mtry = mtry_best)
for (maxnodes in c(2:20)) {
set.seed(1234)
rf_maxnode <- train(train_X,
train_Y,
method = "rf",
metric = "Accuracy",
tuneGrid = tuneGrid,
trControl = control,
importance = TRUE,
nodesize = 5,
maxnodes = maxnodes,
ntree = 300
)
current_iteration <- toString(maxnodes)
store_maxnode[[current_iteration]] <- rf_maxnode
}
results_mtry <- resamples(store_maxnode)
summary(results_mtry)
I expected to see different values of Accuracy and Kappa for different maxnode, but they were identical.

Using F1 score metric in KNN through caret package

I am attempting to use the F1 score to determine which k value maximises the model for its given purpose. The model is made through the train function in the caret package.
Example dataset: https://www.kaggle.com/lachster/churndata
My current code includes the following (as the function for f1 score):
f1 <- function(data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, positive = "pass")
f1_val <- (2*precision*recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
The following as train control:
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = f1, search = "grid")
And the following as my final execution of the train command:
x <- train(CHURN ~. ,
data = experiment,
method = "knn",
tuneGrid = expand.grid(.k=1:30),
metric = "F1",
trControl = train.control)
Please note that the model is attempting to predict the churn rate from a set of telco customers.
The execution returns the following result:
Something is wrong; all the F1 metric values are missing:
F1
Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :30
Error in train.default(x, y, weights = w, ...) : Stopping
In addition: Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
EDIT: Thanks to help from missuse my code now looks like the following but returns this error
levels(exp2$CHURN) <- make.names(levels(factor(exp2$CHURN)))
library(mlbench)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(CHURN ~., data = exp2, method = "knn", trControl =
train.control, preProcess = c("center", "scale"), tuneLength = 15, metric = "F")
The error:
Error in trainControl(method = "repeatedcv", number = 10, repeats = 3, :
object 'prSummary' not found
Caret contains a summary function: prSummary that provides the F1 score Full example:
library(caret)
library(mlbench)
data(Sonar)
train.control <- trainControl(method = "repeatedcv", number = 10, repeats = 3,
summaryFunction = prSummary, classProbs = TRUE)
knn_fit <- train(Class ~., data = Sonar, method = "knn",
trControl=train.control ,
preProcess = c("center", "scale"),
tuneLength = 15,
metric = "F")
knn_fit
#output
k-Nearest Neighbors
208 samples
60 predictor
2 classes: 'M', 'R'
Pre-processing: centered (60), scaled (60)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 187, 188, 187, 188, 187, 187, ...
Resampling results across tuning parameters:
k AUC Precision Recall F
5 0.3582687 0.7936713 0.9065657 0.8414592
7 0.4985709 0.7758271 0.8883838 0.8239438
9 0.6632328 0.7484092 0.8853535 0.8089210
11 0.7426320 0.7151175 0.8676768 0.7814297
13 0.7388742 0.6883105 0.8646465 0.7641392
15 0.7594436 0.6787983 0.8467172 0.7520524
17 0.7583071 0.6909693 0.8527778 0.7616448
19 0.7702208 0.6913001 0.8585859 0.7644433
21 0.7642698 0.6962528 0.8707071 0.7719442
23 0.7652370 0.6945755 0.8707071 0.7696863
25 0.7606508 0.6929364 0.8707071 0.7683987
27 0.7454728 0.6916762 0.8676768 0.7669464
29 0.7551679 0.6900416 0.8707071 0.7676640
31 0.7603099 0.6935720 0.8828283 0.7749490
33 0.7614621 0.6938805 0.8770202 0.7728923
F was used to select the optimal model using the largest value.
The final value used for the model was k = 5.

Caret train() mlp missing values

I'm trying to train a multi layer perceptron for non linear regression on a dataset but it keeps giving me this error :
Warning message:
In nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
There were missing values in resampled performance measures.
I tried doing it again with a R dataset to see if my data was the problem but I keep getting the error and i have no idea why.
I already tried adding or removing tuneGrid, trControl, even linOut.
It will always give me constant results for some reasons.
library(datasets)
library(MASS)
library(caret)
DP = caret::createDataPartition(Boston$medv, p=0.75, list = F)
train = Boston[DP,]
test = Boston[-DP,]
colnames(train) = colnames(Boston)
colnames(test) = colnames(Boston)
mlp = caret::train(medv ~., data = Boston, method = "mlp", trControl = trainControl(method = "cv", number = 3),
tuneGrid = expand.grid(size = 1:3), linOut = T, metric = "RMSE")
Yp = caret::predict.train(mlp, test[,1:13])

Resources