Meaning: improvement in RMSE during crossvalidation, although not on testset?

Meaning: improvement in RMSE during crossvalidation, although not on testset? - r

In the code below I train a NN with crossvalidation on the first 20000 records in the dataset. The dataset contains 8 predictors.
First I have split my data in 2 parts:
the first 20.000 rows (trainset)
and the last 4003 rows (out of sample test set)
I have done 2 runs:
run 1) a run with 3 predictors
run 2) a run with all 8 predictors (see code below).
Based on crossvalidation within the 20.000 rows from the trainset, the RMSE (for the optimal parametersetting) improves from 2.30 (run 1) to 2.11 (run 2).
Although when I test both models on the 4003 rows from the out of sample test set, the RMSE improves only neglible from 2.64 (run 1) to 2.63 (run 2).
What can be concluded from these contradiction in the results?
Thanks!
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Chapter 7: Non-Linear Regression Models
### Required packages: AppliedPredictiveModeling, caret, doMC (optional),
### earth, kernlab, lattice, nnet
################################################################################
library(caret)
### Load the data
mydata <- read.csv(file="data.csv", header=TRUE, sep=",")
validatiex <- mydata[20001:24003,c(1:8)]
validatiey <- mydata[20001:24003,9]
mydata <- mydata[1:20000,]
x <- mydata[,c(1:8)]
y <- mydata[,9]
parti <- createDataPartition(y, times = 1, p=0.8, list = FALSE)
x_train <- x[parti,]
x_test <- x[-parti,]
y_train <- y[parti]
y_test <- y[-parti]
set.seed(100)
indx <- createFolds(y_train, returnTrain = TRUE)
ctrl <- trainControl(method = "cv", index = indx)
## train neural net:
nnetGrid <- expand.grid(decay = c(.1),
size = c(5, 15, 30),
bag = FALSE)
set.seed(100)
nnetTune <- train(x = x_train, y = y_train,
method = "avNNet",
tuneGrid = nnetGrid,
trControl = ctrl,
preProc = c("center", "scale"),
linout = TRUE,
trace = FALSE,
MaxNWts = 30 * (ncol(x_train) + 1) + 30 + 1,
maxit = 1000,
repeats = 25,
allowParallel = FALSE)
nnetTune
plot(nnetTune)
predictions <- predict(nnetTune, validatiex, type="raw")
mse <- mean((validatiey - predictions)^2)
mse <- sqrt(mse)
print (mse)

Related

R Differential Optimization produces error in checkForRemoteErrors(val): 4 nodes produced errors;

I am trying to use DEOptim package to tune gradient boosting (gbm) with caret. When I launch the job, I have an error: Error in checkForRemoteErrors(val): 4 nodes produced errors; first error: Stopping
I am working with R version 4.2.0 with Linux Centos 7.
May be I am doing something wrong in my code or with with cores + R.
tune parameters:
n.trees, interaction_min_max,shrinkage,minobsinnode
I provide few lines of my code:
library(caret)
library(DEoptim)
library(parallel)
fitControl <- trainControl(method = "repeatedcv",number = 5,repeats = 3)
#Differential
# Set parameter settings for search algorithm
max_iter <- 5 # maximum number of iterations
pop_size <- 10 # population size
# Create custom function for assessing solutions
eval_function_XGBoost_Linear <- function(x, data, train_settings) {
x1 <- x[1]; x2 <- x[2]; x3 <- x[3]; x4 <- x[4]
suppressWarnings(
XGBoost_Linear_model <- caret::train(isMut ~.,
data = data,
method = "gbm",
trControl = train_settings,
verbose = FALSE,
silent = 1,
tuneGrid = expand.grid(
interaction.depth = x1,
n.trees = x2,
shrinkage = x3,
n.minobsinnode = x4
)
)
)
return(XGBoost_Linear_model$results$Accuracy) # Accuracy
}
I Define minimum and maximum values for each input
interaction_min_max <- c(1,9)
ntrees_min_max <- c(1500,2000)
shrinkage <- c(0.1,0.1)
minobsinnode <- c(20,20)
When I run the differential evolution algorithm, I have the error described above.
set.seed(1)
n_cores <- detectCores()-1

DE_T0 <- Sys.time()
# Run differential evolution algorithm
DE_model_XGBoost_Linear <- DEoptim::DEoptim(
fn = eval_function_XGBoost_Linear,
lower = c(interaction_min_max[1], ntrees_min_max[1], shrinkage[1], minobsinnode[1]),
upper = c(interaction_min_max[2], ntrees_min_max[2], shrinkage[2], minobsinnode[2]),
control = DEoptim.control(
NP = pop_size, # population size
itermax = max_iter, # maximum number of iterations
CR = 0.5, # probability of crossover
storepopfreq = 1, # store every population
parallelType = 1 # run parallel processing
),
data = rose,
train_settings = fitControl
)
DE_T1 <- Sys.time()
DE_T1-DE_T0

R: Multiclass Matrices

I am working with the R programming language. I am trying to learn how to make a "confusion matrix" for multiclass variables (e.g. How to construct the confusion matrix for a multi class variable).
Suppose I generate some data and fit a decision tree model :
#load libraries
library(rpart)
library(caret)
#generate data
a <- rnorm(1000, 10, 10)
b <- rnorm(1000, 10, 5)
d <- rnorm(1000, 5, 10)
group_1 <- sample( LETTERS[1:3], 1000, replace=TRUE, prob=c(0.33,0.33,0.34) )
e = data.frame(a,b,d, group_1)
e$group_1 = as.factor(d$group_1)
#split data into train and test set
trainIndex <- createDataPartition(e$group_1, p = .8,
list = FALSE,
times = 1)
training <- e[trainIndex,]
test <- e[-trainIndex,]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 5,
## repeated ten times
repeats = 1)
#fit decision tree model
TreeFit <- train(group_1 ~ ., data = training,
method = "rpart2",
trControl = fitControl)
From here, I am able to store the results into a "confusion matrix":
pred <- predict(TreeFit,test)
table_example <- table(pred,test$group_1)
This satisfies my requirements - but this "table" requires me to manually calculate the different accuracy metrics of "A", "B" and "C" (as well as the total accuracy).
My question: Is it possible to use the caret::confusionMatrix() command for this problem?
e.g.
pred <- predict(TreeFit, test, type = "prob")
labels_example <- as.factor(ifelse(pred[,2]>0.5, "1", "0"))
con <- confusionMatrix(labels_example, test$group_1)
This way, I would be able to directly access the accuracy measurements from the confusion matrix. E.g. metric = con$overall[1]
Thanks

Is this what you're looking for?
pred <- predict(
TreeFit,
test)
con <- confusionMatrix(
test$group_1,
pred)
con
con$overall[1]
Same output as in:
table(test$group_1, pred)
Plus accuracy metrics.

Training different ML models using Caret

I am trying to train a couple of ML models using the trainControl and train functions in Caret but I am always getting the same error which just says:
Error: Stopping
without giving any more details.
The problem is the same for gbm and ranger, so I am wondering if it has something to do with a conflict of the packages I am also using in my code.
library(ggplot2)
library(lattice)
library(caret)
library(rlang)
library(tidyverse)
library(Matrix)
library(glmnet)
library(iterators)
library(parallel)
library(doParallel) # parallel processing.
registerDoParallel(cores=16)
library(randomForest)
library(gbm)
library(ranger)
library(data.table)
library(smooth)
data<- data.frame(A=seq(as.Date("2019-01-01"), by=1, len=100),B=as.numeric(runif(100, 50, 150)),C=as.numeric(runif(100, 50, 150)))
# define data sets
data_training<-data[1:60,]
data_test<-data[(60+1):nrow(data),]
# creating sampling seeds
set.seed(123)
n=nrow(data_training)
tuneLength.num <- 5
seeds <- vector(mode = "list", length = n) # creates an empty vector containing lists
for(i in 1:(n-1)){ # choose tuneLength.num random samples from 1 to 1000
seeds[[i]] <- sample.int(1000, tuneLength.num)
}
# For the last model:
seeds[[n]] <- sample.int(1000, 10)
# Define TimeControl for training and fitting:
trainingTimeControl <- trainControl(method = "timeslice",
initialWindow = 25,
horizon = 1,
fixedWindow = TRUE,
returnResamp="all",
allowParallel = TRUE,
seeds = seeds,
savePredictions = TRUE)
gbm.mod<- caret::train(B ~.- A,
data = data_training,
method = "gbm",
distribution = "gaussian",
trControl = trainingTimeControl,
tuneLength=tuneLength.num,
metric="RMSE")
EDIT: The following code works just fine:
gbm<-gbm(formula = B ~ . - A,
distribution = "gaussian",
data = data_training,
keep.data = TRUE)
It would be great if anyone has an idea what is going on here. The code is working fine with svmRadial.

Why is there different MSE for cross-validation data in h2o package

In the summary output the MSE for cross-validation data is 0.1641124, however, it is 0.14977892 in the detailed Cross-Validation Metrics Summary. Are they not the same metrics?
library(h2o)
h <- h2o.init()
data <- as.h2o(iris)
part <- h2o.splitFrame(data, 0.7, seed = 123)
train <- part[[1]]
test <- part[[2]]
m <- h2o.glm(x=2:5,y=1,train, nfolds = 10, seed = 123)
summary(m)
#...
#H2ORegressionMetrics: glm
#** Reported on cross-validation data. **
#** 10-fold cross-validation on training data (Metrics computed for combined
#holdout predictions) **
#MSE: ***0.1641124***
#RMSE: 0.4051079
#...
#Cross-Validation Metrics Summary:
# mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid cv_6_valid cv_7_valid cv_8_valid cv_9_valid
#...
# mse ***0.14977892*** 0.053578787 0.14102486 0.14244498 0.05266633 0.19028585 0.043878503 0.12635022 0.13820939 0.15831167 0.33359975

These two MSE values are calculated differently.
The first one (0.1641124) is calculated using all the predictions on the hold out sets during cross validation:
create model:
m <- h2o.glm(x = 2:5,
y = 1,
train,
nfolds = 10,
seed = 123,
keep_cross_validation_predictions = TRUE,
keep_cross_validation_fold_assignment = TRUE)
extract hold out predictions
preds <- as.data.frame(h2o.cross_validation_holdout_predictions(m))
calculate MSE:
mean((preds$predict - as.data.frame(train)$Sepal.Length)^2)
#output
0.1641125
wheres the lower MSE (0.14977892) represents the average of MSE for each hold out set:
folds <- as.data.frame(h2o.cross_validation_fold_assignment(m))
library(tidyverse)
data.frame(preds = preds$predict, #create a data frame with hold out predictions
folds = folds$fold_assignment, #folds assignement
true = as.data.frame(train)$Sepal.Length) %>% #true values
group_by(folds) %>% #group by folds
summarise(mse = mean((preds - true)^2)) %>% # calculate mse for each fold
ungroup() %>%
summarise(mse = mean(mse)) %>% #average them
as.numeric
#output
0.1497789
to reproduce first run:
library(h2o)
h <- h2o.init()
data <- as.h2o(iris)
part <- h2o.splitFrame(data, 0.7, seed = 123)
train <- part[[1]]
test <- part[[2]]

R random forest cross validation using the caret train function doesn't produce the same accuracy as when done by hand

I'm building a random forest on some data from work (this means I can't share that data, there are 15k observations), using the caret train function for cross validation, the accuracy of the model is very low: 0.9%.
here's the code I used:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = TRUE, savePredictions = T, index=my_folds))
print(model$resample)
--Edit
As Gilles noticed, the folds indices are wrongly constructed and training is done on 20% of the observations, but even if I fix this by adding returnTrain = T , I'm still getting near zero accuracy
--Edit
model$resample produces this:
Accuracy ___ Kappa_____ Resample
0.026823683_ 0.0260175246_ Fold1
0.002615234_ 0.0019433907_ Fold2
0.002301118_ 0.0017644472_ Fold3
0.001637733_ 0.0007026352_ Fold4
0.010187315_ 0.0094986595_ Fold5
Now if I do the cross validation by hand like this:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$ICNumber, T, F)
print(sum(e) / nrow(test_data))
}
I get the following accuracy:
[1] 0.743871
[1] 0.7566957
[1] 0.7380645
[1] 0.7390181
[1] 0.7311168
I was expecting to get about the same accuracy values, what am I doing wrong in train? Or is the manual prediction code wrong?
--Edit
Furthermore, this code works well on the Soybean data and I can reproduce the results from Gilles below
--Edit
--Edit2
Here are some details about my data:
15493 obs. of 17 variables:
ICNUmber is a string with 1531 different values, these are the classes
the other 16 variables are factors with 33 levels
--Edit2
--Edit3
My last experiment was to drop the observations for all the classes occurring less than 10 times, 12k observations of 396 classes remained. For this dataset, the manual and automatic cross validations accuracy match...
--Edit3

It was a tricky one ! ;-)
The error comes from a misuse of the index option in trainControl.
According tho the help page, index should be :
a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.
In your code you provided the integers coresponding to the rows that should be removed
from the training dataset instead of providing the integers corresponding to the
rows that should be used...
You can cange that by using createFolds(train_indices, k=5, returnTrain = T) instead
of createFolds(train_indices, k=5).
Note also that internaly, afaik, caret is creating folds that are balanced relative
to the classes that you want to predict. So the code should ideally be more like :
createFolds(my_data[train_indices, "Class"], k=5, returnTrain = T), particularly
if the classes are not balanced...
Here is a reproducible example with the Soybean dataset
library(caret)
#> Le chargement a nécessité le package : lattice
#> Le chargement a nécessité le package : ggplot2
data(Soybean, package = "mlbench")
my_data <- droplevels(na.omit(Soybean))
Your code (the training data is here much smaller than expected, you use only 20% of the data, hence the lower accuracy).
You also get some warnings due to the absense of some classes in the training datasets (because of the class imbalance and reduced training set).
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
#> Warning: Dropped unused factor level(s) in dependent variable: rhizoctonia-
#> root-rot.
#> Warning: Dropped unused factor level(s) in dependent variable: downy-
#> mildew.
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.7951002 0.7700909 Fold1
#> 2 0.5846868 0.5400131 Fold2
#> 3 0.8440980 0.8251373 Fold3
#> 4 0.8822222 0.8679453 Fold4
#> 5 0.8444444 0.8263563 Fold5
Corrected code, just with returnTrain = T (here you really use 80% of the data for training...)
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5, returnTrain = T)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.9380531 0.9293371 Fold1
#> 2 0.8750000 0.8583687 Fold2
#> 3 0.9115044 0.9009814 Fold3
#> 4 0.8660714 0.8505205 Fold4
#> 5 0.9107143 0.9003825 Fold5
To be compared to your loop. There are still some small differences so maybe there is still something that I don't understand.
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#> [1] 0.9380531
#> [1] 0.875
#> [1] 0.9115044
#> [1] 0.875
#> [1] 0.9196429
Created on 2018-03-09 by the reprex package (v0.2.0).

To expand on the excellent answer by Gilles. Apart the mistake in specifying the indexes used for testing and training, to get a fully reproducible model for algorithms that involve some stochastic process like random forrest you should specify the seeds argument in trainControl. The length of this argument should equal the number of re-samples + 1 (for the final model):
library(caret)
library(mlbench)
data(Sonar)
data(Sonar)
set.seed(512)
n <- nrow(Sonar)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k = 5, returnTrain = T)
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = Sonar,
method = "ranger",
trControl = trainControl(verboseIter = F,
savePredictions = T,
index = my_folds,
seeds = rep(512, 6))) #this is the important part
model$resample
#output
Accuracy Kappa Resample
1 0.8536585 0.6955446 Fold1
2 0.8095238 0.6190476 Fold2
3 0.8536585 0.6992665 Fold3
4 0.7317073 0.4786127 Fold4
5 0.8372093 0.6681367 Fold5
now lets do the resample manually:
for (fold in my_folds) {
train_data <- Sonar[fold,]
test_data <- Sonar[-fold,]
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = train_data,
method = "ranger",
trControl = trainControl(method = "none",
seeds = 512)) #use the same seeds as above
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#output
[1] 0.8536585
[1] 0.8095238
[1] 0.8536585
[1] 0.7317073
[1] 0.8372093
#semicolo if you can reproduce this example on the Sonar data set, but not with your own data, then the problem is in the data set and any further insights will need to investigate the data in question.

It looks like the train function transforms the class column into a factor, in my dataset there are a lot (about 20%) of classes that have less than 4 observations. When splitting the set by hand, the factor is constructed after the split and for each of the factor value there's at least one observation.
But during the automatic cross validation, the factor is constructed on the full dataset and when the splits are done some values of the factor don't have any observation. This seems to somehow mess up the accuracy. This probably calls for a new different question, thanks to Gilles and missuse for their help.

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Meaning: improvement in RMSE during crossvalidation, although not on testset? - r

Related

R Differential Optimization produces error in checkForRemoteErrors(val): 4 nodes produced errors;

R: Multiclass Matrices

Training different ML models using Caret

Why is there different MSE for cross-validation data in h2o package

R random forest cross validation using the caret train function doesn't produce the same accuracy as when done by hand

Categories

Resources