Good day to all
I have a problem in code when I use RF hyperparameter tuning. The algorithm (Simulated annealing) give me the RMSE value of 4000. I am not sure from where it has performed this calculation because in the code I did not specify any grid/values? The code is below, which was originally for SVM but I edited for RF.
svm_obj <- function(param, maximize = FALSE) {
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "MAE",
trControl = ctrl,
tuneGrid = data.frame(mtry = 10^(param[1])))
##, sigma = 10^(param[2])))
if(maximize)
-getTrainPerf(mod)[, "TrainRMSE"] else
getTrainPerf(mod)[, "TrainRMSE"]
}
## Simulated annealing from base R
set.seed(45642)
san_res <- optim(par = c(0), fn = svm_obj, method = "SANN",
control = list(maxit = 10))
The answer I get is: $value
[1] 4487.821
$counts
function gradient
10 NA
$convergence
[1] 0
$message
NULL
mtry is the number of variables used by rf to split the tree, and it cannot be more than the number of columns of predictors.
Let's do a model that doesn't work:
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = data.frame(mtry = ncol(tr)+1)
)
You see a warning:
There were 11 warnings (use warnings() to see them)
And the results, and final model disagrees:
mod$results
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 12 2.203626 0.9159377 1.880211 0.979291 0.1025424 0.7854203
mod$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 6.088637
% Var explained: 82.7
So although you specified mtry=12, the default randomForest function brings it down to 10, which is sensible. But if you try this over optim, you are never going to get something that makes sense, once you go over ncol(tr)-1.
If you do not have so much variables, it's much easier to use tuneLength or specify the mtry to use. Let's start with the results you expect with just specifying mtry:
library(caret)
library(randomForest)
ctrl = trainControl(method="cv",repeats=3)
#use mtcars
tr = mtcars
# set mpg to be Effort so your function works
colnames(tr)[1] = "Effort"
TG = data.frame(mtry=1:10)
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = TG)
mod$results
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 2.725944 0.8895202 2.384232 1.350958 0.1592133 1.183400
2 2 2.498627 0.9012830 2.192391 1.276950 0.1375281 1.200895
3 3 2.506250 0.8849148 2.168141 1.229709 0.1562686 1.173904
4 4 2.503700 0.8891134 2.170633 1.249049 0.1478276 1.168831
5 5 2.480846 0.8837597 2.148329 1.250889 0.1540574 1.191068
6 6 2.459317 0.8872104 2.126315 1.196187 0.1554423 1.128351
7 7 2.493736 0.8736399 2.165258 1.158384 0.1766644 1.082568
8 8 2.530672 0.8768546 2.199941 1.224193 0.1681286 1.127467
9 9 2.547422 0.8757422 2.196878 1.222921 0.1704655 1.130261
10 10 2.514791 0.8720315 2.184602 1.224944 0.1740556 1.093184
Maybe something like 6 is the best mtry you can get.
Well, I dont know what values are you calling your function with, so it's hard to spot the error.
However, mtry is a value that needs to be between 1 and the number of columns, while it looks to me like you might be setting it to 10 to the power of something - which is most likely out of bounds :)
#Javed #Wolf
Please mind that id DOES make sense to tune mtry.
mtry is going to affect the correlation between the trees you grow (therefore the variance of the model), and it is very problem specific, so the optimal value might change depending on the number of features you have and correlations between them.
It is however quite useless to tune bias-related hyperparameters (max depth, and other stopping/pruning rules). It takes a lot of time and the effects are often non significant.
Related
I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
model_glm$resample$RMSE
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
mean(model_glm$resample$RMSE)
32.16515
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.
I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.
I'm building a random forest on some data from work (this means I can't share that data, there are 15k observations), using the caret train function for cross validation, the accuracy of the model is very low: 0.9%.
here's the code I used:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = TRUE, savePredictions = T, index=my_folds))
print(model$resample)
--Edit
As Gilles noticed, the folds indices are wrongly constructed and training is done on 20% of the observations, but even if I fix this by adding returnTrain = T , I'm still getting near zero accuracy
--Edit
model$resample produces this:
Accuracy ___ Kappa_____ Resample
0.026823683_ 0.0260175246_ Fold1
0.002615234_ 0.0019433907_ Fold2
0.002301118_ 0.0017644472_ Fold3
0.001637733_ 0.0007026352_ Fold4
0.010187315_ 0.0094986595_ Fold5
Now if I do the cross validation by hand like this:
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(ICNumber ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$ICNumber, T, F)
print(sum(e) / nrow(test_data))
}
I get the following accuracy:
[1] 0.743871
[1] 0.7566957
[1] 0.7380645
[1] 0.7390181
[1] 0.7311168
I was expecting to get about the same accuracy values, what am I doing wrong in train? Or is the manual prediction code wrong?
--Edit
Furthermore, this code works well on the Soybean data and I can reproduce the results from Gilles below
--Edit
--Edit2
Here are some details about my data:
15493 obs. of 17 variables:
ICNUmber is a string with 1531 different values, these are the classes
the other 16 variables are factors with 33 levels
--Edit2
--Edit3
My last experiment was to drop the observations for all the classes occurring less than 10 times, 12k observations of 396 classes remained. For this dataset, the manual and automatic cross validations accuracy match...
--Edit3
It was a tricky one ! ;-)
The error comes from a misuse of the index option in trainControl.
According tho the help page, index should be :
a list with elements for each resampling iteration. Each list element is a vector of integers corresponding to the rows used for training at that iteration.
In your code you provided the integers coresponding to the rows that should be removed
from the training dataset instead of providing the integers corresponding to the
rows that should be used...
You can cange that by using createFolds(train_indices, k=5, returnTrain = T) instead
of createFolds(train_indices, k=5).
Note also that internaly, afaik, caret is creating folds that are balanced relative
to the classes that you want to predict. So the code should ideally be more like :
createFolds(my_data[train_indices, "Class"], k=5, returnTrain = T), particularly
if the classes are not balanced...
Here is a reproducible example with the Soybean dataset
library(caret)
#> Le chargement a nécessité le package : lattice
#> Le chargement a nécessité le package : ggplot2
data(Soybean, package = "mlbench")
my_data <- droplevels(na.omit(Soybean))
Your code (the training data is here much smaller than expected, you use only 20% of the data, hence the lower accuracy).
You also get some warnings due to the absense of some classes in the training datasets (because of the class imbalance and reduced training set).
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
#> Warning: Dropped unused factor level(s) in dependent variable: rhizoctonia-
#> root-rot.
#> Warning: Dropped unused factor level(s) in dependent variable: downy-
#> mildew.
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.7951002 0.7700909 Fold1
#> 2 0.5846868 0.5400131 Fold2
#> 3 0.8440980 0.8251373 Fold3
#> 4 0.8822222 0.8679453 Fold4
#> 5 0.8444444 0.8263563 Fold5
Corrected code, just with returnTrain = T (here you really use 80% of the data for training...)
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5, returnTrain = T)
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = my_data, method = "ranger",
trControl = trainControl(verboseIter = F, savePredictions = T,
index=my_folds))
print(model$resample)
#> Accuracy Kappa Resample
#> 1 0.9380531 0.9293371 Fold1
#> 2 0.8750000 0.8583687 Fold2
#> 3 0.9115044 0.9009814 Fold3
#> 4 0.8660714 0.8505205 Fold4
#> 5 0.9107143 0.9003825 Fold5
To be compared to your loop. There are still some small differences so maybe there is still something that I don't understand.
set.seed(512)
n <- nrow(my_data)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k=5)
for (fold in my_folds) {
train_data <- my_data[-fold,]
test_data <- my_data[fold,]
model <- train(Class ~ ., tuneGrid = data.frame(mtry = c(32), min.node.size = 1, splitrule = "gini"),
data = train_data, method = "ranger",
trControl = trainControl(method = "none"))
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#> [1] 0.9380531
#> [1] 0.875
#> [1] 0.9115044
#> [1] 0.875
#> [1] 0.9196429
Created on 2018-03-09 by the reprex package (v0.2.0).
To expand on the excellent answer by Gilles. Apart the mistake in specifying the indexes used for testing and training, to get a fully reproducible model for algorithms that involve some stochastic process like random forrest you should specify the seeds argument in trainControl. The length of this argument should equal the number of re-samples + 1 (for the final model):
library(caret)
library(mlbench)
data(Sonar)
data(Sonar)
set.seed(512)
n <- nrow(Sonar)
train_indices <- sample(1:n)
my_folds <- createFolds(train_indices, k = 5, returnTrain = T)
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = Sonar,
method = "ranger",
trControl = trainControl(verboseIter = F,
savePredictions = T,
index = my_folds,
seeds = rep(512, 6))) #this is the important part
model$resample
#output
Accuracy Kappa Resample
1 0.8536585 0.6955446 Fold1
2 0.8095238 0.6190476 Fold2
3 0.8536585 0.6992665 Fold3
4 0.7317073 0.4786127 Fold4
5 0.8372093 0.6681367 Fold5
now lets do the resample manually:
for (fold in my_folds) {
train_data <- Sonar[fold,]
test_data <- Sonar[-fold,]
model <- train(Class ~ .,
tuneGrid = data.frame(mtry = c(32),
min.node.size = 1,
splitrule = "gini"),
data = train_data,
method = "ranger",
trControl = trainControl(method = "none",
seeds = 512)) #use the same seeds as above
p <- predict(model, test_data)
e <- ifelse(p == test_data$Class, T, F)
print(sum(e) / nrow(test_data))
}
#output
[1] 0.8536585
[1] 0.8095238
[1] 0.8536585
[1] 0.7317073
[1] 0.8372093
#semicolo if you can reproduce this example on the Sonar data set, but not with your own data, then the problem is in the data set and any further insights will need to investigate the data in question.
It looks like the train function transforms the class column into a factor, in my dataset there are a lot (about 20%) of classes that have less than 4 observations. When splitting the set by hand, the factor is constructed after the split and for each of the factor value there's at least one observation.
But during the automatic cross validation, the factor is constructed on the full dataset and when the splits are done some values of the factor don't have any observation. This seems to somehow mess up the accuracy. This probably calls for a new different question, thanks to Gilles and missuse for their help.
I have a data frame containing 499 observations and 1412 variables. I split my data frame into train and test set and try the train set in Caret 5 fold cross validation by Random Forest method. My question is that how the cross-validation with Random Forest method chooses values of mtry? if you look at the plot, for example, why doesn't the procedure choose 30 as the statring value of mtry?
To answer this one needs to check the train code for the rf model.
From the linked code it is clear that if grid search is specified caret will use caret::var_seq function to generate mtry.
mtry = caret::var_seq(p = ncol(x),
classification = is.factor(y),
len = len)
From the help for the function it can be seen that if the number of predictors is less than 500, a simple sequence of values of length len is generated between 2 and p. For larger numbers of predictors, the sequence is created using log2 steps.
so for example:
caret::var_seq(p = 1412,
classification = T,
len = 3)
#output
[1] 2 53 1412
If len = 1 is specified the defaults from the randomForest package are used:
mtry = if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x)))
if a random search is specified then caret calculates mtry as:
unique(sample(1:ncol(x), size = len, replace = TRUE)
in other words for your case:
unique(sample(1:1412 , size = 3, replace = TRUE))
#output
[1] 857 181 64
here is an example:
library(caret)
#some data
z <- matrix(rnorm(100000), ncol = 1000)
colnames(z) = paste0("V", 1:1000)
#specify model evaluation
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1)
#train
fit_rf <- train(V1 ~.,
data = z,
method = "rf",
tuneLength = 3,
trControl = ctrl)
fit_rf$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 2 0.8030665 0.11101385 0.5889436 0.2824439 0.09644324 0.1650381
2 44 0.8146023 0.09481331 0.6014367 0.2821711 0.10082099 0.1665926
3 998 0.8420705 0.03190199 0.6375570 0.2503089 0.03205335 0.1550021
same mtry values as one would obtain by doing:
caret::var_seq(p = 999,
classification = F,
len = 3)
When random search is specified:
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
search = "random")
fit_rf <- train(V1 ~.,
data = z,
method = "rf",
tuneLength = 3,
trControl = ctrl)
fit_rf$results
#output
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 350 0.8571330 0.10195986 0.6214896 0.1637944 0.1385415 0.09904165
2 826 0.8644918 0.07775553 0.6286101 0.1725390 0.1264605 0.10587076
3 855 0.8636692 0.07025535 0.6232729 0.1754164 0.1332580 0.10438083
or some other random numbers obtained by:
unique(sample(1:999 , size = 3, replace = TRUE))
To fix the mtry to desired values it is best to provide your own search grid. A tutorial on how to do that and much more can be found here.
I am using the randomForest package for R to train a model for classification.
To compare it to other classifiers, I need a way to display all the information given by the rather verbose cross-validation method in Weka. Therefore, the R script should output somesthing like [a] from Weka.
Is there a way to validate an R model via RWeka to produce those measures?
If not, how is a cross-validation on a random forest done purely in R?
Is it possble to use rfcv from the randomForest package here? I could not get it to work.
I do know that the out-of-bag error (OOB) used in randomForest is some kind of a cross-validation. But I need the full information for a suited comparison.
What I tried so far using R is [b]. However, the code also produces an error on my setup [c] due to missing values.
So, can you help me with the cross-validation?
Appendix
[a] Weka:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 3059 96.712 %
Incorrectly Classified Instances 104 3.288 %
Kappa statistic 0.8199
Mean absolute error 0.1017
Root mean squared error 0.1771
Relative absolute error 60.4205 %
Root relative squared error 61.103 %
Coverage of cases (0.95 level) 99.6206 %
Mean rel. region size (0.95 level) 78.043 %
Total Number of Instances 3163
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,918 0,028 0,771 0,918 0,838 0,824 0,985 0,901 sick-euthyroid
0,972 0,082 0,991 0,972 0,982 0,824 0,985 0,998 negative
Weighted Avg. 0,967 0,077 0,971 0,967 0,968 0,824 0,985 0,989
=== Confusion Matrix ===
a b <-- classified as
269 24 | a = sick-euthyroid
80 2790 | b = negative
[b] Code so far:
library(randomForest) #randomForest() and rfImpute()
library(foreign) # read.arff()
library(caret) # train() and trainControl()
nTrees <- 2 # 200
myDataset <- 'D:\\your\\directory\\SE.arff' # http://hakank.org/weka/SE.arff
mydb = read.arff(myDataset)
mydb.imputed <- rfImpute(class ~ ., data=mydb, ntree = nTrees, importance = TRUE)
myres.rf <- randomForest(class ~ ., data=mydb.imputed, ntree = nTrees, importance = TRUE)
summary(myres.rf)
# specify type of resampling to 10-fold CV
fitControl <- trainControl(method = "rf",number = 10,repeats = 10)
set.seed(825)
# deal with NA | NULL values in categorical variables
#mydb.imputed[is.na(mydb.imputed)] <- 1
#mydb.imputed[is.null(mydb.imputed)] <- 1
rfFit <- train(class~ ., data=mydb.imputed,
method = "rf",
trControl = fitControl,
## This last option is actually one
## for rf() that passes through
ntree = nTrees, importance = TRUE, na.action = na.omit)
rfFit
The error is:
Error in names(resamples) <- gsub("^\\.", "", names(resamples)) :
attempt to set an attribute on NULL
Using traceback()
5: nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
method = models, ppOpts = preProcess, ctrl = trControl, lev = classLevels,
...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(class~ ., data = mydb.imputed, method = "rf",
trControl = fitControl, ntree = nTrees, importance = TRUE,
sampsize = rep(minorityClassNum, 2), na.action = na.omit)
1: train(class~ ., data = mydb.imputed, method = "rf", trControl = fitControl,
ntree = nTrees, importance = TRUE, sampsize = rep(minorityClassNum,
2), na.action = na.omit) at #39
[c] R version information via sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)
[...]
other attached packages:
[1] e1071_1.6-3 caret_6.0-30 ggplot2_1.0.0 foreign_0.8-61 randomForest_4.6-7 DMwR_0.4.1
[7] lattice_0.20-29 JGR_1.7-16 iplots_1.1-7 JavaGD_0.6-1 rJava_0.9-6
I dont know about weka, but i have done randomForest modelling in R and I have always used predict function in R to do this.
Try using this function
predict(Model,data)
Bind the output with original values and use table command to get the confusion matrix.