Perform cross-validation on randomForest with R - r

I am using the randomForest package for R to train a model for classification.
To compare it to other classifiers, I need a way to display all the information given by the rather verbose cross-validation method in Weka. Therefore, the R script should output somesthing like [a] from Weka.
Is there a way to validate an R model via RWeka to produce those measures?
If not, how is a cross-validation on a random forest done purely in R?
Is it possble to use rfcv from the randomForest package here? I could not get it to work.
I do know that the out-of-bag error (OOB) used in randomForest is some kind of a cross-validation. But I need the full information for a suited comparison.
What I tried so far using R is [b]. However, the code also produces an error on my setup [c] due to missing values.
So, can you help me with the cross-validation?
Appendix
[a] Weka:
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 3059 96.712 %
Incorrectly Classified Instances 104 3.288 %
Kappa statistic 0.8199
Mean absolute error 0.1017
Root mean squared error 0.1771
Relative absolute error 60.4205 %
Root relative squared error 61.103 %
Coverage of cases (0.95 level) 99.6206 %
Mean rel. region size (0.95 level) 78.043 %
Total Number of Instances 3163
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0,918 0,028 0,771 0,918 0,838 0,824 0,985 0,901 sick-euthyroid
0,972 0,082 0,991 0,972 0,982 0,824 0,985 0,998 negative
Weighted Avg. 0,967 0,077 0,971 0,967 0,968 0,824 0,985 0,989
=== Confusion Matrix ===
a b <-- classified as
269 24 | a = sick-euthyroid
80 2790 | b = negative
[b] Code so far:
library(randomForest) #randomForest() and rfImpute()
library(foreign) # read.arff()
library(caret) # train() and trainControl()
nTrees <- 2 # 200
myDataset <- 'D:\\your\\directory\\SE.arff' # http://hakank.org/weka/SE.arff
mydb = read.arff(myDataset)
mydb.imputed <- rfImpute(class ~ ., data=mydb, ntree = nTrees, importance = TRUE)
myres.rf <- randomForest(class ~ ., data=mydb.imputed, ntree = nTrees, importance = TRUE)
summary(myres.rf)
# specify type of resampling to 10-fold CV
fitControl <- trainControl(method = "rf",number = 10,repeats = 10)
set.seed(825)
# deal with NA | NULL values in categorical variables
#mydb.imputed[is.na(mydb.imputed)] <- 1
#mydb.imputed[is.null(mydb.imputed)] <- 1
rfFit <- train(class~ ., data=mydb.imputed,
method = "rf",
trControl = fitControl,
## This last option is actually one
## for rf() that passes through
ntree = nTrees, importance = TRUE, na.action = na.omit)
rfFit
The error is:
Error in names(resamples) <- gsub("^\\.", "", names(resamples)) :
attempt to set an attribute on NULL
Using traceback()
5: nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo,
method = models, ppOpts = preProcess, ctrl = trControl, lev = classLevels,
...)
4: train.default(x, y, weights = w, ...)
3: train(x, y, weights = w, ...)
2: train.formula(class~ ., data = mydb.imputed, method = "rf",
trControl = fitControl, ntree = nTrees, importance = TRUE,
sampsize = rep(minorityClassNum, 2), na.action = na.omit)
1: train(class~ ., data = mydb.imputed, method = "rf", trControl = fitControl,
ntree = nTrees, importance = TRUE, sampsize = rep(minorityClassNum,
2), na.action = na.omit) at #39
[c] R version information via sessionInfo()
R version 3.1.0 (2014-04-10)
Platform: i386-w64-mingw32/i386 (32-bit)
[...]
other attached packages:
[1] e1071_1.6-3 caret_6.0-30 ggplot2_1.0.0 foreign_0.8-61 randomForest_4.6-7 DMwR_0.4.1
[7] lattice_0.20-29 JGR_1.7-16 iplots_1.1-7 JavaGD_0.6-1 rJava_0.9-6

I dont know about weka, but i have done randomForest modelling in R and I have always used predict function in R to do this.
Try using this function
predict(Model,data)
Bind the output with original values and use table command to get the confusion matrix.

Related

How to extract RMSE from models built using caret?

I have built a glm model using R package "caret" and I'd like to assess its performance using RMSE. I notice that the two RMSEs are different and I wonder which one is the real RMSE?
Also, how can I extract each fold (5*5=25 in total) of the training data, test data, and predicted data (based on the optimal tuned parameter) from the model?
library(caret)
data("mtcars")
set.seed(100)
mydata = mtcars[, -c(8,9)]
model_glm <- train(
hp ~ .,
data = mydata,
method = "glm",
metric = "RMSE",
preProcess = c('center', 'scale'),
trControl = trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
verboseIter = TRUE
)
)
GLM.pred = predict(model_glm, subset(mydata, select = -hp))
RMSE(pred = GLM.pred, obs = mydata$hp) # 21.89
model_glm$results$RMSE # 32.16
With the following code, I get :
sqrt(mean((mydata$hp - predict(model_glm)) ^ 2))
[1] 21.89127
This suggests that the real is "RMSE(pred = GLM.pred, obs = mydata$hp)"
Also, you have
model_glm$resample$RMSE
[1] 28.30254 34.69966 25.55273 25.29981 40.78493 31.91056 25.05311 41.83223 26.68105 23.64629 27.98388 25.98827 45.26982 37.28214
[15] 38.13617 31.14513 23.35353 42.05274 34.04761 35.17733 28.28838 35.89639 21.42580 45.17860 29.13998
which is the RMSE for each of the 25 CV. Also, we have
mean(model_glm$resample$RMSE)
32.16515
So, the 32.16 is the average of the RMSE of the 25 CV. The 21.89 is the RMSE on the original dataset.

R: Caret package: Brier Score

I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
library(caret)
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
levels(data$Species)
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
head(pred)
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
NA
} else {
rocObject$auc
}
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCROC = rocAUC,
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
out
}
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.

unused argument in train function

Good day to all
I have a problem in code when I use RF hyperparameter tuning. The algorithm (Simulated annealing) give me the RMSE value of 4000. I am not sure from where it has performed this calculation because in the code I did not specify any grid/values? The code is below, which was originally for SVM but I edited for RF.
svm_obj <- function(param, maximize = FALSE) {
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "MAE",
trControl = ctrl,
tuneGrid = data.frame(mtry = 10^(param[1])))
##, sigma = 10^(param[2])))
if(maximize)
-getTrainPerf(mod)[, "TrainRMSE"] else
getTrainPerf(mod)[, "TrainRMSE"]
}
## Simulated annealing from base R
set.seed(45642)
san_res <- optim(par = c(0), fn = svm_obj, method = "SANN",
control = list(maxit = 10))
The answer I get is: $value
[1] 4487.821
$counts
function gradient
10 NA
$convergence
[1] 0
$message
NULL
mtry is the number of variables used by rf to split the tree, and it cannot be more than the number of columns of predictors.
Let's do a model that doesn't work:
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = data.frame(mtry = ncol(tr)+1)
)
You see a warning:
There were 11 warnings (use warnings() to see them)
And the results, and final model disagrees:
mod$results
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 12 2.203626 0.9159377 1.880211 0.979291 0.1025424 0.7854203
mod$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 10
Mean of squared residuals: 6.088637
% Var explained: 82.7
So although you specified mtry=12, the default randomForest function brings it down to 10, which is sensible. But if you try this over optim, you are never going to get something that makes sense, once you go over ncol(tr)-1.
If you do not have so much variables, it's much easier to use tuneLength or specify the mtry to use. Let's start with the results you expect with just specifying mtry:
library(caret)
library(randomForest)
ctrl = trainControl(method="cv",repeats=3)
#use mtcars
tr = mtcars
# set mpg to be Effort so your function works
colnames(tr)[1] = "Effort"
TG = data.frame(mtry=1:10)
mod <- train(Effort ~ ., data = tr,
method = "rf",
preProc = c("center", "scale", "zv"),
metric = "RMSE",
trControl = ctrl,
tuneGrid = TG)
mod$results
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 1 2.725944 0.8895202 2.384232 1.350958 0.1592133 1.183400
2 2 2.498627 0.9012830 2.192391 1.276950 0.1375281 1.200895
3 3 2.506250 0.8849148 2.168141 1.229709 0.1562686 1.173904
4 4 2.503700 0.8891134 2.170633 1.249049 0.1478276 1.168831
5 5 2.480846 0.8837597 2.148329 1.250889 0.1540574 1.191068
6 6 2.459317 0.8872104 2.126315 1.196187 0.1554423 1.128351
7 7 2.493736 0.8736399 2.165258 1.158384 0.1766644 1.082568
8 8 2.530672 0.8768546 2.199941 1.224193 0.1681286 1.127467
9 9 2.547422 0.8757422 2.196878 1.222921 0.1704655 1.130261
10 10 2.514791 0.8720315 2.184602 1.224944 0.1740556 1.093184
Maybe something like 6 is the best mtry you can get.
Well, I dont know what values are you calling your function with, so it's hard to spot the error.
However, mtry is a value that needs to be between 1 and the number of columns, while it looks to me like you might be setting it to 10 to the power of something - which is most likely out of bounds :)
#Javed #Wolf
Please mind that id DOES make sense to tune mtry.
mtry is going to affect the correlation between the trees you grow (therefore the variance of the model), and it is very problem specific, so the optimal value might change depending on the number of features you have and correlations between them.
It is however quite useless to tune bias-related hyperparameters (max depth, and other stopping/pruning rules). It takes a lot of time and the effects are often non significant.

Metric Accuracy not applicable for regression models

I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
savePredictions="final",
classProbs = TRUE
)
fit_glmnet_phy <- train (Methane~.,
trainingPhy,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
}
glmnetstatsPhy
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.

R Caret's rfe [Error in { : task 1 failed - "rfe is expecting 184 importance values but only has 2"]

I am using Caret's rfe for a regression application. My data (in data.table) has 176 predictors (including 49 factor predictors). When I run the function, I get this error:
Error in { : task 1 failed - "rfe is expecting 176 importance values but only has 2"
Then, I used model.matrix( ~ . - 1, data = as.data.frame(train_model_sell_single_bid)) to convert the factor predictors to dummy variables. However, I got similar error:
Error in { : task 1 failed - "rfe is expecting 184 importance values but only has 2"
I'm using R version 3.1.1 on Windows 7 (64-bit), Caret version 6.0-41. I also have Revolution R Enterprise version 7.3 (64-bit) installed.
But the same error was reproduced on Amazon EC2 (c3.8xlarge) Linux instance with R version 3.0.1 and Caret version 6.0-24.
Datasets used (to reproduce my error):
https://www.dropbox.com/s/utuk9bpxl2996dy/train_model_sell_single_bid.RData?dl=0
https://www.dropbox.com/s/s9xcgfit3iqjffp/train_model_bid_outcomes_sell_single.RData?dl=0
My code:
library(caret)
library(data.table)
library(bit64)
library(doMC)
load("train_model_sell_single_bid.RData")
load("train_model_bid_outcomes_sell_single.RData")
subsets <- seq(from = 4, to = 184, by= 4)
registerDoMC(cores = 32)
set.seed(1015498)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 1,
#saveDetails = TRUE,
verbose = FALSE)
x <- as.data.frame(train_model_sell_single_bid[,!"security_id", with=FALSE])
y <- train_model_bid_outcomes_sell_single[,bid100]
lmProfile_single_bid100 <- rfe(x, y,
sizes = subsets,
preProc = c("center", "scale"),
rfeControl = ctrl)
It seems that you might have highly correlated predictors.
Prior to feature selection you should run:
crrltn = findCorrelation(correlations, cutoff = .90)
if (length(crrltn) != 0)
x <- x[,-crrltn]
If after this the problem persists, it might be related to high correlation of the predictors within folds automatically generated, you can try to control the generated folds with:
set.seed(12213)
index <- createFolds(y, k = 10, returnTrain = T)
and then give these as arguments to the rfeControl function:
lmctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
index = index,
verbose = TRUE)
set.seed(111333)
lrprofile <- rfe( z , x,
sizes = sizes,
rfeControl = lmctrl)
If you keep having the same problem, check if there are highly correlated between predictors within each fold:
for(i in 1:length(index)){
crrltn = cor(x[index[[i]],])
findCorrelation(crrltn, cutoff = .90, names = T, verbose = T)
}

Resources