Caret: Unable to adjust performance metric when using rfe function - r

I'm trying to perform recursive feature elimination using the rfe function but I'm having a bit of trouble trying to change the performance measure to output the ROC:
newFunc <- caretFuncs
newFunc$summary <- twoClassSummary
ctrl <- rfeControl(functions = newFunc,
method = 'cv',
returnResamp = TRUE,
number = 2,
verbose = TRUE)
profiler <- rfe(predictors, response,
sizes = c(1),
method = 'nnet',
tuneGrid = expand.grid(size=c(4), decay=c(0.1)),
maxit = 20,
metric = 'ROC',
rfeControl = ctrl)
Trying to run this code is giving me the following error:
Error in { : task 1 failed - "undefined columns selected"
If I remove the custom newFunc, set the functions parameter inside rfeControl to use caretFuncs and remove the metric parameter from rfe, the model works fine. This makes me think there's something wrong with the summary.
caretFuncs$summary:
function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs))
data$obs <- factor(data$obs, levels = lev)
postResample(data[, "pred"], data[, "obs"])
}
twoClassSummary
function (data, lev = NULL, model = NULL)
{
lvls <- levels(data$obs)
if (length(lvls) > 2)
stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
requireNamespaceQuietStop("ModelMetrics")
if (!all(levels(data[, "pred"]) == lvls))
stop("levels of observed and predicted data do not match")
data$y = as.numeric(data$obs == lvls[2])
rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0,
1), data[, lvls[1]])
out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"],
lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
names(out) <- c("ROC", "Sens", "Spec")
out
}
The output to postResample and twoClassSummary are identical in their structures so I'm a little lost as to what this problem is. Am I doing something inherently wrong here or is this a bug that I need to flag to the devs?
I'm actually interested in obtaining the logLoss so I could write my own function:
logLoss = function(data, lev = NULL, model = NULL) {
-1*mean(log(data[, 'pred'][model.matrix(~ as.numeric(data[, 'obs'], levels = lev) + 0) - data[, 'pred'] > 0]))
}
But, I'm a little unsure how to convert the factor levels into the correct [0,1] from my [yes, no] factor?

First of all here is a viable logloss function for use with caret:
LogLoss <- function (data, lev = NULL, model = NULL)
{
obs <- data[, "obs"]
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss")
out
}
to answer the question how to convert the factor levels into the correct [0,1] from my [yes, no] factor:
real <- (as.numeric(data$obs) - 1)
to get rfe to work you can use rfFuncs instead of caretFuncs. Example:
rfFuncs$summary <- twoClassSummary
ctrl <- rfeControl(functions = rfFuncs,
method = 'cv',
returnResamp = TRUE,
number = 2,
verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class,
sizes = c(1, 5, 20, 40, 60),
method = 'nnet',
tuneGrid = expand.grid(size=c(4), decay=c(0.1)),
maxit = 20,
metric = 'ROC',
rfeControl = ctrl)
profiler$results
Variables ROC Sens Spec ROCSD SensSD SpecSD
1 1 0.6460027 0.6387987 0.5155187 0.08735968 0.132008571 0.007516016
2 5 0.7563971 0.6847403 0.7013180 0.03751483 0.008724045 0.039383924
3 20 0.8633511 0.8462662 0.7017432 0.08460677 0.091143309 0.097708207
4 40 0.8841540 0.8642857 0.7429847 0.08096697 0.090913729 0.098309489
5 60 0.8945351 0.9004870 0.7431973 0.05707867 0.064971175 0.127471631
or with the LogLoss function I provided:
rfFuncs$summary <- LogLoss
ctrl <- rfeControl(functions = rfFuncs,
method = 'cv',
returnResamp = TRUE,
number = 2,
verbose = TRUE)
profiler <- rfe(Sonar[,1:60], Sonar$Class,
sizes = c(1, 5, 20, 40, 60),
method = 'nnet',
tuneGrid = expand.grid(size=c(4), decay=c(0.1)),
maxit = 20,
metric = 'LogLoss',
rfeControl = ctrl,
maximize = FALSE) #this was edited after the answer of Дмитрий Пасько)
profiler$results
Variables LogLoss LogLossSD
1 1 1.8237372 1.030120134
2 5 0.5548774 0.128704686
3 20 0.4226522 0.021547998
4 40 0.4167819 0.013587892
5 60 0.4328718 0.008000892
EDIT: Дмитрий Пасько raises a valid concern in his answer - LogLoss should be minimized. One way to achieve this is to provide the logical argument maximize telling caret should the metric be minimized or maximized.

but u should minimize logLoss, thus use this code (example with logistic regression https://www.kaggle.com/demetrypascal/rfe-logreg-with-pca-and-feature-importance):
LogLoss <- function (data, lev = NULL, model = NULL)
{
obs <- data[, "obs"]
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLossNegative")
-out
}
lrFuncs$summary <- LogLoss
rfec = rfeControl(method = "cv",
number = 2,
functions = lrFuncs)

Related

Training Model in Caret Using F1 Metric

I am trying to fit a random forest model to my dataset and I would like to select the best model based off of the F1 score. I saw a post here describing the code necessary. I attempted to copy the code but I am getting the error
"Error in { : task 1 failed - "could not find function "F1_Score"
while I run the train function. (FYI the variable I am trying to predict ("pass") is a two class factor "Fail" and "Pass")
See Code Below:
library(MLmetrics)
library(caret)
library(doSNOW)
f1 <- function(data, lev = NULL, model = NULL) {
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
c(F1 = f1_val)
}
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
summaryFunction = f1,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)
random.forest.orig <- train(pass ~ manufacturer+meter.type+premise+size+age+avg.winter+totalizer,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "F1",
weights = model_weights,
trControl = train.control)
stopCluster(cl)
I've rewritten the f1 function not using the MLmetrics library and it seems to work. See below for a working code to create a f1 score:
f1 <- function (data, lev = NULL, model = NULL) {
precision <- posPredValue(data$pred, data$obs, positive = "pass")
recall <- sensitivity(data$pred, data$obs, postive = "pass")
f1_val <- (2 * precision * recall) / (precision + recall)
names(f1_val) <- c("F1")
f1_val
}
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = f1,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)
random.forest.orig <- train(pass ~ manufacturer+meter.type+premise+size+age+avg.winter+totalizer,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "F1",
trControl = train.control)
stopCluster(cl)
I had exactly the same error. The error also happened when I used other functions from the MLmetrics package, e.g., Precision function.
I solved it by accessing the F1_Score function using double colons ::.
f1 <- function(data, lev = NULL, model = NULL) {
f1_val <- MLmetrics::F1_Score(y_pred = data$pred,
y_true = data$obs,
positive = lev[1])
c(F1 = f1_val)
}
Using MLmetrics::F1_Score you unequivocally work with the F1_Score from the MLmetrics package.
One advantage of MLmetrics package is that its functions work with variables that have more than 2 levels.

Create own model in train for multiple logit regressions

I want to mimic a multinomial logit model using separate logistic regressions and cross validating them via caret. In the non-CV world, I want to achieve the following:
# Create Data-Set
library(data.table)
library(dplyr)
N <- 1000
X1 <- rnorm(N, 175, 7)
X2 <- rnorm(N, 30, 8)
X3 <- rnorm(N,0,1)
length <- sample(0:5,N,T)
Ycont <- 0.5*X1 - 0.3*X2 +0.01*X3 + 10 + rnorm(N, 0, 6)
# create 3 categories
Ycateg <- ntile(Ycont,3)
df <- data.frame(id=1:N,length,X1, X2,X3, Ycateg)
df=setDT(df)[,.SD[rep(1L,length)],by = id]
df=df[ , time := 1:.N , by=id]
df=df[,-c("length")]
df$Ycateg=ifelse(df$Ycateg==1,"type1",ifelse(df$Ycateg==2,"type2","type0"))
head(df)
# aim of the model without CV - combine logit regressions
y_1=ifelse(df$Ycateg=="type1",1,0)
y_2=ifelse(df$Ycateg=="type2",1,0)
#drop the ID column
dat_model=df[,-1]
# fit the models
fit_1=glm(formula=y_1~. ,family=binomial(link='logit'),data=dat_model,control = list(maxit = 50))
fit_2=glm(formula=y_2~. ,family=binomial(link='logit'),data=dat_model,control = list(maxit = 50))
# predict
p_1=predict(fit_1,type = "response")
p_2=predict(fit_2,type = "response")
p_0=1-p_1-p_2
head(cbind(p_0,p_1,p_2))
p_0 p_1 p_2
1 1.000000e+00 2.220446e-16 2.220446e-16
2 0.000000e+00 2.220446e-16 1.000000e+00
3 4.930381e-32 1.000000e+00 2.220446e-16
4 4.930381e-32 1.000000e+00 2.220446e-16
So what I need, is to build my own model in the caret-framework to replicate the model above. What I have done so far is:
#Extend Caret
customLogit <- list(type = "Classification", library = "stats", loop = NULL)
customLogit$parameters =data.frame(parameter = c("decay"), class = c("numeric"), label = c("decay"))
customLogit$grid = function(x, y, len = NULL, search = "grid") { }
customLogit$fit <- function(x,y, ...) {
y_1=ifelse(df$Ycateg=="type1",1,0)
y_2=ifelse(df$Ycateg=="type2",1,0)
fit_1=glm(formula=y_1~. ,family=binomial(link='logit'),control = list(maxit = 50),...)
fit_2=glm(formula=y_2~. ,family=binomial(link='logit'),control = list(maxit = 50),...)
out = vector("list",2)
out[[1]]=fit_1
out[[2]]=fit_2
return(out)
}
customLogit$predict <- function(modelFit_all, newdata, preProc = NULL, submodels = NULL,...) {
p_1=predict(modelFit_all[[1]],newdata=newdata,...)
p_2=predict(modelFit_all[[2]],newdata=newdata,...)
p_0=ifelse(p_1==0 & p_2==0,1,0)
out=cbind(p_0,p_1,p_2)
return(out)
}
customLogit$prob <- function(modelFit_all, newdata, preProc = NULL, submodels = NULL) {
p_1=predict(modelFit_all[[1]],newdata=newdata,type="response",...)
p_2=predict(modelFit_all[[2]],newdata=newdata,type="response",...)
p_0=1- p_1-p_0
out=cbind(p_cur,p_def,p_pre)
return(out)
}
customLogit$sort <- NULL
customLogit$levels <- function(x) x$classes
# which type of cross validation to do
fitControl <- trainControl(method = 'cv',number=5,classProbs=TRUE,summaryFunction=defaultSummary, selectionFunction = "best", savePredictions = TRUE)
# tuning parameters
grid <- expand.grid(decay = 0 )
cv=train(as.factor(Ycateg)~.,
data = dat_model,
method = customLogit,
trControl = fitControl,
tuneGrid = grid,
)
Sadely, I could not bring the code to work and it throws me the error:
Error in train.default(x, y, weights = w, ...) :
argument is missing, with no default
I assume that the problem is the decay parameter, but as I understood, one cannot "tune" the logistic regression model using glm, such that I don't want to introduce any "tuning" parameters.
Many thanks in advance!

Custom classification threshold for GBM

I'm trying to create a custom GBM model that tunes the classification threshold for a binary classification problem. There is a nice example provided on the caret website here, but when I try to apply something similar to GBM I receive the following error:
Error in { : task 1 failed - "argument 1 is not a vector"
Unfortunately, I have no idea where the error is and the error isn't very helpful.
Here's an example, with the code that I've used for defining the custom GBM
library(caret)
library(gbm)
library(pROC)
#### DEFINE A CUSTOM GBM MODEL FOR PROBABILITY THRESHOLD TUNING ####
## Get the model code for the original gbm method from caret
customGBM <- getModelInfo("gbm", regex = FALSE)[[1]]
customGBM$type <- c("Classification")
## Add the threshold (i.e. class cutoff) as another tuning parameter
customGBM$parameters <- data.frame(parameter = c("n.trees", "interaction.depth", "shrinkage",
"n.minobsinnode", "threshold"),
class = rep("numeric", 5),
label = c("# Boosting Iterations", "Max Tree Depth", "Shrinkage",
"Min. Terminal Node Size", "Probability Cutoff"))
## Customise the tuning grid:
## Some paramters are fixed. Will give a tuning grid of 2,500 values if len = 100
customGBM$grid <- function(x, y, len = NULL, search = "grid") {
if (search == "grid") {
grid <- expand.grid(n.trees = seq(50, 250, 50),
interaction.depth = 2, ### fix interaction depth at 2
shrinkage = 0.0001, ### fix learning rate at 0.0001
n.minobsinnode = seq(2, 10, 2),
threshold = seq(.01, .99, length = len))
} else {
grid <- expand.grid(n.trees = floor(runif(len, min = 1, max = 5000)),
interaction.depth = sample(1:10, replace = TRUE, size = len),
shrinkage = runif(len, min = .001, max = .6),
n.minobsinnode = sample(5:25, replace = TRUE, size = len),
threshold = runif(1, 0, size = len))
grid <- grid[!duplicated(grid),] ### remove any duplicated rows in the training grid
}
grid
}
## Here we fit a single gbm model and loop over the threshold values to get predictions from the
## same gbm model.
customGBM$loop = function(grid) {
library(plyr)
loop <- ddply(grid, c("n.trees", "shrinkage", "interaction.depth", "n.minobsinnode"),
function(x) c(threshold = max(x$threshold)))
submodels <- vector(mode = "list", length = nrow(loop))
for (i in seq(along = loop$threshold)) {
index <- which(grid$n.trees == loop$n.trees[i] &
grid$interaction.depth == loop$interaction.depth[i] &
grid$shrinkage == loop$shrinkage[i] &
grid$n.minobsinnode == loop$n.minobsinnode[i])
cuts <- grid[index, "threshold"]
submodels[[i]] <- data.frame(threshold = cuts[cuts != loop$threshold[i]])
}
list(loop = loop, submodels = submodels)
}
## Fit the model independent of the threshold parameter
customGBM$fit = function(x, y, wts, param, lev, last, classProbs, ...) {
theDots <- list(...)
if (any(names(theDots) == "distribution")) {
modDist <- theDots$distribution
theDots$distribution <- NULL
} else {
if (is.numeric(y)) {
stop("This works only for 2-class classification problems")
} else modDist <- if (length(lev) == 2) "bernoulli" else
stop("This works only for 2-class classification problems")
}
# if (length(levels(y)) != 2)
# stop("This works only for 2-class problems")
## check to see if weights were passed in (and availible)
if (!is.null(wts)) theDots$w <- wts
if (is.factor(y) && length(lev) == 2) y <- ifelse(y == lev[1], 1, 0)
modArgs <- list(x = x,
y = y,
interaction.depth = param$interaction.depth,
n.trees = param$n.trees,
shrinkage = param$shrinkage,
n.minobsinnode = param$n.minobsinnode,
distribution = modDist)
do.call("gbm.fit", modArgs)
}
## Now get a probability prediction and use different thresholds to
## get the predicted class
customGBM$predict = function(modelFit, newdata, submodels = NULL) {
out <- predict(modelFit, newdata, n.trees = modelFit$tuneValue$n.trees,
type = "response")#[, modelFit$obsLevels[1]]
out[is.nan(out)] <- NA
class1Prob <- ifelse(out >= modelFit$tuneValue$threshold,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
## Raise the threshold for class #1 and a higher level of
## evidence is needed to call it class 1 so it should
## decrease sensitivity and increase specificity
out <- ifelse(class1Prob >= modelFit$tuneValue$threshold,
modelFit$obsLevels[1],
modelFit$obsLevels[2])
if (!is.null(submodels)) {
tmp2 <- out
out <- vector(mode = "list", length = length(submodels$threshold))
out[[1]] <- tmp2
for (i in seq(along = submodels$threshold)) {
out[[i + 1]] <- ifelse(class1Prob >= submodels$threshold[[i]],
modelFit$obsLevels[1],
modelFit$obsLevels[2])
}
}
out
}
## The probabilities are always the same but we have to create
## mulitple versions of the probs to evaluate the data across
## thresholds
customGBM$prob = function(modelFit, newdata, submodels = NULL) {
out <- predict(modelFit, newdata, type = "response",
n.trees = modelFit$tuneValue$n.trees)
out[is.nan(out)] <- NA
out <- cbind(out, 1 - out)
colnames(out) <- modelFit$obsLevels
if (!is.null(submodels)) {
tmp <- predict(modelFit, newdata, type = "response", n.trees = submodels$n.trees)
tmp <- as.list(as.data.frame(tmp))
lapply(tmp, function(x, lvl) {
x <- cbind(x, 1 - x)
colnames(x) <- lvl
x}, lvl = modelFit$obsLevels)
out <- c(list(out), tmp)
}
out
}
fourStats <- function (data, lev = levels(data$obs), model = NULL) {
## This code will get use the area under the ROC curve and the
## sensitivity and specificity values using the current candidate
## value of the probability threshold.
out <- c(twoClassSummary(data, lev = levels(data$obs), model = NULL))
## The best possible model has sensitivity of 1 and specificity of 1.
## How far are we from that value?
coords <- matrix(c(1, 1, out["Spec"], out["Sens"]),
ncol = 2,
byrow = TRUE)
colnames(coords) <- c("Spec", "Sens")
rownames(coords) <- c("Best", "Current")
c(out, Dist = dist(coords)[1])
}
And then some code showing how to use the custom model
set.seed(949)
trainingSet <- twoClassSim(500, -9)
mod1 <- train(Class ~ ., data = trainingSet,
method = customGBM, metric = "Dist",
maximize = FALSE, tuneLength = 10,
trControl = trainControl(method = "cv", number = 5,
classProbs = TRUE,
summaryFunction = fourStats))
The model appears to run, but finishes with the error from above. If someone could please help me with customising the GBM model to tune the GBM parameters, and the probability threshold for the classes that would be great.

Caret package Custom metric

I'm using the caret function "train()" in one of my project and I'd like to add
a "custom metric" F1-score. I looked at this url caret package
But I cannot understand how I can build this score with the parameter available.
There is an example of custom metric which is the following:
## Example with a custom metric
madSummary <- function (data,
lev = NULL,
model = NULL) {
out <- mad(data$obs - data$pred,
na.rm = TRUE)
names(out) <- "MAD"
out
}
robustControl <- trainControl(summaryFunction = madSummary)
marsGrid <- expand.grid(degree = 1, nprune = (1:10) * 2)
earthFit <- train(medv ~ .,
data = BostonHousing,
method = "earth",
tuneGrid = marsGrid,
metric = "MAD",
maximize = FALSE,
trControl = robustControl)
Update:
I tried your code but the problem is that it doesn't work with multiple classes like with the code below (The F1 score is displayed, but it is weird) I'm not sure but I think the function F1_score works only on binary classes
library(caret)
library(MLmetrics)
set.seed(346)
dat <- iris
## See http://topepo.github.io/caret/training.html#metrics
f1 <- function(data, lev = NULL, model = NULL) {
print(data)
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs)
c(F1 = f1_val)
}
# Split the Data into .75 input
in_train <- createDataPartition(dat$Species, p = .70, list = FALSE)
trainClass <- dat[in_train,]
testClass <- dat[-in_train,]
set.seed(35)
mod <- train(Species ~ ., data = trainClass ,
method = "rpart",
metric = "F1",
trControl = trainControl(summaryFunction = f1,
classProbs = TRUE))
print(mod)
I coded a manual F1 score as well, with one input the confusion matrix: (I'm not sure if we can have a confusion matrix in "summaryFunction"
F1_score <- function(mat, algoName){
##
## Compute F1-score
##
# Remark: left column = prediction // top = real values
recall <- matrix(1:nrow(mat), ncol = nrow(mat))
precision <- matrix(1:nrow(mat), ncol = nrow(mat))
F1_score <- matrix(1:nrow(mat), ncol = nrow(mat))
for(i in 1:nrow(mat)){
recall[i] <- mat[i,i]/rowSums(mat)[i]
precision[i] <- mat[i,i]/colSums(mat)[i]
}
for(i in 1:ncol(recall)){
F1_score[i] <- 2 * ( precision[i] * recall[i] ) / ( precision[i] + recall[i])
}
# We display the matrix labels
colnames(F1_score) <- colnames(mat)
rownames(F1_score) <- algoName
# Display the F1_score for each class
F1_score
# Display the average F1_score
mean(F1_score[1,])
}
You should look at The caret Package - Alternate Performance Metrics for details. A working example:
library(caret)
library(MLmetrics)
set.seed(346)
dat <- twoClassSim(200)
## See https://topepo.github.io/caret/model-training-and-tuning.html#metrics
f1 <- function(data, lev = NULL, model = NULL) {
f1_val <- F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
c(F1 = f1_val)
}
set.seed(35)
mod <- train(Class ~ ., data = dat,
method = "rpart",
tuneLength = 5,
metric = "F1",
trControl = trainControl(summaryFunction = f1,
classProbs = TRUE))
For the two-class case, you can try the following:
mod <- train(Class ~ .,
data = dat,
method = "rpart",
tuneLength = 5,
metric = "F",
trControl = trainControl(summaryFunction = prSummary,
classProbs = TRUE))
or define a custom summary function that combines both twoClassSummary and prSummary current favorite which provides the following possible evaluation metrics - AUROC, Spec, Sens, AUPRC, Precision, Recall, F - any of which can be used as the metric argument. This also includes the special case I mentioned in my comment on the accepted answer (F is NA).
comboSummary <- function(data, lev = NULL, model = NULL) {
out <- c(twoClassSummary(data, lev, model), prSummary(data, lev, model))
# special case missing value for F
out$F <- ifelse(is.na(out$F), 0, out$F)
names(out) <- gsub("AUC", "AUPRC", names(out))
names(out) <- gsub("ROC", "AUROC", names(out))
return(out)
}
mod <- train(Class ~ .,
data = dat,
method = "rpart",
tuneLength = 5,
metric = "F",
trControl = trainControl(summaryFunction = comboSummary,
classProbs = TRUE))

How to change metrics using the library(caret)?

I would like to change the metric from RMSE to RMSLE using the
caret library
Given some sample data:
ivar1<-rnorm(500, mean = 3, sd = 1)
ivar2<-rnorm(500, mean = 4, sd = 1)
ivar3<-rnorm(500, mean = 5, sd = 1)
ivar4<-rnorm(500, mean = 4, sd = 1)
dvar<-rpois(500, exp(3+ 0.1*ivar1 - 0.25*ivar2))
data<-data.frame(dvar,ivar4,ivar3,ivar2,ivar1)
ctrl <- rfeControl(functions=rfFuncs,
method="cv",
repeats = 5,
verbose = FALSE,
number=5)
model <- rfe(data[,2:4], data[,1], sizes=c(1:4), rfeControl=ctrl)
Here I would like to change to RMSLE and keeping the idea of the graph
plot <-ggplot(model,type=c("g", "o"), metric="RMSE")+ scale_x_continuous(breaks = 2:4, labels = names(data)[2:4])
Im not sure how / if you can easily convert RMSE to RMSLE, so you can try changing the control function.
Look at rfFuncs$summary it calls a function postResample. This is where the RMSE is calculated - look at the section
mse <- mean((pred - obs)^2)
n <- length(obs)
out <- c(sqrt(mse), resamplCor^2)
So you can amend this function to calculate the RMSLE instead:
msle <- mean((log(pred) - log(obs))^2)
out <- sqrt(msle)
}
names(out) <- "RMSLE"
Then if this amended function has been saved in a function called mypostResample, you then need to update the rfFuncs$summary.
So altogether:
First update the summary function - this will call the new function with RMSLE
newSumm <- function (data, lev = NULL, model = NULL)
{
if (is.character(data$obs))
data$obs <- factor(data$obs, levels = lev)
mypostResample(data[, "pred"], data[, "obs"])
}
Then define new function to calculate RMSLE
mypostResample <- function (pred, obs)
{
isNA <- is.na(pred)
pred <- pred[!isNA]
obs <- obs[!isNA]
msle <- mean((log(pred) - log(obs))^2)
out <- sqrt(msle)
names(out) <- "RMSLE"
if (any(is.nan(out)))
out[is.nan(out)] <- NA
out
}
Update rfFuncs
# keep old settings for future use
oldSumm <- rfFuncs$summary
# update with new function
rfFuncs$summary <- newSumm
ctrl <- rfeControl(functions=rfFuncs,
method="cv",
repeats = 5,
verbose = FALSE,
number=5)
set.seed(1)
model <- rfe(data[,2:4], data[,1], sizes=c(1:4), rfeControl=ctrl, metric="RMSLE")
# plot
ggplot(model,type=c("g", "o"), metric="RMSLE")+ scale_x_continuous(breaks = 2:4, labels = names(data)[2:4])

Resources