Manual Summary Function in Caret - Make all predictions = Fail - r

I am having some issues in understanding how the manual summary function in Caret works. I have created a simple code to maximize all predictions as "fail". But for some reason it doesn't seem to predict all instances as fail (on the training dataset).
See below for the code:
Maximize all predictions as fail function:
BS <- function (data, lev = NULL, model = NULL) {
negpredictions <- sum(data$pred == "fail")
names(negpredictions) <- c("Min_Precision")
negpredictions
}
Training Script:
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = BS,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)
random.forest.orig <- train(pass ~ manufacturer+meter.type+premise+size+age+avg.winter+totalizer,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "Min_Precision",
maximize = TRUE,
trControl = train.control)
stopCluster(cl)

The metric specified in caret is not a loss function but rather the metric to chose the optimal model (in most cases the optimal combination of hyper parameters). So by specifying the BS function you are merely selecting the mtry that maximizes the prediction of "fail".
From the function help:
metric
A string that specifies what summary metric will be used to select the optimal model. By default, possible values are "RMSE" and "Rsquared" for regression and "Accuracy" and "Kappa" for classification. If custom performance metrics are used (via the summaryFunction argument in trainControl, the value of metric should match one of the arguments. If it does not, a warning is issued and the first metric given by the summaryFunction is used. (NOTE: If given, this argument must be named.)
If you check
random.forest.orig$bestTune
you will see the best tune is the one that maximized the BS function. However this does not change the native models loss function.

Related

Using the type = "raw" option for the predict() function after repeated cross validation for logistic lasso regression returns empty vector

I used the caret and glmnet pacakges to run a lasso logistic regression using repeated cross validation to select the optimized minimum lambda.
glmnet.obj <- train(outcome ~ .,
data = df.train,
method = "glmnet",
metric = "ROC",
family = "binomial",
trControl = trainControl(
method = "repeatedcv",
repeats = 10,
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = "all",
selectionFunction = "best"))
After that, I get the best lambda and alpha:
best_lambda<- get_best_result(glmnet.obj)$lambda
best_alpha<- get_best_result(glmnet.obj)$alpha
Then I obtain the predicted probabilities for the test set:
pred_prob<- predict(glmnet.obj,s=best_lambda, alpha=best_alpha, type="prob", newx = x.test)
and then to get the predicted classes, which I intend to use in ConfusionMatrix:
pred_class<-predict(glmnet.obj,s=best_lambda, alpha=best_alpha, type="raw",newx=x.test)
But when I just run pred_class it returns NULL.
What could I be missing here?
You need to use newdata = as opposed to newx= because when you do predict(glmnet.obj), it is calling predict.train on the caret object.
You did not provide one function, but I suppose it is rom this source:
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
Using an example data
set.seed(111)
df = data.frame(outcome = factor(sample(c("y","n"),100,replace=TRUE)),
matrix(rnorm(1000),ncol=10))
colnames(df.train)[-1] = paste0("col",1:10)
df.train = df[1:70,]
x.test = df[71:100,]
And we run your model, then you can predict using the function:
pred_class<-predict(glmnet.obj,type="raw",newdata=x.test)
confusionMatrix(table(pred_class,x.test$outcome))
Confusion Matrix and Statistics
pred_class n y
n 1 5
y 11 13
The arguments for lambda = and newx= comes from glmnet, you can potentially use it on glmnet.obj$finalModel , but you need to convert the data into a matrix, for example:
predict(glmnet.obj$finalModel,s=best_lambda, alpha=best_alpha,
type="class",newx=as.matrix(x.test[,-1]))

Use F1 Score instead of Accuracy to Optimize SVM Parameters

I am using the e1071 'tune' function to optimize an SVM model. I would like to use F1 instead of Accuracy as the value to optimize for. I have found on this post: Optimize F-score in e1071 package that I need to define a new error.fun. The problem that I am having is that the function that is shown in that post was not shown to ultimately be the solution and it does not work for me. If I knew the variable names for the predictions from each iteration of tune I could write a function to calculate F1 but I don't know how to get those values. How can I calculate F1 and use it to optimize model parameters using 'tune' in e1071? My code is as follows:
tuned = tune.svm(PriYN~., data = dataset, kernel = "radial", probability=TRUE, gamma = 10^(-5:-1), cost = 10^(-3:1), tunecontrol=tune.control(cross=10))
Using {caret} :
ctrl <- trainControl(method = "repeatedcv", # choose your CV method
number = 5, # according to CV method
repeats = 2, # according to CV method
summaryFunction = prSummary, # TO TUNE ON F1 SCORE
classProbs = T,
verboseIter = T
#sampling = "smote" # you can try 'smote' resampling method
)
Then tune your model
set.seed(2202)
svm_model <- train(target ~., data = training,
method = "svmRadial",
#preProcess = c("center", "scale"),
tuneLength = 10,
metric = "F", # The metric used for tuning is the F1 SCORE
trControl = ctrl)
svm_model

How tunelength parameter works in caret

I'm using following code to implement elastic net using R
model <- train(
Sales ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
I'm confused about tunelength paramater. In Cran I'm seeing that
To change the candidate values of the tuning parameter, either of the
tuneLength or tuneGrid arguments can be used. The train function can
generate a candidate set of parameter values and the tuneLength
argument controls how many are evaluated. In the case of PLS, the
function uses a sequence of integers from 1 to tuneLength. If we want
to evaluate all integers between 1 and 15, setting tuneLength = 15
would achieve this
But train function is taking dependent & independent variable from my data then how it's using tuneLength parameter? Can you please help me understand?
In caret the train() function has a number of arguments to help select the "optimal" tuning parameters for your chosen model.
Model tuning is explained in detail in package documentation here.
Users can customize the tuning process by specifying a grid of possible parameter values that the model will use when training the model.
For some models, the use of tuneLength is an alternative to specifying a tuneGird.
For example, one method of searching for the 'optimal' model parameters is using random selection. In this case the tuneLength argument is used to control the number of combinations generated by this random tuning parameter search.
To use random search, another option is available in trainControl called search. Possible values of this argument are "grid" and "random". The built-in models contained in caret contain code to generate random tuning parameter combinations. The total number of unique combinations is specified by the tuneLength option to train.
It is covered in more detail here:
http://topepo.github.io/caret/random-hyperparameter-search.html
It is important to check the model you are using in the train function and look at which tuning parameters are used for that model. It will then be easier to understand how to correctly customize the model fitting process.
For your example of using method = 'glmnet' here is a comparison using tuneGrid and tuneLength (taken from package tests):
cctrl1 <- trainControl(method = "cv", number = 3, returnResamp = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
test_class_cv_model <- train(trainX, trainY,
method = "glmnet",
trControl = cctrl1,
metric = "ROC",
preProc = c("center", "scale"),
tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
.lambda = c((1:5)/10)))
cctrlR <- trainControl(method = "cv", number = 3, returnResamp = "all", search = "random")
test_class_rand <- train(trainX, trainY,
method = "glmnet",
trControl = cctrlR,
tuneLength = 4)

Plot ROC curve for bootstrapped caret model

I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))

Selecting a Different ROC Set Point in Caret

Is it possible to select a different ROC set point in the Caret Train function instead of using metric = ROC (which I believe maximizes the AUC).
For example:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = train.control)
Specifically I have a two class problem (fail or pass) and I want to maximize the fail predictions while still maintaining a fail accuracy (or negative prediction value) of >80%. ie for every 10 fails I predict at least 8 of them are correct.
You can customize the caret::trainControl() object to use AUC, instead of accuracy, to tune the parameters of your models. Please check the caret documentation for details. (The built-in function, twoClassSummary, will compute the sensitivity, specificity and area under the ROC curve).
Note: In order to compute class probabilities, the pass feature must be Factor
Here under is an example of using 5-fold CV:
fitControl <- caret::trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
So your code will be adjusted a bit:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = fitControl)
# Print model to console to examine the output
random.forest.orig

Resources