Is it possible to select a different ROC set point in the Caret Train function instead of using metric = ROC (which I believe maximizes the AUC).
For example:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = train.control)
Specifically I have a two class problem (fail or pass) and I want to maximize the fail predictions while still maintaining a fail accuracy (or negative prediction value) of >80%. ie for every 10 fails I predict at least 8 of them are correct.
You can customize the caret::trainControl() object to use AUC, instead of accuracy, to tune the parameters of your models. Please check the caret documentation for details. (The built-in function, twoClassSummary, will compute the sensitivity, specificity and area under the ROC curve).
Note: In order to compute class probabilities, the pass feature must be Factor
Here under is an example of using 5-fold CV:
fitControl <- caret::trainControl(
method = "cv",
number = 5,
summaryFunction = twoClassSummary,
classProbs = TRUE,
verboseIter = TRUE
)
So your code will be adjusted a bit:
random.forest.orig <- train(pass ~ x+y,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "ROC",
trControl = fitControl)
# Print model to console to examine the output
random.forest.orig
Related
How do I choose the optimal df(degrees of freedom) for my splines?
I used poisson regression and splines that help me to adjust for non linear changes. Using the caret package, I used the train function with method = gamSpline to test only 3 df.
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
Aggregating results
Selecting tuning parameters
Fitting df = 3 on full training set
Is this the default? If so how I can change it?
Tnx,
Daniel
The tuneGrid argument allows the user to specify a custom grid of tuning parameters, in this case, df
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
),
tuneGrid = data.frame(df=seq(2,20,by=2))
)
I am trying to run a logistic regression on a classification problem
the dependent variable "SUBSCRIBEDYN" is a factor with 2 levels ("Yes" and "No")
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
verboseIter = F,
classProbs = T,
summaryFunction = prSummary)
set.seed(13)
simple.logistic.regression <- caret::train(SUBSCRIBEDYN ~ .,
data = train_data,
method = "glm",
metric = "Accuracy",
trControl = train.control)
simple.logistic.regression`
However, it does not accept Accuracy as a metric
"The metric "Accuracy" was not in the result set. AUC will be used instead"
For a classification model with 2 levels, you should use metric="ROC". metric="Accuracy" is used for multiple classes. However, after training the model, you can use the confusion matrix to retrieve the accuracy, for example using the function confusionMatrix().
I'm using following code to implement elastic net using R
model <- train(
Sales ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
I'm confused about tunelength paramater. In Cran I'm seeing that
To change the candidate values of the tuning parameter, either of the
tuneLength or tuneGrid arguments can be used. The train function can
generate a candidate set of parameter values and the tuneLength
argument controls how many are evaluated. In the case of PLS, the
function uses a sequence of integers from 1 to tuneLength. If we want
to evaluate all integers between 1 and 15, setting tuneLength = 15
would achieve this
But train function is taking dependent & independent variable from my data then how it's using tuneLength parameter? Can you please help me understand?
In caret the train() function has a number of arguments to help select the "optimal" tuning parameters for your chosen model.
Model tuning is explained in detail in package documentation here.
Users can customize the tuning process by specifying a grid of possible parameter values that the model will use when training the model.
For some models, the use of tuneLength is an alternative to specifying a tuneGird.
For example, one method of searching for the 'optimal' model parameters is using random selection. In this case the tuneLength argument is used to control the number of combinations generated by this random tuning parameter search.
To use random search, another option is available in trainControl called search. Possible values of this argument are "grid" and "random". The built-in models contained in caret contain code to generate random tuning parameter combinations. The total number of unique combinations is specified by the tuneLength option to train.
It is covered in more detail here:
http://topepo.github.io/caret/random-hyperparameter-search.html
It is important to check the model you are using in the train function and look at which tuning parameters are used for that model. It will then be easier to understand how to correctly customize the model fitting process.
For your example of using method = 'glmnet' here is a comparison using tuneGrid and tuneLength (taken from package tests):
cctrl1 <- trainControl(method = "cv", number = 3, returnResamp = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
test_class_cv_model <- train(trainX, trainY,
method = "glmnet",
trControl = cctrl1,
metric = "ROC",
preProc = c("center", "scale"),
tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
.lambda = c((1:5)/10)))
cctrlR <- trainControl(method = "cv", number = 3, returnResamp = "all", search = "random")
test_class_rand <- train(trainX, trainY,
method = "glmnet",
trControl = cctrlR,
tuneLength = 4)
I have a model like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = T,
summaryFunction = twoClassSummary
)
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
How do I plot the ROC curve for this model? As I understand it, the probabilities must be saved (which I did in trainControl), but because of the random sampling which bootstrapping uses to generate a 'test' set, I am not sure how caret calculates the ROC value and how to generate a curve.
To isolate the class probabilities for the best performing parameters, I am doing:
for (a in 1:length(model$bestTune))
{model$pred <-
model$pred[model$pred[, paste(colnames(model$bestTune)[a])] == model$bestTune[1, a], ]}
Please advise.
Thanks!
First an explanation:
If you are not going to check how each possible hyper parameter combination predicted on each sample in each re-sample you can set savePredictions = "final" in trainControl to save space:
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
after running the model:
model <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
results of interest are in model$pred
here you can check how many samples were tested in each re-sample (I set 25 repetitions)
nrow(model$pred[model$pred$Resample == "Resample01",])
#83
caret always provides prediction from rows not used in the model build.
nrow(my_data) #208
83/208 makes sense for the test samples for boot632
Now to build the ROC curve. You may opt for several options here:
-average the probability for each sample and use that (this is usual for CV since you have all samples repeated the same number of times, but it can be done with boot also).
-plot all as is without averaging
-plot ROC for each re-sample.
I will show you the second approach:
Create a data frame of class probabilities and true outcomes:
for_lift = data.frame(Class = model$pred$obs, xgbTree = model$pred$R)
plot ROC:
pROC::plot.roc(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels = c("M", "R")),
lwd=1.5)
You can also do this with ggplot, to do so I find it easiest to make a lift object using caret function lift
lift_obj = lift(Class ~ xgbTree, data = for_lift, class = "R")
specify which class the probability was used ^.
library(ggplot2)
ggplot(lift_obj$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))
I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests
Right now, I'm using the caret package, mainly to for tuning the random forests.
So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows.
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
When I run this example, I get the following error
The tuning parameter grid should have columns mtry
I've come across discussions like this suggesting that passing in these parameters in should be possible.
On the other hand, this page suggests that the only parameter that can be passed in is mtry
Can I even pass in sampsize into the random forests via caret?
It looks like there is a bracket issue with your mtryGrid. Alternatively, you can also use expand.grid to give the different values of mtry you want to try.
By default the only parameter you can tune for a random forest is mtry. However you can still pass the others parameters to train. But those will have a fix value an so won't be tuned by train. But you can still ask to use a stratified sample in train. Below is how I would do, assuming that trainY is a boolean variable according which you want to stratify your samples, and that you want samples of size 80 for each category:
mtryGrid <- expand.grid(mtry = 100) # you can put different values for mtry
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
strata = factor(trainY),
sampsize = c(80, 80),
importance = TRUE)
I doubt one can directly pass sampsize and strata to train. But from here I believe the solution is to use trControl(). That is,
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = trainControl(sampling=X),
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
where X can be one of c("up","down","smote","rose").