This question already has an answer here:
Optimising caret for sensitivity still seems to optimise for ROC
(1 answer)
Closed 2 years ago.
Let's consider data
set.seed(20)
y <- sample(0:1, 100, replace = T)
x <- data.frame(rnorm(100), rexp(100))
I want to perform cross validation and output sensitivity and specificity. I found out that I can provide additional input to train function 'metric' to specify which metric I want to have. SO :
# train the model on training set
library(caret)
cross <- train(as.factor(y) ~ .,
data = cbind(y,x),
metric = 'Sensitivity',
trControl = trainControl(method = "cv", number = 5)
,
method = "glm",
family = binomial()
)
However I see the problem :
The metric "Sensitivity" was not in the result set. Accuracy will be used instead.
Is there any solution how Sensitivity and specificity can be used in cross validation ?
You can subset confusionMatrix() with $ or [] and this will probably give you what you need.
You can also use function like negPredValue() to get Sensitivity and Specificity.
The 'Sensitivity' metric does not exist for train() in caret package.
Since you are using caret, you can find some of the answer in the documentation of this package. It states that the metric parameter is ...
a string that specifies what summary metric will be used to select the
optimal model. By default, possible values are "RMSE" and "Rsquared"
for regression and "Accuracy" and "Kappa" for classification. If
custom performance metrics are used (via the summaryFunction argument
in trainControl, the value of metric should match one of the
arguments. If it does not, a warning is issued and the first metric
given by the summaryFunction is used. (NOTE: If given, this argument
must be named.)
So by default, a 'Sensitivity' metric does not exist. But you can define such a metric yourself. One approach is to use the trainControl function to pass a custom function that calculates sensitivity. See Optimising caret for sensitivity still seems to optimise for ROC for instance.
Related
I want to do feature selection of my random forrest model following the approach of rfe of the caret package. As my data set contains only about 100 labeled samples and as it is highly unbalanced (which reflects real life balance), I need/want to do stratified cross validation. However, I did not find any documentation about the rfeControl function regarding stratified cross validation.
Does anybody know if the rfeControl function does create stratified folds if I use
ctrl <- rfeControl(functions = rfFuncs,
method = "cv",
verbose = FALSE)
with method ="cv", rfe() should use createFolds() to create your folds, and these will be balanced based on your output variable.
You can see ?createFolds for details on how this is implemented.
I am quite new to the neural network world so I ask for your understanding. I am generating some tests and thus I have a question about the parameters size and decay. I use the caret package and the method nnet. Example dataset:
require(mlbench)
require(caret)
require (nnet)
data(Sonar)
mydata=Sonar[,1:12]
set.seed(54878)
ctrl = trainControl(method="cv", number=10,returnResamp = "all")
for_train= createDataPartition(mydata$V12, p=.70, list=FALSE)
my_train=mydata[for_train,]
my_test=mydata[-for_train,]
t.grid=expand.grid(size=5,decay=0.2)
mymodel = train(V12~ .,data=my_train,method="nnet",metric="Rsquared",trControl=ctrl,tuneGrid=t.grid)
So, two are my questions. First, is this the best way with caret to use the nnet method?Second, I have read about the size and the decay (eg. Purpose of decay parameter in nnet function in R?) but I cannot understand how to use them in practice here. Can anyone help?
Brief Caret explanation
The Caret package lets you train different models and tuning hyper-parameters using Cross Validation (Hold-Out or K-fold) or Bootstrap.
There are two different ways to tune the hyper-parameters using Caret: Grid Search and Random Search. If you use Grid Search (Brute Force) you need to define the grid for every parameter according to your prior knowledge or you can fix some parameters and iterate on the remain ones. If you use Random Search you need to specify a tuning length (maximum number of iterations) and Caret is going to use random values for hyper-parameters until the stop criteria holds.
No matter what method you choose Caret is going to use each combination of hyper-parameters to train the model and compute performance metrics as follows:
Split the initial Training samples into two different sets: Training and Validation (For bootstrap or Cross validation) and into k sets (For k-fold Cross Validation).
Train the model using the training set and to predict on validation set (For Cross Validation Hold-Out and Bootstrap). Or using k-1 training sets and to predict using the k-th training set (For K-fold Cross Validation).
On the validation set Caret computes some performance metrics as ROC, Accuracy...
Once the Grid Search has finished or the Tune Length is completed Caret uses the performance metrics to select the best model according to the criteria previously defined (You can use ROC, Accuracy, Sensibility, RSquared, RMSE....)
You can create some plot to understand the resampling profile and to pick the best model (Keep in mind performance and complexity)
if you need more information about Caret you can check the Caret web page
Neural Network Training Process using Caret
When you train a neural network (nnet) using Caret you need to specify two hyper-parameters: size and decay. Size is the number of units in hidden layer (nnet fit a single hidden layer neural network) and decay is the regularization parameter to avoid over-fitting. Keep in mind that for each R package the name of the hyper-parameters can change.
An example of training a Neural Network using Caret for classification:
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
nnetGrid <- expand.grid(size = seq(from = 1, to = 10, by = 1),
decay = seq(from = 0.1, to = 0.5, by = 0.1))
nnetFit <- train(Label ~ .,
data = Training[, ],
method = "nnet",
metric = "ROC",
trControl = fitControl,
tuneGrid = nnetGrid,
verbose = FALSE)
Finally, you can make some plots to understand the resampling results. The following plot was generated from a GBM training process
GBM Training Process using Caret
I'm having a dataset of 90 stations with a variety of different covariates which I would like to take for prediction by using a step-wise forward multiple regression. Therefore I would like to use Monte Carlo Cross Validation to estimate the performance of my linear model by splitting into test- and training tests for many times.
How can I implement the MCCV in R to test my model for certain iterations? I found the package WilcoxCV which gives me the observation number for each iteration. I also found the CMA-package which doesn't helps me a lot so far.
I checked all threads about MCCV but didn't find the answer.
You can use the caret package. The MCCV is called 'LGOCV' in this package (i.e Leave Group Out CV). It randomly selects splits between training and test sets.
Here is an example use training a L1-regularized regression model (you should look into regularization instead of step-wise btw), validating the selection of the penalizing lambda parameter using MCCV:
library(caret)
library(glmnet)
n <- 1000 # nbr of observations
m <- 20 # nbr of features
# Generate example data
x <- matrix(rnorm(m*n),n,m)
colnames(x) <- paste0("var",1:m)
y <- rnorm(n)
dat <- as.data.frame(cbind(y,x))
# Set up training settings object
trControl <- trainControl(method = "LGOCV", # Leave Group Out CV (MCCV)
number = 10) # Number of folds/iterations
# Set up grid of parameters to test
params = expand.grid(alpha=c(0,0.5,1), # L1 & L2 mixing parameter
lambda=2^seq(1,-10, by=-0.3)) # regularization parameter
# Run training over tuneGrid and select best model
glmnet.obj <- train(y ~ ., # model formula (. means all features)
data = dat, # data.frame containing training set
method = "glmnet", # model to use
trControl = trControl, # set training settings
tuneGrid = params) # set grid of params to test over
# Plot performance for different params
plot(glmnet.obj, xTrans=log, xlab="log(lambda)")
# Plot regularization paths for the best model
plot(glmnet.obj$finalModel, xvar="lambda", label=T)
You can use glmnet to train linear models. If you want to use step-wise caret supports that too using e.g method = 'glmStepAIC' or similar.
a list of the feature selection wrappers can be found here: http://topepo.github.io/caret/Feature_Selection_Wrapper.html
Edit
alphaand lambda arguments in the expand.grid function are glmnet specific parameters. If you use another model it will have a different set of parameters to optimize over.
lambda is the amount of regularization, i.e the amount of penalization on the beta values. Larger values will give "simpler" models, less prone to overfit, and smaller values more complex models that will tend to overfit if not enough data is available. The lambda values I supplied are just an example. Supply the grid you are interested in. But in general it is nice to supply an exponentially decreasing sequence for lambda.
alpha is the mixing parameter between L1 and L2 regularization. alpha=1 is L1 and alpha=0 is L2 regularization. I only supplied one value in the grid for this parameter. It is of course possible to supply several, like e.g alpha=c(0,0.5,1) which would test L1, L2 and an even mix of the two.
expand.grid creates a grid of potential parameter values we want to run the MCCV procedure over. Essentially, the MCCV procedure will evaluate performance for each of the different values in the grid and select the best one for you.
You can read more about glmnet, caret and parameter tuning here:
An Introduction to Glmnet
glmnet documentation
Model Training and Parameter Tuning with Caret
I am currently wondering on the way to set 10 trees using the random forest algorithm from the Caret package, and hope an assistance could be obtained:
below is my syntax:
tr <- trainControl(method = "repeatedcv",number = 20)
fit<-train(y ~.,method="rf",data=example, trControl=tr)
Following researches on http://www.inside-r.org/packages/cran/randomForest/docs/randomForest
Setting either n=10
as argument in randomForest() or n.trees in case of using gbm could have merely helped, but I am interested in the Caret package.
Any feedback would be very appreciated.
Thanks
Caret's train() uses the randomForest() function when you specify method = "rf" in the train call.
You simply need to pass ntree = 10 to train which then will be passed on to randomForest().
Therefore, your call would look like this:
fit <- train(y ~., method="rf",data=example, trControl=tr, ntree = 10)
For interest to anyone in my position who landed here while using ranger method of random forrest (Google still directed me here when specifying "ranger" in my search term) use num.trees.
num.trees = 20
I think ntree is a parameter you are looking for
When using caret's train function to fit GBM classification models, the function predictionFunction converts probabilistic predictions into factors based on a probability threshold of 0.5.
out <- ifelse(gbmProb >= .5, modelFit$obsLevels[1], modelFit$obsLevels[2])
## to correspond to gbmClasses definition above
This conversion seems premature if a user is trying to maximize the area under the ROC curve (AUROC). While sensitivity and specificity correspond to a single probability threshold (and therefore require factor predictions), I'd prefer AUROC be calculated using the raw probability output from gbmPredict. In my experience, I've rarely cared about the calibration of a classification model; I want the most informative model possible, regardless of the probability threshold over which the model predicts a '1' vs. '0'. Is it possible to force raw probabilities into the AUROC calculation? This seems tricky, since whatever summary function is used gets passed predictions that are already binary.
"since whatever summary function is used gets passed predictions that are already binary"
That's definitely not the case.
It cannot use the classes to compute the ROC curve (unless you go out of your way to do so). See the note below.
train can predict the classes as factors (using the internal code that you show) and/or the class probabilities.
For example, this code will compute the class probabilities and use them to get the area under the ROC curve:
library(caret)
library(mlbench)
data(Sonar)
ctrl <- trainControl(method = "cv",
summaryFunction = twoClassSummary,
classProbs = TRUE)
set.seed(1)
gbmTune <- train(Class ~ ., data = Sonar,
method = "gbm",
metric = "ROC",
verbose = FALSE,
trControl = ctrl)
In fact, if you omit the classProbs = TRUE bit, you will get the error:
train()'s use of ROC codes requires class probabilities. See the classProbs option of trainControl()
Max