R caret "besttune" for CV & repeatedCV - r

I"m trying to understand how caret is coming to the decision it's making on the best-tuned model. I have looked through the documentation and I have not found (which could easily be my fault) a place to adjust how this decision is made. I'm using something similar to :
train(
y~.,
data=X,
num.trees = 1000,
method = "ranger",
trControl = trainControl(
method = "repeatedcv",
number = 100,
repeats = 100, verboseIter = T
)
I'm trying to use Caret more often, and I'm sure there is a smart way it's making the decision.. I'm just trying to understand how and if I can adjust it.

There is a lot of documentation but the best place to look for your question is here.
Basically, for grid search, multiple combinations of tuning parameters are evaluated using resampling. Each combination gets an associated resampling estimate of performance (let's say it is accuracy).
train() knows that accuracy should be maximized so, by default, it picks the parameter combination with the largest value and uses these to fit one final model (using these values and the entire training set).

Related

Hyperparameter tuning for neural net (nnet) in caret in R

I am constructing a neural net model in R using caret package and my code is as follows:
model <- train(RS_LAI~S2REP_LF+PSRI_ES+IRECI+NDVIRE+B11_ES+B7+TCARI_LF+MCARI+WDRVI
, data = Data,
method = "nnet", trControl = controlparameters,
linout = TRUE)
At the end when the model runs the result I get is the final value of size and decay. Here I suppose, size is the number of hidden layers, but I am confused what is the number of nodes its using in each layer? How can I get that? I think the number of nodes is also an important parameter to tune, but caret doesn't give that option.
You are using nnet , if you read the help page:
Fit single-hidden-layer neural network, possibly with skip-layer
connections.
So it is 1 layer and the size parameter is the number of nodes or units, as you can see from the same help page:
size: number of units in the hidden layer. Can be zero if there are
skip-layer units.
You can try to use neuralnet, it specifies up to 3 layers and your hyperparameters would be the number of nodes in each layer.

What is the proper way to use glmnet with caret?

I was reading the glmnet documentation and I found this:
Note also that the results of cv.glmnet are random, since the folds
are selected at random. Users can reduce this randomness by running
cv.glmnet many times, and averaging the error curves.
The following code uses caret with a repeated cv.
library(caret)
ctrl <- trainControl(verboseIter = TRUE, classProbs = TRUE,
summaryFunction = twoClassSummary, method = "repeatedcv",
repeats = 10)
fit <- train(x, y, method = "glmnet", metric = "ROC", trControl = ctrl)
Is that the best way to run glmnet with cross validation through caret?, or is it better to run glmnet directly?
You need to define best way. Do you want to use
A regularized regression alone on a dataset for feature selection? (in which case, use glmnet--Max Kuhn has implied that you may be better off using models with in-built CV features as they would have been optimized for both predictor selection and minimizing error). See below.
"In many cases, using these models with built-in feature selection will be more efficient than algorithms where the search routine for
the right predictors is external to the model. Built-in feature
selection typically couples the predictor search algorithm with the
parameter estimation and are usually optimized with a single
objective function (e.g. error rates or likelihood)." (Kuhn, caret
package documentation: caret feature selection overview)
Or are you comparing different models, one of which is glmnet? In which case, caret may be a great choice.

R caret randomforest

Using the defaults of the train in caret package, I am trying to train a random forest model for the dataset xtr2 (dim(xtr2): 765 9408). The problem is that it unbelievably takes too long (more than one day for one training) to fit the function. As far as I know train in its default uses bootstrap sampling (25 times) and three random selection of mtry, so why it should take so long?
Please notice that I need to train the rf, three times in each run (because I need to make a mean of the results of different random forest models with the same data), and it takes about three days, and I need to run the code for 10 different samples, so it would take me 30 days to have the results.
My question is how I can make it faster?
Can changing the defaults of train make the operation time less? for example using CV for training?
Can parallel processing with caret package help? if yes, how it can be done?
Can tuneRF of random forest package make any changes to the time?
This is the code:
rffit=train(xtr2,ytr2,method="rf",ntree=500)
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
keep.forest=TRUE,importance=TRUE,oob.prox =FALSE ,
mtry = rffit$bestTune$mtry)
Thank you,
My thoughts on your questions:
Yes! But don't forget you also have control over the search grid caret uses for the tuning parameters; in this case, mtry. I'm not sure what the default search grid is for mtry, but try the following:
ctrl <- trainControl("cv", number = 5, verboseIter = TRUE)
set.seed(101) # for reproducibility
rffit <- train(xtr2, ytr2, method = "rf", trControl = ctrl, tuneLength = 5)
Yes! See the caret website: http://topepo.github.io/caret/parallel-processing.html
Yes and No! tuneRF simply uses the OOB error to find an optimal value of mtry (the only tuning parameter in randomForest). Using cross-validation tends to work better and produce a more honest estimate of model performance. tuneRF can take a long time but should be quicker than k-fold cross-validation.
Overall, the online manual for caret is quite good: http://topepo.github.io/caret/index.html.
Good luck!
You use train for determining mtry only. I would skip the train step, and stay with default mtry:
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
keep.forest=TRUE,importance=TRUE,oob.prox =FALSE)
I strongly doubt that 3 different runs is a good idea.
If you do 10 fold cross-validation (I am not sure it should be done anyways, as validation is ingrained into the random forest), 10 parts is too much, if you are short in time. 5 parts would be enough.
Finally, the time of randomForest is proportional to nTree. Set nTree=100, and your program will run 5 time faster.
I would also just add, that it the main issue is speed, there are several other random forest implementations in caret, and many of them are much faster than the original randomForest which is notoriously slow. I've found ranger to be a nice alternative that suited my very simple needs.
Here is a nice summary of the random forest packges in R. Many of these are in caret already.
Also for consideration, here's an interesting study of the performance of ranger vs rborist, where you can see how performance is affected by the tradeoff between sample size and features.

How can you reduce the default ntree=500 parameter passed to RF from caret?

I believe the "rf" (randomForest) method in caret sets the default number of trees at 500. Unfortunately, this causes the time complexity to grow out of control for larger datasets. Is there any quick way to reduce the number of trees without creating a custom method? I know that the only tuneable parameter for rf is mtry.
Just to clarify: I'm not looking to tune on number of trees. I simply want to fix it to a lower value so that I can run rf in a reasonable amount of time.
You can specify the ntree parameter when you call train like so:
rf <- train(X, y, method="rf", preProcess=c("center","scale"), ntree=100, trControl=fitControl)
One suggestion would be to use the randomForest library. I have always found that one simpler to use than the one in caret, and it has a parameter to set the number of trees.

Tuning ksvm from kernlab

I want to use an SVM implementation in R to do some regression. I tried using svm from e1071 already but I am limited by the kernel functions there. So I moved on to ksvm from kernlab. But I have a major disadvantage that a tuning function has not been provided in kernlab (like tune.svm in e1071). Can someone explain how do I tune the parameters for different kernels there?
PS. I want to particularly use the rbfdot kernel. So if at least someone can help me understand how to tune sigma, I'd be extremely grateful.
PPS. I'm completely aware that the "automatic" value for kpar can be used "to calculate a good sigma". But I need something more tangible and more along the lines of tune.svm.
Either you write your own wrapper (wouldn't be that hard to be honest) or you could try already proven implemented solutions, like mlr and caret.
mlr tutorial has an example about it.
ps = makeParamSet(
makeDiscreteParam("C", values = 2^(-2:2)),
makeDiscreteParam("sigma", values = 2^(-2:2))
)
ctrl = makeTuneControlGrid()
rdesc = makeResampleDesc("CV", iters = 3L)
res = tuneParams("classif.ksvm", task = iris.task, resampling = rdesc, par.set = ps, control = ctrl)
This will perform 3-fold cross-validation to select parameters from the grid and evaluate accuracy on the iris dataset. You can, of course, change resampling strategies (leave-one-out, monte-carlo CV, CV, repeated CV, bootstrap validation and holdout are all implemented), search strategy (grid search, random search, generalized simulated annealing and iterated F-race are all supported) and evaluation metrics.

Resources