The caret library in R has a hyper-parameter 'selectionFunction' inside trainControl().
It's used to prevent over-fitting models using Breiman's one standard error rule, or tolerance, etc.
Does mlr have an equivalent? If so, which function is it within?
mlr only allows to choose optimal hyperparameters by optimizing certain measures/metrics.
However, essentially each "measure" in mlr is just a function that specifies how a certain performance is handled.
You can try to write your own custom measure as outlined in this vignette.
Other than that, it might be worth opening this as a feature request in the new mlr3 framework, specifically in mlr3measures, since mlr itself is deprecated.
Posting an answer to my own question, I found this..
Estimate relative overfitting.
Source: R/relativeOverfitting.R
Estimates the relative overfitting of a model as the ratio of the difference in test and train performance to the difference of test performance in the no-information case and train performance. In the no-information case the features carry no information with respect to the prediction. This is simulated by permuting features and predictions.
estimateRelativeOverfitting(
predish,
measures,
task,
learner = NULL,
pred.train = NULL,
iter = 1
)
Arguments
predish - (ResampleDesc ResamplePrediction Prediction) Resampling strategy or resampling prediction or test predictions.
measures - (Measure list of Measure) Performance measure(s) to evaluate. Default is the default measure for the task, see here getDefaultMeasure.
task - (Task) The task.
learner - (Learner character(1)) The learner. If you pass a string the learner will be created via makeLearner.
pred.train - (Prediction) Training predictions. Only needed if test predictions are passed.
iter - (integer) Iteration number. Default 1, usually you don't need to specify this. Only needed if test predictions are passed.
Related
I'm using the R-package randomForest version 4.6-14. The function randomForest takes a parameter localImp and if that parameter is set to true the function computes local explanations for the predictions. However, these explanations are for the provided training set. I want to fit a random forest model on a training set and use that model to compute local explanations for a separate test set. As far as I can tell the predict.randomForest function in the same package provides no such functionality. Any ideas?
Can you explain more about what it means to have some local explanation on a test set?
According to this answer along with the package document, the variable importance (or, the casewise importance implied by localImp) evaluates how the variable may affect the prediction accuracy. On the other hand, for the test set where there is no label to assess the prediction accuracy, the variable importance should be unavailable.
Using the defaults of the train in caret package, I am trying to train a random forest model for the dataset xtr2 (dim(xtr2): 765 9408). The problem is that it unbelievably takes too long (more than one day for one training) to fit the function. As far as I know train in its default uses bootstrap sampling (25 times) and three random selection of mtry, so why it should take so long?
Please notice that I need to train the rf, three times in each run (because I need to make a mean of the results of different random forest models with the same data), and it takes about three days, and I need to run the code for 10 different samples, so it would take me 30 days to have the results.
My question is how I can make it faster?
Can changing the defaults of train make the operation time less? for example using CV for training?
Can parallel processing with caret package help? if yes, how it can be done?
Can tuneRF of random forest package make any changes to the time?
This is the code:
rffit=train(xtr2,ytr2,method="rf",ntree=500)
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
keep.forest=TRUE,importance=TRUE,oob.prox =FALSE ,
mtry = rffit$bestTune$mtry)
Thank you,
My thoughts on your questions:
Yes! But don't forget you also have control over the search grid caret uses for the tuning parameters; in this case, mtry. I'm not sure what the default search grid is for mtry, but try the following:
ctrl <- trainControl("cv", number = 5, verboseIter = TRUE)
set.seed(101) # for reproducibility
rffit <- train(xtr2, ytr2, method = "rf", trControl = ctrl, tuneLength = 5)
Yes! See the caret website: http://topepo.github.io/caret/parallel-processing.html
Yes and No! tuneRF simply uses the OOB error to find an optimal value of mtry (the only tuning parameter in randomForest). Using cross-validation tends to work better and produce a more honest estimate of model performance. tuneRF can take a long time but should be quicker than k-fold cross-validation.
Overall, the online manual for caret is quite good: http://topepo.github.io/caret/index.html.
Good luck!
You use train for determining mtry only. I would skip the train step, and stay with default mtry:
rf.mdl =randomForest(x=xtr2,y=as.factor(ytr2),ntree=500,
keep.forest=TRUE,importance=TRUE,oob.prox =FALSE)
I strongly doubt that 3 different runs is a good idea.
If you do 10 fold cross-validation (I am not sure it should be done anyways, as validation is ingrained into the random forest), 10 parts is too much, if you are short in time. 5 parts would be enough.
Finally, the time of randomForest is proportional to nTree. Set nTree=100, and your program will run 5 time faster.
I would also just add, that it the main issue is speed, there are several other random forest implementations in caret, and many of them are much faster than the original randomForest which is notoriously slow. I've found ranger to be a nice alternative that suited my very simple needs.
Here is a nice summary of the random forest packges in R. Many of these are in caret already.
Also for consideration, here's an interesting study of the performance of ranger vs rborist, where you can see how performance is affected by the tradeoff between sample size and features.
Let me start by saying that I have read many posts on Cross Validation and it seems there is much confusion out there. My understanding of that it is simply this:
Perform k-fold Cross Validation i.e. 10 folds to understand the average error across the 10 folds.
If acceptable then train the model on the complete data set.
I am attempting to build a decision tree using rpart in R and taking advantage of the caret package. Below is the code I am using.
# load libraries
library(caret)
library(rpart)
# define training control
train_control<- trainControl(method="cv", number=10)
# train the model
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
# make predictions
predictions<- predict(model,mydat)
# append predictions
mydat<- cbind(mydat,predictions)
# summarize results
confusionMatrix<- confusionMatrix(mydat$predictions,mydat$resp)
I have one question regarding the caret train application. I have read A Short Introduction to the caret Package train section which states during the resampling process the "optimal parameter set" is determined.
In my example have I coded it up correctly? Do I need to define the rpart parameters within my code or is my code sufficient?
when you perform k-fold cross validation you are already making a prediction for each sample, just over 10 different models (presuming k = 10).
There is no need make a prediction on the complete data, as you already have their predictions from the k different models.
What you can do is the following:
train_control<- trainControl(method="cv", number=10, savePredictions = TRUE)
Then
model<- train(resp~., data=mydat, trControl=train_control, method="rpart")
if you want to see the observed and predictions in a nice format you simply type:
model$pred
Also for the second part of your question, caret should handle all the parameter stuff. You can manually try tune parameters if you desire.
An important thing to be noted here is not confuse model selection and model error estimation.
You can use cross-validation to estimate the model hyper-parameters (regularization parameter for example).
Usually that is done with 10-fold cross validation, because it is good choice for the bias-variance trade-off (2-fold could cause models with high bias, leave one out cv can cause models with high variance/over-fitting).
After that, if you don't have an independent test set you could estimate an empirical distribution of some performance metric using cross validation: once you found out the best hyper-parameters you could use them in order to estimate de cv error.
Note that in this step the hyperparameters are fixed but maybe the model parameters are different accross the cross validation models.
In the first page of the short introduction document for caret package, it is mentioned that the optimal model is chosen across the parameters.
As a starting point, one must understand that cross-validation is a procedure for selecting best modeling approach rather than the model itself CV - Final model selection. Caret provides grid search option using tuneGrid where you can provide a list of parameter values to test. The final model will have the optimized parameter after training is done.
I used ML PipeLine to run logistic regression models but for some reasons I got worst results than R. I have done some researches and the only post that I found that is related to this issue is this . It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood. The Spark model only got 71.3% of the records right while R can predict 95.55% of the cases correctly. I was wondering if I did something wrong on the set up and if there's a way to improve the prediction. The below is my Spark code and R code-
Spark code
partial model_input
label,AGE,GENDER,Q1,Q2,Q3,Q4,Q5,DET_AGE_SQ
1.0,39,0,0,1,0,0,1,31.55709342560551
1.0,54,0,0,0,0,0,0,83.38062283737028
0.0,51,0,1,1,1,0,0,35.61591695501733
def trainModel(df: DataFrame): PipelineModel = {
val lr = new LogisticRegression().setMaxIter(100000).setTol(0.0000000000000001)
val pipeline = new Pipeline().setStages(Array(lr))
pipeline.fit(df)
}
val meta = NominalAttribute.defaultAttr.withName("label").withValues(Array("a", "b")).toMetadata
val assembler = new VectorAssembler().
setInputCols(Array("AGE","GENDER","DET_AGE_SQ",
"QA1","QA2","QA3","QA4","QA5")).
setOutputCol("features")
val model = trainModel(model_input)
val pred= model.transform(model_input)
pred.filter("label!=prediction").count
R code
lr <- model_input %>% glm(data=., formula=label~ AGE+GENDER+Q1+Q2+Q3+Q4+Q5+DET_AGE_SQ,
family=binomial)
pred <- data.frame(y=model_input$label,p=fitted(lr))
table(pred $y, pred $p>0.5)
Feel free to let me know if you need any other information. Thank you!
Edit 9/18/2015 I have tried increasing the maximum iteration and decreasing the tolerance dramatically. Unfortunately, it didn't improve the prediction. It seems the model converged to a local minimum instead of the global minimum.
It seems that Spark Logistic Regression returns models that minimize loss function while R glm function uses maximum likelihood.
Minimization of a loss function is pretty much a definition of the linear models and both glm and ml.classification.LogisticRegression are no different here. Fundamental difference between these two is the way how it is achieved.
All linear models from ML/MLlib are based on some variants of Gradient descent. Quality of the model generated using this approach vary on a case by case basis and depend on the Gradient Descent and regularization parameters.
R from the other hand computes an exact solution which, given its time complexity, is not well suited for large datasets.
As I've mentioned above quality of the model generated using GS depends on the input parameters so typical way to improve it is to perform hyperparameter optimization. Unfortunately ML version is rather limited here compared to MLlib but for starters you can increase a number of iterations.
I am currently trying to optimize the random forest classifier for a very high-dimensional dataset (p > 200k) using recursive feature elimination (RFE). caret package has a nice implementation for doing this (rfe()-function). However, I am also thinking about optimizing RAM and CPU usage.. That's why I wonder if there is an opportunity to set different (larger) number of trees to train the first forest (without feature elimination) and to use its importances to build the remaining ones (with RFE) using for example 500 trees with 10- or 5-fold cross-validation. I know that this option is available in varSelRF.. But how about caret? I didn't manage to find anything regarding this in the manual.
You can do that. The rfFuncs list has an object called fit that defines how the model is fit. One argument to this function is called 'first' which is TRUE on the first fit (there is also a 'last' arg). You can set ntree based on this.
See the feature selection vignette for more details.
Max