I have been trying to implement the mlpKerasDropout method when training using the caret R package.
My code seems to cycle through 10 of 10 epochs continuously, and doesnt seem to converge. I have studied mlpKerasDropout.R but am struggling to understand how this function works.
Has anyone out there got a minimal example that they can share of how to use this function?
Many thanks,
For each resampling (driven by trainControl()), it is running a fit. This is what you're seeing with the 10 of 10 epochs cycling continuously. Each cycle was a resample/fold being fit. You can change the number of epochs used while you're hyperparameter tuning by setting the epochs argument to train, which will be passed to the training method "mlpKerasDropout" via dot args (...)
See the code for mlpKerasDropout here: https://github.com/topepo/caret/blob/master/models/files/mlpKerasDropout.R
By default, the search argument for hyperparameters is set to 'grid' but you may want to set it to 'random' so that it tries different activation functions other than relu, or provide your own tuning grid.
Here's a code sample showing the usage of tuneLength with search='random', and utilizing early stopping as well as epochs arguments passed to keras.
tune_model <- train(x, y,
method = "mlpKerasDropout",
preProc = c('center', 'scale', 'spatialSign'),
trControl = trainControl(search = 'random', classProbs = T,
summaryFunction = mnLogLoss, allowParallel = TRUE),
metric = 'logLoss',
tuneLength = 20,
# keras arguments following
validation_split = 0.25,
callbacks = list(
keras::callback_early_stopping(monitor = "val_loss", mode = "auto",
patience = 20, restore_best_weights = TRUE)
),
epochs = 500)
Keep in mind you want re-fit your model on the training data after completing the hyperparameter tuning CV.
Related
I am new to R and trying to generate predictions of a "yes"/"no" variable using an ensemble model. To do so, I am using caret to generate predictions using a random forest (ranger), a LASSO (glmnet) and a gradient boosted regression tree (xgblinear) model. My dataset contains around 600k observations and 500 variables (a mix of continuous and binary variables, weighting 322MB), of which I am using 30% to train the models.
Here is the code I am using for this:
train_control_final <- trainControl(method = "none", savePredictions = TRUE, allowParallel = T, classProbs = TRUE,summaryFunction = twoClassSummary)
rff_final <- train(training_y~., data = training_final, method = "ranger",
tuneGrid = rfgrid_final, num.trees = 250, metric = "ROC", sample.fraction = 0.1,
replace = TRUE, trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rboost_final <- train(training_y~., data = training_final, method = "xgbLinear",
tuneGrid = boostgrid_final, metric = "ROC", subsample = 0.1,
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
rlasso_final <- train(training_y~., data = training_final, method = "glmnet",
tuneGrid = lassogrid_final, metric = "ROC",
trControl=train_control_final, maximize = FALSE, na.action = na.omit)
In a first step, I use a different 10% of the sample to tune the parameters for each model (using 3-fold CV), with the results stored in rf/lasso/boostgrid_final (for this, I use the same code structure).
The problem is that every time I run this code to generate the predictions, R crashes because of memory issues. The code works without problems when tuning the parameters and when including less variables (around 60). So my question is what can I try to make this work with the full dataset? Or is there an alternative way of accomplish what I want (generate predictions using these three different algorithms) without running into memory issues?
Thanks a lot in advance!
I'm using following code to implement elastic net using R
model <- train(
Sales ~., data = train_data, method = "glmnet",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
I'm confused about tunelength paramater. In Cran I'm seeing that
To change the candidate values of the tuning parameter, either of the
tuneLength or tuneGrid arguments can be used. The train function can
generate a candidate set of parameter values and the tuneLength
argument controls how many are evaluated. In the case of PLS, the
function uses a sequence of integers from 1 to tuneLength. If we want
to evaluate all integers between 1 and 15, setting tuneLength = 15
would achieve this
But train function is taking dependent & independent variable from my data then how it's using tuneLength parameter? Can you please help me understand?
In caret the train() function has a number of arguments to help select the "optimal" tuning parameters for your chosen model.
Model tuning is explained in detail in package documentation here.
Users can customize the tuning process by specifying a grid of possible parameter values that the model will use when training the model.
For some models, the use of tuneLength is an alternative to specifying a tuneGird.
For example, one method of searching for the 'optimal' model parameters is using random selection. In this case the tuneLength argument is used to control the number of combinations generated by this random tuning parameter search.
To use random search, another option is available in trainControl called search. Possible values of this argument are "grid" and "random". The built-in models contained in caret contain code to generate random tuning parameter combinations. The total number of unique combinations is specified by the tuneLength option to train.
It is covered in more detail here:
http://topepo.github.io/caret/random-hyperparameter-search.html
It is important to check the model you are using in the train function and look at which tuning parameters are used for that model. It will then be easier to understand how to correctly customize the model fitting process.
For your example of using method = 'glmnet' here is a comparison using tuneGrid and tuneLength (taken from package tests):
cctrl1 <- trainControl(method = "cv", number = 3, returnResamp = "all",
classProbs = TRUE, summaryFunction = twoClassSummary)
test_class_cv_model <- train(trainX, trainY,
method = "glmnet",
trControl = cctrl1,
metric = "ROC",
preProc = c("center", "scale"),
tuneGrid = expand.grid(.alpha = seq(.05, 1, length = 15),
.lambda = c((1:5)/10)))
cctrlR <- trainControl(method = "cv", number = 3, returnResamp = "all", search = "random")
test_class_rand <- train(trainX, trainY,
method = "glmnet",
trControl = cctrlR,
tuneLength = 4)
I am trying to fit xgboost model on multiclass prediction problem, and wanted to use caret to do hyperparameter search.
To test the package, I used the following code, and it takes takes 20 seconds, when I do not supply train object with trainControl
# just use one parameter combination
xgb_grid_1 <- expand.grid(
nrounds = 1,
eta = 0.3,
max_depth = 5,
gamma = 0,
colsample_bytree=1,
min_child_weight=1
)
# train
xgb_train_1 = train(
x = as.matrix(sparse_train),
y = conversion_tbl$y_train_c ,
trControl = trainControl(method="none", classProbs = TRUE, summaryFunction = multiClassSummary),
metric="logLoss",
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
However, when I supply train with a trainControl object, the code never gets finished..or taking a long time(at least it dint' finish for 15 minutes.
xgb_trcontrol_1 <- trainControl(
method = "cv",
number = 2,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "none",
classProbs = TRUE,
summaryFunction = multiClassSummary
)
xgb_train_1 = train(
x = as.matrix(sparse_train),
y = conversion_tbl$y_train_c ,
trControl = xgb_trcontrol_1,
metric="logLoss",
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
Why is this?
FYI, my data size is
dim(sparse_train)
[1] 702402 36
Your trainControl objects are different.
In the first trainControl object, the method is method="none".
In the second trainControl object, the method is method="cv" and number=2. So, in the second object, you are running a two-fold cross-validation which takes longer then not running a cross-validation.
Another thing you can try is adding nthread = 1 to the caret::train() call.
Both XGBoost and Caret try to use parallel/multicore processing where possible, and in the past I have found this to (silently) cause too many threads to spawn, throttling your machine.
Telling caret to process models in sequence minimizes the problem and should mean that only xgboost will be spawning threads.
I am using CARET package to fine tune random forest mtry parameter. In the package, tunelength parameter can be used to automate search for best mtry parameter. But the problem is the "tunelength" works when i set minimum 2 folds in crossvalidation. It does not work when i do not want cross validation.
ctrl <- trainControl(method = "cv", classProbs = TRUE, summaryFunction = twoClassSummary, number = 2)
set.seed(2)
trained <- train(Y ~ . , data = mydata, method = "rf", ntree = 500, tunelength = 10, metric = "ROC", trControl = ctrl, importance = TRUE)
And do anyone know the default setting of tunelength? I mean which value of mtry , it would start with.
I think you don't understand what does parameter tuning mean. You want to select the best combination of parameters that improve some quality measure. The thing is that this quality measure can't be computed on the training set itself, because this would lead to over fitting. Crossvalidation precisely gives you an unbiased estimate of your quality measure.
But the problem is the "tunelength" works when i set minimum 2 folds in crossvalidation. It does not work when i do not want cross validation.
I'm not sure what "does not work" means. If you are not resampling, there are not many ways for determining mtry. You could use method = "OOB" in trainControl and use the internal random forest estimate and set tuneLength the same way you did before (see these two pages for more details).
Again, I'm not sure if this answers your question.
Max
I've been dealing with some extremely imbalanced data and I would like to use stratified sampling to created more balanced random forests
Right now, I'm using the caret package, mainly to for tuning the random forests.
So I try to setup a tuneGrid to pass in the mtry and sampsize parameters into caret train method as follows.
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
When I run this example, I get the following error
The tuning parameter grid should have columns mtry
I've come across discussions like this suggesting that passing in these parameters in should be possible.
On the other hand, this page suggests that the only parameter that can be passed in is mtry
Can I even pass in sampsize into the random forests via caret?
It looks like there is a bracket issue with your mtryGrid. Alternatively, you can also use expand.grid to give the different values of mtry you want to try.
By default the only parameter you can tune for a random forest is mtry. However you can still pass the others parameters to train. But those will have a fix value an so won't be tuned by train. But you can still ask to use a stratified sample in train. Below is how I would do, assuming that trainY is a boolean variable according which you want to stratify your samples, and that you want samples of size 80 for each category:
mtryGrid <- expand.grid(mtry = 100) # you can put different values for mtry
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = ctrl,
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
strata = factor(trainY),
sampsize = c(80, 80),
importance = TRUE)
I doubt one can directly pass sampsize and strata to train. But from here I believe the solution is to use trControl(). That is,
mtryGrid <- data.frame(.mtry = 100),.sampsize=80)
rfTune<- train(x = trainX,
y = trainY,
method = "rf",
trControl = trainControl(sampling=X),
metric = "Kappa",
ntree = 1000,
tuneGrid = mtryGrid,
importance = TRUE)
where X can be one of c("up","down","smote","rose").