I am trying to fit xgboost model on multiclass prediction problem, and wanted to use caret to do hyperparameter search.
To test the package, I used the following code, and it takes takes 20 seconds, when I do not supply train object with trainControl
# just use one parameter combination
xgb_grid_1 <- expand.grid(
nrounds = 1,
eta = 0.3,
max_depth = 5,
gamma = 0,
colsample_bytree=1,
min_child_weight=1
)
# train
xgb_train_1 = train(
x = as.matrix(sparse_train),
y = conversion_tbl$y_train_c ,
trControl = trainControl(method="none", classProbs = TRUE, summaryFunction = multiClassSummary),
metric="logLoss",
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
However, when I supply train with a trainControl object, the code never gets finished..or taking a long time(at least it dint' finish for 15 minutes.
xgb_trcontrol_1 <- trainControl(
method = "cv",
number = 2,
verboseIter = TRUE,
returnData = FALSE,
returnResamp = "none",
classProbs = TRUE,
summaryFunction = multiClassSummary
)
xgb_train_1 = train(
x = as.matrix(sparse_train),
y = conversion_tbl$y_train_c ,
trControl = xgb_trcontrol_1,
metric="logLoss",
tuneGrid = xgb_grid_1,
method = "xgbTree"
)
Why is this?
FYI, my data size is
dim(sparse_train)
[1] 702402 36
Your trainControl objects are different.
In the first trainControl object, the method is method="none".
In the second trainControl object, the method is method="cv" and number=2. So, in the second object, you are running a two-fold cross-validation which takes longer then not running a cross-validation.
Another thing you can try is adding nthread = 1 to the caret::train() call.
Both XGBoost and Caret try to use parallel/multicore processing where possible, and in the past I have found this to (silently) cause too many threads to spawn, throttling your machine.
Telling caret to process models in sequence minimizes the problem and should mean that only xgboost will be spawning threads.
Related
How do I choose the optimal df(degrees of freedom) for my splines?
I used poisson regression and splines that help me to adjust for non linear changes. Using the caret package, I used the train function with method = gamSpline to test only 3 df.
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
)
)
Aggregating results
Selecting tuning parameters
Fitting df = 3 on full training set
Is this the default? If so how I can change it?
Tnx,
Daniel
The tuneGrid argument allows the user to specify a custom grid of tuning parameters, in this case, df
model <- train(
RBC ~ elapsed,
obgyn_aleph,
method = "gamSpline",
trControl = trainControl(
method = "cv",
number = 10,
verboseIter = TRUE
),
tuneGrid = data.frame(df=seq(2,20,by=2))
)
I'm fitting a model using cross fold validation with caret:
library(caret)
## tuning & parameters
set.seed(123)
train_control <- trainControl(
method = "cv",
number = 5,
savePredictions = TRUE,
verboseIter = TRUE,
classProbs = TRUE,
summaryFunction = my_summary
)
linear_model = train(
x = select(training_data, Avg_Load_Time),
y = target,
trControl = train_control,
method = "glm", # logistic regression
family = "binomial",
metric = "ROC"
)
The trouble is that out of ~5K rows I have only ~120 true cases. This is throwing a warning message when using GLM via caret "glm.fit: fitted probabilities numerically 0 or 1 occurred".
Is there a parameter I can set or some approach to ensuring each fold has some of the true cases?
It's easier when you shuffle data and have enough examples of each class.
If you don't have enough examples, you can increase the size of the minority class using SMOTE (Synthetic Minority Oversampling Technique). Package smotefamily in R.
Then you will be able to do 5 or 10 fold Cross Validation without raising any issues.
I have been trying to implement the mlpKerasDropout method when training using the caret R package.
My code seems to cycle through 10 of 10 epochs continuously, and doesnt seem to converge. I have studied mlpKerasDropout.R but am struggling to understand how this function works.
Has anyone out there got a minimal example that they can share of how to use this function?
Many thanks,
For each resampling (driven by trainControl()), it is running a fit. This is what you're seeing with the 10 of 10 epochs cycling continuously. Each cycle was a resample/fold being fit. You can change the number of epochs used while you're hyperparameter tuning by setting the epochs argument to train, which will be passed to the training method "mlpKerasDropout" via dot args (...)
See the code for mlpKerasDropout here: https://github.com/topepo/caret/blob/master/models/files/mlpKerasDropout.R
By default, the search argument for hyperparameters is set to 'grid' but you may want to set it to 'random' so that it tries different activation functions other than relu, or provide your own tuning grid.
Here's a code sample showing the usage of tuneLength with search='random', and utilizing early stopping as well as epochs arguments passed to keras.
tune_model <- train(x, y,
method = "mlpKerasDropout",
preProc = c('center', 'scale', 'spatialSign'),
trControl = trainControl(search = 'random', classProbs = T,
summaryFunction = mnLogLoss, allowParallel = TRUE),
metric = 'logLoss',
tuneLength = 20,
# keras arguments following
validation_split = 0.25,
callbacks = list(
keras::callback_early_stopping(monitor = "val_loss", mode = "auto",
patience = 20, restore_best_weights = TRUE)
),
epochs = 500)
Keep in mind you want re-fit your model on the training data after completing the hyperparameter tuning CV.
I am having some issues in understanding how the manual summary function in Caret works. I have created a simple code to maximize all predictions as "fail". But for some reason it doesn't seem to predict all instances as fail (on the training dataset).
See below for the code:
Maximize all predictions as fail function:
BS <- function (data, lev = NULL, model = NULL) {
negpredictions <- sum(data$pred == "fail")
names(negpredictions) <- c("Min_Precision")
negpredictions
}
Training Script:
train.control <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
#sampling = "smote",
summaryFunction = BS,
search = "grid")
tune.grid <- expand.grid(.mtry = seq(from = 1, to = 10, by = 1))
cl <- makeCluster(3, type = "SOCK")
registerDoSNOW(cl)
random.forest.orig <- train(pass ~ manufacturer+meter.type+premise+size+age+avg.winter+totalizer,
data = meter.train,
method = "rf",
tuneGrid = tune.grid,
metric = "Min_Precision",
maximize = TRUE,
trControl = train.control)
stopCluster(cl)
The metric specified in caret is not a loss function but rather the metric to chose the optimal model (in most cases the optimal combination of hyper parameters). So by specifying the BS function you are merely selecting the mtry that maximizes the prediction of "fail".
From the function help:
metric
A string that specifies what summary metric will be used to select the optimal model. By default, possible values are "RMSE" and "Rsquared" for regression and "Accuracy" and "Kappa" for classification. If custom performance metrics are used (via the summaryFunction argument in trainControl, the value of metric should match one of the arguments. If it does not, a warning is issued and the first metric given by the summaryFunction is used. (NOTE: If given, this argument must be named.)
If you check
random.forest.orig$bestTune
you will see the best tune is the one that maximized the BS function. However this does not change the native models loss function.
I'd like to downsample my data given that I have a signficant class imbalance. Without downsampling, my GBM model performs reasonably well; however, with r-caret's downSample, accuracy = 0.5. I applied the same downsampling to another GBM model and got exactly the same results. What gives?
set.seed(1914)
down_train_my_gbm <- downSample(x = combined_features,
y = combined_features$label)
down_train_my_gbm$label <- NULL
my_gbm_combined_downsampled <- train(Class ~ .,
data = down_train_my_gbm,
method = "gbm",
trControl = trainControl(method="repeatedcv",
number=10, repeats=3,
classProbs = TRUE),
preProcess = c("range"),
verbose = FALSE)
I suspected that the issue might have to do with classProbs=TRUE. Changing this to FALSE skyrockets the accuracy to >0.95...but I get the exact same results for multiple models (which do not result in the same accuracy without downsampling). I'm baffled by this. What am I doing wrong here?
Caret train function allows to downsample, upsample and more with the trainControl options: from the guide Subsampling During Resampling, in your case it would be
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary,
## new option here:
sampling = "down")
model_with_down_sample <- train(Class ~ ., data = imbal_train,
method = "gbm",
preProcess = c("range"),
verbose = FALSE,
trControl = ctrl)
As a side note, avoid the formula style (e.g. Class~ .), but use the direct columns: it has been shown to have issues with memory and speed when many predictors are used (https://github.com/topepo/caret/issues/263).
Hope it helps.