r caret train and predict after tuning - r

I'm currently working on a research paper.
My dataset consists of daily stock index data with a predictor that says UP or DOWN.
In total I use 20% of my dataset to tune the parameters. Based on the code below I found that I should use sigma = 0.05 and C = 25.
Now, I want to train 50% on my dataset based on these settings and make a prediction on the other 50% of the set.
Could anyone explain how I should do this? I can't find how I should proceed in caret's documentation.
### MODELS
## SetUp Training
trainControl <- trainControl(method="cv", index = fit_on, indexOut = pred_on, savePredictions = TRUE)
metric <- "Accuracy"
## Parameter tuning
# SVM
set.seed(1908)
grid_rf <- expand.grid(C = c(0.1, 0.25, 0.5, 1, 5, 10, 25, 50), sigma = c(0,025, 0.05, 0.1, 0.25, 0.5, 1, 1.5, 2, 2.5, 3, 3.5,4,4.5,5,7,5,10,15))
fit.svm <- train(class~., data=dataCombined, method="svmRadial", metric=metric, preProc=c("range"),trControl=trainControl, tuneGrid = grid_rf)
Many thanks!
Best,
Mathijs

Related

Why does my code take so long to process?

I try to run code from this web site in my computer.
I use data set from kaggle competition
In my training data 1022 rows and 81 variables.
I run this code:
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
random_index <- sample(1:nrow(ames_train), nrow(ames_train))
random_ames_train <- ames_train[random_index, ]
# grid search
for(i in 1:nrow(hyper_grid)) {
# reproducibility
set.seed(123)
# train model
gbm.tune <- gbm(
formula = SalePrice ~ .,
distribution = "gaussian",
data = random_ames_train,
n.trees = 5000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = .75,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
I'm waiting more than 1 hour.
I think it's bacause my laptop is not powerful.
On the picture you can see parameters of my computer.
Please, answer: can my computer perform this operation?
If yes, how long should I wait?
It's taking a long time because you're training 81 GBM models, and GBM's are complex. To get a rough estimate of training time, you could train one model and then multiply that time by 81.

Predict in R gives probabilities or some score?

I am new to R and trying to build xgboost model , I am able to get prediction in pred variable
X_train <- train_list$X
X_test <- test_list$X
X_Train <- as(X_train, "dgCMatrix")
xgb <- xgboost(data = X_train,
label = y_train,
eta = 0.9,
max_depth = 3,
nround=10,
subsample = 0.3,
colsample_bytree = 0.1,
seed = 1,
eval_metric = "auc",
objective = "binary:logistic",
#num_class = 2,
booster = 'gbtree',
max_delta_step = 1,
scale_pos_weight=2,
gamma=5 ,
min_child_weight = 2
)
pred <- predict(xgb, X_test)
I would like to know pred is probability or score, I feel the prediction are probabilities but not sure.
If some one can clear by doubt on this and if it is not probabilities then what it is ?
In a binary / 2-class problem, if one of the classes (positive) = 1, and the other class (negative) = 0, your predictions that you calculate using your trained model (via predict()) will fall somewhere between 0 and 1. The probability that a predicted value is 1 (positive class) or 0 (negative class) is the absolute difference from 0.5, and you typically transform the predictions so that <0.5 becomes '0' (negative) and >0.5 becomes '1' (positive). This is all explained in detail in the docs: https://xgboost.readthedocs.io/en/latest/R-package/xgboostPresentation.html#perform-the-prediction

Issues with neural net number of rounds with caret ensemble

I'm creating a simple ensemble of two xgboost and mxnet models. The data frame is A3n.df with the classification variable at A3n.df[,1]. Both the models run fine on their own and get believable accuracy. All data is normalized 0-1, shuffled and the class variable converted to a factor (for caret). I have already run grid search for the best hyperparameters, but need to include a grid for caretEnsemble.
#training grid for xgboost
xgb_grid_A3 = expand.grid(
nrounds = 1200,
eta = 0.01,
max_depth = 20,
gamma = 1,
colsample_bytree = 0.6,
min_child_weight = 2,
subsample = 0.8)
#training grid for mxnet
mxnet_grid_A3 = expand.grid(layer1 = 12,
layer2 = 2,
layer3 = 0,
learningrate = 0.001,
dropout = 0
beta1 = .9,
beta2 = 0.999,
activation = 'relu')
Ensemble_control_A4 <- trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = TRUE,
returnResamp = "all",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE,
sampling = "up",
index=createResample(yEf, 20))
yE = A4n.df[,1]
xE = data.matrix(A4n.df[,-1])
yf <- yE
yEf <- ifelse(yE == 0, "no", "yes")
yEf <- factor(yEf)
Ensemble_list_A4 <- caretList(
x=xE,
y=yEf,
trControl=Ensemble_control_A4,
metric="ROC",
methodList=c("glm", "rpart"),
tuneList=list(
xgbA4=caretModelSpec(method="xgbTree", tuneGrid=xgb_grid_A4),
mxA4=caretModelSpec(method="mxnetAdam", tuneGrid=mxnet_grid_A4)))
XGboost seems to train fine:
+ Resample01: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
....
+ Resample20: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
- Resample20: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
Aggregating results
Selecting tuning parameters
Fitting nrounds = 1200, max_depth = 20, eta = 0.01, gamma = 1, colsample_bytree = 0.6, min_child_weight = 2, subsample = 0.8 on full training set
However, mxnet seems to only run for 10 rounds, when 1 or 2 thousand makes more sense, and there seems to be missing parameters:
+ Resample01: layer1=12, layer2=2, layer3=0, learningrate=0.001, dropout=0, beta1=0.9, beta2=0.999, activation=relu
Start training with 1 devices
[1] Train-accuracy=0.487651209677419
[2] Train-accuracy=0.624751984126984
[3] Train-accuracy=0.599082341269841
[4] Train-accuracy=0.651909722222222
[5] Train-accuracy=0.662202380952381
[6] Train-accuracy=0.671006944444444
[7] Train-accuracy=0.676463293650794
[8] Train-accuracy=0.683407738095238
[9] Train-accuracy=0.691964285714286
[10] Train-accuracy=0.698660714285714
- Resample01: layer1=12, layer2=2, layer3=0, learningrate=0.001, dropout=0, beta1=0.9, beta2=0.999, activation=relu
+ Resample01: parameter=none
- Resample01: parameter=none
+ Resample02: parameter=none
Aggregating results
Selecting tuning parameters
Fitting cp = 0.0243 on full training set
There were 40 warnings (use warnings() to see them)
Warnings (1-40):
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == ... :
prediction from a rank-deficient fit may be misleading
I expect mxnet to train for thousands of rounds, and the training accuracy to end up like the pre-ensemble model, 60-70%
*On second thought, some of the 20 mxnet runs reach 60-70%, but it seems inconsistent. Perhaps it is functioning normally?
There's a note in the caret documentation that num.round needs to be set by the user outside the tune_grid: http://topepo.github.io/caret/train-models-by-tag.html
Ensemble_list_A2 <- caretList(
x=xE,
y=yEf,
trControl=Ensemble_control_A2,
metric="ROC",
methodList=c("glm", "rpart", "bayesglm"),
tuneList=list(
xgbA2=caretModelSpec(method="xgbTree", tuneGrid=xgb_grid_A2),
mxA2=caretModelSpec(method="mxnetAdam", tuneGrid=mxnet_grid_A2, num.round=1500, ctx=mx.gpu()),
svmA2=caretModelSpec(method="svmLinear2", tuneGrid=svm_grid_A2),
rfA2=caretModelSpec(method="rf", tuneGrid=rf_grid_A2)))

xgboost always predict 1 level with imbalance dataset

I'm using xgboost to build a model. The dataset only has 200 rows and 10000 cols.
I tried chi-2 to get 100 cols, but my confusion matrix looks like this:
1 0
1 190 0
0 10 0
I tried to use 10000 attributes, randomly select 100 attributes, select 100 attributes according to chi-2, but I never get 0 case predicted. Is it because of the dataset, or because the way I use xgboost?
My factor(pred.cv) is always showing only 1 level, while factor(y+1) has 1 or 2 as levels.
param <- list("objective" = "binary:logistic",
"eval_metric" = "error",
"nthread" = 2,
"max_depth" = 5,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 0.8,
"colsample_bytree" = 0.8,
"min_child_weight" = 1,
"max_delta_step"= 5,
"learning_rate" =0.1,
"n_estimators" = 1000,
"seed"=27,
"scale_pos_weight" = 1
)
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level
I found caret to be slow and it is not able to tune all the parameters of xgboost models without building a custom model which is quite more complicated than using ones own custom function for evaluation.
However if you are doing some up/down sampling or smote/rose caret is the way to go since it incorporates them correctly in the model evaluating phase (during re-sampling). See: https://topepo.github.io/caret/subsampling-for-class-imbalances.html
However I found these techniques have a very small impact on the results and usually for the worse, at least in the models I trained.
scale_pos_weight gives a higher weight to a certain class, if the minority class is at 10% abundance then playing with scale_pos_weight around 5 - 10 should be beneficial.
Tuning regularization parameters can be quite beneficial for xgboost: here one has several parameters: alpha, beta and gamma - I found valid values to be 0 - 3. Other useful parameters that add direct regularization (by adding uncertainty) are subsample, colsample_bytree and colsample_bylevel. I found that playing with colsample_bylevel can also have a positive outcome on the model. subsample and colsample_bytree you are already utilizing.
I would test a much smaller eta and more trees to see if the model benefits. early_stopping_rounds rounds can speed up the process in that case.
Other eval_metric are probably going to be more beneficial than accuracy. Try logloss or auc and even map and ndcg
Here is a function for grid search of hyper-parameters. It uses auc as evaluation metric but one can change that easily
xgb.par.opt=function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma = c(0, 1, 2),
eta = c(0.01, 0.03),
max_depth = c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data[train,1]==0)/sum(all_data[train,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma = currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample = currentSubsampleRate
seed = seed)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
auc = xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc = cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc) = c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
One can add other parameters to the expand.grid call.
I usually train hyper-parameters on one CV repetition and evaluate them on additional repetitions with other seeds or on the validation set (but doing it on validation set should be used with caution to avoid over-fitting)
test
param <- list("objective" = "binary:logistic",
"eval_metric" = "error",
"nthread" = 2,
"max_depth" = 5,
"eta" = 0.3,
"gamma" = 0,
"subsample" = 0.8,
"colsample_bytree" = 0.8,
"min_child_weight" = 1,
"max_delta_step"= 5,
"learning_rate" =0.1,
"n_estimators" = 1000,
"seed"=27,
"scale_pos_weight" = 1
)
nfold=3
nrounds=200
pred.cv = matrix(bst.cv$pred, nrow=length(bst.cv$pred)/1, ncol=1)
pred.cv = max.col(pred.cv, "last")
factor(y+1) # this is the target in train, level 1 and 2
factor(pred.cv) # this is the issue, it is always only 1 level

Combining train + test data and running cross validation in R

I have the following R code that runs a simple xgboost model on a set of training and test data with the intention of predicting a binary outcome.
We start by
1) Reading in the relevant libraries.
library(xgboost)
library(readr)
library(caret)
2) Cleaning up the training and test data
train.raw = read.csv("train_data", header = TRUE, sep = ",")
drop = c('column')
train.df = train.raw[, !(names(train.raw) %in% drop)]
train.df[,'outcome'] = as.factor(train.df[,'outcome'])
test.raw = read.csv("test_data", header = TRUE, sep = ",")
drop = c('column')
test.df = test.raw[, !(names(test.raw) %in% drop)]
test.df[,'outcome'] = as.factor(test.df[,'outcome'])
train.c1 = subset(train.df , outcome == 1)
train.c0 = subset(train.df , outcome == 0)
3) Running XGBoost on the properly formatted data.
train_xgb = xgb.DMatrix(data.matrix(train.df [,1:124]), label = train.raw[, "outcome"])
test_xgb = xgb.DMatrix(data.matrix(test.df[,1:124]))
4) Running the model
model_xgb = xgboost(data = train_xgb, nrounds = 8, max_depth = 5, eta = .1, eval_metric = "logloss", objective = "binary:logistic", verbose = 5)
5) Making predicitions
pred_xgb <- predict(model_xgb, newdata = test_xgb)
My question is: How can I adjust this process so that I'm just pulling in / adjusting a single 'training' data set, and getting predictions on the hold-out sets of the cross-validated file?
To specify k-fold CV in the xgboost call one needs to call xgb.cv with nfold = some integer argument, to save the predictions for each resample use prediction = TRUE argument. For instance:
xgboostModelCV <- xgb.cv(data = dtrain,
nrounds = 1688,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
stratified = T,
scale_pos_weight = 2
max_depth = 6,
eta = 0.01,
gamma=0,
colsample_bytree = 1 ,
min_child_weight = 1,
subsample= 0.5 ,
prediction = T)
xgboostModelCV$pred #contains predictions in the same order as in dtrain.
xgboostModelCV$folds #contains k-fold samples
Here's a decent function to pick hyperparams
function(train, seed){
require(xgboost)
ntrees=2000
searchGridSubCol <- expand.grid(subsample = c(0.5, 0.75, 1),
colsample_bytree = c(0.6, 0.8, 1),
gamma=c(0, 1, 2),
eta=c(0.01, 0.03),
max_depth=c(4,6,8,10))
aucErrorsHyperparameters <- apply(searchGridSubCol, 1, function(parameterList){
#Extract Parameters to test
currentSubsampleRate <- parameterList[["subsample"]]
currentColsampleRate <- parameterList[["colsample_bytree"]]
currentGamma <- parameterList[["gamma"]]
currentEta =parameterList[["eta"]]
currentMaxDepth =parameterList[["max_depth"]]
set.seed(seed)
xgboostModelCV <- xgb.cv(data = train,
nrounds = ntrees,
nfold = 5,
objective = "binary:logistic",
eval_metric= "auc",
metrics = "auc",
verbose = 1,
print_every_n = 50,
early_stopping_rounds = 200,
stratified = T,
scale_pos_weight=sum(all_data_nobad[index_no_bad,1]==0)/sum(all_data_nobad[index_no_bad,1]==1),
max_depth = currentMaxDepth,
eta = currentEta,
gamma=currentGamma,
colsample_bytree = currentColsampleRate,
min_child_weight = 1,
subsample= currentSubsampleRate)
xvalidationScores <- as.data.frame(xgboostModelCV$evaluation_log)
#Save rmse of the last iteration
auc=xvalidationScores[xvalidationScores$iter==xgboostModelCV$best_iteration,c(1,4,5)]
auc=cbind(auc, currentSubsampleRate, currentColsampleRate, currentGamma, currentEta, currentMaxDepth)
names(auc)=c("iter", "test.auc.mean", "test.auc.std", "subsample", "colsample", "gamma", "eta", "max.depth")
print(auc)
return(auc)
})
return(aucErrorsHyperparameters)
}
You can change the grid values and the params in the grid, as well as loss/evaluation metric. It is similar as provided by caret grid search, but caret does not provide the possibility to define alpha, lambda, colsample_bylevel, num_parallel_tree... hyper parameters in the grid search apart defining a custom function which I found cumbersome. Caret has the advantage of automatic preprocessing, automatic up/down sampling within CV etc.
setting the seed outside the xgb.cv call will pick the same folds for CV but not the same trees at each round so you will end up with a different model. Even if you set the seed inside the xgb.cv function call there is no guarantee you will end up with the same model but there's a much higher chance (depends on threads, type of model.. - I for one like the uncertainty and found it to have little impact on the result).
You can use xgb.cv and set prediction = TRUE.

Resources