Error while running varImp(gbm_model) - r

I'm running gbm model for a classification problem.Below is my code & output
library(gbm)
library(caret)
set.seed(123)
train=read.csv("train.csv")
gbm_model= gbm(DV~.,
data=train,
distribution = "bernoulli",
n.trees = 9,
interaction.depth = 9,
n.minobsinnode = 1,
shrinkage = 0.2,
bag.fraction = 0.9)
output of print(gbm1)
gbm(formula = DP ~ ., distribution = "bernoulli",
data = train, n.trees = 9, interaction.depth = 9, n.minobsinnode = 1,
shrinkage = 0.2, bag.fraction = 0.9)
A gradient boosted model with bernoulli loss function.
9 iterations were performed.
There were 100 predictors of which 67 had non-zero influence.
When I try to print top variables, it throws error.
varImp(gbm_model)
Error in 1:n.trees : argument of length 0
Any suggestion how to rectify this error.

I got the error rectified after researching a bit more on caret package. First I needed to train the model and then use the varImp().
gbm1= train(as.factor(DV)~., data=train,method="gbm",
distribution ="bernoulli",trControl=trainControl(number=200),
tuneGrid=expand.grid(.interaction.depth = 9,.n.trees = 9, .shrinkage = .1), n.minobsinnode = 1,
bag.fraction = 0.9)
then run
plot(varImp(gbm1),top=20)
to get top 20 variables

Related

Why does my code take so long to process?

I try to run code from this web site in my computer.
I use data set from kaggle competition
In my training data 1022 rows and 81 variables.
I run this code:
hyper_grid <- expand.grid(
shrinkage = c(.01, .1, .3),
interaction.depth = c(1, 3, 5),
n.minobsinnode = c(5, 10, 15),
bag.fraction = c(.65, .8, 1),
optimal_trees = 0, # a place to dump results
min_RMSE = 0 # a place to dump results
)
random_index <- sample(1:nrow(ames_train), nrow(ames_train))
random_ames_train <- ames_train[random_index, ]
# grid search
for(i in 1:nrow(hyper_grid)) {
# reproducibility
set.seed(123)
# train model
gbm.tune <- gbm(
formula = SalePrice ~ .,
distribution = "gaussian",
data = random_ames_train,
n.trees = 5000,
interaction.depth = hyper_grid$interaction.depth[i],
shrinkage = hyper_grid$shrinkage[i],
n.minobsinnode = hyper_grid$n.minobsinnode[i],
bag.fraction = hyper_grid$bag.fraction[i],
train.fraction = .75,
n.cores = NULL, # will use all cores by default
verbose = FALSE
)
I'm waiting more than 1 hour.
I think it's bacause my laptop is not powerful.
On the picture you can see parameters of my computer.
Please, answer: can my computer perform this operation?
If yes, how long should I wait?
It's taking a long time because you're training 81 GBM models, and GBM's are complex. To get a rough estimate of training time, you could train one model and then multiply that time by 81.

Some questions about rpart & gbm in R

I am trying to model the claim frequency with the rpart and gbm packages. And I have a few questions regarding these packages.
In the rpart-model, what is the purpose/function of the shrink-parameter?
In the gbm-model, do I use the weights correct? I get an output (no errors), but I just want to be sure I have understood it correct.
I the gbm-model, I know that the parameter n.minobsinnode let me say that it should at least be 10 observations in each node. But is there a way to say that each node should have at least 1 claim? I don’t want a model that predicts 0 in claim frequency for some observations.
In RandomForest, d variables are randomly picked from the n variables for each split. But in the gbm-model, all n variables are considered for each split?
In tree-based-models, is it possible to offset one variable (e.g. deductible)?
Regression tree
Model_tree <- rpart(cbind(duration, nclaims) ~ Var_1 + … + Var_n ,
data = data ,
method = "poisson",
parms = list(shrink = 1),control=rpart.control(minbucket = 10, cp = 0.00005 , maxdepth = 5))
# Gradient Boosting Model
Model_gbm <- gbm(nclaims ~ Var_1 + … + Var_n,
data = data,
weights = duration,
distribution = "poisson",
cv.folds = 0,
shrinkage = 0.01,
interaction.depth = 5,
n.trees = 5000,
n.minobsinnode = 10,
bag.fraction = 1,
train.fraction = 1)
# Predict with a gbm
predict.gbm(object = Model_gbm, n.trees = 1000, newdata = testdata, type = "response")

Issues with neural net number of rounds with caret ensemble

I'm creating a simple ensemble of two xgboost and mxnet models. The data frame is A3n.df with the classification variable at A3n.df[,1]. Both the models run fine on their own and get believable accuracy. All data is normalized 0-1, shuffled and the class variable converted to a factor (for caret). I have already run grid search for the best hyperparameters, but need to include a grid for caretEnsemble.
#training grid for xgboost
xgb_grid_A3 = expand.grid(
nrounds = 1200,
eta = 0.01,
max_depth = 20,
gamma = 1,
colsample_bytree = 0.6,
min_child_weight = 2,
subsample = 0.8)
#training grid for mxnet
mxnet_grid_A3 = expand.grid(layer1 = 12,
layer2 = 2,
layer3 = 0,
learningrate = 0.001,
dropout = 0
beta1 = .9,
beta2 = 0.999,
activation = 'relu')
Ensemble_control_A4 <- trainControl(
method = "cv",
number = 5,
verboseIter = TRUE,
returnData = TRUE,
returnResamp = "all",
classProbs = TRUE,
summaryFunction = twoClassSummary,
allowParallel = TRUE,
sampling = "up",
index=createResample(yEf, 20))
yE = A4n.df[,1]
xE = data.matrix(A4n.df[,-1])
yf <- yE
yEf <- ifelse(yE == 0, "no", "yes")
yEf <- factor(yEf)
Ensemble_list_A4 <- caretList(
x=xE,
y=yEf,
trControl=Ensemble_control_A4,
metric="ROC",
methodList=c("glm", "rpart"),
tuneList=list(
xgbA4=caretModelSpec(method="xgbTree", tuneGrid=xgb_grid_A4),
mxA4=caretModelSpec(method="mxnetAdam", tuneGrid=mxnet_grid_A4)))
XGboost seems to train fine:
+ Resample01: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
....
+ Resample20: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
- Resample20: eta=0.01, max_depth=20, gamma=1, colsample_bytree=0.6, min_child_weight=2, subsample=0.8, nrounds=1200
Aggregating results
Selecting tuning parameters
Fitting nrounds = 1200, max_depth = 20, eta = 0.01, gamma = 1, colsample_bytree = 0.6, min_child_weight = 2, subsample = 0.8 on full training set
However, mxnet seems to only run for 10 rounds, when 1 or 2 thousand makes more sense, and there seems to be missing parameters:
+ Resample01: layer1=12, layer2=2, layer3=0, learningrate=0.001, dropout=0, beta1=0.9, beta2=0.999, activation=relu
Start training with 1 devices
[1] Train-accuracy=0.487651209677419
[2] Train-accuracy=0.624751984126984
[3] Train-accuracy=0.599082341269841
[4] Train-accuracy=0.651909722222222
[5] Train-accuracy=0.662202380952381
[6] Train-accuracy=0.671006944444444
[7] Train-accuracy=0.676463293650794
[8] Train-accuracy=0.683407738095238
[9] Train-accuracy=0.691964285714286
[10] Train-accuracy=0.698660714285714
- Resample01: layer1=12, layer2=2, layer3=0, learningrate=0.001, dropout=0, beta1=0.9, beta2=0.999, activation=relu
+ Resample01: parameter=none
- Resample01: parameter=none
+ Resample02: parameter=none
Aggregating results
Selecting tuning parameters
Fitting cp = 0.0243 on full training set
There were 40 warnings (use warnings() to see them)
Warnings (1-40):
1: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == ... :
prediction from a rank-deficient fit may be misleading
I expect mxnet to train for thousands of rounds, and the training accuracy to end up like the pre-ensemble model, 60-70%
*On second thought, some of the 20 mxnet runs reach 60-70%, but it seems inconsistent. Perhaps it is functioning normally?
There's a note in the caret documentation that num.round needs to be set by the user outside the tune_grid: http://topepo.github.io/caret/train-models-by-tag.html
Ensemble_list_A2 <- caretList(
x=xE,
y=yEf,
trControl=Ensemble_control_A2,
metric="ROC",
methodList=c("glm", "rpart", "bayesglm"),
tuneList=list(
xgbA2=caretModelSpec(method="xgbTree", tuneGrid=xgb_grid_A2),
mxA2=caretModelSpec(method="mxnetAdam", tuneGrid=mxnet_grid_A2, num.round=1500, ctx=mx.gpu()),
svmA2=caretModelSpec(method="svmLinear2", tuneGrid=svm_grid_A2),
rfA2=caretModelSpec(method="rf", tuneGrid=rf_grid_A2)))

Sample error in xgboost R-studio

Training model with Caret and Xgboost algorithm.
Training stops with error.
Grid setup
expand.grid(nrounds = c(12,15, 17, 20, 22,24,26,28), #
max_depth = c( 3, 4, 5, 6,7,8,9,10), #
eta = c(.001,.05,.06,0.07,0.08,.1,.2,.3, .4),
gamma = c(0, .1,.2,.3,.4,.5,.6,.7),
colsample_bytree = c(.5,.6,.7, .8, .9,1),#
min_child_weight = c(1,2,3),#
subsample = c(.6,.7,.8, .9, 1)
Error in sample.int(n = 1000000L, size = num_rs * nrow(trainInfo$loop)
+ : cannot take a sample larger than the population when 'replace = FALSE'
Data set I have 2500 rows and 50 parameters. How can I fix this error and train model?

r caret train and predict after tuning

I'm currently working on a research paper.
My dataset consists of daily stock index data with a predictor that says UP or DOWN.
In total I use 20% of my dataset to tune the parameters. Based on the code below I found that I should use sigma = 0.05 and C = 25.
Now, I want to train 50% on my dataset based on these settings and make a prediction on the other 50% of the set.
Could anyone explain how I should do this? I can't find how I should proceed in caret's documentation.
### MODELS
## SetUp Training
trainControl <- trainControl(method="cv", index = fit_on, indexOut = pred_on, savePredictions = TRUE)
metric <- "Accuracy"
## Parameter tuning
# SVM
set.seed(1908)
grid_rf <- expand.grid(C = c(0.1, 0.25, 0.5, 1, 5, 10, 25, 50), sigma = c(0,025, 0.05, 0.1, 0.25, 0.5, 1, 1.5, 2, 2.5, 3, 3.5,4,4.5,5,7,5,10,15))
fit.svm <- train(class~., data=dataCombined, method="svmRadial", metric=metric, preProc=c("range"),trControl=trainControl, tuneGrid = grid_rf)
Many thanks!
Best,
Mathijs

Resources