Caret and GBM Errors - r

I am attempting to use the caret package in R for several nested cross-validation processes with user-defined performance metrics. I have had all kinds of problems, so I pulled back to see see if there were issues with a more out of the box use of caret and it seems I have run into one.
If I run the following:
install.packages("caret")
install.packages("gbm")
library(caret)
library(gbm)
data(GermanCredit)
GermanCredit$Class<-ifelse(GermanCredit$Class=='Bad',1,0)
gbmGrid <- expand.grid(.interaction.depth = 1,
.n.trees = 150,
.shrinkage = 0.1)
gbmMOD <- train(Class~., data=GermanCredit
,method = "gbm",
tuneGrid= gbmGrid,
distribution="bernoulli",
bag.fraction = 0.5,
train.fraction = 0.5,
n.minobsinnode = 10,
cv.folds = 1,
keep.data=TRUE,
verbose=TRUE
)
I get the error (or similar):
Error in { :
task 1 failed - "arguments imply differing number of rows: 619, 381"
with warnings:
1: In eval(expr, envir, enclos) :
model fit failed for Resample01: interaction.depth=1, n.trees=150, shrinkage=0.1
But, if I run just the gbm routine everything finishes fine.
gbm1 <- gbm(Class~., data=GermanCredit,
distribution="bernoulli",
n.trees=150, # number of trees
shrinkage=0.10,
interaction.depth=1,
bag.fraction = 0.5,
train.fraction = 0.5,
n.minobsinnode = 10,
cv.folds = 1,
keep.data=TRUE,
verbose=TRUE
)

There were two issues: passing cv.folds caused a problem. Also, you don't need to convert the outcome to a binary number; this causes train to think that it is a regression problem. The idea behind the train function is to smooth out the inconsistencies with the modeling functions, so we use factors for classification and numbers for regression.

Just for note - although this issue has been caused by the reason described in the answer, the error message (given below) may also occur with older version of caret and gbm. I encountered this error and after spending a lot of time trying to figure out what the issue was it turned out that I had to upgrade to the most recent version of caret (5.17-7) and gbm (2.1-0.1). These are the most recent version as of today on CRAN.
Error in { :
task 1 failed - "arguments imply differing number of rows: ...

Related

partial dependency plot using pdp package for classification xgboost

I use pdp package to run partial dependency for linear regression using xgboost package is perfect without any warning. But when I change to classification(logistic) label for xgboost. I got the warning messages for partial dependency say the partial dependency is based on linear as follows. May I ask if the code has to be revised in somehow to exactly feed classification object using xgboost package so that the partial dependency is correct. Or I can ignore the warning message , it is correct already. I know randomforest is straight forward without any warning messages
# Load required packages
library(pdp)
library(xgboost)
# Simulate training data with ten million records
set.seed(101)
trn <- as.data.frame(mlbench::mlbench.friedman1(n = 1e+07, sd = 1))
trn=trn[sample(nrow(trn), 500), ]
trn$y=ifelse(trn$y>16,1,0)
# Fit an XGBoost classification(logistic) model
set.seed(102)
bst <- xgboost(data = data.matrix(subset(trn, select = -y)),
label = trn$y,
objective = "reg:logistic",
nrounds = 100,
max_depth = 2,
eta = 0.1)
#partial dependency plot
pd <- partial(bst$handle,
pred.var = c("x.1"),
grid.resolution = 10,
train = data.matrix(subset(trn, select = -y)),
prob=TRUE,
plot = FALSE,
.progress = "text")
Warning message:
In superType.default(object) :
`type` could not be determined; assuming `type = "regression"`
In this case, you can safely ignore the warning; however, it did lead me to a small bug in the pdp package for which I will push a fix to shortly. Thanks for reporting!

xgboost with tree_method = 'hist' in R

According to a benchmarking of GBM vs. xgboost vs. LightGBM (https://www.kaggle.com/nschneider/gbm-vs-xgboost-vs-lightgbm) it is possible to implement xgboost with the argument
tree_method = 'hist'
in R.
However doing so gives me always an error:
Error in xgb.iter.update(bst$handle, dtrain, iteration - 1, obj) :
Invalid Input: 'hist', valid values are: {'approx', 'auto', 'exact'}
What am I missing?

R - Decreasing memory usage of using caret to train a random forest

I am trying to create a random forest given ~100 thousand inputs. To accomplish them, I am using train from the caret package with method = "parRF". Unfortunately, my machine with 128 GBs of memory still runs out. Therefore, I need to cut down on how much memory I use.
Right now, the training method I am running is:
> trControl <- trainControl(method = "LGOCV", p = 0.9, savePredictions = T)
> model_parrf <- train(x = data_preds, y = data_resp, method = "parRF",
trControl = trControl)
However, because each forest is kept, the system quickly runs out of memory. If my understanding of train and randomForest is correct, each random forest made stores about 500 * 100,000 doubles at the very least. Therefore, I would like to throw away the random forests I no longer need. I tried passing the keep.forest = FALSE into randomForest using
> model_parrf <- train(x = data_preds, y = data_resp, method = "parRF",
trControl = trControl, keep.forest = FALSE)
Error in train.default(x = data_preds, y = data_resp, method = "parRF", :
final tuning parameters could not be determined
In addition, this warning was thrown repeatedly:
In eval(expr, envir, enclos) :
predictions failed for Resample01: mtry=2 Error in predict.randomForest(modelFit, newdata) :
No forest component in the object
It seems that for some reason, caret requires the forests to be kept in order to compare models. Is there any way I can use caret with less memory?
Keep in mind that, if you use M cores, you need to store the data set M+1 times. Try using less workers.

Fatal error with train() in caret on Windows 7, R 3.0.2, caret 6.0-21

I am trying to use train() in caret to fit a classification model, but I'm hitting some kind of unhandled exception and my R session crashes before outputting any error information in the R console.
Windows error:
R for Windows terminal front-end has stopped working
I am running Windows 7, R 3.0.2, caret 6.0-21, and have tried running this on both 32/64 versions of R, in R Studio and also directly in the R console, and am getting the same results each time.
Here is my call to train:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
data <- data.frame(predictors, diagnosis)
tuneGrid <- expand.grid(interaction.depth = 1:2, n.trees = 100, shrinkage = 0.1)
trainControl <- trainControl(method = "cv", number = 5, verboseIter = TRUE)
gbmFit <- train(diagnosis ~ ., data = data, method = "gbm", trControl = trainControl, tuneGrid = tuneGrid)
There are no more errors using this parameter grid instead:
tuneGrid <- expand.grid(interaction.depth = 1, n.trees = 100:101, shrinkage = 0.1)
However, I am still getting all nans in the ValidDeviance column. Is this normal?
Note: My original problem is resolved, and this is a continuation from the comments section. Formatting blocks of code in the comments section is unreadable so I'm posting it up here. This is no longer a question regarding caret, but gbm instead.
I am still having issues, however, with direct calls to gbm using a single predictor with cv.folds specified. Here is the code:
library("AppliedPredictiveModeling")
library("caret")
data("AlzheimerDisease")
diagnosis <- as.numeric(diagnosis)
diagnosis[diagnosis == 1] <- 0
diagnosis[diagnosis == 2] <- 1
data <- data.frame(diagnosis, predictors[, 1])
gbmFit <- gbm(diagnosis ~ ., data = data, cv.folds = 5)
Again, this works without specifying cv.folds but with it, returns an error:
Error in checkForRemoteErrors(val) : 5 nodes produced errors; first error: incorrect number of dimensions
It is a bug that occurs when method = 'gbm' is used with a single model (i.e. nrow(tuneGrid) == 1). I'm about to release a new version, so I will fix this in that version.
One side note... it looks like you want to do classification. In that case, y should be a factor (and you shouldn't use only integers as the classes) otherwise it will be doing regression. These changes will work for now:
y <- factor(paste("Class", y, sep = ""))
and
tuneGrid <- expand.grid(interaction.depth = 1,
n.trees = 100:101,
shrinkage = 0.1)
Thanks,
Max

using caret package to find optimal parameters of GBM

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages
Not sure if you found what you were looking for, but I find some of these sheets less than helpful.
If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters
He are some rules of thumb for running GBM:
The interaction.depth is 1, and on most data sets that seems
adequate, but on a few I have found that testing the results against
odd multiples up the max has given better results. The max value I
have seen for this parameter is floor(sqrt(NCOL(training))).
Shrinkage: the smaller the number, the better the predictive value,
the more trees required, and the more computational cost. Testing
the values on a small subset of data with something like shrinkage =
shrinkage = seq(.0005, .05,.0005) can be helpful in defining the
ideal value.
n.minobsinnode: default is 10, and generally I don't mess with that.
I have tried c(5,10,15,20) on small sets of data, and didn't really
see an adequate return for computational cost.
n.trees: the smaller the shrinkage, the more trees you should have.
Start with n.trees = (0:50)*50 and adjust accordingly.
Example setup using the caret package:
getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
n.trees = (0:50)*50,
shrinkage = seq(.0005, .05,.0005),
n.minobsinnode = 10) # you can also put something like c(5, 10, 15, 20)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95),
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
distribution = "adaboost",
method = "gbm", bag.fraction = 0.5,
nTrain = round(nrow(training) *.75),
trControl = fitControl,
verbose = TRUE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC"))
Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow.
To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.
This link has a concrete example (page 10) -
http://www.jstatsoft.org/v28/i05/paper
Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). Then call the generic train function as usual.

Resources