debugging caret with SMOTE in R - r

I'm trying to use SMOTE in R within the trainControl function in caret. Following the author's example I do as follows:
#first, create an imbalanced data set
set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
Class1 Class2
9411 589
I want to use the SMOTE algorithm to oversample my minority class. However, this has to be done carefully. For instance, we shouldn't oversample before doing cross validation. This would lead us to optimistic generalization error.
#create my folds (5 in this case)
folds <- createFolds(factor(imbal_train$Class), k = 5, list = TRUE,returnTrain=TRUE)
#trainControl to set up my training phase.
ctrl <- trainControl(method = "cv", index = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "all",
## new option here:
sampling = "smote")
#train the model
set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
It runs without error. I now want to see the training and testing set used in each iteration. I need to make sure that before oversampling the training folders, one folder was hold out and no new synthetic records were created inside of it.
Looking into the objects output by train, I could see that smote_inside$control may have some information. Concretely, it has the index and index_out: these are the row indexes for the training and testing in each cv iteration. However, when I do :
lista=smote_inside$control
dd=imbal_train[lista$index$Fold1,] #training data first cv iteration
table(dd$Class)
Class1 Class2
7529 471
You can see that it is still imbalanced. SMOTE is supposed to create some synthetic records from the minority class. Maybe this information is saved in another place?
Questions:
How can I see the new training records that were created using smote to balance the data?
How can I be sure that the testing folder wasn't contaminated with the oversampling?
Where can I find what caret is doing with SMOTE? pointers to a source code.

Some answers:
It does not retain that information
It is designed not to contaminate the holdout data. If you want proof (beyond what is shown in the link that you reference), look at createModel to see how it does the sampling and predictionFunction for how the data are handled prior to prediction.
The package sources are available basically everywhere. The two functions above (along with probFunction) to the work.

Related

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

How to read and get the C5.0 Model out of R

I've searched and search but can't find the answer. I have a c5_model trained and ready but I needed to do 100 trails to get it working to the level I want it to. But I'm stuck on trying to get it out of the model in R. I have done a summary but how do I get the decision tree out. Which trial do I want to use?
Update:
I'm building the model by doing the following
control <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
grid <- expand.grid( .winnow = c(FALSE),
.trials=100,
.model="tree" )
c5_model <- train(HasFraud ~ .,data = train, method = "C5.0",trControl = control,metric = "ROC",tuneGrid = grid,verbose = FALSE)
Is this the wrong method to train the model?
An object of class C5.0 has a number of elements, as described in the help file you can pull up with ?C50::C5.0.default. One of those elements is tree. If you've assigned the output of a call to C5.0() to a value, say model, you can extract any of its elements using the $ operator. For example:
model <- C5.0(<the call you made that generated the model>)
model$tree

The xgboost package and the random forests regression

The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used.
To ensure that, consider just a standard toy example.
library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
dtrain = xgb.DMatrix(agaricus.train$data,
label = agaricus.train$label)
bst = xgb.train(data = dtrain,
nround = 1,
subsample = 0.8,
colsample_bytree = 0.5,
num_parallel_tree = 100,
verbose = 2,
max_depth = 12)
answer1 = predict(bst, dtrain);
(answer1 - agaricus.train$label) %*% (answer1 - agaricus.train$label)
forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)
answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 - agaricus.train$label)
Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well.
Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected.
#
Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function.
The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid?
Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens.
You may specify the 'objective' option like this:
mse = function(predict, dtrain)
{
real = getinfo(dtrain, 'label')
return(list(grad = 2 * (predict - real),
hess = rep(2, length(real))))
}
This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest.
Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree').
So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.
xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore.
I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences.
Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does.
true model: $y = x_1 * x_2$ + noise
If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained.
Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin".
I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2
I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees.
NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear
library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure
fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))
#1 single deep tree
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree
#100 trees (gbm boosting)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!
#20 shallow trees (bagging)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250,
num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)
#20 deep tree (bagging), no column limitation
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=500,
num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data

When to use index and seeds arguments of train() in caret package in R

Primary Question:
After reading the documentation and google searching, I am still stumped as to what the situations are where it is advisable to pre-define resampling indices such as:
resamples <- createResample(classVector_training, times = 500, list=TRUE)
or predefine seeds such as:
seeds <- vector(mode = "list", length = 501) #length is = (n_repeats*nresampling)+1
for(i in 1:501) seeds[[i]]<- sample.int(n=1000, 1)
My plan is to train a bunch of different reproducible models using parallel processing via the doParallel package. Is predefining resamples unnecessary due to the seeds already being set? Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing? Is there any reason to pre-define both index and seeds as I've seen at least once via searching google? And what is a reason to ever use indexOut?
Side Question:
So far, I've managed to run train fine for RF:
rfControl <- trainControl(method="oob", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
mtryGrid <- expand.grid(mtry = 9480^0.5) #set mtry parameter to the square root of the number of variables
rfTrain <- train(x = training, y = classVector_training, method = "rf", trControl = rfControl, tuneGrid = mtryGrid)
But when I try to run train() with method = "baruta" as such:
borutaControl <- trainControl(method="bootstrap", number = 500, p = 0.7, returnData=TRUE, returnResamp = "all", savePredictions=TRUE, classProbs = TRUE, summaryFunction = twoClassSummary, allowParallel=TRUE)
borutaTrain <- train(x = training, y = classVector_training, method = "Boruta", trControl = borutaControl, tuneGrid = mtryGrid)
I end up getting the following error:
Error in names(trControl$indexOut) <- prettySeq(trControl$indexOut) : 'names' attribute [1] must be the same length as the vector [0]
Anyone know why?
There are a few different times random numbers are used here, so I'll try to be specific about which seeds.
Is predefining resamples unnecessary due to the seeds already being set?
If you do not provide your own resampling indices, the first things that train, rfe, sbf, gafs, and safs do is to create them. So, setting the overall seed prior to calling these controls the randomness of creating resamples. So, you can call these functions repeatedly and use the same samples of you set the main seed beforehand:
set.seed(2346)
mod1 <- train(y ~ x, data = dat, method = "a", ...)
set.seed(2346)
mod2 <- train(y ~ x, data = dat, method = "b", ...)
set.seed(2346)
mod3 <- rfe(x, y, ...)
You can use createResamples or createFolds if you like and give those to trainControl's index argument too.
One other note about this: if indexOut is missing, the holdouts are defined as whatever samples were not used to train the model. There are cases when this is bad (see the exception below) and that is why indexOut exists.
Do I need to predefine seeds in the way above instead of setting seeds=NULL in the trainControl object because I intend to use parallel processing?
That was the main intent. When the worker processes startup, there was no way to control the randomness inside the model fit prior to our addition of the seeds argument. You don't have to use it, but it will lead to reproducible models.
Note that, like resamples, train will create seeds for you if you do not supply them. They are found in the control$seeds element in the train object.
Note that trainControl(seeds) has nothing to do with creating the resamples.
Is there any reason to pre-define both index and seeds as I've seen at least once via searching google?
If you want to pre-define the resamples and control any potential randomness in the worker processes that build the models, then yes.
And what is a reason to ever use indexOut?
There are always specialized situations. The reason it is there is for time series data where you might have data splits that do not involve all the samples passed to train (this is the exception mentioned above). See the white space in this graphic.
tl/dr
trainControl(seeds) only controls the randomness of the model fits
setting the seed prior to calling train is one way to control the randomness of data splitting
Max

Cannot extractPrediction using caret in R

I'm totally stucked on a random forest classification model since I cannot extract predictions. And I'm really out of clues since:
predict(forest.model1, titanic.final.test)
works like a charm, while
extractPrediction(list(forest.model1), testX=titanic.final.test[,-2], testY=titanic.final.test[,2])
which should be equivalent, gives me this error:
Error in predict.randomForest(modelFit, newdata) :
variables in the training data missing in newdata
Here's my trainControl:
forest.fitControl <- trainControl( method = "repeatedcv", repeats = 5,
summaryFunction = twoClassSummary, classProbs=TRUE,
returnData=TRUE, seeds=NULL, savePredictions=TRUE, returnResamp="all")
any idea?
Test and Train need to have the same structure (i.e. all the same columns). So my only guess is that negating the second column is resulting in a different structure that the data used to train the model. Hard to know without seeing the structire of the training vs. test data.frames.
Edit After Looking at Code:
Recreated this from your repo... Sure it shouldn't be the first column you pull out for testX and use for testY. Something like:
extractPrediction(list(forest.model1), testX=titanic.final.test[,-1], testY=titanic.final.test[,1])

Resources