rpart giving same results for cross-validation and no CV - r

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)

In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

Related

MLR - calculating feature importance for bagged, boosted trees (XGBoost)

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

Custom model (R6 class) with caret giving binding error

I have created a custom model using R6 class. Due to the proprietary nature of work, I cant share the code but can share the anonymized structure. So basically, you would instantiate the model as follows
mod = my_model$new(formula = form, data = data, hyper1 = hyper1, hyper2 = hyper2)
Arguments:
formula: Standard R formula
data: Data Frame containing X and y values
hyper1 and hyper2: hyper-parameters of this model
Now, I am trying to integrate this with the caret package and having issues during the training process. This is the fit function that I am using
fitFunc <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
# Custom algorithm takes dataframe as input so need to convert 'x' and 'y' to a dataframe
# https://stats.stackexchange.com/questions/89171/help-requested-with-using-custom-model-in-caret-package
data = as.data.frame(x)
data$.outcome = y
data = as.data.frame(data)
# Define formula for model
form = formula(.outcome ~ .)
mod = my_model$new(formula = form, data = data, hyper1 = param$hyper1, hyper2 = param$hyper2, ...)
return(mod)
}
This is the code snippet for integrating the custom model with caret and running it.
library(caret)
set.seed(998)
inTraining <- createDataPartition(mtcars$mpg, p = .75, list = FALSE)
training <- mtcars[ inTraining,]
testing <- mtcars[-inTraining,]
fitControl <- trainControl(method = "cv", number = 3)
set.seed(825)
mdl_builder = train(mpg ~ ., data = training,
method = custom_model_list,
tuneLength = 8,
trControl = fitControl)
However, this leads to the following error messages (just a small snippet, it actually fails on every fold)
model fit failed for Fold2: hyper1=1, hyper2=4 Error in modelFit$xNames <- colnames(x) :
cannot add bindings to a locked environment
I think this is coming from the fact that the caret code internally is trying to assign the xNames to the R6 class object but the R6 class is not allowing this. I dont understand how to fix this (if at all it is possible). Any help would be appreciated.
Thanks!!
I figured it out right after typing this question (+ a little more research)
The key was to set lock_objects = FALSE in the R6 class to that new attributes can be added from outside.

mlr: Tune model parameters with validation set

Just switched to mlr for my machine learning workflow. I am wondering if it is possible to tune hyperparameters using a separate validation set. From my minimum understanding, makeResampleDesc and makeResampleInstance accepts only resampling from training data.
My goal is to tune parameters with a validation set and test the final model with the test set. This is to prevent overfitting and knowledge leak.
Here is what I did code-wise:
## Create training, validation and test tasks
train_task <- makeClassifTask(data = train_data, target = "y", positive = 1)
validation_task <- makeClassifTask(data = validation_data, target = "y")
test_task <- makeClassifTask(data = test_data, target = "y")
## Attempt to tune parameters with separate validation data
tuned_params <- tuneParams(
task = train_task,
resampling = makeResampleInstance("Holdout", task = validation_task),
...
)
From the error message, it looks like evaluation is still trying to resample from the training set:
00001: Error in resample.fun(learner2, task, resampling, measures =
measures, : Size of data set: 19454 and resampling instance:
1666333 differ!
Does anyone know what I should do? Am I setting up everything the right way?
[Update as of 2019/03/27]
Following #jakob-r's comment, and finally understanding #LarsKotthoff's suggestion, here is what I did:
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
## Create learner, training task, etc.
xgb_learner <- makeLearner("classif.xgboost", predict.type = "prob")
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tune_wrapper <- makeTuneWrapper(
learner = xgb_learner,
resampling = makeResampleDesc("Holdout"),
measures = ...,
par.set = ...,
control = ...
)
model_xgb <- train(tune_wrapper, train_task)
Here is what I did following #LarsKotthoff 's comment. Assume you have two separate datasets for training (train_data) and validation (validation_data):
## Create combined training data
train_task_data <- rbind(train_data, validation_data)
size <- nrow(train_task_data)
train_ind <- seq_len(nrow(train_data))
validation_ind <- seq.int(max(train_ind) + 1, size)
## Create training task
train_task <- makeClassifTask(data = train_task_data, target = "y", positive = 1)
## Tune hyperparameters
tuned_params <- tuneParams(
task = train_task,
resampling = makeFixedHoldoutInstance(train_ind, validation_ind, size),
...
)
After optimizing the hyperparameter set, you can build a final model and test against your test dataset.
Note: I have to install the latest development version (as of 2018/08/06) from GitHub. Current CRAN version (2.12.1) throws an error when I call makeFixedHoldoutInstance(), i.e.,
Assertion on 'discrete.names' failed: Must be of type 'logical flag',
not 'NULL'.

rpart Complexity Parameter values

rpart parameters can be found using getModelInfo
getModelInfo("rpart")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
initialFit <- rpart(.outcome ~ .,
data = dat,
control = rpart.control(cp = 0))$cptable
initialFit <- initialFit[order(-initialFit[,"CP"]), , drop = FALSE]
if(search == "grid") {
if(nrow(initialFit) < len) {
tuneSeq <- data.frame(cp = seq(min(initialFit[, "CP"]),
max(initialFit[, "CP"]),
length = len))
} else tuneSeq <- data.frame(cp = initialFit[1:len,"CP"])
colnames(tuneSeq) <- "cp"
} else {
tuneSeq <- data.frame(cp = unique(sample(initialFit[, "CP"], size = len, replace = TRUE)))
}
tuneSeq
}
the only parameter is
cp = seq(min(initialFit[, "CP"]), max(initialFit[, "CP"]),length = len)
But how can I get the initialFit and the len?
Searching elsewhere I found that cp can usually take 10 values from 0.18 to 0.01. But still couldn't find out where those values come from
If you're unsure about appropriate values for a parameter, you can make caret choose for you and use default values. Here is an example that works end-to-end without explicitly specifying cp:
library(tidyverse)
library(caret)
library(forcats)
# Take mtcars data for example
df <- mtcars %>%
# Which cars are automatic, which ones are manual?
mutate(am = as.factor(am),
am = fct_recode(am, 'automatic' = '1', 'manual' = '0'))
set.seed(123456)
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Run rpart
# Tuning grid is left unspecified, so caret uses the default
tree1 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl)
Alternatively, if you want to explicitly specify cp, do so using the tuning grid:
tuneGrid <- expand.grid(cp = seq(0, 0.05, 0.005))
tree2 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl,
tuneGrid = tuneGrid)
A question on why you should select which values for cp is probably better posted on CrossValidated.
Update:
To answer your follow-on question about the default values and the values I chose in my example, I recommend going back to the primary source of the modelling function. caret is a great package for convenience reasons, but all it does is making lots of algorithms more accessible through a shared syntax. If you have a technical question about rpart, consult the package manual here.
As mentioned above, this type of question is better placed on CrossValidated, where the focus is on maths, stats, and machine learning.
However, to give you a tldr here:
The choice of tuning grid parameters is always going to be arbitrary to some extent. The objective is to find the value that produces the best results for your specific problem, which in turn depends on your data, your algorithm, and your evaluation metric. Some common "rules of thumb" include to start with a wide range, identify the area with a likely maximum and then use finer intervals around that region. In your case it is relatively easy as you only have one parameter to optimise over. Just try a couple of values and see what happens. You can plot the fitted tree object (plot(tree1)) to see how your model improves as a function of the complexity parameter cp. Eventually you will start developing a "feel" and "intuition" for what might work.

Using Cost Sensitive C50 in caret

I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:
Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.
Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)
c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))
However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?
Thanks! Any help would be really welcome!
Edit:
So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code.
This line:
cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)
I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right?
My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:
cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)
Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}.
Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...
I'd really appreciate any help here.
Thanks!
There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:
library(caret)
set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)
stats <- function (data, lev = NULL, model = NULL) {
c(postResample(data[, "pred"], data[, "obs"]),
Sens = sensitivity(data[, "pred"], data[, "obs"]),
Spec = specificity(data[, "pred"], data[, "obs"]))
}
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
summaryFunction = stats)
set.seed(2)
mod1 <- train(Class ~ ., data = dat1,
method = "C5.0",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10)),
trControl = ctrl)
xyplot(Sens + Spec ~ trials, data = mod1$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
set.seed(2)
mod2 <- train(Class ~ ., data = dat1,
method = "C5.0Cost",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10),
cost = 1:10),
trControl = ctrl)
xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
Max
If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?
Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.

Resources