rpart Complexity Parameter values - r

rpart parameters can be found using getModelInfo
getModelInfo("rpart")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
initialFit <- rpart(.outcome ~ .,
data = dat,
control = rpart.control(cp = 0))$cptable
initialFit <- initialFit[order(-initialFit[,"CP"]), , drop = FALSE]
if(search == "grid") {
if(nrow(initialFit) < len) {
tuneSeq <- data.frame(cp = seq(min(initialFit[, "CP"]),
max(initialFit[, "CP"]),
length = len))
} else tuneSeq <- data.frame(cp = initialFit[1:len,"CP"])
colnames(tuneSeq) <- "cp"
} else {
tuneSeq <- data.frame(cp = unique(sample(initialFit[, "CP"], size = len, replace = TRUE)))
}
tuneSeq
}
the only parameter is
cp = seq(min(initialFit[, "CP"]), max(initialFit[, "CP"]),length = len)
But how can I get the initialFit and the len?
Searching elsewhere I found that cp can usually take 10 values from 0.18 to 0.01. But still couldn't find out where those values come from

If you're unsure about appropriate values for a parameter, you can make caret choose for you and use default values. Here is an example that works end-to-end without explicitly specifying cp:
library(tidyverse)
library(caret)
library(forcats)
# Take mtcars data for example
df <- mtcars %>%
# Which cars are automatic, which ones are manual?
mutate(am = as.factor(am),
am = fct_recode(am, 'automatic' = '1', 'manual' = '0'))
set.seed(123456)
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Run rpart
# Tuning grid is left unspecified, so caret uses the default
tree1 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl)
Alternatively, if you want to explicitly specify cp, do so using the tuning grid:
tuneGrid <- expand.grid(cp = seq(0, 0.05, 0.005))
tree2 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl,
tuneGrid = tuneGrid)
A question on why you should select which values for cp is probably better posted on CrossValidated.
Update:
To answer your follow-on question about the default values and the values I chose in my example, I recommend going back to the primary source of the modelling function. caret is a great package for convenience reasons, but all it does is making lots of algorithms more accessible through a shared syntax. If you have a technical question about rpart, consult the package manual here.
As mentioned above, this type of question is better placed on CrossValidated, where the focus is on maths, stats, and machine learning.
However, to give you a tldr here:
The choice of tuning grid parameters is always going to be arbitrary to some extent. The objective is to find the value that produces the best results for your specific problem, which in turn depends on your data, your algorithm, and your evaluation metric. Some common "rules of thumb" include to start with a wide range, identify the area with a likely maximum and then use finer intervals around that region. In your case it is relatively easy as you only have one parameter to optimise over. Just try a couple of values and see what happens. You can plot the fitted tree object (plot(tree1)) to see how your model improves as a function of the complexity parameter cp. Eventually you will start developing a "feel" and "intuition" for what might work.

Related

MLR - calculating feature importance for bagged, boosted trees (XGBoost)

Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068

Custom model (R6 class) with caret giving binding error

I have created a custom model using R6 class. Due to the proprietary nature of work, I cant share the code but can share the anonymized structure. So basically, you would instantiate the model as follows
mod = my_model$new(formula = form, data = data, hyper1 = hyper1, hyper2 = hyper2)
Arguments:
formula: Standard R formula
data: Data Frame containing X and y values
hyper1 and hyper2: hyper-parameters of this model
Now, I am trying to integrate this with the caret package and having issues during the training process. This is the fit function that I am using
fitFunc <- function(x, y, wts, param, lev, last, weights, classProbs, ...) {
# Custom algorithm takes dataframe as input so need to convert 'x' and 'y' to a dataframe
# https://stats.stackexchange.com/questions/89171/help-requested-with-using-custom-model-in-caret-package
data = as.data.frame(x)
data$.outcome = y
data = as.data.frame(data)
# Define formula for model
form = formula(.outcome ~ .)
mod = my_model$new(formula = form, data = data, hyper1 = param$hyper1, hyper2 = param$hyper2, ...)
return(mod)
}
This is the code snippet for integrating the custom model with caret and running it.
library(caret)
set.seed(998)
inTraining <- createDataPartition(mtcars$mpg, p = .75, list = FALSE)
training <- mtcars[ inTraining,]
testing <- mtcars[-inTraining,]
fitControl <- trainControl(method = "cv", number = 3)
set.seed(825)
mdl_builder = train(mpg ~ ., data = training,
method = custom_model_list,
tuneLength = 8,
trControl = fitControl)
However, this leads to the following error messages (just a small snippet, it actually fails on every fold)
model fit failed for Fold2: hyper1=1, hyper2=4 Error in modelFit$xNames <- colnames(x) :
cannot add bindings to a locked environment
I think this is coming from the fact that the caret code internally is trying to assign the xNames to the R6 class object but the R6 class is not allowing this. I dont understand how to fix this (if at all it is possible). Any help would be appreciated.
Thanks!!
I figured it out right after typing this question (+ a little more research)
The key was to set lock_objects = FALSE in the R6 class to that new attributes can be added from outside.

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

How to retrieve elastic net coefficients?

I am using the caret package to train an elastic net model on my dataset modDat. I take a grid search approach paired with repeated cross validation to select the optimal values of the lambda and fraction parameters required by the elastic net function. My code is shown below.
library(caret)
library(elasticnet)
grid <- expand.grid(
lambda = seq(0.5, 0.7, by=0.1),
fraction = seq(0, 1, by=0.1)
)
ctrl <- trainControl(
method = 'repeatedcv',
number = 5, #folds
repeats = 10, #repeats
classProbs = FALSE
)
set.seed(1)
enetTune <- train(
y ~ .,
data = modDat,
method = 'enet',
metric = 'RMSE',
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl
)
I can get predictions using y_hat <- predict(enetTune, modDat), but I cannot view the coefficients underlying the predictions.
I have tried coef(enetTune$finalModel) but the only thing returned is NULL. I am suspecting that I have to give the coef() function more information but not sure how to do this.
In addition, I would like to produce a box plot of the 50 sets of coefficients (10 repeats of 5 folds) associated with the optimal lambda and fraction parameters.
To see the coefficients, use predict:
predict(enetTune$finalModel, type = "coefficients")
See ?predict.enet for more information on how to get specific coefficients.
Following on from the answer by #Weihuang Wong, you can get the coefficients from the final model using the following code:
predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
To me what works best is stats::predict, as is #Weihuang Wong answer. However, as OP pointed out in a comment, that provides a list of coefficients for every value of lambda tested.
The important thing to understand here is that when you are using predict, your intention is precisely to predict the value of the parameters, and not really to retrieve them. You should then be aware of that an explore the options available.
In this case, you could use the same function with the argument s for the penalty parameter lambda. Remebember that you are still predicting, but this time you will get the coefficients you are looking for.
stats::predict(enetTune$finalModel, type = "coefficients", s = enetTune$bestTune$lambda)

Using Cost Sensitive C50 in caret

I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:
Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.
Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)
c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))
However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?
Thanks! Any help would be really welcome!
Edit:
So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code.
This line:
cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)
I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right?
My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:
cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)
Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}.
Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...
I'd really appreciate any help here.
Thanks!
There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:
library(caret)
set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)
stats <- function (data, lev = NULL, model = NULL) {
c(postResample(data[, "pred"], data[, "obs"]),
Sens = sensitivity(data[, "pred"], data[, "obs"]),
Spec = specificity(data[, "pred"], data[, "obs"]))
}
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
summaryFunction = stats)
set.seed(2)
mod1 <- train(Class ~ ., data = dat1,
method = "C5.0",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10)),
trControl = ctrl)
xyplot(Sens + Spec ~ trials, data = mod1$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
set.seed(2)
mod2 <- train(Class ~ ., data = dat1,
method = "C5.0Cost",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10),
cost = 1:10),
trControl = ctrl)
xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
Max
If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?
Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.

Resources