How can choose number of nods in rpart? - r

In tree package we can use following code for choosing number of terminal nods:
tree.model = tree(...)
tree.prune = prune.tree(tree.model, best = 20)
This code returns a new tree with 20 terminal nods.
In rpart package following code can use for this:
rpart.model = rpart(...)
rpart.prune = prune.rpart(rpart.model, cp =?)
That cp is cost complexity parameter. but I want similar best argument in prune.tree.

rpart package doesn't have a similar argument to best of tree package. The tree package was developed to cover the functionalities rpart was missing on.
To choose appropriate number of nodes, you can tune other parameters in rpart. For eg.
prune.control <- rpart.control(minsplit = 20, minbucket = round(minsplit/3), xval = 10)
rpart(formula, data, method, control = prune.control)
Then, evaluate the cross validated error vs cp, to choose a cp value. Also, you can automatically tune cp value using caret package. For eg.
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 5)
model <- train(x = train_data,
y = labels,
method = "rpart",
trControl = ctrl)

Related

extract_inner_fselect_results is NULL with mlr3 Nested Resampling

This question is an extension of the following question: No Model Stored with Mlr3.
I have been performing nested resampling to get an unbiased metric of model performance. If I don't specify store_models=TRUE then I get Error: No model stored at the end of the run. However, if I specify store_models=TRUE in both the at and resample calls then RStudio crashes due to RAM consumption.
I have now tried the following code in which I specified store_models=TRUE for just the at call:
MSvCon<-read.csv("MS v Control Proteomics Final.csv", row.names=1)
MSvCon$Status<-as.factor(MSvCon$Status)
MSvCon[,2:4399]<-scale(MSvCon[,2:4399], center=TRUE, scale=TRUE)
set.seed(123, "L'Ecuyer")
task = as_task_classif(MSvCon, target = "Status")
learner = lrn("classif.ranger", importance = "impurity", num.trees=10000)
set_threads(learner, n = 8)
measure = msr("classif.fbeta", beta=1, average="micro")
terminator = trm("none")
resampling_inner = rsmp("repeated_cv", folds = 10, repeats = 10)
at = AutoFSelector$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
store_models=TRUE)
resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)
rr = resample(task, at, resampling_outer)
After finishing, I am able to extract performance measures successfully. However, I tried to use extract_inner_fselect_results and extract_inner_fselect_archives to check what features were selected and importance measures but received a NULL result.
Do you have any suggestions on what I would need to adjust in my code to see this information? I anticipate that adding store_models=TRUE to the resample call would but the RAM consumption issue (even using 128GB on Rstudio Workbench) prevents that. Is there a way around this?
The archives of the inner resampling are stored in the model slot of the AutoFSelectors i.e. without store_models = TRUE in resample() you cannot access the inner results and archives. I will write a workaround for you and answer in the other question.

rpart giving same results for cross-validation and no CV

Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.

How to read and get the C5.0 Model out of R

I've searched and search but can't find the answer. I have a c5_model trained and ready but I needed to do 100 trails to get it working to the level I want it to. But I'm stuck on trying to get it out of the model in R. I have done a summary but how do I get the decision tree out. Which trial do I want to use?
Update:
I'm building the model by doing the following
control <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
grid <- expand.grid( .winnow = c(FALSE),
.trials=100,
.model="tree" )
c5_model <- train(HasFraud ~ .,data = train, method = "C5.0",trControl = control,metric = "ROC",tuneGrid = grid,verbose = FALSE)
Is this the wrong method to train the model?
An object of class C5.0 has a number of elements, as described in the help file you can pull up with ?C50::C5.0.default. One of those elements is tree. If you've assigned the output of a call to C5.0() to a value, say model, you can extract any of its elements using the $ operator. For example:
model <- C5.0(<the call you made that generated the model>)
model$tree

rpart Complexity Parameter values

rpart parameters can be found using getModelInfo
getModelInfo("rpart")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
initialFit <- rpart(.outcome ~ .,
data = dat,
control = rpart.control(cp = 0))$cptable
initialFit <- initialFit[order(-initialFit[,"CP"]), , drop = FALSE]
if(search == "grid") {
if(nrow(initialFit) < len) {
tuneSeq <- data.frame(cp = seq(min(initialFit[, "CP"]),
max(initialFit[, "CP"]),
length = len))
} else tuneSeq <- data.frame(cp = initialFit[1:len,"CP"])
colnames(tuneSeq) <- "cp"
} else {
tuneSeq <- data.frame(cp = unique(sample(initialFit[, "CP"], size = len, replace = TRUE)))
}
tuneSeq
}
the only parameter is
cp = seq(min(initialFit[, "CP"]), max(initialFit[, "CP"]),length = len)
But how can I get the initialFit and the len?
Searching elsewhere I found that cp can usually take 10 values from 0.18 to 0.01. But still couldn't find out where those values come from
If you're unsure about appropriate values for a parameter, you can make caret choose for you and use default values. Here is an example that works end-to-end without explicitly specifying cp:
library(tidyverse)
library(caret)
library(forcats)
# Take mtcars data for example
df <- mtcars %>%
# Which cars are automatic, which ones are manual?
mutate(am = as.factor(am),
am = fct_recode(am, 'automatic' = '1', 'manual' = '0'))
set.seed(123456)
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Run rpart
# Tuning grid is left unspecified, so caret uses the default
tree1 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl)
Alternatively, if you want to explicitly specify cp, do so using the tuning grid:
tuneGrid <- expand.grid(cp = seq(0, 0.05, 0.005))
tree2 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl,
tuneGrid = tuneGrid)
A question on why you should select which values for cp is probably better posted on CrossValidated.
Update:
To answer your follow-on question about the default values and the values I chose in my example, I recommend going back to the primary source of the modelling function. caret is a great package for convenience reasons, but all it does is making lots of algorithms more accessible through a shared syntax. If you have a technical question about rpart, consult the package manual here.
As mentioned above, this type of question is better placed on CrossValidated, where the focus is on maths, stats, and machine learning.
However, to give you a tldr here:
The choice of tuning grid parameters is always going to be arbitrary to some extent. The objective is to find the value that produces the best results for your specific problem, which in turn depends on your data, your algorithm, and your evaluation metric. Some common "rules of thumb" include to start with a wide range, identify the area with a likely maximum and then use finer intervals around that region. In your case it is relatively easy as you only have one parameter to optimise over. Just try a couple of values and see what happens. You can plot the fitted tree object (plot(tree1)) to see how your model improves as a function of the complexity parameter cp. Eventually you will start developing a "feel" and "intuition" for what might work.

using caret package to find optimal parameters of GBM

I'm using the R GBM package for boosting to do regression on some biological data of dimensions 10,000 X 932 and I want to know what are the best parameters settings for GBM package especially (n.trees, shrinkage, interaction.depth and n.minobsinnode) when I searched online I found that CARET package on R can find such parameter settings. However, I have difficulty on using the Caret package with GBM package, so I just want to know how to use caret to find the optimal combinations of the previously mentioned parameters ? I know this might seem very typical question, but I read the caret manual and still have difficulty in integrating caret with gbm, especially cause I'm very new to both of these packages
Not sure if you found what you were looking for, but I find some of these sheets less than helpful.
If you are using the caret package, the following describes the required parameters: > getModelInfo()$gbm$parameters
He are some rules of thumb for running GBM:
The interaction.depth is 1, and on most data sets that seems
adequate, but on a few I have found that testing the results against
odd multiples up the max has given better results. The max value I
have seen for this parameter is floor(sqrt(NCOL(training))).
Shrinkage: the smaller the number, the better the predictive value,
the more trees required, and the more computational cost. Testing
the values on a small subset of data with something like shrinkage =
shrinkage = seq(.0005, .05,.0005) can be helpful in defining the
ideal value.
n.minobsinnode: default is 10, and generally I don't mess with that.
I have tried c(5,10,15,20) on small sets of data, and didn't really
see an adequate return for computational cost.
n.trees: the smaller the shrinkage, the more trees you should have.
Start with n.trees = (0:50)*50 and adjust accordingly.
Example setup using the caret package:
getModelInfo()$gbm$parameters
library(parallel)
library(doMC)
registerDoMC(cores = 20)
# Max shrinkage for gbm
nl = nrow(training)
max(0.01, 0.1*min(1, nl/10000))
# Max Value for interaction.depth
floor(sqrt(NCOL(training)))
gbmGrid <- expand.grid(interaction.depth = c(1, 3, 6, 9, 10),
n.trees = (0:50)*50,
shrinkage = seq(.0005, .05,.0005),
n.minobsinnode = 10) # you can also put something like c(5, 10, 15, 20)
fitControl <- trainControl(method = "repeatedcv",
repeats = 5,
preProcOptions = list(thresh = 0.95),
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
# Method + Date + distribution
set.seed(1)
system.time(GBM0604ada <- train(Outcome ~ ., data = training,
distribution = "adaboost",
method = "gbm", bag.fraction = 0.5,
nTrain = round(nrow(training) *.75),
trControl = fitControl,
verbose = TRUE,
tuneGrid = gbmGrid,
## Specify which metric to optimize
metric = "ROC"))
Things can change depending on your data (like distribution), but I have found the key being to play with gbmgrid until you get the outcome you are looking for. The settings as they are now would take a long time to run, so modify as your machine, and time will allow.
To give you a ballpark of computation, I run on a Mac PRO 12 core with 64GB of ram.
This link has a concrete example (page 10) -
http://www.jstatsoft.org/v28/i05/paper
Basically, one should first create a grid of candidate values for hyper parameters (like n.trees, interaction.depth and shrinkage). Then call the generic train function as usual.

Resources