I am using train in caret package to train some c50 models. I manage to do fine with the method C5.0 but when I want to use the cost sensitive C50 method I struggle understanding how to tune the cost parameter. What I am trying to do is to introduce a cost when predicting wrong one of my classes. I've try searching in the caret package website (http://topepo.github.io/caret/index.html) and reading several manuals/tutorials found here and there. I didn't find any information about how to handle the cost parameter. So this is what I tried on my own:
Run the train with the default settings to see what I get. In the output, the train function tried with cost from 0 to 2 and gave the best model for cost=2.
Try to add in the expand.grid function the cost as a matrix, the same way you'd do using the package C5.0. The code is below (trials is pushed to 1 cause I just want one tree/set of rules in my output)
c50Grid <- expand.grid(.trials=1, .model=c("tree", "rules"), .winnow=c("TRUE", "FALSE"), .cost=matrix(c(0,1,2,0), ncol=2))
However when I execute the train function, although I don't get any errors (but I get 50 warnings), the train tried again cost from 0 to 2. What am I doing wrong? Which format has the cost parameter? What's the meaning here? How would I interpret the results? Which class is the one getting the cost as "Predicting class 0 wrong cost double than class 1"? Also, what I tried was using one matrix, but although it didn't work with this format, how would I add the different costs that I want to test?
Thanks! Any help would be really welcome!
Edit:
So, trying to find an answer on my own about the meaning of the cost parameter for the C5.0Cost, I went to the C5.0Cost.R (https://r-forge.r-project.org/scm/viewvc.php/models/files/C5.0Cost.R?view=markup&root=caret&pathrev=761) and looked up the code.
This line:
cmat <-matrix(c(0, param$cost, 1, 0), ncol = 2)
I guess, it's passing the cost parameter to the cost matrix. So, I think now I can understand how it works. If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right?
My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0", which would be:
cmat <- matrix(c(0, 1, param$cost, 0), ncol=2)
Could I just set the cost to 0.5? And if want to train with different values, just use values less than 1 { 0.5, 0.6, 0.7, etc}.
Note: the way my data is, when I used C50 or other trees before, it takes as "Positive class = 0", so I had to invert the cost matrix when I used C50 so if I use caret method C5.0Cost, I'd need to do the same or find another way to do it...
I'd really appreciate any help here.
Thanks!
There is a cost-senstivite model code for train and C5.0 (use method = "C5.0Cost"). For example:
library(caret)
set.seed(1)
dat1 <- twoClassSim(1000, intercept = -12)
dat2 <- twoClassSim(1000, intercept = -12)
stats <- function (data, lev = NULL, model = NULL) {
c(postResample(data[, "pred"], data[, "obs"]),
Sens = sensitivity(data[, "pred"], data[, "obs"]),
Spec = specificity(data[, "pred"], data[, "obs"]))
}
ctrl <- trainControl(method = "repeatedcv", repeats = 5,
summaryFunction = stats)
set.seed(2)
mod1 <- train(Class ~ ., data = dat1,
method = "C5.0",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10)),
trControl = ctrl)
xyplot(Sens + Spec ~ trials, data = mod1$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
set.seed(2)
mod2 <- train(Class ~ ., data = dat1,
method = "C5.0Cost",
tuneGrid = expand.grid(model = "tree", winnow = FALSE,
trials = c(1:10, (1:5)*10),
cost = 1:10),
trControl = ctrl)
xyplot(Sens + Spec ~ trials|format(cost), data = mod2$results,
type = "l",
auto.key = list(columns = 2,
lines = TRUE,
points = FALSE))
Max
If I have class = {0,1} and my positive class is 0, this matrix says that "Predicting class 0 wrong costs double than class 1", right? My question now is, how could I do the opposite? How could I set that "Predicting class 1 wrong costs double than class 0" [...]?
Unfortunately, you can't change the costs for the false positives in caret at the moment. This appears to be a bug! See this post for further information about this issue.
Related
Good morning,
I have a question about calculating feature importance for bagged and boosted regression tree models with MLR package in R. I am using XGBOOST to make predictions and i'm using bagging to estimate prediction uncertainty. My data set is relatively large; approximately 10k features and observations. The predictions work perfectly (see code below), but I can't seem to calculate feature importance (the last line in the code below). The importance function crashes with no errors... and freezes the R session. I saw some related python code, where people seem to calculate the importance for each of the bagged models here and here. I haven't been able to get that to work properly in R either. Specifically, i'm not sure how to access individual models within the objected produced by MLR (mb object in the code below). In python, this seems to be trivial. In R, i can't seem to extract mb$learner.model, which seems logically closest to what i need. So i'm wondering if anyone had any experience with this issues?
Please see the code below
learn1 <- makeRegrTask(data = train.all , target= "resp", weights = weights1)
lrn.xgb <- makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals <- list( objective="reg:squarederror", eval_metric="error", nrounds=300, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 50, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag <- setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, learn1)
fimp1 <- getFeatureImportance(mb)
If you set bw.feats = 1 it might be feasible to average the feature importance values.
Basically you just have to apply over all single models that are stored in the HomogeneousEnsembleModel. Some extra care is necessary because the order of the features gets mixed up because of the sampling - although we set it to 100%.
library(mlr)
data = data.frame(x1 = runif(100), x2 = runif(100), x3 = runif(100))
data$y = with(data, x1 + 2 * x2 + 0.1 * x3 + rnorm(100))
task = makeRegrTask(data = data, target = "y")
lrn.xgb = makeLearner("regr.xgboost", predict.type = "response")
lrn.xgb$par.vals = list( objective="reg:squarederror", eval_metric="error", nrounds=50, gamma=0, booster="gbtree", max.depth=6)
lrn.xgb.bag = makeBaggingWrapper(lrn.xgb, bw.iters = 10, bw.replace = TRUE, bw.size = 0.85, bw.feats = 1)
lrn.xgb.bag = setPredictType(lrn.xgb.bag, predict.type="se")
mb = mlr::train(lrn.xgb.bag, task)
fimps = lapply(mb$learner.model$next.model, function(x) getFeatureImportance(x)$res)
fimp = fimps[[1]]
# we have to take extra care because the results are not ordered
for (i in 2:length(fimps)) {
fimp = merge(fimp, fimps[[i]], by = "variable")
}
rowMeans(fimp[,-1]) # only makes sense with bw.feats = 1
# [1] 0.2787052 0.4853880 0.2359068
Like the title says, I'm trying to run a decision tree both with and without cross-validation using the rpart package in R. I'm doing this using the xval parameter, as described in the vignette (https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf)
Unfortunately, I'm getting the same tree with and without CV. I've compared the calculation time for each and the CV model looks like it takes about 10 times as long, so its apparently doing something, I just can't figure out what.
I've also redone the model a number of times with different complexity parameters, but it hasn't made any difference.
Here's sample code that shows my problem, the printcp's show the same results and the predictions from both on the training and a hold-out set are the same.
library(rpart)
library(caret)
abalone <- read.csv(file = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data',header = FALSE)
names(abalone) <- c("sex", "length", "diameter", "height", "whole_weight", "shucked_weight", "viscera_weight", "shell_weight", "rings")
train_set <- createDataPartition(abalone$sex, times = 1, p = 0.8, list = FALSE)
abalone_train <- slice(abalone, train_set)
abalone_test <- slice(abalone, -train_set)
abalone_fit_noCV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 0,
cp = 0.005))
abalone_fit_CV <- rpart(sex ~ .,
data = abalone_train,
method = "class",
parms = list(split = 'information'),
control = rpart.control(xval = 10,
cp = 0.005))
printcp(abalone_fit_noCV)
printcp(abalone_fit_CV)
CV_pred <- predict(abalone_fit_CV, type = "class")
noCV_pred <- predict(abalone_fit_noCV, type = "class")
confusionMatrix(CV_pred, noCV_pred)
CV_pred <- predict(abalone_fit_CV, abalone_test, type = "class")
noCV_pred <- predict(abalone_fit_noCV, abalone_test, type = "class")
confusionMatrix(CV_pred, noCV_pred)
In true beginner fashion, I figured this out shortly after posting.
For anybody else coming upon this issue, it is basically answered on Cross Validated :
The final tree that is returned is still the initial tree. You must use the prune function using the cross-validation plot to choose the best subtree.
This is clear if you read the full Pruning the tree section of the vignette, rather than just the cross-validation section.
rpart parameters can be found using getModelInfo
getModelInfo("rpart")[[1]]$grid
function(x, y, len = NULL, search = "grid"){
dat <- if(is.data.frame(x)) x else as.data.frame(x)
dat$.outcome <- y
initialFit <- rpart(.outcome ~ .,
data = dat,
control = rpart.control(cp = 0))$cptable
initialFit <- initialFit[order(-initialFit[,"CP"]), , drop = FALSE]
if(search == "grid") {
if(nrow(initialFit) < len) {
tuneSeq <- data.frame(cp = seq(min(initialFit[, "CP"]),
max(initialFit[, "CP"]),
length = len))
} else tuneSeq <- data.frame(cp = initialFit[1:len,"CP"])
colnames(tuneSeq) <- "cp"
} else {
tuneSeq <- data.frame(cp = unique(sample(initialFit[, "CP"], size = len, replace = TRUE)))
}
tuneSeq
}
the only parameter is
cp = seq(min(initialFit[, "CP"]), max(initialFit[, "CP"]),length = len)
But how can I get the initialFit and the len?
Searching elsewhere I found that cp can usually take 10 values from 0.18 to 0.01. But still couldn't find out where those values come from
If you're unsure about appropriate values for a parameter, you can make caret choose for you and use default values. Here is an example that works end-to-end without explicitly specifying cp:
library(tidyverse)
library(caret)
library(forcats)
# Take mtcars data for example
df <- mtcars %>%
# Which cars are automatic, which ones are manual?
mutate(am = as.factor(am),
am = fct_recode(am, 'automatic' = '1', 'manual' = '0'))
set.seed(123456)
fitControl <- trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# Run rpart
# Tuning grid is left unspecified, so caret uses the default
tree1 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl)
Alternatively, if you want to explicitly specify cp, do so using the tuning grid:
tuneGrid <- expand.grid(cp = seq(0, 0.05, 0.005))
tree2 <- train(am ~ .,
df,
method = 'rpart',
tuneLength = 20,
metric = 'ROC',
trControl = fitControl,
tuneGrid = tuneGrid)
A question on why you should select which values for cp is probably better posted on CrossValidated.
Update:
To answer your follow-on question about the default values and the values I chose in my example, I recommend going back to the primary source of the modelling function. caret is a great package for convenience reasons, but all it does is making lots of algorithms more accessible through a shared syntax. If you have a technical question about rpart, consult the package manual here.
As mentioned above, this type of question is better placed on CrossValidated, where the focus is on maths, stats, and machine learning.
However, to give you a tldr here:
The choice of tuning grid parameters is always going to be arbitrary to some extent. The objective is to find the value that produces the best results for your specific problem, which in turn depends on your data, your algorithm, and your evaluation metric. Some common "rules of thumb" include to start with a wide range, identify the area with a likely maximum and then use finer intervals around that region. In your case it is relatively easy as you only have one parameter to optimise over. Just try a couple of values and see what happens. You can plot the fitted tree object (plot(tree1)) to see how your model improves as a function of the complexity parameter cp. Eventually you will start developing a "feel" and "intuition" for what might work.
The vegan package includes the ordiR2step() function for model building, which can be used to identify the most important variables using the R2 and the p-value as goodness of fit measures. However for the dataset I was recently working with the function doesn't provide the best-fit model.
# data
RIKZ <- read.table("http://www.uni-koblenz-landau.de/en/campus-landau/faculty7/environmental-sciences/landscape-ecology/Teaching/RIKZ_data/at_download/file", header = TRUE)
# data preparation
Species <- RIKZ[ ,2:5]
ExplVar <- RIKZ[ , 9:15]
Species_fin <- Species[ rowSums(Species) > 0, ]
ExplVar_fin <- ExplVar[ rowSums(Species) > 0, ]
# rda
RIKZ_rda <- rda(Species_fin ~ . , data = ExplVar_fin, scale = TRUE)
# stepwise model building: ordiR2step()
require(vegan)
step_both_R2 <- ordiR2step(rda(Species_fin ~ salinity, data = ExplVar_fin, scale = TRUE),
scope = formula(RIKZ_rda),
direction = "both", R2scope = TRUE, Pin = 0.05,
steps = 1000)
Why does ordiR2step() not add the variable exposure to the model, although it would increase the explained variance?
If R2scope is set FALSE and the p-value criterion is increased (Pin = 0.15) it adds the variable exposure corretly but throws the following error:
Error in terms.formula(tmp, simplify = TRUE) :
invalid model formula in ExtractVars
If R2scope is set TRUE (Pi = 0.15) exposure is not added.
Note: This might seem more as a statistic question and therefore more suitable for CV. However I think the problem is rather technical and better off here on SO.
Please read the ordiR2step documentation: it will tell you why exposure is not added to the model. The help page tells that ordiR2step has three stopping criteria. The second criterion is that "the adjusted R2 of the ‘scope’ is exceeded". This happens with exposure and therefore it was not added. This second criterion will be ignored if you set R2scope = FALSE (also documented). So the function works like documented.
I am using the caret package to train an elastic net model on my dataset modDat. I take a grid search approach paired with repeated cross validation to select the optimal values of the lambda and fraction parameters required by the elastic net function. My code is shown below.
library(caret)
library(elasticnet)
grid <- expand.grid(
lambda = seq(0.5, 0.7, by=0.1),
fraction = seq(0, 1, by=0.1)
)
ctrl <- trainControl(
method = 'repeatedcv',
number = 5, #folds
repeats = 10, #repeats
classProbs = FALSE
)
set.seed(1)
enetTune <- train(
y ~ .,
data = modDat,
method = 'enet',
metric = 'RMSE',
tuneGrid = grid,
verbose = FALSE,
trControl = ctrl
)
I can get predictions using y_hat <- predict(enetTune, modDat), but I cannot view the coefficients underlying the predictions.
I have tried coef(enetTune$finalModel) but the only thing returned is NULL. I am suspecting that I have to give the coef() function more information but not sure how to do this.
In addition, I would like to produce a box plot of the 50 sets of coefficients (10 repeats of 5 folds) associated with the optimal lambda and fraction parameters.
To see the coefficients, use predict:
predict(enetTune$finalModel, type = "coefficients")
See ?predict.enet for more information on how to get specific coefficients.
Following on from the answer by #Weihuang Wong, you can get the coefficients from the final model using the following code:
predict.enet(enetTune$finalModel, s=enetTune$bestTune[1, "fraction"], type="coef", mode="fraction")$coefficients
To me what works best is stats::predict, as is #Weihuang Wong answer. However, as OP pointed out in a comment, that provides a list of coefficients for every value of lambda tested.
The important thing to understand here is that when you are using predict, your intention is precisely to predict the value of the parameters, and not really to retrieve them. You should then be aware of that an explore the options available.
In this case, you could use the same function with the argument s for the penalty parameter lambda. Remebember that you are still predicting, but this time you will get the coefficients you are looking for.
stats::predict(enetTune$finalModel, type = "coefficients", s = enetTune$bestTune$lambda)