Suppose i'm doing seveal runs of the same model, but only with different complexity parameters, on the same (seed fixed) cross-validation with the caret package, for exemple :
# controls are the same for every models
c = trainControl(method = "cv",number=10,verboseIter = TRUE)
d = iris # data is also the same
f = Species ~ . # formula is also the same
m = "rpart" # method is also the same
model1 <- train(form = f, data = d, trControl = c, method = m,
tuneGrid = expand.grid(cp = c(0,0.5)))
model2 <- train(form = f, data = d, trControl = c, method = m,
tuneGrid = expand.grid(cp = c(0.1,0.2)))
model3 <- train(form = f, data = d, trControl = c, method = m,
tuneGrid = expand.grid(cp = c(0,0.5,0.1,0.2)))
Is there a way i could "build up" the model3 train object only from model1 and the model2 ?
Calculations are long, and i did'nt ran all my different tuning in the same caret call. But having every run in the same train object will be much easier for comparing them (via the plot function, the update function, the resamples function, etc...)
I'm particularly looking for a way do do the same thing plot.train do but for all of them together.

I perfectly understand your concern, because my computation sources are also very limited. However I would approach it as follows, instead of "building up" the model3 object.
Suppose what you wish to achieve is highest accuracy. Then you simply need to evaluate the following: which among the model1 and model2 do we see highest accuracy? Then we are only interested in choosing the best-result tuning parameter. For example, we see the following:
> model1$bestTune$cp
[1] 0
> model2$bestTune$cp
[1] 0.2
> model1$results$Accuracy ## Respectively for cp = 0.0 and cp = 0.5
[1] 0.9333 0.3333
> model2$results$Accuracy ## Respectively for cp = 0.1 and cp = 0.2
[1] 0.9267 0.9267
We would choose cp = 0.
Suppose you have broken things down to model1, model2, model3, ... and wish to explore all manually input parameter values using them.
k = 2 ## Here we only have model1 and model2 to compare
evaluate <- list()
for (i in 1:k) {
model = eval(parse(text = paste0("model", i)))
evaluate[["cp"]][[paste0("model", i)]] <-
evaluate[["accuracy"]][[paste0("model", i)]] <-
model$results$Accuracy[[which(model$results$cp == model$bestTune$cp)]]
Then in our evaluate list, we have the following:
> evaluate
model1 model2
0.0 0.2
model1 model2
0.9333 0.9267
Upon this, we can do
> which(evaluate$accuracy == max(evaluate$accuracy))
> evaluate$cp[[which(evaluate$accuracy == max(evaluate$accuracy))]]
[1] 0
Now we can happily choose cp = 0 and we also know that the result from the optimal cp is stored in model1.
If you wish to still "build up" the model3, you can simply substitute some of the components (e.g. results in which AccuracySD, KappaSD, and such metrics would be stored) after having chosen what we evaluated as the best model---model1 in this case.


R: Caret package: Brier Score

I want to perform a logistic regression with the train() function from the caret package. My model looks something like that:
model <- train(Y ~.,
data = train_data,
family = "binomial",
method = "glmnet")
With the resulting model, I want to make predictions:
pred <- predict(model, newdata = test_data, s = "lambda.min", type = "prob")
Now, I want to evaluate how good the model predictions are in comparison with the actual test data. For this I know how to receive the ROC and AUC. However I am also interested in receiveing the BRIER SCORE. The formula for the Brier Score is almost identical to the MSE.
The problem I am facing, is that the type argument in predict only allows "prob" (or "class" which I am not interested in) which gives the probability of one prediction beeing a ONE (e.g. 0.64) , and the complementing probability of beeing a ZERO (e.g. 0.37). For the Brier Score however, I need One probability estimate for each prediction that contains the information of both (e.g. a value above 0.5 would indicate a 1 and a value below 0.5 would indicate a 0).
I have not found any solution for receiving the Brier Score in the caret package. I am aware that with the package cv.glmnet the predict function allows the argument "response" which would solve my problem. However, for personal preferences I would like to stay with the caretpackage.
Thanks for the help!
If we go by the wiki definition of brier score:
The most common formulation of the Brier score is
where f_t is the probability that was forecast, o_t the actual outcome of the (0 or 1) and N is the number of forecasting instances.
In R, if your label is a factor, then the logistic regression will always predict with respect to the 2nd level, meaning you just calculate the probability and 0/1 with respect to that. For example:
idx = sample(nrow(iris),100)
data = iris
data$Species = factor(ifelse(data$Species=="versicolor","v","o"))
[1] "o" "v"
In this case, o is 0 and v is 1.
train_data = data[idx,]
test_data = data[-idx,]
model <- train(Species ~.,data = train_data,family = "binomial",method = "glmnet")
pred <- predict(model, newdata = test_data)
So we can see the probability of the class:
o v
1 0.8367885 0.16321154
2 0.7970508 0.20294924
3 0.6383656 0.36163437
4 0.9510763 0.04892370
5 0.9370721 0.06292789
To calculate the score:
f_t = pred[,2]
o_t = as.numeric(test_data$Species)-1
mean((f_t - o_t)^2)
[1] 0.32
I use the Brier score to tune my models in caret for binary classification. I ensure that the "positive" class is the second class, which is the default when you label your response "0:1". Then I created this master summary function, based on caret's own suite of summary functions, to return all the metrics I want to see:
BigSummary <- function (data, lev = NULL, model = NULL) {
pr_auc <- try(MLmetrics::PRAUC(data[, lev[2]],
ifelse(data$obs == lev[2], 1, 0)),
silent = TRUE)
brscore <- try(mean((data[, lev[2]] - ifelse(data$obs == lev[2], 1, 0)) ^ 2),
silent = TRUE)
rocObject <- try(pROC::roc(ifelse(data$obs == lev[2], 1, 0), data[, lev[2]],
direction = "<", quiet = TRUE), silent = TRUE)
if (inherits(pr_auc, "try-error")) pr_auc <- NA
if (inherits(brscore, "try-error")) brscore <- NA
rocAUC <- if (inherits(rocObject, "try-error")) {
} else {
tmp <- unlist(e1071::classAgreement(table(data$obs,
data$pred)))[c("diag", "kappa")]
out <- c(Acc = tmp[[1]],
Kappa = tmp[[2]],
AUCPR = pr_auc,
Brier = brscore,
Precision = caret:::precision.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
Recall = caret:::recall.default(data = data$pred,
reference = data$obs,
relevant = lev[2]),
F = caret:::F_meas.default(data = data$pred, reference = data$obs,
relevant = lev[2]))
Now I can simply pass summaryFunction = BigSummary in trainControl and then metric = "Brier", maximize = FALSE in the train call.

Metric Accuracy not applicable for regression models

I am trying to investigate my model with R with machine learning. Training model in general works not well.
# # Logistic regression multiclass
for (i in 1:30) {
# split data into training/test
trainPhyIndex <- createDataPartition(subs_phy$Methane, p=10/17,list = FALSE)
trainingPhy <- subs_phy[trainPhyIndex,]
testingPhy <- subs_phy[-trainPhyIndex,]
# Pre-process predictor values
trainXphy <- trainingPhy[,names(trainingPhy)!= "Methane"]
preProcValuesPhy <- preProcess(x= trainXphy,method = c("center","scale"))
# using boot to avoid over-fitting
fitControlPhyGLMNET <- trainControl(method = "repeatedcv",
number = 10,
repeats = 4,
classProbs = TRUE
fit_glmnet_phy <- train (Methane~.,
method = "glmnet",
tuneGrid = expand.grid(
.alpha =0.1,
.lambda = 0.00023),
metric = "Accuracy",
trControl = fitControlPhyGLMNET)
pred_glmnet_phy <- predict(fit_glmnet_phy, testingPhy)
# Get the confusion matrix to see accuracy value
u <- union(pred_glmnet_phy,testingPhy$Methane)
t <- table(factor(pred_glmnet_phy, u), factor(testingPhy$Methane, u))
accu_glmnet_phy <- confusionMatrix(t)
# accu_glmnet_phy<-confusionMatrix(pred_glmnet_phy,testingPhy$Methane)
glmnetstatsPhy[(nrow(glmnetstatsPhy)+1),] = accu_glmnet_phy$overall
The program always stopped on fit_glmnet_phy <- train (Methane~., ..
this command and shows
Metric Accuracy not applicable for regression models
I have no idea about this error
I also attached the type of mathane
enter image description here
Try normalizing the input columns and mapping the output column as factors. This helped me resolve an issue similar to it.

How do I save models as a vector with a loop?

So I have this assignment where I have to create 3 different models (r). I can do them individually without a problem. However I want to take it a step further and to create a function that trains all of them with a for loop. (I know I could create a function that trained the 3 models each time. I am not looking for other solutions to the problem, I want to do it this way (or in a similar fashion) because now I have 3 models but imagine if I wanted to train 20!
I tried creating a list to store all three models, but i keep having some warnings.
TrainingFunction <- function(method,formula,data,tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
if(method == "rf") {Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)}
else if (method == "knn"){
preObj <- preProcess(data[, c(13,14,15)], method=c("center", "scale"))
data <- predict(preObj, data)
Model <- train(formula, data = data,method = method, trcontrol = fitcontrol , tunelenght = tune)
else if (method == "svm"){Model <- svm(formula, data = data,cost=1000 , gamma = 0.001)}
So this is a training function I created, and it works, but now I want to train all three at once !
So I tried this:
methods <- c("rf","knn","svm")
Models <- vector(mode = "list" , length = length(methods))
for(i in 1:length(methods))
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
This are the warnings :
Warning messages:
1: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
2: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
3: In svm.default(x, y, scale = scale, ..., na.action = na.action) :
Variable(s) ‘ProductType.GameConsole’ constant. Cannot scale data.
4: In Models[i] <- TrainingFunction(methods[i], Volume ~ ., List$trainingSet, :
number of items to replace is not a multiple of replacement length
When I do Models the output is this :
[1] "rf"
[1] "knn"
svm(formula = formula, data = data, cost = 1000, gamma = 0.001)
Consider switch to avoid the many if and else especially if extending to 20 models. Then use lapply to build a list without initialization or iterative assignment:
TrainingFunction <- function(method, formula, data, tune) {
fitcontrol <- trainControl(method = "repeatedcv", repeats = 4)
Model <- switch(method,
"rf" = train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"knn" = {
preObj <- preProcess(data[,c(13,14,15)],
method=c("center", "scale"))
data <- predict(preObj, data)
train(formula, data = data, method = method,
trcontrol = fitcontrol, tunelength = tune)
"svm" = svm(formula, data = data, cost = 1000, gamma = 0.001)
methods <- c("rf","knn","svm")
Model_list <-lapply(methods, function(m)
TrainingFunction(m, Volume~., List$trainingSet, 5))
I think the problem comes from this line:
{Models[i] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
If you want to assign your model to the i-th place of the list, you should do it with a double bracket, like this:
{Models[[i]] <- TrainingFunction(methods[i],Volume~.,List$trainingSet,5)}
Another alternative would be use lapply instead of an explicit loop, so you avoid that problem altogether:
train_from_method <- function(methods) {TrainingFunction(methods,Volume~.,List$trainingSet,5)}
Models <- lapply(species_vector, train_from_method)

Interpreting dummy variables created in caret train

I understand from reading various answers 1,2,3, that the train function from caret will create dummy variables to deal with factors that have multiple levels.
Here is an example using mtcars (model is useless other than to show point):
df <- mtcars
df$cyl <- factor(df$cyl)
df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")
tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)
mod <- as.formula(mpg_bound ~ cyl)
m1 <- train(mod, data = df,
method = "rpart",
preProcess = c("center", "scale"),
trControl = tc)
n= 32
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 32 14 bad (0.5625000 0.4375000)
2) cyl8>=0.124004 14 0 bad (1.0000000 0.0000000) *
3) cyl8< 0.124004 18 4 good (0.2222222 0.7777778) *
I don't understand this part cyl8>=0.124004. I get that cyl8 is the dummy variable for the factor but what does it mean that cyl8>=0.124004?
I'd like to extend the existing answer, because I don't think the conclusion reached in the comments is true. As you say, when using the formula interface, caret's train function will transform factor variables into dummy variables that only take the values 0 or 1, e.g. cyl8 == 1 means 'the car has 8 cylinders'. Each dummy variable makes a statement about a characteristic that is either true or false for the observation.
Rpart will nevertheless output a numeric value as the split criterion, so that cyl8 >= 0.5, cyl8 >= 0.2 and cyl8 == 1 all mean the same thing "This car has exactly 8 cylinders". By default, rpart will choose the split value cyl8 >= 0.5 for binary dummies to indicate that the dummy is true. The interpretation of cyl8 >= 0.5 is then "Does the car have 8 cylinders?" (and not "Does the car have more than 8 cylinders?")
df <- mtcars
df$cyl <- factor(df$cyl)
df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")
tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)
m1 <- train(mod, data = df,
method = "rpart",
#preProcess = c("center", "scale"),
trControl = tc,
metric = "ROC")
#1) root 32 14 bad (0.5625000 0.4375000)
#2) cyl8>=0.5 14 0 bad (1.0000000 0.0000000) *
#3) cyl8< 0.5 18 4 good (0.2222222 0.7777778) *
The confusing value in your example is caused because caret apparently applies the preProcessing to the extended dataset where the dummies are numeric variables. The interpretation stays the same, but the (arbitrary) split value is transformed.
# Transform to dummies
mm <- model.matrix(mpg_bound ~ .-1, data = df)
# Do pre-processing
pp <- preProcess(mm, method = c("center", "scale"))
mm.pp <- as.matrix(predict(pp, mm))
# Dummy-Split in the middle
(max(mm.pp[,"cyl8"]) + min(mm.pp[,"cyl8"]) ) / 2
I think this value represents the split point based on the dummy var scale (0-1). This code produces the same outcome:
df = mtcars
df$cyl <- factor(df$cyl)
df$mpg_bound <- ifelse(df$mpg > 20, "good", "bad")
tc <- trainControl(classProbs = TRUE, summaryFunction = twoClassSummary)
data = cbind(df,model.matrix(~cyl+mpg_bound,df)) # binds the dummy transf to the data
mod <- as.formula(mpg_bound ~ cyl8)
m1 <- train(mod, data = data,
method = "rpart",
preProcess = c("center", "scale"),
trControl = tc)
It might be easier running the rpart code directly (incl original scale), although this might not allow you to specify the features you used. e.g.

Cross-validating a CART model

In an assignment, we are asked to perform a cross-validation on a CART model. I have tried using the cvFit function from cvTools but got a strange error message. Here's a minimal example:
cvFit(rpart(formula=Species~., data=iris))
The error I'm seeing is:
Error in nobs(y) : argument "y" is missing, with no default
And the traceback():
5: nobs(y)
4: cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K,
R = R, foldType = foldType, folds = folds, names = names,
predictArgs = predictArgs, costArgs = costArgs, envir = envir,
seed = seed)
3: cvFit(call, data = data, x = x, y = y, cost = cost, K = K, R = R,
foldType = foldType, folds = folds, names = names, predictArgs = predictArgs,
costArgs = costArgs, envir = envir, seed = seed)
2: cvFit.default(rpart(formula = Species ~ ., data = iris))
1: cvFit(rpart(formula = Species ~ ., data = iris))
It looks that y is mandatory for cvFit.default. But:
> cvFit(rpart(formula=Species~., data=iris), y=iris$Species)
Error in cvFit.call(call, data = data, x = x, y = y, cost = cost, K = K, :
'x' must have 0 observations
What am I doing wrong? Which package would allow me to do a cross-validation with a CART tree without having to code it myself? (I am sooo lazy...)
The caret package makes cross validation a snap:
> library(caret)
> data(iris)
> tc <- trainControl("cv",10)
> rpart.grid <- expand.grid(.cp=0.2)
> (train.rpart <- train(Species ~., data=iris, method="rpart",trControl=tc,tuneGrid=rpart.grid))
150 samples
4 predictors
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 135, 135, 135, 135, 135, 135, ...
Resampling results
Accuracy Kappa Accuracy SD Kappa SD
0.94 0.91 0.0798 0.12
Tuning parameter 'cp' was held constant at a value of 0.2
Finally, I was able to get it to work. As Joran noted, the cost parameter needs to be adapted. In my case I am using 0/1 loss, which means that I use a simple function that evaluates != instead of - between y and yHat. Also, predictArgs must include c(type='class'), otherwise the predict call used internally will return a vector of probabilities instead of the most probable classification. To sum up:
cvFit(rpart, formula=Species~., data=iris,
cost=function(y, yHat) (y != yHat) + 0, predictArgs=c(type='class'))
(This uses another variant of cvFit. Additional args to rpart can be passed by setting the args= parameter.)
