Custom Performance Function in caret Package using predicted Probability - r

This SO post is about using a custom performance measurement function in the caret package. You want to find the best prediction model, so you build several and compare them by calculating a single metric that is drawn from comparing the observation and the predicted value. There are default functions to calculate this metric, but you can also define your own metric-function. This custom functions must take obs and predicted values as input.
In classification problems (let's say only two classes) the predicted value is 0 or 1. However, what I need to evaluate is also the probability calculated in the model. Is there any way to achieve this?
The reason is that there are applications where you need to know whether a 1 prediction is actually with a 99% probability or with a 51% probability - not just if the prediction is 1 or 0.
Can anyone help?
Edit
OK, so let me try to explain a little bit better. In the documentation of the caret package under 5.5.5 (Alternate Performance Metrics) there is a description how to use your own custom performance function like so
fitControl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
## Estimate class probabilities
classProbs = TRUE,
## Evaluate performance using
## the following function
summaryFunction = twoClassSummary)
twoClassSummary is the custom performance function in this example. The function provided here needs to take as input a dataframe or matrix with obs and pred. And here's the point - I want to use a function that does not take observerd and predicted, but observed and predicted probability.
One more thing:
Solutions from other packages are also welcome. The only thing I am not looking for is "This is how you write your own cross-validation function."

Caret does support passing class probabilities to custom summary functions when you specify classProbs = TRUE in trainControl. In that case the data argument when creating a custom summary function will have additional two columns named as classes containing the probability of each class. Names of these classes will be in the lev argument which is a vector of length 2.
See the Example:
library(caret)
library(mlbench)
data(Sonar)
Custom summary LogLoss:
LogLoss <- function (data, lev = NULL, model = NULL){
obs <- data[, "obs"] #truth
cls <- levels(obs) #find class names
probs <- data[, cls[2]] #use second class name to extract probs for 2nd clas
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability, this line and bellow is just logloss calculation, irrelevant for your question
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss") #important since this is specified in call to train. Output can be a named vector of multiple values.
out
}
fitControl <- trainControl(method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = LogLoss)
fit <- train(Class ~.,
data = Sonar,
method = "rpart",
metric = "LogLoss" ,
tuneLength = 5,
trControl = fitControl,
maximize = FALSE) #important, depending on calculated performance measure
fit
#output
CART
208 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 166, 166, 166, 167, 167
Resampling results across tuning parameters:
cp LogLoss
0.00000000 1.1220902
0.01030928 1.1220902
0.05154639 1.1017268
0.06701031 1.0694052
0.48453608 0.6405134
LogLoss was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.4845361.
Alternatively use the lev argument which contains the class levels and define some error checking
LogLoss <- function (data, lev = NULL, model = NULL){
if (length(lev) > 2) {
stop(paste("Your outcome has", length(lev), "levels. The LogLoss() function isn't appropriate."))
}
obs <- data[, "obs"] #truth
probs <- data[, lev[2]] #use second class name
probs <- pmax(pmin(as.numeric(probs), 1 - 1e-15), 1e-15) #bound probability
logPreds <- log(probs)
log1Preds <- log(1 - probs)
real <- (as.numeric(data$obs) - 1)
out <- c(mean(real * logPreds + (1 - real) * log1Preds)) * -1
names(out) <- c("LogLoss")
out
}
Check out this section of caret book: https://topepo.github.io/caret/model-training-and-tuning.html#metrics
for additional info. Great book to read if you plan on using caret and even if you are not its a good read.

Sadly, I just found the answer to my question. There is this one little sentence in the caret documentation...
"...If none of these parameters are satisfactory, the user can also compute custom performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:
data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values (either numeric data for regression or character values for classification). Currently, class probabilities are not passed to the function. The values in data are the held-out predictions (and their associated reference values) for a single combination of tuning..."
For the sake of documentation: This is written on 2020-07-03 with caret package documentation from 2019-03-27.

I am not really sure I understand your question correctly:
To receive predicted probabilities from a model mdl, you can use predict(mdl, type = "prob").
I.e.,
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
df <- iris
df$isSetosa <- factor(df$Species == "setosa", levels = c(FALSE, TRUE), labels = c("not-setosa", "is-setosa"))
df$Species <- NULL
mdl <- train(isSetosa ~ ., data = df, method = "glm",
family = "binomial",
trControl = trainControl(method = "cv"))
preds <- predict(mdl, newdata = df, type = "prob")
head(preds)
#> not-setosa is-setosa
#> 1 2.220446e-16 1
#> 2 2.220446e-16 1
#> 3 2.220446e-16 1
#> 4 1.875722e-12 1
#> 5 2.220446e-16 1
#> 6 2.220446e-16 1
Created on 2020-07-02 by the reprex package (v0.3.0)
I.e., we see that case 4 is predicted to be a setosa with ~100% (tbh, this toy model is way too good to be true)...
Now we can create a custom function that collapses the values to a single metric.
true <- df$isSetosa
# very basic model metrics that just sums the absolute differences in true - probability
custom_model_metric <- function(preds, true) {
d <- data.frame(true = true)
tt <- predict(dummyVars(~true, d), d)
colnames(tt) <- c("not-setosa", "is-setosa")
sum(abs(tt - preds))
}
custom_model_metric(preds, true)
#> [1] 3.294029e-09
Created on 2020-07-02 by the reprex package (v0.3.0)

Related

Linear SVM and extracting the weights

I am practicing SVM in R using the iris dataset and I want to get the feature weights/coefficients from my model, but I think I may have misinterpreted something given that my output gives me 32 support vectors. I was under the assumption I would get four given I have four variables being analyzed. I know there is a way to do it when using the svm() function, but I am trying to use the train() function from caret to produce my SVM.
library(caret)
# Define fitControl
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary )
# Define Tune
grid<-expand.grid(C=c(2^-5,2^-3,2^-1))
##########
df<-iris head(df)
df<-df[df$Species!='setosa',]
df$Species<-as.character(df$Species)
df$Species<-as.factor(df$Species)
# set random seed and run the model
set.seed(321)
svmFit1 <- train(x = df[-5],
y=df$Species,
method = "svmLinear",
trControl = fitControl,
preProc = c("center","scale"),
metric="ROC",
tuneGrid=grid )
svmFit1
I thought it was simply svmFit1$finalModel#coefbut I get 32 vectors when I believe I should get 4. Why is that?
So coef is not the weight W of the support vectors. Here's the relevant section of the ksvm class in the docs:
coef The corresponding coefficients times the training labels.
To get what you are looking for, you'll need to do the following:
coefs <- svmFit1$finalModel#coef[[1]]
mat <- svmFit1$finalModel#xmatrix[[1]]
coefs %*% mat
See below for a reproducible example.
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.5.2
# Define fitControl
fitControl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Define Tune
grid <- expand.grid(C = c(2^-5, 2^-3, 2^-1))
##########
df <- iris
df<-df[df$Species != 'setosa', ]
df$Species <- as.character(df$Species)
df$Species <- as.factor(df$Species)
# set random seed and run the model
set.seed(321)
svmFit1 <- train(x = df[-5],
y=df$Species,
method = "svmLinear",
trControl = fitControl,
preProc = c("center","scale"),
metric="ROC",
tuneGrid=grid )
coefs <- svmFit1$finalModel#coef[[1]]
mat <- svmFit1$finalModel#xmatrix[[1]]
coefs %*% mat
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] -0.1338791 -0.2726322 0.9497457 1.027411
Created on 2019-06-11 by the reprex package (v0.2.1.9000)
Sources
https://www.researchgate.net/post/How_can_I_find_the_w_coefficients_of_SVM
http://r.789695.n4.nabble.com/SVM-coefficients-td903591.html
https://stackoverflow.com/a/1901200/6637133
As more folks start moving from Caret to Tidymodels I thought I'd put a version of the above solution for Tidymodels Aug 2020 because I don't see many discussions about this so far and it isn't that straightforward to do.
Outlining the main steps here but please review the links at the end for detail for why it was done this way.
1. Get Your Final Model
set.seed(2020)
# Assuming kernlab linear SVM
# Grid Search Parameters
tune_rs <- tune_grid(
model_wf,
train_folds,
grid = param_grid,
metrics = classification_measure,
control = control_grid(save_pred = TRUE)
)
# Finalise workflow with the parameters for best accuracy
best_accuracy <- select_best(tune_rs, "accuracy")
svm_wf_final <- finalize_workflow(
model_wf,
best_accuracy
)
# Fit on your final model on all available data at the end of experiment
final_model <- fit(svm_wf_final, data)
# fit takes a model spec and executes the model fit routine (Parsnip)
# model_spec, formula and data to fit upon
2. Extract the KSVM Object, Pull Required Info, Calculate Variable Importance
ksvm_obj <- pull_workflow_fit(final_model)$fit
# Pull_workflow_fit returns the parsnip model fit object
# $fit returns the object produced by the fitting fn (which is what we need! and is dependent on the engine)
coefs <- ksvm_obj#coef[[1]]
# first bit of info we need are the coefficients from the linear fit
mat <- ksvm_obj#xmatrix[[1]]
# xmatrix that we need to matrix multiply against
var_impt <- coefs %*% mat
# var importance
Ref:
Extracting the Weights of Support Vectors using Caret: Linear SVM and extracting the weights
Variable Importance (Last Section of this post): http://www.rebeccabarter.com/blog/2020-03-25_machine_learning/#finalize-the-workflow

R: obtain coefficients&CI from bootstrapping mixed-effect model results

The working data looks like:
set.seed(1234)
df <- data.frame(y = rnorm(1:30),
fac1 = as.factor(sample(c("A","B","C","D","E"),30, replace = T)),
fac2 = as.factor(sample(c("NY","NC","CA"),30,replace = T)),
x = rnorm(1:30))
The lme model is fitted as:
library(lme4)
mixed <- lmer(y ~ x + (1|fac1) + (1|fac2), data = df)
I used bootMer to run the parametric bootstrapping and I can successfully obtain the coefficients (intercept) and SEs for fixed&random effects:
mixed_boot_sum <- function(data){s <- sigma(data)
c(beta = getME(data, "fixef"), theta = getME(data, "theta"), sigma = s)}
mixed_boot <- bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", use.u = FALSE)
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I have no problem extracting the coefficients(slope) from mixed model by using augment function from broom package, see below:
library(broom)
mixed.coef <- augment(mixed, df)
However, it seems like broom can't deal with boot class object. I can't use above functions directly on mixed_boot.
I also tried to modify the mixed_boot_sum by adding mmList( I thought this would be what I am looking for), but R complains as:
Error in bootMer(mixed, FUN = mixed_boot_sum, nsim = 100, type = "parametric", :
bootMer currently only handles functions that return numeric vectors
Furthermore, is it possible to obtain CI of both fixed&random effects by specifying FUN as well?
Now, I am very confused about the correct specifications for the FUN in order to achieve my needs. Any help regarding to my question would be greatly appreciated!
My first question is how to obtain the coefficients(slope) of each individual levels of the two random effects from the bootstrapping results mixed_boot ?
I'm not sure what you mean by "coefficients(slope) of each individual level". broom::augment(mixed, df) gives the predictions (residuals, etc.) for every observation. If you want the predicted coefficients at each level I would try
mixed_boot_coefs <- function(fit){
unlist(coef(fit))
}
which for the original model gives
mixed_boot_coefs(mixed)
## fac1.(Intercept)1 fac1.(Intercept)2 fac1.(Intercept)3 fac1.(Intercept)4
## -0.4973925 -0.1210432 -0.3260958 0.2645979
## fac1.(Intercept)5 fac1.x1 fac1.x2 fac1.x3
## -0.6288728 0.2187408 0.2187408 0.2187408
## fac1.x4 fac1.x5 fac2.(Intercept)1 fac2.(Intercept)2
## 0.2187408 0.2187408 -0.2617613 -0.2617613
## ...
If you want the resulting object to be more clearly named you can use:
flatten <- function(cc) setNames(unlist(cc),
outer(rownames(cc),colnames(cc),
function(x,y) paste0(y,x)))
mixed_boot_coefs <- function(fit){
unlist(lapply(coef(fit),flatten))
}
When run through bootMer/confint/boot::boot.ci these functions will give confidence intervals for each of these values (note that all of the slopes facW.xZ are identical across groups because the model assumes random variation in the intercept only). In other words, whatever information you know how to extract from a fitted model (conditional modes/BLUPs [ranef], predicted intercepts and slopes for each level of the grouping variable [coef], parameter estimates [fixef, getME], random-effects variances [VarCorr], predictions under specific conditions [predict] ...) can be used in bootMer's FUN argument, as long as you can flatten its structure into a simple numeric vector.

R - Top x Important Variable Each Individual Sample Data in Classification

I'm building a churn model using C5 algorithm in R. After finishing the model and successfully predicting the data, how do I know the top 3 important predictors for each of customer that will churn? So I know the reason why the model classifies -for example- cust A,B,D,F as positive and the others as negative. Is it possible?
Thanks.
Many models have built-in approaches for measuring the aggregate effect of the predictors on the model. The caret package contains a general class for calculating or returning these values including C5.0, JRip, PART, RRF, RandomForest, bagEarth, classbagg, cubist, dsa, earth, fda, gam, gbm, glm, glmnet, lm, multinom, mvr, nnet, pamrtrained, plsda, randomForest, regbagg, rfe, rpart, sbf, and train.
For example,
> library(caret)
> set.seed(1401)
> ctrl <- trainControl(method = 'repeatedcv' , number = 6, , repeats = 2 , classProbs = TRUE)
> C5fit <- train(x = iris[, 1:4], y = iris$Species, method = "C5.0", metric = "ROC", trControl = ctrl)
> varImp(C5fit, scale = FALSE)
C5.0 variable importance
Overall
Petal.Width 100
Sepal.Width 0
Petal.Length 0
Sepal.Length 0
You can plot the trees within the model. If you use a single C5.0 tree, this gives you an easy way to provide the exact reasoning of the tree.
library(C50)
set.seed(1401)
C5tree <- C5.0(x = iris[, 1:4], y = iris$Species, trials = 1) # A single C50 tree
C5imp(C5tree)
plot(C5tree, trial = 0)
If you use boosting (i.e. trials > 1 when you train the trees), then this approach is likely too complicated due to the number of trees.
C5boosted <- C5.0(x = iris[, 1:4], y = iris$Species, trials = 3) # Boost three trees
C5imp(C5boosted)
# Plot each of the trees
for(i in 0:2){ # trials starts counting at 0, see ?plot.C5.0
plot(C5boosted, trial = i)
}
Instead, you can rely on the variable importance for a general report of important variables or use partial dependence plots that show the (non-linear) effect of one variable relative to all other variables. I suggest having a look at package pdp on CRAN.

Is there a discrepancy between createMultiFolds behavior and the resampling summary of a caret object?

I encountered a strange issue using custom folds for the cross-validation with caret.
A MWE (in which the use of createMultiFolds doesn't really make sense)
library(caret) #version 6.0-47
data(iris)
set.seed(1)
train.idx <- createDataPartition(iris$Species, p = .75,
list = FALSE,
times = 1)
train_1 <- iris[train.idx, ]
#I create specific folds
set.seed(1)
id_1 <- createMultiFolds(train_1$Species, k=10, times = 10)
# And use them in my cross validation
cvCtrl_2 <- trainControl(method = "repeatedcv",
index = id_1,
classProbs = TRUE)
trainX <- train_1[, names(train_1) != "Species"]
# I fit my model
set.seed(1111)
rfTune2 <- train(trainX, train_1$Species,
method = "rf",
trControl = cvCtrl_2)
rfTune2
And my model summary is the following :
##Random Forest
...
##Resampling: Cross-Validated (10 fold, repeated 1 times)
id_1 is a list of 100 index vectors, for a 10 fold cross validation repeated 10 times. And I ask trainControl to do the resampling using this list.
So why does my model summary define the resampling with
(10 fold, repeated 1 times)
whereas length(rfTune2$control$index) is equal to 100 so I assume my model was correctly trained using all the folds ?
Should I post an issue on github or did I miss anything obvious about how trainControl work ?
The defaults of trainControl has
number = ifelse(grepl("cv", method), 10, 25),
repeats = ifelse(grepl("cv", method), 1, number)
If you supply index, the code has no idea what types of resampling is used. You will have to specify these arguments along with repeats to get the label correct.

How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?

This question is a continuation of the same thread here. Below is a minimal working example taken from this book:
Wehrens R. Chemometrics with R multivariate data analysis in the
natural sciences and life sciences. 1st edition. Heidelberg; New York:
Springer. 2011. (page 250).
The example was taken from this book and its package ChemometricsWithR. It highlighted some pitfalls when modeling using cross-validation techniques.
The Aim:
A cross-validated methodology using the same set of repeated CV to perform a known strategy of PLS followed typically by LDA or cousins like logistic regression, SVM, C5.0, CART, with the spirit of caret package. So PLS would be needed every time before calling the waiting classifier in order to classify PLS score space instead of the observations themselves. The nearest approach in the caret package is doing PCA as a pre-processing step before modeling with any classifier. Below is a PLS-LDA procedure with only one cross-validation to test performance of the classifier, there was no 10-fold CV or any repetition. The code below was taken from the mentioned book but with some corrections otherwise throws error:
library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var
odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test
prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)
Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))
tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test
prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])
predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)
library(caret)
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret
Output:
Confusion Matrix and Statistics
Reference
Prediction bph control pca
bph 4 1 9
control 1 35 7
pca 34 4 68
Overall Statistics
Accuracy : 0.6564
95% CI : (0.5781, 0.7289)
No Information Rate : 0.5153
P-Value [Acc > NIR] : 0.0001874
Kappa : 0.4072
Mcnemar's Test P-Value : 0.0015385
Statistics by Class:
Class: bph Class: control Class: pca
Sensitivity 0.10256 0.8750 0.8095
Specificity 0.91935 0.9350 0.5190
Pos Pred Value 0.28571 0.8140 0.6415
Neg Pred Value 0.76510 0.9583 0.7193
Prevalence 0.23926 0.2454 0.5153
Detection Rate 0.02454 0.2147 0.4172
Detection Prevalence 0.08589 0.2638 0.6503
Balanced Accuracy 0.51096 0.9050 0.6643
However, the confusion matrix didn't match that in the book, anyway the code in the book did break, but this one here worked with me!
Notes:
Although this was only one CV, but the intention is to agree on this methodology first, sd and mean of the train set were applied on the test set, PLUS transformed into PLS scores based a specific number of PC ncomp. I want this to occur every round of the CV in the caret. If the methodology as code is correct here, then it can serve, may be, as a good start for a minimal work example while modifying the code of the caret package.
Side Notes:
It can be very messy with scaling and centering, I think some of the PLS functions in R do scaling internally, with or without centering, I am not sure, so building a custom model in caret should be handled with care to avoid both lack or multiple scalings or centerings (I am on my guards with these things).
Perils of multiple centering/scaling
The code below is just to show how multliple centering/scaling can change the data, only centering is shown here but the same problem with scaling applies too.
set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)
Output:
xNotCentered xCentered1 xCentered2 xCentered3
2.035540e+00 1.897798e-16 -5.603699e-18 -5.332377e-18
Please drop a comment if I am missing something somewhere in this course. Thanks.
If you want to fit these types of models with caret, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.
My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp and mtry is used.
The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):
modelInfo <- list(label = "PLS-RF",
library = c("pls", "randomForest"),
type = "Classification",
parameters = data.frame(parameter = c('ncomp', 'mtry'),
class = c("numeric", 'numeric'),
label = c('#Components',
'#Randomly Selected Predictors')),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
mtry = 1:len)
grid <- subset(grid, mtry <= ncomp)
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the random forest object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- randomForest(scores, y, mtry = param$mtry, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
and here is the call to train:
library(ChemometricsWithR)
data(prostate)
set.seed(1)
inTrain <- createDataPartition(prostate.type, p = .90)
trainX <-prostate[inTrain[[1]], ]
trainY <- prostate.type[inTrain[[1]]]
testX <-prostate[-inTrain[[1]], ]
testY <- prostate.type[-inTrain[[1]]]
## These will take a while for these data
set.seed(2)
plsrf <- train(trainX, trainY, method = modelInfo,
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
## How does random forest do on its own?
set.seed(2)
rfOnly <- train(trainX, trainY, method = "rf",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Just for kicks, I got:
> getTrainPerf(plsrf)
TrainAccuracy TrainKappa method
1 0.7940423 0.65879 custom
> getTrainPerf(rfOnly)
TrainAccuracy TrainKappa method
1 0.7794082 0.6205322 rf
and
> postResample(predict(plsrf, testX), testY)
Accuracy Kappa
0.7741935 0.6226087
> postResample(predict(rfOnly, testX), testY)
Accuracy Kappa
0.9032258 0.8353982
Max
Based on Max's valuable comments, I felt the need to have IRIS referee, which is famous for classification, and more importantly the Species outcome has more than two classes, which would be a good data set to test the PLS-LDA custom model in caret:
data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36
ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
data = training,
method = "pls",
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)
set.seed(2)
plsldaFitIris <- train(Species ~ .,
data = training,
method = modelInfo,
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsldaFitIris
plot(plsldaFitIris)
Now comparing the two models:
getTrainPerf(plsFitIris)
TrainAccuracy TrainKappa method
1 0.8574242 0.7852462 pls
getTrainPerf(plsldaFitIris)
TrainAccuracy TrainKappa method
1 0.975303 0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy Kappa
0.750 0.625
postResample(predict(plsldaFitIris, testing), testing$Species)
Accuracy Kappa
0.9444444 0.9166667
So, finally there was the EXPECTED difference, and improvement in the metrics. So this would support Max's notion, that two-class problems because of Bayes' probabilistic approach of plsda function both lead to the same results.
You need to wrap the CV around both PLS and LDA.
Yes, both plsr and lda center the data their own way
I had a closer look at caret::preProcess (): as it is defined now, you will not be able to use PLS as preprocessing method because it is supervised but caret::preProcess () uses unsupervised methods only (there is no way to hand over the dependent variable). This would probably make patching rather difficult.
So inside the caret framework, you'll need to go for a custom model.
If the scenario were to custom a model of PLS-LDA type, according to the code kindly provided by Max (maintainer of CARET), something is not corect in this code, but I didn't figure it out, because I used the Sonar data set the same in caret vignette and tried to reproduce the result one time using method="pls" and another time using the below custom model for PLS-LDA, the results were exactly identical even to the last digit, which was nonsensical. For benchmarking, one need a known data set (I think a cross-validated PLS-LDA for iris data set would fit here as it is famous for this type of analysis and there should be somewhere a cross-validated treatment of it), everything should be the same (the set.seed(xxx) and the no of K-CV repitition) except the code in question so as to rightly compare and to judge the code below:
modelInfo <- list(label = "PLS-LDA",
library = c("pls", "MASS"),
type = "Classification",
parameters = data.frame(parameter = c("ncomp"),
class = c("numeric"),
label = c("#Components")),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the lda object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- lda(scores, y, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$class
},
prob = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$posterior
},
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
Based on Zach's request, the code below is for method="pls" in caret, exactly the same concrete example in caret vigenette on CRAN:
library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
data = training,
method = "pls",
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness
Now, the code below is a pilot run to classify Sonar data using the custom model PLS-LDA which is under question, it is expected to come up with any numbers apart from identical with those using PLS only:
set.seed(108)
plsldaFitSon <- train(Class ~ .,
data = training,
method = modelInfo,
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
Now comparing the results between the two models:
getTrainPerf(plsFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 pls
getTrainPerf(plsldaFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom
postResample(predict(plsFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
So, the results are exactly the same which cannot be. As if the lda model were not added?

Resources