How train data manually per fold with k-fold CV in R? - r

I have the following code segment which works for me and I get the model result:
library(base)
library(caret)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
X = dataset[, 1:8]
Y = as.factor(ifelse(dataset$X9 == 1, 'diabetes', 'nondiabetes'))
set.seed(88)
nfolds <- 3
cvIndex <- createFolds(Y, nfolds, returnTrain = T)
fit.control <- trainControl(method="cv",
index=cvIndex,
number=nfolds,
classProbs=TRUE,
savePredictions=TRUE,
verboseIter=TRUE,
summaryFunction=twoClassSummary,
allowParallel=FALSE)
model <- caret::train(X, Y,
method = "svmLinear",
trControl = fit.control,
preProcess=c("center","scale"),
tuneLength=10)
Using this I can access the final model as model$finalModel, however, in this case instead of having one final model, I actually want to have 3 models as I have 3-fold. So, I want to get the trained model after first fold, then after second fold and lastly after the third fold, which corresponds to the actual final model. Any ideas how to achieve this in R? Please note that usage of caret is not strict, if you can do it with mlr that's also welcomed.

The train function in caret streamlines model evaluation and training
https://cran.r-project.org/web/packages/caret/vignettes/caret.html
"evaluate, using resampling, the effect of model tuning parameters on performance
choose the ``optimal’’ model across these parameters
estimate model performance from a training set"
So, the model that it gives is the optimal final model.
There is no reason to use the models trained on each fold. I'm not aware of how to do this in R

Here is an approach using mlr package:
library(mlr)
library(base)
library(tidyverse)
dataset <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
X = dataset[, 1:8]
Y = as.factor(ifelse(dataset$X9 == 1, 'diabetes', 'nondiabetes'))
create a mlr task:
mlr_task <- makeClassifTask(data = data.frame(X, Y),
target = "Y",
positive = "diabetes")
define the resampling:
set.seed(7)
cv3 <- makeResampleInstance(makeResampleDesc("CV", iters = 3),
task = mlr_task)
define the type of hyper parameter search
ctrl <- makeTuneControlRandom(maxit = 10L)
define a learner
lrn <- makeLearner("classif.ksvm", predict.type = "prob")
optionally check learner parameters to see which ones to tune
mlr::getLearnerParamSet(lrn)
define search space (vanilladot is linear kernel in kernlab package which is called internally for "classif.ksvm"). More info on integrated learners in mlr: https://mlr.mlr-org.com/articles/tutorial/integrated_learners.html
ps <- makeParamSet(makeDiscreteParam("kernel", "vanilladot"),
makeNumericParam("C", lower = 2e-6, upper = 2e-6))
tune hyper parameters. I just set some random measures, the first one listed is used to evaluate the performance, the others are there just for show.
res <- tuneParams(lrn,
mlr_task,
cv3,
measures = list(auc, bac, f1),
par.set = ps,
control = ctrl)
set optimal hyper parameters to a learner
lrn <- setHyperPars(lrn, par.vals = res$x)
resample with models = TRUE
rsmpls <- resample(lrn,
mlr_task,
cv3,
measures = list(auc, bac, f1),
models = TRUE)
models are in
rsmpls$models[[1]]$learner.model
rsmpls$models[[2]]$learner.model
rsmpls$models[[3]]$learner.model
What this does is it first tunes the hyper parameters and then performs another set of cross validation with tuned parameters on the same folds.
an alternative and in my opinion a better approach is to pick hyper parameters in the inner folds of nested cross validation and evaluate on the outer folds keeping outer fold models to fiddle with.
lrn <- makeLearner("classif.ksvm", predict.type = "prob")
define an inner resampling strategy
cv3_inner <- makeResampleDesc("CV", iters = 3)
create a tune wrapper - define what happens in inner cross validation loop
lrn <- makeTuneWrapper(lrn,
resampling = cv3_inner,
measures = list(auc, bac, f1),
par.set = ps,
control = ctrl)
perform outer cross validation
rsmpls <- resample(lrn,
mlr_task,
cv3,
measures = list(auc, bac, f1),
models = TRUE)
This performs three fold CV in the outer loop, in each training instance another, three fold CV is performed to tune the hyper parameters and a model is fit on the whole training instance with optimal hyper parameters, these models are evaluated on the outer loop test instances. This is done to reduce evaluation bias. See also: https://mlr.mlr-org.com/articles/tutorial/nested_resampling.html

Not a caret nor machine learning expert, but why not just train the model on a random sample and store the result in a list?
data <- read_csv("https://gist.githubusercontent.com/dmpe/bfe07a29c7fc1e3a70d0522956d8e4a9/raw/7ea71f7432302bb78e58348fede926142ade6992/pima-indians-diabetes.csv", col_names=FALSE)
train_multiple_models <- function(data, kfolds) {
resultlist <- list()
for(i in 1:kfolds) {
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
X = train[, 1:8]
Y = as.factor(ifelse(train$X9 == 1, 'diabetes', 'nondiabetes'))
model <- caret::train(X, Y,
method = "svmLinear",
preProcess=c("center","scale"),
tuneLength=10)
resultlist[[i]] <- model
}
return(resultlist)
}
result <- train_multiple_models(data, kfolds = 3)
> result[[1]]$finalModel
Support Vector Machine object of class "ksvm"
SV type: C-svc (classification)
parameter : cost C = 1
Linear (vanilla) kernel function.
Number of Support Vectors : 307
Objective Function Value : -302.065
Training error : 0.230903

Related

Rpart vs Caret Cross Validation for Complexity Parameter

I have a few questions about the difference between Rpart and Caret (using Rpart):
When using Rpart to fit a decision tree, calling dt$cptable displays a table of complexity parameters and their associated cross-validation errors. When pruning a tree, we would want to select the CP with the lowest cross-validation error. How are these cross-validation errors calculated? In reading Rpart's vignette, it seems like RPart does the following:
a) Fits the full tree based on the user-specified parameters. As the tree is being built, the algorithm calculates the complexity parameter at each split
b) The algorithm then splits the data into k folds, and for each CP, basically just performs cross-validation using these folds. Then it calculates the average error across all of the folds to get the 'xerror' output we see in CP$table
If we were to use caret with cross validation to find the optimal tree, how is it running? Basically, is the algorithm splitting the dataset into k folds, then calling the Rpart function, and for each call of the Rpart function doing the same thing described in point 1 above? In other words, is it using cross-validation within cross-validation, whereas Rpart is just using cross-validation once?
Below is some code, even though I'm asking more about how the algorithm functions, maybe it will be useful:
library(rpart)
library(rpart.plot)
library(caret)
set.seed(100)
data.class <- data.termlife[, 2:ncol(data.termlife)]
data.class$TERM_FLAG <- as.factor(data.class$TERM_FLAG)
train.indices <- createDataPartition(data.class$TERM_FLAG, p = .8, list = FALSE)
data.class.t <- data.class[train.indices, ]
data.class.v <- data.class[-train.indices, ]
#Using Rpart
rpart.ctrl <- rpart.control(minsplit = 5, minbucket = 5, cp = .01)
f <- as.formula(paste0("TERM_FLAG ~ ", paste0(names(data.class.t)[2:9], collapse = "+")))
dt <- rpart(formula = f, data = data.class.t, control = rpart.ctrl, parms = list(split = "gini"))
cp.best.rpart <- dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]
#Using Caret
train.ctrl <- trainControl(method = "cv", number = 10)
tGrid <- expand.grid(cp = seq(0, .02, .0001))
dt.caret <- train(form = f, data = data.class.t, method = "rpart", metric = "Accuracy", trControl = train.ctrl, tuneGrid = tGrid)
cp.best.caret <- dt.caret$bestTune$cp
print(paste("Rpart's best CP: ", cp.best.rpart))
print(paste("Caret's best CP: ", cp.best.caret))
[1] "Rpart's best CP: 0.0194444444444444"
[1] "Caret's best CP: 0.02"
The results are very similar, so when would you ever want to use Caret with Rpart? Thank you!!!

Retrieve models from resample function in mlr

I would like to retrieve the binary classification models (i.e. selected features and coefficients) generated by resample function in MLR. Below, you can find my code sample. It seems to be located within the attribute models of the resulting object (here r$models), but I don't find it.
# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################
library(mlbench)
data(BreastCancer)
# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable
p<-mlbench.waveform(1000)
# convert list into dataframe
dataset<-as.data.frame(p)
# drop thrid class to get 2 classes
dataset2 = subset(dataset, classes != 3)
dataset2 <- droplevels(dataset2 )
# 2. Perform cross validation with embedded feature selection using logistic regression
##########################################################################################
library(BBmisc)
library(mlr)
set.seed(123, "L'Ecuyer")
set.seed(21)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm
mL <- makeLearner("classif.logreg", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sbs", maxit = NA,beta = 0.001)
# Choice of sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
You have to dig a bit deeper in the list. For the first model, for example:
r$models[[1]]$learner.model$opt.result
r$models[[1]]$learner.model$next.model$learner.model

R: Feature Selection with Cross Validation using Caret on Logistic Regression

I am currently learning how to implement logistical Regression in R
I have taken a data set and split it into a training and test set and wish to implement forward selection, backward selection and best subset selection using cross validation to select the best features.
I am using caret to implement cross-validation on the training data set and then testing the predictions on the test data.
I have seen the rfe control in caret and had also had a look at the documentation on the caret website as well as following the links on the question How to use wrapper feature selection with algorithms in R?. It isn't apparent to me how to change the type of feature selection as it seems to default to backward selection. Can anyone help me with my workflow. Below is a reproducible example
library("caret")
# Create an Example Dataset from German Credit Card Dataset
mydf <- GermanCredit
# Create Train and Test Sets 80/20 split
trainIndex <- createDataPartition(mydf$Class, p = .8,
list = FALSE,
times = 1)
train <- mydf[ trainIndex,]
test <- mydf[-trainIndex,]
ctrl <- trainControl(method = "repeatedcv",
number = 10,
savePredictions = TRUE)
mod_fit <- train(Class~., data=train,
method="glm",
family="binomial",
trControl = ctrl,
tuneLength = 5)
# Check out Variable Importance
varImp(mod_fit)
summary(mod_fit)
# Test the new model on new and unseen Data for reproducibility
pred = predict(mod_fit, newdata=test)
accuracy <- table(pred, test$Class)
sum(diag(accuracy))/sum(accuracy)
You can simply call it in mod_fit. When it comes to backward stepwise the code below is sufficient
trControl <- trainControl(method="cv",
number = 5,
savePredictions = T,
classProbs = T,
summaryFunction = twoClassSummary)
caret_model <- train(Class~.,
train,
method="glmStepAIC", # This method fits best model stepwise.
family="binomial",
direction="backward", # Direction
trControl=trControl)
Note that in trControl
method= "cv", # No need to call repeated here, the number defined afterward defines the k-fold.
classProbs = T,
summaryFunction = twoClassSummary # Gives back ROC, sensitivity and specifity of the chosen model.

Hyper-parameter tuning using pure ranger package in R

Love the speed of the ranger package for random forest model creation, but can't see how to tune mtry or number of trees. I realize I can do this via caret's train() syntax, but I prefer the speed increase that comes from using pure ranger.
Here's my example of basic model creation using ranger (which works great):
library(ranger)
data(iris)
fit.rf = ranger(
Species ~ .,
training_data = iris,
num.trees = 200
)
print(fit.rf)
Looking at the official documentation for tuning options, it seems like the csrf() function may provide the ability to tune hyper-parameters, but I can't get the syntax right:
library(ranger)
data(iris)
fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)
print(fit.rf.tune)
Results in:
Error in ranger(Species ~ ., training_data = iris, num.trees = 200) :
unused argument (training_data = iris)
And I'd prefer to tune with the regular (read: non-csrf) rf algorithm ranger provides. Any idea as to a hyper-parameter tuning solution for either path in ranger? Thank you!
To answer my (unclear) question, apparently ranger has no built-in CV/GridSearch functionality. However, here's how you do hyper-parameter tuning with ranger (via a grid search) outside of caret. Thanks goes to Marvin Wright (the maintainer of ranger) for the code. Turns out caret CV with ranger was slow for me because I was using the formula interface (which should be avoided).
ptm <- proc.time()
library(ranger)
library(mlr)
# Define task and learner
task <- makeClassifTask(id = "iris",
data = iris,
target = "Species")
learner <- makeLearner("classif.ranger")
# Choose resampling strategy and define grid
rdesc <- makeResampleDesc("CV", iters = 5)
ps <- makeParamSet(makeIntegerParam("mtry", 3, 4),
makeDiscreteParam("num.trees", 200))
# Tune
res = tuneParams(learner, task, rdesc, par.set = ps,
control = makeTuneControlGrid())
# Train on entire dataset (using best hyperparameters)
lrn = setHyperPars(makeLearner("classif.ranger"), par.vals = res$x)
m = train(lrn, iris.task)
print(m)
print(proc.time() - ptm) # ~6 seconds
For the curious, the caret equivalent is
ptm <- proc.time()
library(caret)
data(iris)
grid <- expand.grid(mtry = c(3,4))
fitControl <- trainControl(method = "CV",
number = 5,
verboseIter = TRUE)
fit = train(
x = iris[ , names(iris) != 'Species'],
y = iris[ , names(iris) == 'Species'],
method = 'ranger',
num.trees = 200,
tuneGrid = grid,
trControl = fitControl
)
print(fit)
print(proc.time() - ptm) # ~2.4 seconds
Overall, caret is the fastest way to do a grid search with ranger if one uses the non-formula interface.
I think there are at least two errors:
First, the function ranger does not have a parameter called training_data. Your error message Error in ranger(Species ~ ., training_data = iris, num.trees = 200) : unused argument (training_data = iris) refers to that. You can see that when you look at ?ranger or args(ranger).
Second, the function csrf, on the other hand, has training_data as input, but also requires test_data. Most importantly, these two arguments do not have any defaults, implying that you must provide them. The following works without problems:
fit.rf = ranger(
Species ~ ., data = iris,
num.trees = 200
)
fit.rf.tune = csrf(
Species ~ .,
training_data = iris,
test_data = iris,
params1 = list(num.trees = 25, mtry=4),
params2 = list(num.trees = 50, mtry=4)
)
Here, I have just provided iris as both training and test dataset. You would obviously not want to do that in your real application. Moreover, note that ranger also take num.trees and mtry as input, so you could try tuning it there.
Note that mlr per default disables the internal parallelization of ranger. Set hyperparameter num.threads to the number of cores available to speed mlr up:
learner <- makeLearner("classif.ranger", num.threads = 4)
Alternatively, start a parallel backend via
parallelStartMulticore(4) # linux/osx
parallelStartSocket(4) # windows
before calling tuneParams to parallelize the tuning.
Another way to tune the model is to create a manual grid, maybe there are better ways to train the model but this may be a different option.
hyper_grid <- expand.grid(
mtry = 1:4,
node_size = 1:3,
num.trees = seq(50,500,50),
OOB_RMSE = 0
)
system.time(
for(i in 1:nrow(hyper_grid)) {
# train model
rf <- ranger(
formula = Species ~ .,
data = iris,
num.trees = hyper_grid$num.trees[i],
mtry = hyper_grid$mtry[i],
min.node.size = hyper_grid$node_size[i],
importance = 'impurity')
# add OOB error to grid
hyper_grid$OOB_RMSE[i] <- sqrt(rf$prediction.error)
})
user system elapsed
3.17 0.19 1.36
nrow(hyper_grid) # 120 models
position = which.min(hyper_grid$OOB_RMSE)
head(hyper_grid[order(hyper_grid$OOB_RMSE),],5)
mtry node_size num.trees OOB_RMSE
6 2 2 50 0.1825741858
23 3 3 100 0.1825741858
3 3 1 50 0.2000000000
11 3 3 50 0.2000000000
14 2 1 100 0.2000000000
# fit best model
rf.model <- ranger(Species ~ .,data = iris, num.trees = hyper_grid$num.trees[position], importance = 'impurity', probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
rf.model
Ranger result
Call:
ranger(Species ~ ., data = iris, num.trees = hyper_grid$num.trees[position], importance = "impurity", probability = FALSE, min.node.size = hyper_grid$node_size[position], mtry = hyper_grid$mtry[position])
Type: Classification
Number of trees: 50
Sample size: 150
Number of independent variables: 4
Mtry: 2
Target node size: 2
Variable importance mode: impurity
Splitrule: gini
OOB prediction error: 5.33 %
I hope it serves you.
There is also the tuneRanger R package, which is specifically designed for tuning ranger and uses predefined tuning parameters, hyperparameter spaces and intelligent tuning by using the out-of-bag observations.
Note, that random forest is not an algorithm were tuning makes a big difference, usually. But it can usually improve the performance a bit.

How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?

This question is a continuation of the same thread here. Below is a minimal working example taken from this book:
Wehrens R. Chemometrics with R multivariate data analysis in the
natural sciences and life sciences. 1st edition. Heidelberg; New York:
Springer. 2011. (page 250).
The example was taken from this book and its package ChemometricsWithR. It highlighted some pitfalls when modeling using cross-validation techniques.
The Aim:
A cross-validated methodology using the same set of repeated CV to perform a known strategy of PLS followed typically by LDA or cousins like logistic regression, SVM, C5.0, CART, with the spirit of caret package. So PLS would be needed every time before calling the waiting classifier in order to classify PLS score space instead of the observations themselves. The nearest approach in the caret package is doing PCA as a pre-processing step before modeling with any classifier. Below is a PLS-LDA procedure with only one cross-validation to test performance of the classifier, there was no 10-fold CV or any repetition. The code below was taken from the mentioned book but with some corrections otherwise throws error:
library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var
odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test
prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)
Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))
tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test
prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])
predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)
library(caret)
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret
Output:
Confusion Matrix and Statistics
Reference
Prediction bph control pca
bph 4 1 9
control 1 35 7
pca 34 4 68
Overall Statistics
Accuracy : 0.6564
95% CI : (0.5781, 0.7289)
No Information Rate : 0.5153
P-Value [Acc > NIR] : 0.0001874
Kappa : 0.4072
Mcnemar's Test P-Value : 0.0015385
Statistics by Class:
Class: bph Class: control Class: pca
Sensitivity 0.10256 0.8750 0.8095
Specificity 0.91935 0.9350 0.5190
Pos Pred Value 0.28571 0.8140 0.6415
Neg Pred Value 0.76510 0.9583 0.7193
Prevalence 0.23926 0.2454 0.5153
Detection Rate 0.02454 0.2147 0.4172
Detection Prevalence 0.08589 0.2638 0.6503
Balanced Accuracy 0.51096 0.9050 0.6643
However, the confusion matrix didn't match that in the book, anyway the code in the book did break, but this one here worked with me!
Notes:
Although this was only one CV, but the intention is to agree on this methodology first, sd and mean of the train set were applied on the test set, PLUS transformed into PLS scores based a specific number of PC ncomp. I want this to occur every round of the CV in the caret. If the methodology as code is correct here, then it can serve, may be, as a good start for a minimal work example while modifying the code of the caret package.
Side Notes:
It can be very messy with scaling and centering, I think some of the PLS functions in R do scaling internally, with or without centering, I am not sure, so building a custom model in caret should be handled with care to avoid both lack or multiple scalings or centerings (I am on my guards with these things).
Perils of multiple centering/scaling
The code below is just to show how multliple centering/scaling can change the data, only centering is shown here but the same problem with scaling applies too.
set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)
Output:
xNotCentered xCentered1 xCentered2 xCentered3
2.035540e+00 1.897798e-16 -5.603699e-18 -5.332377e-18
Please drop a comment if I am missing something somewhere in this course. Thanks.
If you want to fit these types of models with caret, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.
My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp and mtry is used.
The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):
modelInfo <- list(label = "PLS-RF",
library = c("pls", "randomForest"),
type = "Classification",
parameters = data.frame(parameter = c('ncomp', 'mtry'),
class = c("numeric", 'numeric'),
label = c('#Components',
'#Randomly Selected Predictors')),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
mtry = 1:len)
grid <- subset(grid, mtry <= ncomp)
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the random forest object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- randomForest(scores, y, mtry = param$mtry, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
and here is the call to train:
library(ChemometricsWithR)
data(prostate)
set.seed(1)
inTrain <- createDataPartition(prostate.type, p = .90)
trainX <-prostate[inTrain[[1]], ]
trainY <- prostate.type[inTrain[[1]]]
testX <-prostate[-inTrain[[1]], ]
testY <- prostate.type[-inTrain[[1]]]
## These will take a while for these data
set.seed(2)
plsrf <- train(trainX, trainY, method = modelInfo,
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
## How does random forest do on its own?
set.seed(2)
rfOnly <- train(trainX, trainY, method = "rf",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Just for kicks, I got:
> getTrainPerf(plsrf)
TrainAccuracy TrainKappa method
1 0.7940423 0.65879 custom
> getTrainPerf(rfOnly)
TrainAccuracy TrainKappa method
1 0.7794082 0.6205322 rf
and
> postResample(predict(plsrf, testX), testY)
Accuracy Kappa
0.7741935 0.6226087
> postResample(predict(rfOnly, testX), testY)
Accuracy Kappa
0.9032258 0.8353982
Max
Based on Max's valuable comments, I felt the need to have IRIS referee, which is famous for classification, and more importantly the Species outcome has more than two classes, which would be a good data set to test the PLS-LDA custom model in caret:
data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36
ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
data = training,
method = "pls",
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)
set.seed(2)
plsldaFitIris <- train(Species ~ .,
data = training,
method = modelInfo,
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsldaFitIris
plot(plsldaFitIris)
Now comparing the two models:
getTrainPerf(plsFitIris)
TrainAccuracy TrainKappa method
1 0.8574242 0.7852462 pls
getTrainPerf(plsldaFitIris)
TrainAccuracy TrainKappa method
1 0.975303 0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy Kappa
0.750 0.625
postResample(predict(plsldaFitIris, testing), testing$Species)
Accuracy Kappa
0.9444444 0.9166667
So, finally there was the EXPECTED difference, and improvement in the metrics. So this would support Max's notion, that two-class problems because of Bayes' probabilistic approach of plsda function both lead to the same results.
You need to wrap the CV around both PLS and LDA.
Yes, both plsr and lda center the data their own way
I had a closer look at caret::preProcess (): as it is defined now, you will not be able to use PLS as preprocessing method because it is supervised but caret::preProcess () uses unsupervised methods only (there is no way to hand over the dependent variable). This would probably make patching rather difficult.
So inside the caret framework, you'll need to go for a custom model.
If the scenario were to custom a model of PLS-LDA type, according to the code kindly provided by Max (maintainer of CARET), something is not corect in this code, but I didn't figure it out, because I used the Sonar data set the same in caret vignette and tried to reproduce the result one time using method="pls" and another time using the below custom model for PLS-LDA, the results were exactly identical even to the last digit, which was nonsensical. For benchmarking, one need a known data set (I think a cross-validated PLS-LDA for iris data set would fit here as it is famous for this type of analysis and there should be somewhere a cross-validated treatment of it), everything should be the same (the set.seed(xxx) and the no of K-CV repitition) except the code in question so as to rightly compare and to judge the code below:
modelInfo <- list(label = "PLS-LDA",
library = c("pls", "MASS"),
type = "Classification",
parameters = data.frame(parameter = c("ncomp"),
class = c("numeric"),
label = c("#Components")),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the lda object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- lda(scores, y, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$class
},
prob = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$posterior
},
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
Based on Zach's request, the code below is for method="pls" in caret, exactly the same concrete example in caret vigenette on CRAN:
library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
data = training,
method = "pls",
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness
Now, the code below is a pilot run to classify Sonar data using the custom model PLS-LDA which is under question, it is expected to come up with any numbers apart from identical with those using PLS only:
set.seed(108)
plsldaFitSon <- train(Class ~ .,
data = training,
method = modelInfo,
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
Now comparing the results between the two models:
getTrainPerf(plsFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 pls
getTrainPerf(plsldaFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom
postResample(predict(plsFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
So, the results are exactly the same which cannot be. As if the lda model were not added?

Resources