how to pass glm control argument for earth using caret (maxit) - r

when fitting earth for a glm model, one can pass arguments to the glm call. For example:
mars_fit <- earth(formula = response ~ x1 + x2, data = sim_dat,
glm = list(family=binomial, control = list(maxit = 50)))
Using caret looks like
fit_control <- trainControl(method = "cv", number = 10)
mars_grid <- expand.grid(degree=1:2, nprune=2:10)
mars_fit <- train(factor(response)~x1+x2, method='earth', trControl = fit_control,
data=sim_dat, tuneGrid=mars_grid,
glm = list(control = list(maxit = 50)))
but the glm list is not passed. Any advice?
Edit 1:
Reading https://github.com/topepo/caret/issues/554 caret's author says it is either caught in the ... or it should be passed in the tuning grid. when passed through the tuning grid, since glm is a list, train complains that degree and nprune do not belong to the method, which is not true.
Edit 2:
Opened https://github.com/topepo/caret/issues/1018

issue is solved in this commit:
https://github.com/topepo/caret/commit/2ce2cf4c5889791b7dbca5d8896fcc6dc0d0bcfc

Related

Using the type = "raw" option for the predict() function after repeated cross validation for logistic lasso regression returns empty vector

I used the caret and glmnet pacakges to run a lasso logistic regression using repeated cross validation to select the optimized minimum lambda.
glmnet.obj <- train(outcome ~ .,
data = df.train,
method = "glmnet",
metric = "ROC",
family = "binomial",
trControl = trainControl(
method = "repeatedcv",
repeats = 10,
number = 10,
summaryFunction = twoClassSummary,
classProbs = TRUE,
savePredictions = "all",
selectionFunction = "best"))
After that, I get the best lambda and alpha:
best_lambda<- get_best_result(glmnet.obj)$lambda
best_alpha<- get_best_result(glmnet.obj)$alpha
Then I obtain the predicted probabilities for the test set:
pred_prob<- predict(glmnet.obj,s=best_lambda, alpha=best_alpha, type="prob", newx = x.test)
and then to get the predicted classes, which I intend to use in ConfusionMatrix:
pred_class<-predict(glmnet.obj,s=best_lambda, alpha=best_alpha, type="raw",newx=x.test)
But when I just run pred_class it returns NULL.
What could I be missing here?
You need to use newdata = as opposed to newx= because when you do predict(glmnet.obj), it is calling predict.train on the caret object.
You did not provide one function, but I suppose it is rom this source:
get_best_result = function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
Using an example data
set.seed(111)
df = data.frame(outcome = factor(sample(c("y","n"),100,replace=TRUE)),
matrix(rnorm(1000),ncol=10))
colnames(df.train)[-1] = paste0("col",1:10)
df.train = df[1:70,]
x.test = df[71:100,]
And we run your model, then you can predict using the function:
pred_class<-predict(glmnet.obj,type="raw",newdata=x.test)
confusionMatrix(table(pred_class,x.test$outcome))
Confusion Matrix and Statistics
pred_class n y
n 1 5
y 11 13
The arguments for lambda = and newx= comes from glmnet, you can potentially use it on glmnet.obj$finalModel , but you need to convert the data into a matrix, for example:
predict(glmnet.obj$finalModel,s=best_lambda, alpha=best_alpha,
type="class",newx=as.matrix(x.test[,-1]))

Use F1 Score instead of Accuracy to Optimize SVM Parameters

I am using the e1071 'tune' function to optimize an SVM model. I would like to use F1 instead of Accuracy as the value to optimize for. I have found on this post: Optimize F-score in e1071 package that I need to define a new error.fun. The problem that I am having is that the function that is shown in that post was not shown to ultimately be the solution and it does not work for me. If I knew the variable names for the predictions from each iteration of tune I could write a function to calculate F1 but I don't know how to get those values. How can I calculate F1 and use it to optimize model parameters using 'tune' in e1071? My code is as follows:
tuned = tune.svm(PriYN~., data = dataset, kernel = "radial", probability=TRUE, gamma = 10^(-5:-1), cost = 10^(-3:1), tunecontrol=tune.control(cross=10))
Using {caret} :
ctrl <- trainControl(method = "repeatedcv", # choose your CV method
number = 5, # according to CV method
repeats = 2, # according to CV method
summaryFunction = prSummary, # TO TUNE ON F1 SCORE
classProbs = T,
verboseIter = T
#sampling = "smote" # you can try 'smote' resampling method
)
Then tune your model
set.seed(2202)
svm_model <- train(target ~., data = training,
method = "svmRadial",
#preProcess = c("center", "scale"),
tuneLength = 10,
metric = "F", # The metric used for tuning is the F1 SCORE
trControl = ctrl)
svm_model

Automate variable selection based on varimp in R

In R, I have a logistic regression model as follows
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(result~., data=df,
trControl = train_control,
method = "glm",
family=binomial(link="logit"))
calculatedVarImp <- varImp(logit_Model, scale = FALSE)
I use multiple datasets that run through the same code, so the variable importance changes for each dataset. Is there a way to get the names of the variables that are less than n (e.g. 1) in the overall importance, so I can automate the removal of those variables and rerun the model.
I was unable to get the information from 'calculatedVarImp' variable by subsetting 'overall' value
lowVarImp <- subset(calculatedVarImp , importance$Overall <1)
Also, is there a better way of doing variable selection?
Thanks in advance
You're using the caret package. Not sure if you're aware of this, but caret has a method for stepwise logistic regression using the Akaike Information Criterion: glmStepAIC.
So it iteratively trains a model for every subset of predictors and stops at the one with the lowest AIC.
train_control <- trainControl(method = "cv", number = 3)
logit_Model <- train(y~., data= train_data,
trControl = train_control,
method = "glmStepAIC",
family=binomial(link="logit"),
na.action = na.omit)
logit_Model$finalModel
This is as automated as it gets but it may be worth reading this answer about the downsides to this method:
See Also.

R - how to set a specific number of PCA components to train a prediction model

Using train() and preProcess() I want to build a predictive model using PCA with the first 7 principal components as my predictors.
The below works but I'm not able to specify the number of PCs:
predModel2 <- train(diagnosis~., data=training2, method = "glm", preProcess = "pca")
I've tried this to specify the number of PCs but I don't know how to incorporate it into train():
training_pre<-preProcess(training[,ILcols],method = c("center", "scale", "pca"),pcaComp= 7)
I've tried using:
predModel2 <- train(diagnosis~., data=training2, method = "glm", preProcess = "pca", pcaComp=7)
Error in train.default(x, y, weights = w, ...) : Stopping
UPDATE:
It seems I get around this by using predict() first:
training2_pca<-predict(training_pre,training2_pca)
train(diagnosis~., data=training2_pca, method = "glm")
All preprocessing should be done within the training folds or, in this case, resamples. That prevents 'data leaks', so the first of the above approaches should be preferred, see e.g. this question.
The pcaComp argument goes into trainControl(). Using the iris data, KNN and the first two principal components as an example:
predModel2 <- train(Species~., data=iris, method = "knn", preProcess = "pca",
trControl = trainControl(preProcOptions = list(pcaComp = 2)))

How to custom a model in CARET to perform PLS-[Classifer] two-step classificaton model?

This question is a continuation of the same thread here. Below is a minimal working example taken from this book:
Wehrens R. Chemometrics with R multivariate data analysis in the
natural sciences and life sciences. 1st edition. Heidelberg; New York:
Springer. 2011. (page 250).
The example was taken from this book and its package ChemometricsWithR. It highlighted some pitfalls when modeling using cross-validation techniques.
The Aim:
A cross-validated methodology using the same set of repeated CV to perform a known strategy of PLS followed typically by LDA or cousins like logistic regression, SVM, C5.0, CART, with the spirit of caret package. So PLS would be needed every time before calling the waiting classifier in order to classify PLS score space instead of the observations themselves. The nearest approach in the caret package is doing PCA as a pre-processing step before modeling with any classifier. Below is a PLS-LDA procedure with only one cross-validation to test performance of the classifier, there was no 10-fold CV or any repetition. The code below was taken from the mentioned book but with some corrections otherwise throws error:
library(ChemometricsWithR)
data(prostate)
prostate.clmat <- classvec2classmat(prostate.type) # convert Y to a dummy var
odd <- seq(1, length(prostate.type), by = 2) # training
even <- seq(2, length(prostate.type), by = 2) # holdout test
prostate.pls <- plsr(prostate.clmat ~ prostate, ncomp = 16, validation = "CV", subset=odd)
Xtst <- scale(prostate[even,], center = colMeans(prostate[odd,]), scale = apply(prostate[odd,],2,sd))
tst.scores <- Xtst %*% prostate.pls$projection # scores for the waiting trained LDA to test
prostate.ldapls <- lda(scores(prostate.pls)[,1:16],prostate.type[odd]) # LDA for scores
table(predict(prostate.ldapls, new = tst.scores[,1:16])$class, prostate.type[even])
predictionTest <- predict(prostate.ldapls, new = tst.scores[,1:16])$class)
library(caret)
confusionMatrix(data = predictionTest, reference= prostate.type[even]) # from caret
Output:
Confusion Matrix and Statistics
Reference
Prediction bph control pca
bph 4 1 9
control 1 35 7
pca 34 4 68
Overall Statistics
Accuracy : 0.6564
95% CI : (0.5781, 0.7289)
No Information Rate : 0.5153
P-Value [Acc > NIR] : 0.0001874
Kappa : 0.4072
Mcnemar's Test P-Value : 0.0015385
Statistics by Class:
Class: bph Class: control Class: pca
Sensitivity 0.10256 0.8750 0.8095
Specificity 0.91935 0.9350 0.5190
Pos Pred Value 0.28571 0.8140 0.6415
Neg Pred Value 0.76510 0.9583 0.7193
Prevalence 0.23926 0.2454 0.5153
Detection Rate 0.02454 0.2147 0.4172
Detection Prevalence 0.08589 0.2638 0.6503
Balanced Accuracy 0.51096 0.9050 0.6643
However, the confusion matrix didn't match that in the book, anyway the code in the book did break, but this one here worked with me!
Notes:
Although this was only one CV, but the intention is to agree on this methodology first, sd and mean of the train set were applied on the test set, PLUS transformed into PLS scores based a specific number of PC ncomp. I want this to occur every round of the CV in the caret. If the methodology as code is correct here, then it can serve, may be, as a good start for a minimal work example while modifying the code of the caret package.
Side Notes:
It can be very messy with scaling and centering, I think some of the PLS functions in R do scaling internally, with or without centering, I am not sure, so building a custom model in caret should be handled with care to avoid both lack or multiple scalings or centerings (I am on my guards with these things).
Perils of multiple centering/scaling
The code below is just to show how multliple centering/scaling can change the data, only centering is shown here but the same problem with scaling applies too.
set.seed(1)
x <- rnorm(200, 2, 1)
xCentered1 <- scale(x, center=TRUE, scale=FALSE)
xCentered2 <- scale(xCentered1, center=TRUE, scale=FALSE)
xCentered3 <- scale(xCentered2, center=TRUE, scale=FALSE)
sapply (list(xNotCentered= x, xCentered1 = xCentered1, xCentered2 = xCentered2, xCentered3 = xCentered3), mean)
Output:
xNotCentered xCentered1 xCentered2 xCentered3
2.035540e+00 1.897798e-16 -5.603699e-18 -5.332377e-18
Please drop a comment if I am missing something somewhere in this course. Thanks.
If you want to fit these types of models with caret, you would need to use the latest version on CRAN. The last update was created so that people can use non-standard models as they see fit.
My approach below is to jointly fit the PLS and other model (I used random forest in the example below) and tune them at the same time. So for each fold, a 2D grid of ncomp and mtry is used.
The "trick" is to attached the PLS loadings to the random forest object so that they can be used during prediction time. Here is the code that defines the model (classification only):
modelInfo <- list(label = "PLS-RF",
library = c("pls", "randomForest"),
type = "Classification",
parameters = data.frame(parameter = c('ncomp', 'mtry'),
class = c("numeric", 'numeric'),
label = c('#Components',
'#Randomly Selected Predictors')),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1),
mtry = 1:len)
grid <- subset(grid, mtry <= ncomp)
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the random forest object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- randomForest(scores, y, mtry = param$mtry, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
and here is the call to train:
library(ChemometricsWithR)
data(prostate)
set.seed(1)
inTrain <- createDataPartition(prostate.type, p = .90)
trainX <-prostate[inTrain[[1]], ]
trainY <- prostate.type[inTrain[[1]]]
testX <-prostate[-inTrain[[1]], ]
testY <- prostate.type[-inTrain[[1]]]
## These will take a while for these data
set.seed(2)
plsrf <- train(trainX, trainY, method = modelInfo,
preProc = c("center", "scale"),
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
## How does random forest do on its own?
set.seed(2)
rfOnly <- train(trainX, trainY, method = "rf",
tuneLength = 10,
trControl = trainControl(method = "repeatedcv",
repeats = 5))
Just for kicks, I got:
> getTrainPerf(plsrf)
TrainAccuracy TrainKappa method
1 0.7940423 0.65879 custom
> getTrainPerf(rfOnly)
TrainAccuracy TrainKappa method
1 0.7794082 0.6205322 rf
and
> postResample(predict(plsrf, testX), testY)
Accuracy Kappa
0.7741935 0.6226087
> postResample(predict(rfOnly, testX), testY)
Accuracy Kappa
0.9032258 0.8353982
Max
Based on Max's valuable comments, I felt the need to have IRIS referee, which is famous for classification, and more importantly the Species outcome has more than two classes, which would be a good data set to test the PLS-LDA custom model in caret:
data(iris)
names(iris)
head(iris)
dim(iris) # 150x5
set.seed(1)
inTrain <- createDataPartition(y = iris$Species,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Iris
## that belong in the training set.
training <- iris[ inTrain,] # 114
testing <- iris[-inTrain,] # 36
ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
classProbs = TRUE)
set.seed(2)
plsFitIris <- train(Species ~ .,
data = training,
method = "pls",
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsFitIris
plot(plsFitIris)
set.seed(2)
plsldaFitIris <- train(Species ~ .,
data = training,
method = modelInfo,
tuneLength = 4,
trControl = ctrl,
preProc = c("center", "scale"))
plsldaFitIris
plot(plsldaFitIris)
Now comparing the two models:
getTrainPerf(plsFitIris)
TrainAccuracy TrainKappa method
1 0.8574242 0.7852462 pls
getTrainPerf(plsldaFitIris)
TrainAccuracy TrainKappa method
1 0.975303 0.9628179 custom
postResample(predict(plsFitIris, testing), testing$Species)
Accuracy Kappa
0.750 0.625
postResample(predict(plsldaFitIris, testing), testing$Species)
Accuracy Kappa
0.9444444 0.9166667
So, finally there was the EXPECTED difference, and improvement in the metrics. So this would support Max's notion, that two-class problems because of Bayes' probabilistic approach of plsda function both lead to the same results.
You need to wrap the CV around both PLS and LDA.
Yes, both plsr and lda center the data their own way
I had a closer look at caret::preProcess (): as it is defined now, you will not be able to use PLS as preprocessing method because it is supervised but caret::preProcess () uses unsupervised methods only (there is no way to hand over the dependent variable). This would probably make patching rather difficult.
So inside the caret framework, you'll need to go for a custom model.
If the scenario were to custom a model of PLS-LDA type, according to the code kindly provided by Max (maintainer of CARET), something is not corect in this code, but I didn't figure it out, because I used the Sonar data set the same in caret vignette and tried to reproduce the result one time using method="pls" and another time using the below custom model for PLS-LDA, the results were exactly identical even to the last digit, which was nonsensical. For benchmarking, one need a known data set (I think a cross-validated PLS-LDA for iris data set would fit here as it is famous for this type of analysis and there should be somewhere a cross-validated treatment of it), everything should be the same (the set.seed(xxx) and the no of K-CV repitition) except the code in question so as to rightly compare and to judge the code below:
modelInfo <- list(label = "PLS-LDA",
library = c("pls", "MASS"),
type = "Classification",
parameters = data.frame(parameter = c("ncomp"),
class = c("numeric"),
label = c("#Components")),
grid = function(x, y, len = NULL) {
grid <- expand.grid(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
},
loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
## First fit the pls model, generate the training set scores,
## then attach what is needed to the lda object to
## be used later
pre <- plsda(x, y, ncomp = param$ncomp)
scores <- pls:::predict.mvr(pre, x, type = "scores")
mod <- lda(scores, y, ...)
mod$projection <- pre$projection
mod
},
predict = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$class
},
prob = function(modelFit, newdata, submodels = NULL) {
scores <- as.matrix(newdata) %*% modelFit$projection
predict(modelFit, scores)$posterior
},
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),])
Based on Zach's request, the code below is for method="pls" in caret, exactly the same concrete example in caret vigenette on CRAN:
library(mlbench) # data set from here
data(Sonar)
dim(Sonar) # 208x60
set.seed(107)
inTrain <- createDataPartition(y = Sonar$Class,
## the outcome data are needed
p = .75,
## The percentage of data in the
## training set
list = FALSE)
## The format of the results
## The output is a set of integers for the rows of Sonar
## that belong in the training set.
training <- Sonar[ inTrain,] #157
testing <- Sonar[-inTrain,] # 51
ctrl <- trainControl(method = "repeatedcv",
repeats = 3,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(108)
plsFitSon <- train(Class ~ .,
data = training,
method = "pls",
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
plsFitSon
plot(plsFitSon) # might be slightly difference than what in the vignette due to radnomness
Now, the code below is a pilot run to classify Sonar data using the custom model PLS-LDA which is under question, it is expected to come up with any numbers apart from identical with those using PLS only:
set.seed(108)
plsldaFitSon <- train(Class ~ .,
data = training,
method = modelInfo,
tuneLength = 15,
trControl = ctrl,
metric = "ROC",
preProc = c("center", "scale"))
Now comparing the results between the two models:
getTrainPerf(plsFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 pls
getTrainPerf(plsldaFitSon)
TrainROC TrainSens TrainSpec method
1 0.8741154 0.7638889 0.8452381 custom
postResample(predict(plsFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
postResample(predict(plsldaFitSon, testing), testing$Class)
Accuracy Kappa
0.745098 0.491954
So, the results are exactly the same which cannot be. As if the lda model were not added?

Resources