predict.train vs predict using recipe objects - r

After specifiying a recipe to use in caret::train I am trying to predict new samples. I have a couple of questions around this as I can not find in caret/recipes documentation.
Should I use predict() or predict.train()? Whats the difference?
Should I bake the test data with the prepared recipe first before using predict? When using preProcess directly in train() you are advised not to preProcess new data as the train object will automatically do that. Is this the same when using recipes?
Below is a reproducible example illustrating my process and the difference in predictions when using predict vs predict.train
library(recipes)
library(caret)
# Data ----
data("credit_data")
credit_train <- credit_data[1:3500,]
credit_test <- credit_data[-(1:3500),]
# Set up recipe ----
set.seed(0)
Rec.Obj = recipe(Status ~ ., data = credit_train) %>%
step_knnimpute(all_predictors()) %>%
step_center(all_numeric())%>%
step_scale(all_numeric())
# Control parameters ----
set.seed(0)
TC = trainControl("cv",number = 10, savePredictions = "final", classProbs = TRUE, returnResamp = "final")
set.seed(0)
Model.Output = train(Rec.Obj,
credit_train,
trControl = TC,
tuneLength = 1,
metric = "Accuracy",
method = "glm")
# Preped recipe ----
set.seed(0)
prep.rec <-
prep(Rec.Obj, newdata = credit_train)
# Baked data for observation ----
set.seed(0)
bake.train <- bake(prep.rec, new_data = credit_train)
bake.test <- bake(prep.rec, new_data = credit_test)
# investigation of prediction methods ----
# no application of recipe to newdata
set.seed(0)
predict.norm = predict(Model.Output, credit_test, type = "raw")
predict.train = predict.train(Model.Output, credit_test, type = "raw")
identical(predict.norm,predict.train)
# evaluates to FALSE
# Apply recipe to new data (bake.test)
predict.norm.baked = predict(Model.Output, bake.test, type = "raw")
predict.train.baked = predict.train(Model.Output, bake.test, type = "raw")
identical(predict.norm.baked, predict.train.baked)
# evaluates to FALSE
# Comparison of both predict() funcs
identical(predict.norm, predict.norm.baked)
# evaluates to FALSE

The recipe is embedded into the train object. The answers are different for two reasons:
Since you are giving the recipe (inside of Model.Output) the processed data to be re-processed. You should not give predict() baked data; just use predict() and give it the original test set..
Let S3 do its thing: predict.train is for the x/y interface and predict.train.recipe is for the recipe interface. Just using predict() will do the appropriate thing.

Related

Linear SVM and extracting the weights

I am practicing SVM in R using the iris dataset and I want to get the feature weights/coefficients from my model, but I think I may have misinterpreted something given that my output gives me 32 support vectors. I was under the assumption I would get four given I have four variables being analyzed. I know there is a way to do it when using the svm() function, but I am trying to use the train() function from caret to produce my SVM.
library(caret)
# Define fitControl
fitControl <- trainControl(## 5-fold CV
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary )
# Define Tune
grid<-expand.grid(C=c(2^-5,2^-3,2^-1))
##########
df<-iris head(df)
df<-df[df$Species!='setosa',]
df$Species<-as.character(df$Species)
df$Species<-as.factor(df$Species)
# set random seed and run the model
set.seed(321)
svmFit1 <- train(x = df[-5],
y=df$Species,
method = "svmLinear",
trControl = fitControl,
preProc = c("center","scale"),
metric="ROC",
tuneGrid=grid )
svmFit1
I thought it was simply svmFit1$finalModel#coefbut I get 32 vectors when I believe I should get 4. Why is that?
So coef is not the weight W of the support vectors. Here's the relevant section of the ksvm class in the docs:
coef The corresponding coefficients times the training labels.
To get what you are looking for, you'll need to do the following:
coefs <- svmFit1$finalModel#coef[[1]]
mat <- svmFit1$finalModel#xmatrix[[1]]
coefs %*% mat
See below for a reproducible example.
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Warning: package 'ggplot2' was built under R version 3.5.2
# Define fitControl
fitControl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary
)
# Define Tune
grid <- expand.grid(C = c(2^-5, 2^-3, 2^-1))
##########
df <- iris
df<-df[df$Species != 'setosa', ]
df$Species <- as.character(df$Species)
df$Species <- as.factor(df$Species)
# set random seed and run the model
set.seed(321)
svmFit1 <- train(x = df[-5],
y=df$Species,
method = "svmLinear",
trControl = fitControl,
preProc = c("center","scale"),
metric="ROC",
tuneGrid=grid )
coefs <- svmFit1$finalModel#coef[[1]]
mat <- svmFit1$finalModel#xmatrix[[1]]
coefs %*% mat
#> Sepal.Length Sepal.Width Petal.Length Petal.Width
#> [1,] -0.1338791 -0.2726322 0.9497457 1.027411
Created on 2019-06-11 by the reprex package (v0.2.1.9000)
Sources
https://www.researchgate.net/post/How_can_I_find_the_w_coefficients_of_SVM
http://r.789695.n4.nabble.com/SVM-coefficients-td903591.html
https://stackoverflow.com/a/1901200/6637133
As more folks start moving from Caret to Tidymodels I thought I'd put a version of the above solution for Tidymodels Aug 2020 because I don't see many discussions about this so far and it isn't that straightforward to do.
Outlining the main steps here but please review the links at the end for detail for why it was done this way.
1. Get Your Final Model
set.seed(2020)
# Assuming kernlab linear SVM
# Grid Search Parameters
tune_rs <- tune_grid(
model_wf,
train_folds,
grid = param_grid,
metrics = classification_measure,
control = control_grid(save_pred = TRUE)
)
# Finalise workflow with the parameters for best accuracy
best_accuracy <- select_best(tune_rs, "accuracy")
svm_wf_final <- finalize_workflow(
model_wf,
best_accuracy
)
# Fit on your final model on all available data at the end of experiment
final_model <- fit(svm_wf_final, data)
# fit takes a model spec and executes the model fit routine (Parsnip)
# model_spec, formula and data to fit upon
2. Extract the KSVM Object, Pull Required Info, Calculate Variable Importance
ksvm_obj <- pull_workflow_fit(final_model)$fit
# Pull_workflow_fit returns the parsnip model fit object
# $fit returns the object produced by the fitting fn (which is what we need! and is dependent on the engine)
coefs <- ksvm_obj#coef[[1]]
# first bit of info we need are the coefficients from the linear fit
mat <- ksvm_obj#xmatrix[[1]]
# xmatrix that we need to matrix multiply against
var_impt <- coefs %*% mat
# var importance
Ref:
Extracting the Weights of Support Vectors using Caret: Linear SVM and extracting the weights
Variable Importance (Last Section of this post): http://www.rebeccabarter.com/blog/2020-03-25_machine_learning/#finalize-the-workflow

Plotting variable importance from ensemble of models with for loop

I keep running into an error while attempting to plot variable importance from ensemble of models.
I have ensemble of models I've fitted and now I am trying to create multiple variable importance plots for each algorithm I've fitted. I am using varImp() function from caret to extract variable importance, then plot() it. To fit ensemble of models, I am using caretEnsemble package.
Thank you for any help, please see the example of code below.
# Caret ensemble is needed to produce list of models
library(caret)
library(caretEnsemble)
# Set algorithms I wish to fit
my_algorithms <- c("glmnet", "svmRadial", "rf", "nnet", "knn", "rpart")
# Define controls
my_controls <- trainControl(
method = "cv",
savePredictions = "final",
number = 3
)
# Run the models all at once with caretEnsemble
my_list_of_models <- caretEnsemble::caretList(Species ~ .,
data = iris,
trControl = my_controls,
methodList = my_algorithms)
# Subset models
list_of_algorithms <- my_list_of_models[my_algorithms]
# Create first for loop to extract variable importance via caret::varImp()
importance <- list()
for (algo in seq_along(list_of_algorithms)) {
importance[[algo]] <- varImp(list_of_algorithms[[algo]])
}
# Create second loop to go over extracted importance and plot it using plot()
importance_plots <- list()
for (imp in seq_along(importance)) {
importance_plots[[imp]] <- plot(importance[[imp]])
}
# Error occurs during the second for loop:
Error in data.frame(values = unlist(unname(x)), ind, stringsAsFactors = FALSE):arguments imply differing number of rows: 16,
I've come up with the solution to the problem above and decided to post it as my own answer. I've written a small function to plot variable importance without relying on caret helper functions to create plots. I used dotplot and levelplot because caret returns data.frame that differs based on provided algorithm. It may not work on different algorithms and models that didn't fit.
# Libraries ---------------------------------------------------------------
library(caret) # To train ML algorithms
library(dplyr) # Required for %>% operators in custom function below
library(caretEnsemble) # To train multiple caret models
library(lattice) # Required for plotting, should be loaded alongside caret
library(gridExtra) # Required for plotting multiple plots
# Custom function ---------------------------------------------------------
# The function requires list of models as input and is used in for loop
plot_importance <- function(importance_list, imp, algo_names) {
importance <- importance_list[[imp]]$importance
model_title <- algo_names[[imp]]
if (ncol(importance) < 2) { # Plot dotplot if dim is ncol < 2
importance %>%
as.matrix() %>%
dotplot(main = model_title)
} else { # Plot heatmap if ncol > 2
importance %>%
as.matrix() %>%
levelplot(xlab = NULL, ylab = NULL, main = model_title, scales = list(x = list(rot = 45)))
}
}
# Tuning parameters -------------------------------------------------------
# Set algorithms I wish to fit
# Rather than using methodList as provided above, I've switched to tuneList because I need to control tuning parameters of random forest algorithm.
my_algorithms <- list(
glmnet = caretModelSpec(method = "glmnet"),
rpart = caretModelSpec(method = "rpart"),
svmRadial = caretModelSpec(method = "svmRadial"),
rf = caretModelSpec(method = "rf", importance = TRUE), # Importance is not computed for "rf" by default
nnet = caretModelSpec(method = "nnet"),
knn = caretModelSpec(method = "knn")
)
# Define controls
my_controls <- trainControl(
method = "cv",
savePredictions = "final",
number = 3
)
# Run the models all at once with caretEnsemble
my_list_of_models <- caretList(Species ~ .,
data = iris,
tuneList = my_algorithms,
trControl = my_controls
)
# Extract variable importance ---------------------------------------------
importance <- lapply(my_list_of_models, varImp)
# Plotting variable immportance -------------------------------------------
# Create second loop to go over extracted importance and plot it using plot()
importance_plots <- list()
for (imp in seq_along(importance)) {
# importance_plots[[imp]] <- plot(importance[[imp]])
importance_plots[[imp]] <- plot_importance(importance_list = importance, imp = imp, algo_names = names(my_list_of_models))
}
# Multiple plots at once
do.call("grid.arrange", c(importance_plots))

Retrieve models from resample function in mlr

I would like to retrieve the binary classification models (i.e. selected features and coefficients) generated by resample function in MLR. Below, you can find my code sample. It seems to be located within the attribute models of the resulting object (here r$models), but I don't find it.
# 1. Find a synthetic dataset for supervised learning (two classes)
###################################################################
library(mlbench)
data(BreastCancer)
# generate 1000 rows, 21 quantitative candidate predictors and 1 target variable
p<-mlbench.waveform(1000)
# convert list into dataframe
dataset<-as.data.frame(p)
# drop thrid class to get 2 classes
dataset2 = subset(dataset, classes != 3)
dataset2 <- droplevels(dataset2 )
# 2. Perform cross validation with embedded feature selection using logistic regression
##########################################################################################
library(BBmisc)
library(mlr)
set.seed(123, "L'Ecuyer")
set.seed(21)
# Choice of data
mCT <- makeClassifTask(data =dataset2, target = "classes")
# Choice of algorithm
mL <- makeLearner("classif.logreg", predict.type = "prob")
# Choice of cross-validations for folds
outer = makeResampleDesc("CV", iters = 10,stratify = TRUE)
# Choice of feature selection method
ctrl = makeFeatSelControlSequential(method = "sbs", maxit = NA,beta = 0.001)
# Choice of sampling between training and test within the fold
inner = makeResampleDesc("Holdout",stratify = TRUE)
lrn = makeFeatSelWrapper(mL, resampling = inner, control = ctrl)
r = resample(lrn, mCT, outer, extract = getFeatSelResult,measures = list(mlr::auc,mlr::acc,mlr::brier),models=TRUE)
You have to dig a bit deeper in the list. For the first model, for example:
r$models[[1]]$learner.model$opt.result
r$models[[1]]$learner.model$next.model$learner.model

caret: `predict` fails when `train` formula has deleted variables

TL/DR ANSWER: specify training data in newdata argument.
How do I consistently extract class probabilities from trained models with caret's predict? Currently I get an error when the argument to predict was trained with the formula notation and a variable was indicated to be ignored with -variable.
This can be reproduced with:
fit.lda <- train(Species ~ . -Petal.Length,
data = iris,
preProcess = c("center", "scale"),
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
classProbs = TRUE,
savePredictions = "final",
selectionFunction = "best",
summaryFunction = multiClassSummary),
method = "lda",
metric = "Mean_F1")
and then the following line will fail:
predict(fit.lda, type = "prob")
Error in predict.lda(modelFit, newdata) : wrong number of variables
If the -Petal.Length is omitted in the train formula, there is no error. Am I doing something wrong with the formula statement?
I suppose I could dig into the model's pred slot and grab the columns corresponding to the class types (see EDIT2), but this seems hackish. Is there a way to get predict to work as expected?
=====EDIT=====
I trained a number of different models (using formula notation) with caretList from the caretEnsemble package, and I got various errors when trying to use predict:
knn
Error in knn3Train(train = c(....) : dims of 'test' and 'train differ
svmRadial:
Warning message:
In method$prob(modelFit = modelFit, newdata = newdata, submodels = param) :
kernlab class probability calculations failed; returning NAs
mlpML:
Error in myFunc[[1]](x, ...) :
number of input data columns 28 does not match number of input neurons 20
Methods that worked without errors were nnet and tree based methods (rf, xgbTree)
=====EDIT2=====
The following doesn't take repeated resampling into account. The selected answer is much simpler.
Here's a self-fashioned solution for extracting probabilities from the trained model, but for standardization, I'd prefer if it's possible to get predict to behave.
grabProbs <- function(model) model$pred[, colnames(model$pred) %in% model$levels]
grabProbs(fit.lda)
Just use the newdata parameter and it will work
predict(fit.lda, newdata = iris, type = "prob")
[EDITED]
As we can see, for lda the prediction result is identical:
library(MASS)
fit.lda <- lda(Species ~ . -Petal.Length, data = iris)
identical(predict(fit.lda), predict(fit.lda, newdata=iris))
# [1] TRUE
library(randomForest)
fit.rf <- randomForest(Species ~ . -Petal.Length, data = iris)
identical(predict(fit.rf), predict(fit.rf, newdata=iris))
# [1] FALSE

Bagging in R: How to use train with the bag function

In R, I am trying to use the bag function with the train function. I start with using train and rpart for a classification tree model, on the simple iris data set. Now I want to create a bag of such 10 trees with the bag function. The documentation says that the aggregate parameter must be a function to choose a value from all bagged models, so I created one called agg, which chooses the string of greatest frequency. However, the bag function gives the following error:
Error in fitter(btSamples[[iter]], x = x, y = y, ctrl = bagControl, v = vars, :
task 1 failed - "attempt to apply non-function"
Here is my complete code:
# Use bagging to create a bagged classification tree from 10 classification trees created with rpart.
data(iris)
# Create training and testing data sets:
inTrain = createDataPartition(y=iris$Species, p=0.7, list=F)
train = iris[inTrain,]
test = iris[-inTrain,]
# Create regressor and outcome datasets for bag function:
regressors = train[,-5]
species = train[,5]
# Create aggregate function:
agg = function(x, type) {
y = count(x)
y = y[order(y$freq, decreasing=T),]
as.character(y$x[1])
}
# Create bagged trees with bag function:
treebag = bag(regressors, species, B=10,
bagControl = bagControl(fit = train(data=train, species ~ ., method="rpart"),
predict = predict,
aggregate = agg
)
)
This gives the error message stated above. I don't understand why it rejects the agg function.
from ?bag()
When using bag with train, classification models should use type =
"prob" inside of the predict function so that predict.train(object,
newdata, type = "prob") will work.
So I guess you might want to try:
bagControl = bagControl(fit = train(data=train, species ~ .,
method="rpart", type="prob"),
predict = predict,
aggregate = agg
)

Resources