All values of AUC ROC Curve 1 using tidymodels - r

Trying to do a LASSO model with a binary outcome using tidymodels, I have essentially copied the case study from the tidymodels webpage (https://www.tidymodels.org/start/case-study/)(the hotel stay dataset) and applied it to my own data but for some reason all of the values on my area under the ROC curve are coming out at 1 (as you can see from graph below). The only thing I have changed is the recipe (to try and suit my data)
recipe(outcome ~ ., data = df_train) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_predictors()) %>%
step_medianimpute(all_predictors())
so I don't know if it is my recipe that is incorrect or my data is not suitable for whatever reason. As mentioned I have a binary outcome and 68 predictors (59 factors and 9 numeric), some do have missing data but thought that the step_medianimpute would deal with that. Many thanks for any help anyone can offer
My AUC ROC Curve

Without seeing the data it is hard to know for sure, but your results indicate a couple of things.
Firstly, AUC ROC of 1. An AOC ROC of 1 for a binary classification model indicated that the model is perfectly able to separate the two classes. This could either be the case of overfitting or that you just have linearly separable classes.
Secondly, the constant metric value for different values of penalty. For a LASSO model, as the penalty increases, more and more variables will be shrunk to zero. In your case for all the values of the penalty (if you are following the post it will be 10^(-4) through 10^(-1)) you are seeing the same performance. That means that even if you use a penalty of 10^(-1) you still haven't shrunk enough predictors to hurt/change the performance. Reprex below
set.seed(1234)
library(tidymodels)
response <- rep(c(0, 10), length.out = 1000)
data <- bind_cols(
response = factor(response),
map_dfc(seq_len(50), ~ rnorm(1000, response))
)
data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
lasso_spec <- logistic_reg(mixture = 1, penalty = tune()) %>%
set_engine("glmnet")
lasso_wf <- workflow() %>%
add_model(lasso_spec) %>%
add_formula(response ~ .)
data_folds <- vfold_cv(data_train)
param_grid <- tibble(penalty = 10^seq(-4, -1, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
What what you can do is expand the range of penalties until you the performance changes. Below we see that once the penalty is high enough, the last important predictors got shrunk to zero, and we lose performance.
param_grid <- tibble(penalty = 10^seq(-1, 0, length.out = 30))
tune_res <- tune_grid(
lasso_wf,
resamples = data_folds,
grid = param_grid
)
autoplot(tune_res)
To verify, we fit the model using one of the good performance penalties and we get perfect predictions.
lasso_final <- finalize_workflow(lasso_wf, select_best(tune_res))
lasso_final_fit <- fit(lasso_final, data = data_train)
augment(lasso_final_fit, new_data = data_train) %>%
conf_mat(truth = response, estimate = .pred_class)
#> Truth
#> Prediction 0 10
#> 0 375 0
#> 10 0 375
Created on 2021-05-08 by the reprex package (v2.0.0)

Related

tidymodels roc auc results in multiple classification are affected by first level of factor

Using the iris dataset, a knn-classifier was tuned with iterative search and roc_auc as metric for the purpose of multiple classification.
One AUC result per potential model was calculated as expected, nevertheless, this value is not stable, but affected by:
the order of levels ("setosa", "virginica", "versicolor") in the Species column in the initial dataset
the order of columns in the roc_auc(truth = Species, .pred_setosa, .pred_virginica,.pred_versicolor)
Does this indicate that the AUC may be calculated similarly as setting the first level of the Species column as the positive event (which is expected in the binary classification, whereas in the multiple classification a single AUC based on e.g. a one-vs-all comparison would be appropriate)?
If so, is there a way to select a potential model based on e.g. the averaging AUC value of all the AUC values produced by the "one vs all comparisons"?
Could it also be implemented in the metric_set during the iterative search?
Thank you in advance for your support!
library(tidyverse)
library(tidymodels)
tidymodels_prefer()
df <- iris %>%
mutate(Species = factor(Species,levels = c("virginica", "versicolor", "setosa")))
set.seed(2023)
splits <- initial_split(df, strata = Species, prop = 4/5)
df_train <- training(splits)
df_test <- testing(splits)
df_rec <- recipe(Species ~ ., data = df_train)
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
df_wflow <- workflow() %>%
add_model(knn_model) %>%
add_recipe(df_rec)
set.seed(2023)
knn_cv <-
df_wflow %>%
tune_bayes(
metrics = metric_set(roc_auc),
resamples = vfold_cv(df_train, strata = "Species", v = 2),
control = control_bayes(verbose = TRUE, save_pred = TRUE)
)
cv_train_metrics <- knn_cv %>%
collect_predictions() %>%
group_by(.config, id) %>%
roc_auc(truth = Species, .pred_setosa, .pred_virginica,.pred_versicolor)
roc_auc() expects that the columns that have the probability estimates are in the same order as the factor levels. We'll make the documentation better for that.
By default, we use the method of Hand and Till to compute the area under a single muticlass ROC curve.
So this is not doing multiple ROC curves by default. You can change the estimator argument to do different types of averaging methods though but I would not suggest it for this metric.

SVM performance not consistent with AUC score

I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick).
I have tried to implement an SVM model to predict patient status based on these variables.
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")#y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997.
But when I view the predictions, all the patients have been assigned as healthy.
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
Can anyone help me with this problem?
Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.
If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")#y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good

ROC with cross-validation for linear regression in R

A two-part question: I'm trying to figure out: (1) how to generate a ROC curve for a linear regression using lm() (properly, if it's even right??), and (2) how to do it with k-fold cross validation so I may get the mean ROC curve (and AUC).
If the outcome is a continuous variable, it has to be converted into a binary variable, right? Normally I would fit a logistic regression model using glm(..., family = 'binomial') instead, but is it the most appropriate way? (It seems like I'm just fitting a different model.)
I would like something like this plot below from the cvAUC package's rdrr.io website (red line is the mean ROC curve, dotted lines are k-fold ROC curves), but I'm not sure how to get there with my data.
Example with data(USArrests):
library(dplyr)
library(pROC)
data(USArrests)
# create train and test sets
set.seed(2021)
dat <- mutate(USArrests, index=1:nrow(USArrests))
train.dat <- sample_frac(dat, 0.5) # splits `dat` in half
test.dat <- subset(dat, !dat$index %in% train.dat$index) # uses other half to test
# trying to build predictions with lm()
fit <- lm(Murder ~ Assault, data = train.dat)
predicted <- predict(fit, test.dat, type = "response")
# roc curve
roc(test.dat$Murder ~ predicted, plot = TRUE, print.auc = TRUE) # AUC = 1.000
The code above gets results, but gives a warning:
Warning message:
In roc.default(response, m[[predictors]], ...) :
'response' has more than two levels. Consider setting 'levels' explicitly or using 'multiclass.roc' instead
I don't know what to do from its suggestion. It also got an AUC = 1.000 -- is this approach wrong, and why?
Moreover, it's only working with one train/test set. I'm not sure how to train with k-fold sets. I think I have to combine it with caret::train() somehow. I tried with the ROC solutions for random forest models from ROC curve from training data in caret, but it is not working with my code.
Example:
library(caret)
library(MLeval)
train_control <- trainControl(method = "cv", number = 10, savePredictions = TRUE)
rfFit <- train(Murder ~ Assault, data = USArrests, trControl = train_control, method = "lm")
rfFit$pred$mtry # NULL
res <- MLeval::evalm(rfFit) # error with error message below
MLeval: Machine Learning Model Evaluation
Input: caret train function object
Not averaging probs.
Group 1 type: cv
Error in [.data.frame(preds, c(G1, G2, "obs")) :
undefined columns selected
You could do the cross-validation like this if you switched it to a 0/1 variable:
USArrests <- USArrests %>%
mutate(Murder01 = as.numeric(Murder > mean(Murder, na.rm=TRUE)))
# create train and test sets
set.seed(2021)
cvfun <- function(split, ...){
mod <- glm(Murder01 ~ Assault, data=analysis(split), family=binomial)
fit <- predict(mod, newdata=assessment(split), type="response")
data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
library(rsample)
library(purrr)
library(tidyverse)
cv_out <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(auc = roc(y, fit, plot=FALSE)$auc[1])
cv_out
# # A tibble: 5 x 2
# id auc
# * <chr> <dbl>
# 1 Repeat1 0.936
# 2 Repeat2 0.928
# 3 Repeat3 0.937
# 4 Repeat4 0.918
# 5 Repeat5 0.942
That said, I'm not sure this is better than using something like the R-squared or MSE on the linear model. And, I'm not not super confident that the algorithm in the tutorial is actually doing something that makes sense statistically. I could definitely be wrong and would defer to someone with more expertise, but I can't see how it makes a lot of sense and it certainly doesn't produce something meaningful in this case. An AUC of 1 you would think would only happen with perfect prediction.
Further, I'm not sure what probative value these numbers have. Generally you would want to use this sort of analysis to tune the model specification - often by finding nearly optimal values of hyper-parameters. You could imagine doing this with a different model specification. For example, you could evaluate the relative predictive power of a model with a second-degree polynomial in Assault versus one that was linear, as below.
cvfun2 <- function(split, ...){
mod <- glm(Murder01 ~ poly(Assault, 2), data=analysis(split), family=binomial)
fit <- predict(mod, newdata=assessment(split), type="response")
data.frame(fit = fit, y = model.response(model.frame(formula(mod), data=assessment(split))))
}
cv_out2 <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun2)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(auc = roc(y, fit, plot=FALSE)$auc[1])
mean(cv_out2$auc)
# [1] 0.9123994
mean(cv_out$auc)
# [1] 0.9320451
Edit - Making the ROC plot
cv_out_plot <- vfold_cv(USArrests, v=10, repeats = 5) %>%
mutate(fit = map(splits, cvfun)) %>%
unnest(fit) %>%
group_by(id) %>%
summarise(sens = roc(y, fit, plot=FALSE)$sensitivities,
spec = roc(y, fit, plot=FALSE)$specificities,
obs = 1:length(sens))
ave <- cv_out_plot %>%
ungroup %>%
group_by(obs) %>%
summarise(sens = mean(sens),
spec = mean(spec),
id = "Average")
cv_out_plot <- bind_rows(cv_out_plot, ave) %>%
mutate(col = factor(ifelse(id == "Average", "Average", "Individual"),
levels=c("Individual", "Average")))
ggplot(cv_out_plot , aes(x=1-sens, y=spec, group=id, colour=col)) +
geom_line(aes(size=col, alpha=col)) +
scale_colour_manual(values=c("black", "red")) +
scale_size_manual(values=c(.5,1.25)) +
scale_alpha_manual(values=c(.3, 1)) +
theme_classic() +
theme(legend.position=c(.75, .15)) +
labs(x="1-Sensitivity", y="Specificity", colour="", alpha="", size="")

How to incorporate tidy models PCA into the workflow of a model and make predictions

I am trying to incorporate tidy models PCA into the workflow of a model. I want to have a predictive model that uses PCA as a preprocessing step and then make predictions with that model.
I have tried the following approach,
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_test <-vfold_cv(diamonds_train)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
preprocesados <- prep(diamonds_recipe)
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
lr_fitted_workflow <- pca_workflow %>% #option A workflow full dataset
last_fit(diamonds_split)
performance <- lr_fitted_workflow %>% collect_metrics()
test_predictions <- lr_fitted_workflow %>% collect_predictions()
But I get this error:
x Resample1: model (predictions): Error: penalty should be a single numeric value. ...
Warning message:
“All models failed in [fit_resamples()]. See the .notes column.”
Following other tutorials I tried to use this other approach, but I don't know how to use the model to make new predictions, because the new data comes in the original (non-pca) form. So I tried this:
pca_fit <- juice(preprocesados) %>% #option C no work flow at all
lm(price ~ ., data = .)
prep_test <- prep(diamonds_recipe, new_data = diamonds_test)
truths <- juice(prep_test) %>%
select(price)
ans <- predict(pca_fit, new_data = prep_test)
tib <- tibble(row = 1:length(ans),ans, truths)
ggplot(data = tib) +
geom_smooth(mapping = aes(x = row, y = ans, colour = "predicted")) +
geom_smooth(mapping = aes(x = row, y = price, colour = "true"))
And it prints something that seams reasonable, but by this point I have lost confidence and some guidance would be much appreciated. :D
The problem is not in your recipe or the workflow. As described in chapter 7 of TidyModels with R the function for fitting your model is fit and for it to work you'll have to provide the data for the fitting process (here diamonds). The tradeoff is that you don't have to prep your recipe as the workflow will take care of this itself.
So reducing your code slightly, the example below will work.
library(tidymodels)
data(diamonds)
diamonds <- diamonds %>%
select(-clarity, -cut, - color)
diamonds_split <- initial_split(diamonds, prop = 4/5)
diamonds_train <- training(diamonds_split)
diamonds_test <- testing(diamonds_split)
diamonds_recipe <-
# La fórmula básica y todos los datos (outcome ~ predictors)
recipe(price ~ ., data = diamonds_train) %>%
step_log(all_outcomes(),skip = T) %>%
step_normalize(all_predictors(), -all_nominal()) %>%
step_pca(all_predictors())
linear_model <-
linear_reg() %>%
set_engine("glmnet") %>%
set_mode("regression")
pca_workflow <- workflow() %>%
add_recipe(diamonds_recipe) %>%
add_model(linear_model)
pca_fit <- fit(pca_workflow, data = diamonds_train)
As for crossvalidation one has to use fit_resamples and should split the training set and not the testing set. But here I am currently getting the same error (my answer will be updated if i figure out why)
Edit
Now I've done a bit of digging, and the problem with crossvalidation stems from the engine being glmnet. I am guessing that of the many different aspects this one has somehow been missed. I've added a possible issue to the workflows package github site. Often the answers are quick in coming, so likely one of the developers will come with a reply soon.
As for crossvalidation, assume you instead fit using any of the other engines described in ?linear_reg then we could do this as
linear_model_base <-
linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
pca_workflow <- update_model(pca_workflow, linear_model_base)
folds <- vfold_cv(diamonds_train, 10)
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds)
and in the case where metrics are of interest these can indeed be collected as you did using collect_metrics
pca_folds_fit %>% collect_metrics()
If we are interested in the predictions you'll have to tell the model that you want to save these during the fitting process and then use collect_predictions
pca_folds_fit <- fit_resamples(pca_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))
collect_predictions(pca_folds_fit)
Note however that the output from this is the predictions from each fold as you are literally fitting 10 models.
Usually crossvalidation is used to compare multiple models or tuning parameters (eg. random forest vs linear model). The best model on crossvalidation performance (collect_metrics) would then be selected for use and the test dataset would be used to get the evaluation of this models accuracy.
This is all described in TMwR chapter 10 & 11

tidymodels: ranger with cross validation

The dataset can be found here:
https://www.kaggle.com/mlg-ulb/creditcardfraud
I am trying to use tidymodels to run ranger with 5 fold cross validation on this dataset.
I have have 2 code blocks. The first code block is the original code with the full data. The second code block is almost identical to the first code block, except I have subset a portion of the data so the code runs faster. The second block of code is just to make sure my code works before I run it on the original dataset.
Here is the first code block with the full data:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
Here is the second code block with 1,000 rows:
#load packages
library(tidyverse)
library(tidymodels)
library(tune)
library(workflows)
#load data
df <- read.csv("~creditcard.csv")
###################################################################################
#Testing area#
df <- df %>% arrange(-Class) %>% head(1000)
###################################################################################
#check for NAs and convert Class to factor
anyNA(df)
df$Class <- as.factor(df$Class)
#set seed and split data into training and testing
set.seed(123)
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
#in the training and testing datasets, how many are fraudulent transactions?
df_train %>% count(Class)
df_test %>% count(Class)
#ranger model with 5-fold cross validation
rf_spec <-
rand_forest() %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
all_wf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(rf_spec)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
rf_results <-
all_wf %>%
fit_resamples(resamples = cv_folds)
rf_results %>%
collect_metrics()
1.) With the the first code block, I can assign and print cv folds in the console. The Global Enviornment data says cv_folds has 5 obs. of 2 variables. When I View(cv_folds), I have columns labeled splits and id, but there are no rows and no data. When I use str(cv_folds), I get the blank loading line that R is "thinking", but there is not a red STOP icon I can push. The only thing I can do is force quit RStudio. Maybe I just need to wait longer? I am not sure. When I do the same thing with the smaller second code block, str() works fine.
2) My overall goal for this project is to split the dataset into training and testing sets. Then partition the training data with 5 fold cross validation and train a ranger model on it. Next, I want to examine the metrics of my model on the training data. Then I want to test my model on the testing set and view the metrics. Eventually, I want to swap out ranger for something like xgboost. Please give me advice on what parts of my code I can add/modify to improve. I am still missing the portion of testing my model on the testing set.
I think the Predictions portion of this article might be what I'm aiming for.
https://rviews.rstudio.com/2019/06/19/a-gentle-intro-to-tidymodels/
3) When I use rf_results %>% collect_metrics(), it only shows accuracy and roc_auc. How do I get sensitivity, specificity, precision, and recall?
4) How do I plot importance? Would I use something like this?
rf_fit <- get_tree_fit(all_wf)
vip::vip(rf_fit, geom = "point")
5) How can I drastically reduce the amount of time for the model to train? Last time I ran ranger with 5 fold cross validation using caret on this dataset, it took 8+ hours (6 core, 4.0 ghz, 16gb RAM, SSD, gtx 1060). I am open to anything (ie. restructure code, AWS computing, parallelization, etc.)
Edit: This is another way I have tried to set this up
#ranger model with 5-fold cross validation
rf_recipe <- recipe(Class ~ ., data = df_train)
rf_engine <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
rf_grid <- grid_random(
mtry() %>% range_set(c(1, 20)),
trees() %>% range_set(c(500, 1000)),
min_n() %>% range_set(c(2, 10)),
size = 30)
all_wf <-
workflow() %>%
add_recipe(rf_recipe) %>%
add_model(rf_engine)
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
#####
rf_fit <- tune_grid(
all_wf,
resamples = cv_folds,
grid = rf_grid,
metrics = metric_set(roc_auc),
control = control_grid(save_pred = TRUE)
)
collect_metrics(rf_fit)
rf_fit_best <- select_best(rf_fit)
(wf_rf_best <- finalize_workflow(all_wf, rf_fit_best))
I started with your last block of code and made some edits to have a functional workflow. I answered to your questions along the code. I have taken the liberty to give you some advice and reformat your code.
## Packages, seed and data
library(tidyverse)
library(tidymodels)
set.seed(123)
df <- read_csv("creditcard.csv")
df <-
df %>%
arrange(-Class) %>%
head(1000) %>%
mutate(Class = as_factor(Class))
## Modelisation
# Initial split
df_split <- initial_split(df)
df_train <- training(df_split)
df_test <- testing(df_split)
You can see that df_split returns <750/250/1000> (see below).
2) To tune the xgboost model, you have very little things to change.
# Models
model_rf <-
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
model_xgboost <-
boost_tree(mtry = tune(), trees = tune(), min_n = tune()) %>%
set_engine("xgboost", importance = "impurity") %>%
set_mode("classification")
Here you choose your hyperparameter grid. I advise you to use a non random grid to visit the space of hypermarameters in an optimal way.
# Grid of hyperparameters
grid_rf <-
grid_max_entropy(
mtry(range = c(1, 20)),
trees(range = c(500, 1000)),
min_n(range = c(2, 10)),
size = 30)
These are your workflows, as you can see, virtually nothing to change.
# Workflow
wkfl_rf <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(model_rf)
wkfl_wgboost <-
workflow() %>%
add_formula(Class ~ .) %>%
add_model(model_xgboost)
1) <600/150/750> means that you have 600 observations in your training set, 150 in your validation set and a total of 750 observation in the original dataset. Plese note that, here, 600 + 150 = 750 but this is not always the case (e.g. with boostrap methods with resampling).
# Cross validation method
cv_folds <- vfold_cv(df_train, v = 5)
cv_folds
3) Here you choose which metrics you want to collect during your tuning, with the yardstik package.
# Choose metrics
my_metrics <- metric_set(roc_auc, accuracy, sens, spec, precision, recall)
Then you can compute different models according to the grid. For the control parameters, don't save prediction and print progress (imho).
# Tuning
rf_fit <- tune_grid(
wkfl_rf,
resamples = cv_folds,
grid = grid_rf,
metrics = my_metrics,
control = control_grid(verbose = TRUE) # don't save prediction (imho)
)
These are some useful function to deals with the rf_fit object.
# Inspect tuning
rf_fit
collect_metrics(rf_fit)
autoplot(rf_fit, metric = "accuracy")
show_best(rf_fit, metric = "accuracy", maximize = TRUE)
select_best(rf_fit, metric = "accuracy", maximize = TRUE)
Finally, you can fit your model according to best parameters.
# Fit best model
tuned_model <-
wkfl_rf %>%
finalize_workflow(select_best(rf_fit, metric = "accuracy", maximize = TRUE)) %>%
fit(data = df_train)
predict(tuned_model, df_train)
predict(tuned_model, df_test)
4) unfortunately, methods to deals with randomForest objects are usually not availables with parnsnip outputs
5) You can have a look at the vignette about parallelization.

Resources