Creating Custom Folds For Caret CV - r

I'm using the caret package to model and cross validate
model <- caret::train(mpg ~ wt
+ drat
+ disp
+ qsec
+ as.factor(am),
data = mtcars,
method = "lm",
trControl = caret::trainControl(method = "cv",
repeats=5,
returnData =FALSE))
However, I'd like to pass the trainControl a custom set of indices relating to my folds. This can be done via IndexOut.
model <- caret::train(wt ~ + disp + drat,
data = mtcars,
method = "lm",
trControl = caret::trainControl(method = "cv",
returnData =FALSE,
index = indicies$train,
indexOut = indicies$test))
What I'm struggling with is that I only want to test on rows in mtcars where the mtcars.am==0. Thus the use of createFolds won't work because you can't add a criterion. Does anyone know of any other functions that allow indexing of rows into K-folds where a criterion of mtcars.am==0 can be added in creating indicies$test?

I think this should work. Just feed the index with the desired row index.
index = list(which(mtcars$am == 0))
model <- caret::train(
wt ~ +disp + drat,
data = mtcars,
method = "lm",
trControl = caret::trainControl(
method = "cv",
returnData = FALSE,
index = index
)
)
index argument is a list so you can feed as many iterations as you want to that list by creating multiple nested list in the index.

Thanks for you help. I got there in the end by modifying the output from createFolds not the best example mtcars because it's such a small dataset but you get the idea:
folds<-caret::createFolds(mtcars,k=2)
indicies<-list()
#Create training folds
indicies$train<-lapply(folds,function(x) which(!1:nrow(mtcars) %in% x))
#Create test folds based output "folds" and with criterion added
indicies$test<-lapply(folds,function(x) which(1:nrow(mtcars) %in% x & mtcars[,"am"]==1))

Related

ggparty and tidymodels, cannot plot final node graphs, no data attached to model(?)

I am playing with tidymodels workflow for ctree with new bonsai package, an extension for modeling with partykit, here is my code:
pacman::p_load(tidymodels, bonsai, modeldata, finetune)
data(penguins)
doParallel::registerDoParallel()
split <- initial_split(penguins, strata = species)
df_train <- training(split)
df_test <- testing(split)
folds <-
# vfold_cv(train, strata = penguins)
bootstraps(df_train, strata = species, times = 5) # if small number of records
tree_recipe <-
recipe(formula = species ~ flipper_length_mm + island, data = df_train)
tree_spec <-
decision_tree(
tree_depth = tune(),
min_n = tune()
) %>%
set_engine("partykit") %>%
set_mode("classification")
tree_workflow <-
workflow() %>%
add_recipe(tree_recipe) %>%
add_model(tree_spec)
set.seed(8833)
tree_tune <-
tune_sim_anneal(
tree_workflow,
resamples = folds,
iter = 30,
initial = 4,
metrics = metric_set(roc_auc, pr_auc, accuracy))
final_workflow <- finalize_workflow(tree_workflow, select_best(tree_tune, "roc_auc"))
final_fit <- last_fit(final_workflow, split = split)
I understand that to extract a final fit model I need to:
final_model <- extract_fit_parsnip(final_fit)
And then I can plot the tree.
plot(final_model$fit)
I would like to try a different plotting library that works with partykit:
library(ggparty)
ggparty(final_model$fit)+
geom_edge() +
geom_edge_label() +
geom_node_splitvar() +
geom_node_plot(
gglist = list(geom_bar(x = "", color = species),
xlab("species")),
# draw individual legend for each plot
shared_legend = FALSE
)
But the ggparty code works up to the last line (without it the tree looks good, it prints without plots in final nodes).
It does not see the data inside the fitted model, namely, the response variable species.
Error in layer(data = data, mapping = mapping, stat = stat, geom = GeomBar, :
object 'species' not found
How can I extract the final fit from tidymodels, so that it contains the fitted values as it would if I had built a model without tidymodels workflow?
There are two problems in your code, only one of them related to tidymodels.
The arguments to geom_bar() need to be wrapped in aes(), which is necessary both for plain ctree() output and for the result from the tidymodels workflow.
The dependent variable in the output from the tidymodels workflow is not called species anymore but ..y (presumably a standardized placeholder employed in tidymodels). This can be seen from simply printing the object:
final_model$fit
## Model formula:
## ..y ~ flipper_length_mm + island
##
## Fitted party:
## [1] root
## ...
Addressing both of these (plus using the fill= instead of color= aesthetic) works as intended. (Bonus comment: autoplot(final_model$fit) also just works!)
ggparty(final_model$fit) +
geom_edge() +
geom_edge_label() +
geom_node_splitvar() +
geom_node_plot(gglist = list(
geom_bar(aes(x = "", fill = ..y)),
xlab("species")
))

R Caret's Train function quotation marks problem

I am using Caret's Train function inside my own function to train multiple models. Because Caret cannot handle the quotation marks in the X input, i tried removing the quotes with base R's 'noquote' function. Because in other parts of the function i need the input with quotation marks, i cannot remove the quotations surrounding the input values beforehand. Thanks in advance!
Code:
i <- "Dax.Shepard"
celeg_lgr = train(noquote(i) ~ ., method = "glm",
family = binomial(link = "logit"), data = celeb_trn,
trControl = trainControl(method = 'cv', number = 5))
Running this code results in the following error:
Error in model.frame.default(form = op ~ ., data = celeb_trn, na.action = na.fail) :
variable lengths differ (found for 'Dax.Shepard')
PS.
Running the code like this does not result in any error:
celeg_lgr = train(Dax.Shepard ~ ., method = "glm",
family = binomial(link = "logit"), data = celeb_trn,
trControl = trainControl(method = 'cv', number = 5))

Error building partial dependence plots for RF using FinalModel output from caret's train() function

I am using the following code to fit and test a random forest classification model:
> control <- trainControl(method='repeatedcv',
+ number=5,repeats = 3,
+ search='grid')
> tunegrid <- expand.grid(.mtry = (1:12))
> rf_gridsearch <- train(y = river$stat_bino,
+ x = river[,colnames(river) != "stat_bino"],
+ data = river,
+ method = 'rf',
+ metric = 'Accuracy',
+ ntree = 600,
+ importance = TRUE,
+ tuneGrid = tunegrid, trControl = control)
Note, I am using
train(y = river$stat_bino, x = river[,colnames(river) != "stat_bino"],...
rather than: train(stat_bino ~ .,...
so that my categorical variables will not be turned into dummy variables.
solution here: variable encoding in K-fold validation of random forest using package 'caret')
I would like to extract the FinalModel and use it to make partial dependence plots for my variables (using code below), but I get an error message and don't know how to fix it.
> model1 <- rf_gridsearch$finalModel
> library(pdp)
> partial(model1, pred.var = "MAXCL", type = "classification", which.class = "1", plot =TRUE)
Error in eval(stats::getCall(object)$data) :
..1 used in an incorrect context, no ... to look in
Thanks for any solutions here!

How do I plot a single numerical covariate using emmeans (or other package) from a model?

After variable selection I usually end up in a model with a numerical covariable (2nd or 3rd degree). What I want to do is to plot using emmeans package preferentially. Is there a way of doing it?
I can do it using predict:
m1 <- lm(mpg ~ poly(disp,2), data = mtcars)
df <- cbind(disp = mtcars$disp, predict.lm(m1, interval = "confidence"))
df <- as.data.frame(df)
ggplot(data = df, aes(x = disp, y = fit)) +
geom_line() +
geom_ribbon(aes(ymin = lwr, ymax = upr, x = disp, y = fit),alpha = 0.2)
I didn't figured out a way of doing it using emmip neither emtrends
For illustration purposes, how could I do it using mixed models via lme?
m1 <- lme(mpg ~ poly(disp,2), random = ~1|factor(am), data = mtcars)
I suspect that your issue is due to the fact that by default, covariates are reduced to their means in emmeans. You can use theat or cov.reduce arguments to specify a larger number of values. See the documentation for ref_grid and vignette(“basics”, “emmeans”), or the index of vignette topics.
Using sjPlot:
plot_model(m1, terms = "disp [all]", type = "pred")
gives the same graphic.
Using emmeans:
em1 <- ref_grid(m1, at = list(disp = seq(min(mtcars$disp), max(mtcars$disp), 1)))
emmip(em1, ~disp, CIs = T)
returns a graphic with a small difference in layout. An alternative is to add the result to an object and plot as the way that I want to:
d1 <- emmip(em1, ~disp, CIs = T, plotit = F)

Using caret with recipes is leading to difficulties with resample

I've been using recipes to pipe into caret::train, which has been going well, but now I've tried some step_transforms, I'm getting the error:
Error in resamples.default(model_list) :
There are different numbers of resamples in each model
when I compare models with and without the transformations. The same code with step_centre and step_scale works fine.
library(caret)
library(tidyverse)
library(tidymodels)
formula <- price ~ carat
model_recipe <- recipe(formula, data = diamonds)
quadratic_model_recipe <- recipe(formula, data = diamonds) %>%
step_poly(all_predictors())
model_list <- list(
linear_model = NULL,
quadratic = NULL
)
model_list$linear_model <-
model_recipe %>% train(
data = diamonds,
method = "lm",
trControl = trainControl(method = "cv"))
model_list$quadratic_model <-
quadratic_model_recipe %>% train(
data = diamonds,
method = "lm",
trControl = trainControl(method = "cv"))
resamp <- resamples(model_list)
quadratic = NULL should have been quadratic_model = NULL

Resources