Caret obtain train & cv predictions from model to plot - r

I've trained a simple model:
mySim <- train(Event ~ .,
method = 'rf',
data = train,
tuneGrid = tg)
Optimising the two nnet parameters weight_decay and size of the hidden layer. I'm new to trying out caret so what I would usually do is plot the train error and cv error for each model build. To do this, I'd need to have the predictive values of my train and validation pass.
This is the first time I've used cross validation so I'm a little unsure how I can go about getting the predictions from the train and hold-out set at each tuneGrid iteration.
If I have a grid search of length 3 (3 models to build) and 5-fold cross validation I assume I'm going to have 15 sets of train & holdout predictions for each model.
The plot I'm essentially looking to build is:
Where my y-axis is a performance metric, lets say entropy loss for the sake of classification with nnet and the size grid search values on the x-axis increases from 0 - max.
Is there a way in which I can extract the predicted values from the train / holdout set during trainControl cross validation?
I've looked through some of the attributes train returns but not sure if I'm missing something.
I know I lack code in this question but hopefully I've explained myself.
Update
I am correct in assuming setting the following parameters in trainControl will return the predictions allowing me to create this plot:
returnResamp
savePredictions

carets::train keeps only the hold out predictions. If you specify savePredictions ="all" it will save hold out predictions for all hyper parameter combinations. However it does not save the train set predictions. You could generate them afterwards with the knowledge which indexes were used for the hold outs. This info is the model$pred slot of the object returned by train. mlr package has an option to keep both hold out and train predictions and metric.
Here is an example on how to perform the requested operation with mlr library:
library(mlr)
library(mlbench) #for the data set
I will use the Sonar data set:
data(Sonar)
create a task:
task <- makeClassifTask(data = Sonar, target = "Class")
create a learner:
lrn <- makeLearner("classif.nnet", predict.type = "prob")
get all tune-able parameters for a learner:
getParamSet("classif.nnet")
set which ones you would like to tune and the range:
ps <- makeParamSet(
makeIntegerParam("size", lower = 3, upper = 5),
makeNumericParam("decay", lower = 0.1, upper = 0.2))
define resampling:
cross_val <- makeResampleDesc("RepCV",
reps = 2, folds = 5, stratify = TRUE, predict = "both")
how the search will be performed (grid in this case):
ctrl <- mlr::makeTuneControlGrid(resolution = 4L)
get everything together:
res.mbo <- tuneParams(lrn, task, cross_val, par.set = ps, control = ctrl,
show.info = FALSE, measures = list(auc, setAggregation(auc, test.sd), setAggregation(auc, train.mean), setAggregation(auc, train.sd)))
you can define many measures in a list (the first one is used to select hyper parameters all the other are just for show).
extract the results:
res <- mlr::generateHyperParsEffectData(res.mbo)$data
plot:
library(tidyverse)
res %>%
gather(key, value, c(3,5)) %>%
mutate(key = as.factor(key)) %>%
ggplot()+
geom_point(aes(x = size, y = value, color = key))+
geom_smooth(aes(x = size, y = value, color = key))+
facet_wrap(~decay)
a bunch of warnings about geom_smooth since there are only 3 points per fit
and an example on how to do it in caret just on the hold out samples:
library(caret)
create a tune control
ctrl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
classProbs = TRUE,
savePredictions = "all",
returnResamp = "all",
summaryFunction = twoClassSummary
)
create a grid of hyper parameters:
grid <- expand.grid(size = c(4, 5, 6), decay = seq(from = 0.1, to = 0.2, length.out = 4))
tune:
fit <- caret::train(Sonar[,1:60], Sonar$Class,
method = 'nnet',
tuneGrid = grid,
metric = 'ROC',
trControl = ctrl)
plot:
fit$results %>%
ggplot()+
geom_point(aes(x = size, y = ROC))+
geom_smooth(aes(x = size, y = ROC))+
facet_wrap(~decay)

Related

Error when predicting partial effects using new data for gamlss model

I'm here re-raising the issue of predicting CI's for gamlss models using the newdata argument. A further complication is that I'm interested in partial effects as well.
A closely related issue (without partial effects) was un-resolved in 2018: Error when predicting new fitted values from R gamlss object.
I'm wondering if there has been updates that also extend to partial effects. The example below reproduces the error (notice the `type = "terms" specifying I'm interested in the effects of each model term)".
library(gamlss)
library(tidyverse)
#example data
test_df <- tibble(x = rnorm(1e4),
x2 = rnorm(n = 1e4),
y = x2^2 + rnorm(1e4, sd = 0.5))
#fitting gamlss model
gam_test = gamlss(formula = y ~ pb(x2) + x,
sigma.fo= y ~ pb(x2) + x,
data = test_df)
#data I want predictions for
pred_df <- tibble(x = seq(-0.5, 0.5, length.out = 300),
x2 = seq(-0.5, 0.5, length.out = 300))
#returns error when se.fit = TRRUE
pred <- predictAll(object = gam_test,
type = "terms",
se.fit = TRUE, #works if se.fit = FALSE
newdata = pred_df)
Many thanks in advance!
I talked to the main developer of the gamlss software (who is responsible for this function).
He says that the option se.fit=TRUE with type="terms"
has not yet been implemented,
and unfortunately he is too busy at present.
One idea is to bootstrap the original data,
and predict terms for each bootstrap sample,
and then use the results to obtain CI's.

preprocessing (center and scale) only specific variables (numeric variables)

I have a dataframe that consist of numerical and non-numerical variables. I am trying to fit a logisic regression model predicting my variable "risk" based on all other variables, optimizing AUC using a 6-fold cross validation.
However, I want to center and scale all numerical explanatory variables. My code raises no errors or warning but somehow I fail to figure out how to tell train() through preProcess (or in some other way) to just center and scale my numerical variables.
Here is the code:
test <- train(risk ~ .,
method = "glm",
data = df,
family = binomial(link = "logit"),
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = prSummary),
metric = "AUC")
You could try to preprocess all numerical variables in original df first and then applying train function over scaled df
library(dplyr)
library(caret)
df <- df %>%
dplyr::mutate_if(is.numeric, scale)
test <- train(risk ~ .,
method = "glm",
data = df,
family = binomial(link = "logit"),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = prSummary),
metric = "AUC")

Error building partial dependence plots for RF using FinalModel output from caret's train() function

I am using the following code to fit and test a random forest classification model:
> control <- trainControl(method='repeatedcv',
+ number=5,repeats = 3,
+ search='grid')
> tunegrid <- expand.grid(.mtry = (1:12))
> rf_gridsearch <- train(y = river$stat_bino,
+ x = river[,colnames(river) != "stat_bino"],
+ data = river,
+ method = 'rf',
+ metric = 'Accuracy',
+ ntree = 600,
+ importance = TRUE,
+ tuneGrid = tunegrid, trControl = control)
Note, I am using
train(y = river$stat_bino, x = river[,colnames(river) != "stat_bino"],...
rather than: train(stat_bino ~ .,...
so that my categorical variables will not be turned into dummy variables.
solution here: variable encoding in K-fold validation of random forest using package 'caret')
I would like to extract the FinalModel and use it to make partial dependence plots for my variables (using code below), but I get an error message and don't know how to fix it.
> model1 <- rf_gridsearch$finalModel
> library(pdp)
> partial(model1, pred.var = "MAXCL", type = "classification", which.class = "1", plot =TRUE)
Error in eval(stats::getCall(object)$data) :
..1 used in an incorrect context, no ... to look in
Thanks for any solutions here!

r caret: train ONE model once the hyper-parameters are already known

I am using caret to train a ridge regression:
library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary ~ ., Hitters)[, -1] #Dropping the intercept column.
y = Hitters$Salary
set.seed(0)
train = sample(1:nrow(x), 7*nrow(x)/10)
library(caret)
set.seed(0)
# Values of lambda over which to check:
grid = 10 ^ seq(5, -2, length = 100)
train_control = trainControl(method = 'cv', number = 10)
tune.grid = expand.grid(lambda = grid, alpha = 0)
ridge.caret = train(x[train, ], y[train],
method = 'glmnet',
trControl = train_control,
tuneGrid = tune.grid)
ridge.caret$bestTune
# alpha is 0 and best lambda is 242.0128
So, I found my optimal lambda and alpha. In fact, it's not really important for my question, what they are.
Now, how could I now run just ONE ridge regression (using caret) with alpha = 0 and lambda = 242.0128 for the whole data set?
I discovered that I can specify trainControl method as 'none'. See the code below. Did I correctly specify the tuneGrid (with just one line). Is this how it should be done?
Thank you very much!
set.seed(12345)
ridge_full <- train(x, y,
method = 'glmnet',
trControl = trainControl(method = 'none'),
tuneGrid = expand.grid(lambda = ridge.caret$bestTune$lambda, alpha = 0))
coef(ridge_full$finalModel, s = ridge_full$bestTune$lambda)

R - How to let glmnet select lambda, while providing an alpha range in caret?

This question appears to have been asked before here but was correctly closed as off-topic. I'm now experiencing the same issue and figured that stack overflow is a better place for this issue.
I want to use glmnet's warm start for selecting lambda to speed up the model building process, but I want to keep using tuneGrid from caret in order to supply a large sequence of alpha's (glmnet's default alpha range is too narrow). the following attempt returns the error: Error: The tuning parameter grid should have columns alpha, lambda
fitControl <- trainControl(method = 'cv', number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)
tuneGridb <- expand.grid(.alpha = seq(0, 1, 0.05))
model.caretb <- caret::train(y ~ x1 + x2 + x3, data=train, method="glmnet",
family = "binomial", trControl = fitControl,
tuneGrid = tuneGridb, metric = "ROC")
How can I supply a range of values for alpha via caret whilst using the glmnet default lambda selection process?
If you check the default grid search method for glmnet model in caret
you will notice that if a grid search is specified, but without the actual grid, caret will provide alpha values with:
alpha = seq(0.1, 1, length = len)
while lambda values will be provided by the glmnet "warm start" at alpha = 0.5:
init <- glmnet::glmnet(Matrix::as.matrix(x), y,
family = fam,
nlambda = len+2,
alpha = .5)
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
so if you do:
library(caret)
library(mlbench)
data(Sonar)
fitControl <- trainControl(method = 'cv',
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary,
search = "grid")
model.caret <- caret::train(Class~ .,
data = Sonar,
method="glmnet",
family = "binomial",
trControl = fitControl,
tuneLength = 20,
metric = "ROC")
you will not get a grid of 20 combinations but a grid of 400 combinations, for each alpha 20 lambda values:
nrow(model.caret$results)
#output
400
I understand this is not exactly what you are after but it is pretty close without resorting to a custom train function.
To get closer to the desired result you can manually get the range of lambda values from glmnet for each desired alpha:
lambda <- unique(unlist(lapply(seq(0, 1, 0.05), function(x){
init <- glmnet::glmnet(Matrix::as.matrix(Sonar[,1:60]), Sonar$Class,
family = "binomial",
nlambda = 100,
alpha = x)
lambda <- c(min(init$lambda), max(init$lambda))
}
)))
create a grid of many lambda:
tuneGridb <- expand.grid(.alpha = seq(0, 1, 0.05),
.lambda = seq(min(lambda), max(lambda), length.out = 100))
caret is smart enough just to pass the lambda values to glmnet and not fit all the models
model.caret <- caret::train(Class~ .,
data = Sonar,
method="glmnet",
family = "binomial",
trControl = fitControl,
tuneGrid = tuneGridb,
metric = "ROC")
model.caret$bestTune
#output
alpha lambda
1 0 2.159367e-05
Ridge is the way to go in this case. Since this best lambda was in fact the lowest lambda tested
min(lambda)
#output
2.159367e-05
perhaps it would be wise to explore lower lambda values in the grid than glmnet "warm" start suggested.

Resources