preprocessing (center and scale) only specific variables (numeric variables) - r

I have a dataframe that consist of numerical and non-numerical variables. I am trying to fit a logisic regression model predicting my variable "risk" based on all other variables, optimizing AUC using a 6-fold cross validation.
However, I want to center and scale all numerical explanatory variables. My code raises no errors or warning but somehow I fail to figure out how to tell train() through preProcess (or in some other way) to just center and scale my numerical variables.
Here is the code:
test <- train(risk ~ .,
method = "glm",
data = df,
family = binomial(link = "logit"),
preProcess = c("center", "scale"),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = prSummary),
metric = "AUC")

You could try to preprocess all numerical variables in original df first and then applying train function over scaled df
library(dplyr)
library(caret)
df <- df %>%
dplyr::mutate_if(is.numeric, scale)
test <- train(risk ~ .,
method = "glm",
data = df,
family = binomial(link = "logit"),
trControl = trainControl(method = "cv",
number = 6,
classProbs = TRUE,
summaryFunction = prSummary),
metric = "AUC")

Related

R Caret's Train function quotation marks problem

I am using Caret's Train function inside my own function to train multiple models. Because Caret cannot handle the quotation marks in the X input, i tried removing the quotes with base R's 'noquote' function. Because in other parts of the function i need the input with quotation marks, i cannot remove the quotations surrounding the input values beforehand. Thanks in advance!
Code:
i <- "Dax.Shepard"
celeg_lgr = train(noquote(i) ~ ., method = "glm",
family = binomial(link = "logit"), data = celeb_trn,
trControl = trainControl(method = 'cv', number = 5))
Running this code results in the following error:
Error in model.frame.default(form = op ~ ., data = celeb_trn, na.action = na.fail) :
variable lengths differ (found for 'Dax.Shepard')
PS.
Running the code like this does not result in any error:
celeg_lgr = train(Dax.Shepard ~ ., method = "glm",
family = binomial(link = "logit"), data = celeb_trn,
trControl = trainControl(method = 'cv', number = 5))

Caret obtain train & cv predictions from model to plot

I've trained a simple model:
mySim <- train(Event ~ .,
method = 'rf',
data = train,
tuneGrid = tg)
Optimising the two nnet parameters weight_decay and size of the hidden layer. I'm new to trying out caret so what I would usually do is plot the train error and cv error for each model build. To do this, I'd need to have the predictive values of my train and validation pass.
This is the first time I've used cross validation so I'm a little unsure how I can go about getting the predictions from the train and hold-out set at each tuneGrid iteration.
If I have a grid search of length 3 (3 models to build) and 5-fold cross validation I assume I'm going to have 15 sets of train & holdout predictions for each model.
The plot I'm essentially looking to build is:
Where my y-axis is a performance metric, lets say entropy loss for the sake of classification with nnet and the size grid search values on the x-axis increases from 0 - max.
Is there a way in which I can extract the predicted values from the train / holdout set during trainControl cross validation?
I've looked through some of the attributes train returns but not sure if I'm missing something.
I know I lack code in this question but hopefully I've explained myself.
Update
I am correct in assuming setting the following parameters in trainControl will return the predictions allowing me to create this plot:
returnResamp
savePredictions
carets::train keeps only the hold out predictions. If you specify savePredictions ="all" it will save hold out predictions for all hyper parameter combinations. However it does not save the train set predictions. You could generate them afterwards with the knowledge which indexes were used for the hold outs. This info is the model$pred slot of the object returned by train. mlr package has an option to keep both hold out and train predictions and metric.
Here is an example on how to perform the requested operation with mlr library:
library(mlr)
library(mlbench) #for the data set
I will use the Sonar data set:
data(Sonar)
create a task:
task <- makeClassifTask(data = Sonar, target = "Class")
create a learner:
lrn <- makeLearner("classif.nnet", predict.type = "prob")
get all tune-able parameters for a learner:
getParamSet("classif.nnet")
set which ones you would like to tune and the range:
ps <- makeParamSet(
makeIntegerParam("size", lower = 3, upper = 5),
makeNumericParam("decay", lower = 0.1, upper = 0.2))
define resampling:
cross_val <- makeResampleDesc("RepCV",
reps = 2, folds = 5, stratify = TRUE, predict = "both")
how the search will be performed (grid in this case):
ctrl <- mlr::makeTuneControlGrid(resolution = 4L)
get everything together:
res.mbo <- tuneParams(lrn, task, cross_val, par.set = ps, control = ctrl,
show.info = FALSE, measures = list(auc, setAggregation(auc, test.sd), setAggregation(auc, train.mean), setAggregation(auc, train.sd)))
you can define many measures in a list (the first one is used to select hyper parameters all the other are just for show).
extract the results:
res <- mlr::generateHyperParsEffectData(res.mbo)$data
plot:
library(tidyverse)
res %>%
gather(key, value, c(3,5)) %>%
mutate(key = as.factor(key)) %>%
ggplot()+
geom_point(aes(x = size, y = value, color = key))+
geom_smooth(aes(x = size, y = value, color = key))+
facet_wrap(~decay)
a bunch of warnings about geom_smooth since there are only 3 points per fit
and an example on how to do it in caret just on the hold out samples:
library(caret)
create a tune control
ctrl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 2,
classProbs = TRUE,
savePredictions = "all",
returnResamp = "all",
summaryFunction = twoClassSummary
)
create a grid of hyper parameters:
grid <- expand.grid(size = c(4, 5, 6), decay = seq(from = 0.1, to = 0.2, length.out = 4))
tune:
fit <- caret::train(Sonar[,1:60], Sonar$Class,
method = 'nnet',
tuneGrid = grid,
metric = 'ROC',
trControl = ctrl)
plot:
fit$results %>%
ggplot()+
geom_point(aes(x = size, y = ROC))+
geom_smooth(aes(x = size, y = ROC))+
facet_wrap(~decay)

r caret: train ONE model once the hyper-parameters are already known

I am using caret to train a ridge regression:
library(ISLR)
Hitters = na.omit(Hitters)
x = model.matrix(Salary ~ ., Hitters)[, -1] #Dropping the intercept column.
y = Hitters$Salary
set.seed(0)
train = sample(1:nrow(x), 7*nrow(x)/10)
library(caret)
set.seed(0)
# Values of lambda over which to check:
grid = 10 ^ seq(5, -2, length = 100)
train_control = trainControl(method = 'cv', number = 10)
tune.grid = expand.grid(lambda = grid, alpha = 0)
ridge.caret = train(x[train, ], y[train],
method = 'glmnet',
trControl = train_control,
tuneGrid = tune.grid)
ridge.caret$bestTune
# alpha is 0 and best lambda is 242.0128
So, I found my optimal lambda and alpha. In fact, it's not really important for my question, what they are.
Now, how could I now run just ONE ridge regression (using caret) with alpha = 0 and lambda = 242.0128 for the whole data set?
I discovered that I can specify trainControl method as 'none'. See the code below. Did I correctly specify the tuneGrid (with just one line). Is this how it should be done?
Thank you very much!
set.seed(12345)
ridge_full <- train(x, y,
method = 'glmnet',
trControl = trainControl(method = 'none'),
tuneGrid = expand.grid(lambda = ridge.caret$bestTune$lambda, alpha = 0))
coef(ridge_full$finalModel, s = ridge_full$bestTune$lambda)

Plotting ROC curve from two different algorithms using lift in caret

I have a two models like the following:
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
modelxgb <- train(
Class ~ .,
data = my_data,
method = "xgbTree",
trControl = fitControl,
metric = "ROC"
)
library(mlbench)
data(Sonar)
library(caret)
set.seed(998)
my_data <- Sonar
fitControl <-
trainControl(
method = "boot632",
number = 10,
classProbs = T,
savePredictions = "final",
summaryFunction = twoClassSummary
)
modelsvm <- train(
Class ~ .,
data = my_data,
method = "svmLinear2",
trControl = fitControl,
metric = "ROC"
)
I want to plot the ROC curves for both models on one ggplot.
I am doing the following to generate the points for the curve:
for_lift_xgb = data.frame(Class = modelxgb$pred$obs, xgbTree = modelxgb$pred$R)
for_lift_svm = data.frame(Class = modelsvm$pred$obs, svmLinear2 = modelsvm$pred$R)
lift_obj_xgb = lift(Class ~ xgbTree, data = for_lift_xgb, class = "R")
lift_obj_svm = lift(Class ~ svmLinear2, data = for_lift_svm, class = "R")
What would be the easiest way to plot both of these curves on a single plot, and have them in different colors. I would also like to annotate the individual AUC values on the plot.
After building the models you can combine the predictions in a single data frame:
for_lift = data.frame(Class = modelxgb$pred$obs,
xgbTree = modelxgb$pred$R,
svmLinear2 = modelsvm$pred$R)
use it to build the lift object using the following:
lift = lift(Class ~ xgbTree + svmLinear2, data = for_lift, class = "R")
and plot with ggplot:
library(ggplot)
ggplot(lift$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))
You can combine and compare many models this way.
To add auc to the plot you can create a data frame with the models names, the corresponding auc and the coordinates for plotting:
auc_ano <- data.frame(model = c("xgbTree","svmLinear2"),
auc = c(pROC::roc(response = for_lift$Class,
predictor = for_lift$xgbTree,
levels=c("M", "R"))$auc,
pROC::roc(response = for_lift$Class,
predictor = for_lift$svmLinear2,
levels=c("M", "R"))$auc),
y = c(0.95, 0.9))
auc_ano
#output
model auc y
1 xgbTree 0.9000756 0.95
2 svmLinear2 0.5041086 0.90
and pass it to geom_text:
ggplot(lift$data)+
geom_line(aes(1-Sp , Sn, color = liftModelVar))+
scale_color_discrete(guide = guide_legend(title = "method"))+
geom_text(data = auc_ano, aes(label = round(auc, 4), color = model, y = y), x = 0.1)

R - How to let glmnet select lambda, while providing an alpha range in caret?

This question appears to have been asked before here but was correctly closed as off-topic. I'm now experiencing the same issue and figured that stack overflow is a better place for this issue.
I want to use glmnet's warm start for selecting lambda to speed up the model building process, but I want to keep using tuneGrid from caret in order to supply a large sequence of alpha's (glmnet's default alpha range is too narrow). the following attempt returns the error: Error: The tuning parameter grid should have columns alpha, lambda
fitControl <- trainControl(method = 'cv', number = 10, classProbs = TRUE, summaryFunction = twoClassSummary)
tuneGridb <- expand.grid(.alpha = seq(0, 1, 0.05))
model.caretb <- caret::train(y ~ x1 + x2 + x3, data=train, method="glmnet",
family = "binomial", trControl = fitControl,
tuneGrid = tuneGridb, metric = "ROC")
How can I supply a range of values for alpha via caret whilst using the glmnet default lambda selection process?
If you check the default grid search method for glmnet model in caret
you will notice that if a grid search is specified, but without the actual grid, caret will provide alpha values with:
alpha = seq(0.1, 1, length = len)
while lambda values will be provided by the glmnet "warm start" at alpha = 0.5:
init <- glmnet::glmnet(Matrix::as.matrix(x), y,
family = fam,
nlambda = len+2,
alpha = .5)
lambda <- unique(init$lambda)
lambda <- lambda[-c(1, length(lambda))]
lambda <- lambda[1:min(length(lambda), len)]
so if you do:
library(caret)
library(mlbench)
data(Sonar)
fitControl <- trainControl(method = 'cv',
number = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary,
search = "grid")
model.caret <- caret::train(Class~ .,
data = Sonar,
method="glmnet",
family = "binomial",
trControl = fitControl,
tuneLength = 20,
metric = "ROC")
you will not get a grid of 20 combinations but a grid of 400 combinations, for each alpha 20 lambda values:
nrow(model.caret$results)
#output
400
I understand this is not exactly what you are after but it is pretty close without resorting to a custom train function.
To get closer to the desired result you can manually get the range of lambda values from glmnet for each desired alpha:
lambda <- unique(unlist(lapply(seq(0, 1, 0.05), function(x){
init <- glmnet::glmnet(Matrix::as.matrix(Sonar[,1:60]), Sonar$Class,
family = "binomial",
nlambda = 100,
alpha = x)
lambda <- c(min(init$lambda), max(init$lambda))
}
)))
create a grid of many lambda:
tuneGridb <- expand.grid(.alpha = seq(0, 1, 0.05),
.lambda = seq(min(lambda), max(lambda), length.out = 100))
caret is smart enough just to pass the lambda values to glmnet and not fit all the models
model.caret <- caret::train(Class~ .,
data = Sonar,
method="glmnet",
family = "binomial",
trControl = fitControl,
tuneGrid = tuneGridb,
metric = "ROC")
model.caret$bestTune
#output
alpha lambda
1 0 2.159367e-05
Ridge is the way to go in this case. Since this best lambda was in fact the lowest lambda tested
min(lambda)
#output
2.159367e-05
perhaps it would be wise to explore lower lambda values in the grid than glmnet "warm" start suggested.

Resources