I'm using caret with custom fitting metric, but I need to maximize not just this metric but lower bound of it's confidence interval. So I'd like to maximize something like mean(metric) - k * stddev(metric). I know how to do this manually, but is there a way to tell caret to automatically select best parameters using this function?
Yes, you can define your own selection metric through the "summaryFunction" parameter of your "trainControl" object and then with the "metric" parameter of your call to train(). Details on this are pretty well documented in the "Alternate Performance Metrics" section on caret's model tuning page: http://caret.r-forge.r-project.org/training.html
I don't think you gave enough information for anyone to write exactly what you're looking for, but here is an example using the code from the twoClassSummary function:
> library(caret)
> data(Titanic)
> #an example custom function
> roc <- function (data, lev = NULL, model = NULL) {
+ require(pROC)
+ if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
+ stop("levels of observed and predicted data do not match")
+ rocObject <- try(pROC:::roc(data$obs, data[, lev[1]]), silent = TRUE)
+ rocAUC <- if (class(rocObject)[1] == "try-error")
+ NA
+ else rocObject$auc
+ out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
+ names(out) <- c("ROC", "Sens", "Spec")
+ out
+ }
> #your train control specs
> tc <- trainControl(method="cv",classProb=TRUE,summaryFunction=roc)
> #yoru model with selection metric specificed
> train(Survived~.,data=data.frame(Titanic),method="rf",trControl=tc,metric="ROC")
32 samples
4 predictors
2 classes: 'No', 'Yes'
No pre-processing
Resampling: Cross-Validation (10 fold)
Summary of sample sizes: 28, 29, 30, 30, 28, 28, ...
Resampling results across tuning parameters:
mtry ROC Sens Spec ROC SD Sens SD Spec SD
2 0.9 0.2 0.25 0.175 0.35 0.425
4 0.85 0.4 0.6 0.211 0.459 0.459
6 0.875 0.35 0.6 0.212 0.412 0.459
ROC was used to select the optimal model using the largest value.
The final value used for the model was mtry = 2.
There is more basic example in the caret's help for train function:
madSummary <- function (data,
lev = NULL,
model = NULL) {
out <- mad(data$obs - data$pred,
na.rm = TRUE)
names(out) <- "MAD"
robustControl <- trainControl(summaryFunction = madSummary)
marsGrid <- expand.grid(degree = 1, nprune = (1:10) * 2)
earthFit <- train(medv ~ .,
data = BostonHousing,
method = "earth",
tuneGrid = marsGrid,
metric = "MAD",
maximize = FALSE,
trControl = robustControl)
I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
shap_values_3 <- fastshap::explain(
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
I tried computing confusion-matrix for my glm model but I keep getting:
Error: data and reference should be factors with the same levels.
Below is my model:
model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
confusionMatrix(table(predict(model3, newdata=test_set, type="response")) >= 0.5,
train_set$winner == 1)
winner variable contains team1 and team2.
srs.1 and srs.2 are numerical values.
What is my problem here?
I suppose your winner label is a binary of 0,1. So let's use the example below:
data = data.frame(
srs.1 = rnorm(200),
srs.2 = rnorm(200)
data$winner = ifelse(data$srs.1*data$srs.2 > 0,1,0)
idx = sample(nrow(data),150)
train_set = data[idx,]
test_set = data[-idx,]
model3 <- glm(winner ~ srs.1 + srs.2, data = train_set, family = binomial)
Like you did, we try to predict, if > 0.5, it will be 1 else 0. You got the table() about right. Note you need to do it both for test_set, or train_set:
pred = as.numeric(predict(model3, newdata=test_set, type="response")>0.5)
ref = test_set$winner
Confusion Matrix and Statistics
pred 0 1
0 12 5
1 19 14
Accuracy : 0.52
95% CI : (0.3742, 0.6634)
No Information Rate : 0.62
P-Value [Acc > NIR] : 0.943973
Kappa : 0.1085
Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
#> Random Forest
#> 50 samples
#> 2 predictor
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.
I'm relatively new to survival analysis and have been used some standard telco churn data example with a sample below called 'telco':
telco <- read.csv(text = "State,Account_Length,Area_Code,Intl_Plan,Day_Mins,Day_Calls,Day_Charge,Eve_Mins,Eve_Calls,Eve_Charge,Night_Mins,Night_Calls,Night_Charge,Intl_Mins,Intl_Calls,Intl_Charge,CustServ_Calls,Churn
I've run:
dependentvars = Surv(telco$Account_Length, telco$Churn)
telcosurvreg = survreg(dependentvars ~ -Churn -Account_Length, dist="gaussian",data=telco)
telcopred = predict(telcosurvreg, newdata=telco, type="quantile", p=.5)
...to get the predicted lifetime of each customer.
What I'm struggling with is how to visualise a survival curve for this. Is there a way (preferably in ggplot2) to do this from the data I have?
Here is a base R version that plots the predicted survival curves. I have changed the formula so the curves differ for each row
> # change setup so we have one covariate
> telcosurvreg = survreg(
+ Surv(Account_Length, Churn) ~ Eve_Charge, dist = "gaussian", data = telco)
> telcosurvreg # has more than an intercept
survreg(formula = Surv(Account_Length, Churn) ~ Eve_Charge, data = telco,
dist = "gaussian")
(Intercept) Eve_Charge
227.274695 -3.586121
Scale= 56.9418
Loglik(model)= -12.1 Loglik(intercept only)= -12.4
Chisq= 0.54 on 1 degrees of freedom, p= 0.46
n= 6
> # find linear predictors
> vals <- predict(telcosurvreg, newdata = telco, type = "lp")
> # use the survreg.distributions object. See ?survreg.distributions
> x_grid <- 1:400
> sur_curves <- sapply(
+ vals, function(x)
+ survreg.distributions[[telcosurvreg$dist]]$density(
+ (x - x_grid) / telcosurvreg$scale)[, 1])
> # plot with base R
> matplot(x_grid, sur_curves, type = "l", lty = 1)
Here is the result
I have a glm model that works. Since I'd like to add (ridge) regularization I thought I'd switch to glmnet. For some reason I cannot get glmnet to work. It seems to always predict the first class, never the second, which results in low accuracy and kappa = 0.
Below is some code to reproduce the problem. What am I doing wrong?
The test data it generates looks like this:
Since the data cannot be linearly separated two polynomial terms A^2 and B^2 are added.
A glm model predicts the data correctly (with accuracy = 1 and kappa = 1). Here is its prediction boundary:
While a glmnet model always has kappa = 0, no matter what lambda it tries:
lambda Accuracy Kappa Accuracy SD Kappa SD
0 0.746 0 0.0295 0
1e-04 0.746 0 0.0295 0
0.01 0.746 0 0.0295 0
0.1 0.746 0 0.0295 0
1 0.746 0 0.0295 0
10 0.746 0 0.0295 0
Code to reproduce the problem:
# generate test data
n <- 500; m <- 100
data <- data.frame(A=runif(n, 98, 102), B=runif(n, 98, 102), Type="foo")
data <- subset(data, sqrt((A-100)^2 + (B-100)^2) > 1.5)
data <- rbind(data, data.frame(A=rnorm(m, 100, 0.25), B=rnorm(m, 100, 0.25), Type="bar"))
# add a few polynomial features to match ellipses
polymap <- function(data) cbind(data, A2=data$A^2, B2=data$B^2)
data <- polymap(data)
plot(x=data$A, y=data$B, pch=21, bg=data$Type, xlab="A", ylab="B")
# train a binomial glm model
model.glm <- train(Type ~ ., data=data, method="glm", family="binomial",
preProcess=c("center", "scale"))
# train a binomial glmnet model with ridge regularization (alpha = 0)
model.glmnet <- train(Type ~ ., data=data, method="glmnet", family="binomial",
preProcess=c("center", "scale"),
tuneGrid=expand.grid(alpha=0, lambda=c(0, 0.0001, 0.01, 0.1, 1, 10)))
print(model.glm) # <- Accuracy = 1, Kappa = 1 - good!
print(model.glmnet) # <- Accuracy = low, Kappa = 0 - bad!
Calling glmnet directly (without caret) results in the same problem:
x <- as.matrix(subset(data, select=-c(Type)))
y <- data$Type
model.glmnet2 <- cv.glmnet(x=x, y=y, family="binomial", type.measure="class")
preds <- predict(model.glmnet2, x, type="class", s="lambda.min")
# all predictions are class 1...
EDIT: Plot of the scaled data and the decision boundary found by glm:
Model: -37 + 6317*A + 6059*B - 6316*A2 - 6059*B2
You should center and scale data prior to making polynomial versions of the predictor. Numerically, things work better that way:
n <- 500; m <- 100
data <- data.frame(A=runif(n, 98, 102), B=runif(n, 98, 102), Type="foo")
data <- subset(data, sqrt((A-100)^2 + (B-100)^2) > 1.5)
data <- rbind(data, data.frame(A=rnorm(m, 100, 0.25), B=rnorm(m, 100, 0.25), Type="bar"))
data2 <- data
data2$A <- scale(data2$A, scale = TRUE)
data2$B <- scale(data2$B, scale = TRUE)
data2$A2 <- data2$A^2
data2$B2 <- data2$B^2
# train a binomial glm model
model.glm2 <- train(Type ~ ., data=data2, method="glm")
# train a binomial glmnet model with ridge regularization (alpha = 0)
model.glmnet2 <- train(Type ~ ., data=data2, method="glmnet",
lambda=c(0, 0.0001, 0.01, 0.1, 1, 10)))
From these:
> getTrainPerf(model.glm2)
TrainAccuracy TrainKappa method
1 1 1 glm
> getTrainPerf(model.glmnet2)
TrainAccuracy TrainKappa method
1 1 1 glmnet