Is there a way to combine multiple predictions from different models in mlr into a single average prediction so that it can be used to calculate performance measures etc.?
library(mlr)
data(iris)
iris2 <- iris
iris2$Species <- ifelse(iris$Species=="setosa", "ja", "nein")
task = makeClassifTask(data = iris2, target = "Species")
lrn = makeLearner("classif.h2o.deeplearning", predict.type="prob")
model1 = train(lrn, task)
model2 = train(lrn, task)
pred1 = predict(model1, newdata=iris2)
pred2 = predict(model2, newdata=iris2)
performance(pred1, measures = auc)
g = generateThreshVsPerfData(pred1)
plotThreshVsPerf(g)
A workaround to show what I mean could be maybe
pred_avg = pred1
pred_avg$data[,c("prob.ja","prob.nein")] = (pred1$data[,c("prob.ja","prob.nein")] +
pred2$data[,c("prob.ja","prob.nein")])/2
performance(pred_avg, measures = auc)
g_avg = generateThreshVsPerfData(pred_avg)
plotThreshVsPerf(g_avg)
Is there a way to do this without a workaround and could this workaround have any unwanted side effects?
It sounds like you are looking for a stacking learner, which is mlr's method of performing ensembles.
from the docs
# Regression
data(BostonHousing, package = "mlbench")
tsk = makeRegrTask(data = BostonHousing, target = "medv")
base = c("regr.rpart", "regr.svm")
lrns = lapply(base, makeLearner)
m = makeStackedLearner(base.learners = lrns,
predict.type = "response", method = "average")
tmp = train(m, tsk)
res = predict(tmp, tsk)
# Prediction: 506 observations
# predict.type: response
# threshold:
# time: 0.02
# id truth response
# 1 1 24.0 27.33742
# 2 2 21.6 22.08853
# 3 3 34.7 33.52007
# 4 4 33.4 32.49923
# 5 5 36.2 32.67973
# 6 6 28.7 22.99323
# ... (506 rows, 3 cols)
performance(res, rmse)
# rmse
# 3.138981
Related
I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
May someone share how to train, tune (hyperparameters), cross-validate, and test a ranger quantile regression model, along with error evaluation? With the iris or Boston housing dataset?
The reason I ask is because I have not been able to find many examples or walkthroughs using quantile regression on Kaggle, random blogs, Youtube. Most problems I encountered are classification problems.
I am currently using a quantile regression model but I am hoping to see other examples in particular with hyperparameter tuning
There are a lot of parameters for this function. Since this isn't a forum for what it all means, I really suggest that you hit up Cross Validates with questions on the how and why. (Or look for questions that may already be answered.)
library(tidyverse)
library(ranger)
library(caret)
library(funModeling)
data(iris)
#----------- setup data -----------
# this doesn't include exploration or cleaning which are both necessary
summary(iris)
df_status(iris)
#----------------- create training sample ----------------
set.seed(395280469) # for replicability
# create training sample partition (70/20 split)
tr <- createDataPartition(iris$Species,
p = .8,
list = F)
There are a lot of ways to split the data, but I tend to prefer Caret, because they word to even out factors if that's what you feed it.
#--------- First model ---------
fit.r <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE)
fit.r
# Ranger result
#
# Call:
# ranger(Sepal.Length ~ ., data = iris[tr, ], write.forest = TRUE,
# importance = "permutation", quantreg = TRUE, keep.inbag = TRUE,
# replace = FALSE)
#
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 2
# Target node size: 5
# Variable importance mode: permutation
# Splitrule: variance
# OOB prediction error (MSE): 0.1199364
# R squared (OOB): 0.8336928
p.r <- predict(fit.r, iris[-tr, -1],
type = 'quantiles')
It defaults to .1, .5, and .9:
postResample(p.r$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.5165946 0.7659124 0.4036667
postResample(p.r$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3750556 0.7587326 0.3133333
postResample(p.r$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6488991 0.7461830 0.5703333
To see what this looks like in practice:
# this performance is the best so far, let's see what it looks like visually
ggplot(data.frame(p.Q1 = p.r$predictions[, 1],
p.Q5 = p.r$predictions[, 2],
p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q1, color = "P.Q1")) +
geom_point(aes(x = Actual, y = p.Q5, color = "P.Q5")) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
# since Quantile .1 performed the best
ggplot(data.frame(p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_segment(aes(x = Actual, xend = Actual,
y = Actual, yend = p.Q9)) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
#------------ ranger model with options --------------
# last call used default
# splitrule: variance, use "extratrees" (only 2 for this one)
# mtry = 2, use 3 this time
# min.node.size = 5, using 6 this time
# using num.threads = 15 ** this is the number of cores on YOUR device
# change accordingly --- if you don't know, drop this one
set.seed(326)
fit.r2 <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 3,
min.node.size = 6,
num.threads = 15)
fit.r2
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 3
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1107299
# R squared (OOB): 0.8464588
This model produced similarly.
p.r2 <- predict(fit.r2, iris[-tr, -1],
type = 'quantiles')
postResample(p.r2$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4932883 0.8144309 0.4000000
postResample(p.r2$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3610171 0.7643744 0.3100000
postResample(p.r2$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6555939 0.8141144 0.5603333
The prediction was pretty similar overall, as well.
This isn't a very large set of data, with few predictors.
How much do they contribute?
importance(fit.r2)
# Sepal.Width Petal.Length Petal.Width Species
# 0.06138883 0.71052453 0.22956522 0.18082998
#------------ ranger model with options --------------
# drop a predictor, lower mtry, min.node.size
set.seed(326)
fit.r3 <- ranger(Sepal.Length ~ .,
data = iris[tr, -4], # dropped Sepal.Width
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 2, # has to change (var count lower)
min.node.size = 4, # lowered
num.threads = 15)
fit.r3
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 3
# Mtry: 2
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1050143
# R squared (OOB): 0.8543842
The second most important predictor was removed and it improved.
p.r3 <- predict(fit.r3, iris[-tr, -c(1, 4)],
type = 'quantiles')
postResample(p.r3$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4760952 0.8089810 0.3800000
postResample(p.r3$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3738315 0.7769388 0.3250000
postResample(p.r3$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6085584 0.8032592 0.5170000
importance(fit.r3)
# almost everthing relies on Petal.Length
# Sepal.Width Petal.Length Species
# 0.08008264 0.95440333 0.32570147
I have a beautiful mlr3 ensemble model (combined glmnet and glm) for binary prediction, see details here
library("mlr3verse")
library("dplyr")
# get example data
data(PimaIndiansDiabetes, package="mlbench")
data <- PimaIndiansDiabetes
# add an additional predictor "superdoc" which is not entered in the glmnet but in the final glm
set.seed(2323)
data %>%
rowwise() %>%
mutate(superdoc=case_when(diabetes=="pos" ~ as.numeric(sample(0:2,1)), TRUE~ 0)) %>%
ungroup -> data
# make a rather small train set
set.seed(23)
test.data <- sample_n(data,70,replace=FALSE)
# creat elastic net regression
glmnet_lrn = lrn("classif.cv_glmnet", predict_type = "prob")
# create the learner out-of-bag predictions
glmnet_cv1 = po("learner_cv", glmnet_lrn, id = "glmnet")
# PipeOp that drops 'superdoc', i.e. selects all except 'superdoc'
# (ID given to avoid ID clash with other selector)
drop_superdoc = po("select", id = "drop.superdoc",
selector = selector_invert(selector_name("superdoc")))
# PipeOp that selects 'superdoc' (and drops all other columns)
select_superdoc = po("select", id = "select.superdoc",
selector = selector_name("superdoc"))
# superdoc along one path, the fitted model along the other
stacking_layer = gunion(list(
select_superdoc,
drop_superdoc %>>% glmnet_cv1
)) %>>% po("featureunion", id = "union1")
# final logistic regression
log_reg_lrn = lrn("classif.log_reg", predict_type = "prob")
# combine ensemble model
ensemble = stacking_layer %>>% log_reg_lrn
#define tests
train.task <- TaskClassif$new("test.data", test.data, target = "diabetes")
# make ensemble learner
elearner = as_learner(ensemble)
ensemble$plot(html = FALSE)
If I train it with different set.seed, I get different coefficients.
I think this is mainly caused by the rather low number of training data that is entered in the glmnet model and could be migitated by repeated cross-validation.
# Train the Learner:
# seed 1
elearner = as_learner(ensemble)
set.seed(22521136)
elearner$train(train.task) -> seed1
# seed 2
elearner = as_learner(ensemble)
set.seed(12354)
elearner$train(train.task) -> seed2
# different coefficients of the glment model
coef(seed1$model$glmnet$model, s ="lambda.min")
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -6.238598277
#> age .
#> glucose 0.023462376
#> insulin -0.001007037
#> mass 0.055587740
#> pedigree 0.322911217
#> pregnant 0.137419564
#> pressure .
#> triceps .
coef(seed2$model$glmnet$model, s ="lambda.min")
#> 9 x 1 sparse Matrix of class "dgCMatrix"
#> 1
#> (Intercept) -6.876802620
#> age .
#> glucose 0.025601712
#> insulin -0.001500856
#> mass 0.063029550
#> pedigree 0.464369417
#> pregnant 0.155971123
#> pressure .
#> triceps .
# different coefficients of the final regression model
seed1$model$classif.log_reg$model$coefficients
#> (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
#> -9.438452 23.710923 8.726956 NA
seed2$model$classif.log_reg$model$coefficients
#> (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
#> 0.3698143 23.5362542 -5.5514365 NA
Question:
Where and how could a repeated cross-validation be entered in my mlr3 ensemble model to migitate these varying results? Any help is very appreciated.
Thanks to missuse's comment, his marvellous tutorial (Tuning a stacked learner) and mb706's comments I think I could solve my question.
Replace "classif.cv_glmnet" with "classif.glmnet"
# Add tuning
resampling = rsmp("repeated_cv")
resampling$param_set$values = list(repeats = 10, folds=5)
ps_ens = ParamSet$new(
list(
ParamDbl$new("glmnet.alpha", 0, 1),
ParamDbl$new("glmnet.s", 0, 1)))
auto1 = AutoTuner$new(
learner = elearner,
resampling = resampling,
measure = msr("classif.auc"),
search_space = ps_ens,
terminator = trm("evals", n_evals = 5), # to limit running time
tuner = tnr("random_search")
)
Train with different set.seed and get same coefficients
# Train with different set.seed
#first
set.seed(22521136)
at1= auto1
at1$train(train.task) -> seed1
# second
set.seed(12354)
at2= auto1
at2$train(train.task) -> seed2
# Compare coefficients of the learners
# classif.log_reg
seed1$model$learner$model$classif.log_reg$model$coefficients
# (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
# 2.467855 21.570766 -6.966693 NA
seed2$model$learner$model$classif.log_reg$model$coefficients
# (Intercept) superdoc glmnet.prob.neg glmnet.prob.pos
# 2.467855 21.570766 -6.966693 NA
#classif.glmnet
coef(at1$learner$model$glmnet$model, alpha=at1$tuning_result$glmnet.alpha,s=at1$tuning_result$glmnet.s)
# 9 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -3.3066981659
# age 0.0076392198
# glucose 0.0077516975
# insulin 0.0003389759
# mass 0.0133955320
# pedigree 0.3256754612
# pregnant 0.0686746156
# pressure 0.0081338885
# triceps -0.0054976030
coef(at2$learner$model$glmnet$model, alpha=at2$tuning_result$glmnet.alpha,s=at2$tuning_result$glmnet.s)
# 9 x 1 sparse Matrix of class "dgCMatrix"
# 1
# (Intercept) -3.3066981659
# age 0.0076392198
# glucose 0.0077516975
# insulin 0.0003389759
# mass 0.0133955320
# pedigree 0.3256754612
# pregnant 0.0686746156
# pressure 0.0081338885
# triceps -0.0054976030
I was performing feature selection using rfe from package caret for a linear regression.
One of my regressors is a logic variable, when I do feature selection with this variable, I always
got Error in { : task 1 failed - "undefined columns selected".
How to do feature selection with logic variables using rfe?
Is it necessary to convert it to a dummy variable of 0, 1?
Below is a reproducible example:
library(caret)
x <- mtcars[-1]
y <- mtcars$mpg
set.seed(2017)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile1 <- rfe(x, y, sizes = 1:5, rfeControl = ctrl)
# > lmProfile1
#
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables RMSE Rsquared RMSESD RsquaredSD Selected
# 1 3.503 0.8338 1.627 0.2393
# 2 3.197 0.8841 1.347 0.1783
# 3 3.214 0.8788 1.327 0.1815
# 4 3.050 0.8861 1.341 0.1603 *
# 5 3.063 0.8842 1.254 0.1670
# 10 3.332 0.8638 1.404 0.1926
#
# The top 4 variables (out of 4):
# wt, am, qsec, hp
# am is one of the best features, now I turn it into a logic variable
x <- mtcars[-1]
x$am <- x$am == 1
y <- mtcars$mpg
set.seed(2017)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile2 <- rfe(x, y, sizes = 1:5, rfeControl = ctrl)
# Error in { : task 1 failed - "undefined columns selected"
# > packageVersion('caret')
# [1] ‘6.0.73’
I have used caret package's train function with 10-fold cross validation. I also have got class probabilities for predicted classes by setting classProbs = TRUE in trControl, as follows:
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
randomForestFit = train(x = input[3:154],
y = as.factor(input$Target),
method = "rf",
trControl = myTrainingControl,
preProcess = c("center","scale"),
ntree = 50)
The output predictions I am getting is as follows.
pred obs 0 1 rowIndex mtry Resample
1 0 1 0.52 0.48 28 12 Fold01
2 0 0 0.58 0.42 43 12 Fold01
3 0 1 0.58 0.42 51 12 Fold01
4 0 0 0.68 0.32 55 12 Fold01
5 0 0 0.62 0.38 59 12 Fold01
6 0 1 0.92 0.08 71 12 Fold01
Now I want to calculate ROC and AUC under ROC using this data. How would I achieve this?
A sample example for AUC:
rf_output=randomForest(x=predictor_data, y=target, importance = TRUE, ntree = 10001, proximity=TRUE, sampsize=sampsizes)
library(ROCR)
predictions=as.vector(rf_output$votes[,2])
pred=prediction(predictions,target)
perf_AUC=performance(pred,"auc") #Calculate the AUC value
AUC=perf_AUC#y.values[[1]]
perf_ROC=performance(pred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(AUC, digits=5, scientific=FALSE)))
or using pROC and caret
library(caret)
library(pROC)
data(iris)
iris <- iris[iris$Species == "virginica" | iris$Species == "versicolor", ]
iris$Species <- factor(iris$Species) # setosa should be removed from factor
samples <- sample(NROW(iris), NROW(iris) * .5)
data.train <- iris[samples, ]
data.test <- iris[-samples, ]
forest.model <- train(Species ~., data.train)
result.predicted.prob <- predict(forest.model, data.test, type="prob") # Prediction
result.roc <- roc(data.test$Species, result.predicted.prob$versicolor) # Draw ROC curve.
plot(result.roc, print.thres="best", print.thres.best.method="closest.topleft")
result.coords <- coords(result.roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)#to get threshold and accuracy
Update 2019. This is what MLeval was written for (https://cran.r-project.org/web/packages/MLeval/index.html), it works with the Caret train output object to make ROCs, PR curves, calibration curves, and calculate metrics, such as ROC-AUC, sensitivity, specificity etc. It just uses one line to do all of this which is helpful for my analyses and may be of interest.
library(caret)
library(MLeval)
myTrainingControl <- trainControl(method = "cv",
number = 10,
savePredictions = TRUE,
classProbs = TRUE,
verboseIter = TRUE)
randomForestFit = train(x = Sonar[,1:60],
y = as.factor(Sonar$Class),
method = "rf",
trControl = myTrainingControl,
preProcess = c("center","scale"),
ntree = 50)
##
x <- evalm(randomForestFit)
## get roc curve plotted in ggplot2
x$roc
## get AUC and other metrics
x$stdres