I was performing feature selection using rfe from package caret for a linear regression.
One of my regressors is a logic variable, when I do feature selection with this variable, I always
got Error in { : task 1 failed - "undefined columns selected".
How to do feature selection with logic variables using rfe?
Is it necessary to convert it to a dummy variable of 0, 1?
Below is a reproducible example:
library(caret)
x <- mtcars[-1]
y <- mtcars$mpg
set.seed(2017)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile1 <- rfe(x, y, sizes = 1:5, rfeControl = ctrl)
# > lmProfile1
#
# Recursive feature selection
#
# Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
#
# Resampling performance over subset size:
#
# Variables RMSE Rsquared RMSESD RsquaredSD Selected
# 1 3.503 0.8338 1.627 0.2393
# 2 3.197 0.8841 1.347 0.1783
# 3 3.214 0.8788 1.327 0.1815
# 4 3.050 0.8861 1.341 0.1603 *
# 5 3.063 0.8842 1.254 0.1670
# 10 3.332 0.8638 1.404 0.1926
#
# The top 4 variables (out of 4):
# wt, am, qsec, hp
# am is one of the best features, now I turn it into a logic variable
x <- mtcars[-1]
x$am <- x$am == 1
y <- mtcars$mpg
set.seed(2017)
ctrl <- rfeControl(functions = lmFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
lmProfile2 <- rfe(x, y, sizes = 1:5, rfeControl = ctrl)
# Error in { : task 1 failed - "undefined columns selected"
# > packageVersion('caret')
# [1] ‘6.0.73’
Related
I would like to use the fastshap package to obtain SHAP values plots for every category of my outcome in a multi-classification problem using a random forest classifier. I could only found chunks of the code around, but no explanation on how to procede from the beginning in obtaining the SHAP values in this case. Here is the code I have so far (my y has 5 classes, here I am trying to obtain SHAP values for class 3):
library(randomForest)
library(fastshap)
set.seed(42)
sample <- sample.int(n = nrow(ITA), size = floor(.75*nrow(ITA)), replace=F)
train <- ITA [sample,]
test <- ITA [-sample,]
set.seed(42)
rftrain <-randomForest(y ~ ., data=train, ntree=500, importance = TRUE)
p_function_3<- function(object, newdata)
caret::predict.train(object,
newdata = newdata,
type = "prob")[,3]
shap_values_G <- fastshap::explain(rftrain,
X = train,
pred_wrapper = p_function_3,
nsim = 50,
newdata=train[which(y==3),])
Now, I took the code largely from an example I found online, and I tried to adapt it (I am not an expert R user), but it does not work.. Can you please help me in correcting it? Thanks!
Here is a working example (with a different dataset), but I think the logic is the same.
library(randomForest)
library(fastshap)
set.seed(42)
ix <- sample(nrow(iris), 0.75 * nrow(iris))
train <- iris[ix, ]
test <- iris[-ix, ]
xvars <- c("Sepal.Width", "Sepal.Length")
yvar <- "Species"
fit <- randomForest(reformulate(xvars, yvar), data = train, ntree = 500)
pred_3 <- function(model, newdata) {
predict(model, newdata = newdata, type = "prob")[, "virginica"]
}
shap_values_3 <- fastshap::explain(
fit,
X = train, # Reference data
feature_names = xvars,
pred_wrapper = pred_3,
nsim = 50,
newdata = train[train$Species == "virginica", ] # For these rows, you will calculate explanations
)
head(shap_values_3)
# Sepal.Width Sepal.Length
# <dbl> <dbl>
# 1 0.101 0.381
# 2 0.159 -0.0109
# 3 0.0736 -0.0285
# 4 0.0564 0.161
# 5 0.0649 0.594
# 6 0.232 0.0305
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
May someone share how to train, tune (hyperparameters), cross-validate, and test a ranger quantile regression model, along with error evaluation? With the iris or Boston housing dataset?
The reason I ask is because I have not been able to find many examples or walkthroughs using quantile regression on Kaggle, random blogs, Youtube. Most problems I encountered are classification problems.
I am currently using a quantile regression model but I am hoping to see other examples in particular with hyperparameter tuning
There are a lot of parameters for this function. Since this isn't a forum for what it all means, I really suggest that you hit up Cross Validates with questions on the how and why. (Or look for questions that may already be answered.)
library(tidyverse)
library(ranger)
library(caret)
library(funModeling)
data(iris)
#----------- setup data -----------
# this doesn't include exploration or cleaning which are both necessary
summary(iris)
df_status(iris)
#----------------- create training sample ----------------
set.seed(395280469) # for replicability
# create training sample partition (70/20 split)
tr <- createDataPartition(iris$Species,
p = .8,
list = F)
There are a lot of ways to split the data, but I tend to prefer Caret, because they word to even out factors if that's what you feed it.
#--------- First model ---------
fit.r <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE)
fit.r
# Ranger result
#
# Call:
# ranger(Sepal.Length ~ ., data = iris[tr, ], write.forest = TRUE,
# importance = "permutation", quantreg = TRUE, keep.inbag = TRUE,
# replace = FALSE)
#
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 2
# Target node size: 5
# Variable importance mode: permutation
# Splitrule: variance
# OOB prediction error (MSE): 0.1199364
# R squared (OOB): 0.8336928
p.r <- predict(fit.r, iris[-tr, -1],
type = 'quantiles')
It defaults to .1, .5, and .9:
postResample(p.r$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.5165946 0.7659124 0.4036667
postResample(p.r$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3750556 0.7587326 0.3133333
postResample(p.r$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6488991 0.7461830 0.5703333
To see what this looks like in practice:
# this performance is the best so far, let's see what it looks like visually
ggplot(data.frame(p.Q1 = p.r$predictions[, 1],
p.Q5 = p.r$predictions[, 2],
p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q1, color = "P.Q1")) +
geom_point(aes(x = Actual, y = p.Q5, color = "P.Q5")) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
# since Quantile .1 performed the best
ggplot(data.frame(p.Q9 = p.r$predictions[, 3],
Actual = iris[-tr, 1])) +
geom_point(aes(x = Actual, y = p.Q9, color = "P.Q9")) +
geom_segment(aes(x = Actual, xend = Actual,
y = Actual, yend = p.Q9)) +
geom_line(aes(Actual, Actual, color = "Actual")) +
scale_color_viridis_d(end = .8, "Error",
direction = -1)+
theme_bw()
#------------ ranger model with options --------------
# last call used default
# splitrule: variance, use "extratrees" (only 2 for this one)
# mtry = 2, use 3 this time
# min.node.size = 5, using 6 this time
# using num.threads = 15 ** this is the number of cores on YOUR device
# change accordingly --- if you don't know, drop this one
set.seed(326)
fit.r2 <- ranger(Sepal.Length ~ .,
data = iris[tr, ],
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 3,
min.node.size = 6,
num.threads = 15)
fit.r2
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 4
# Mtry: 3
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1107299
# R squared (OOB): 0.8464588
This model produced similarly.
p.r2 <- predict(fit.r2, iris[-tr, -1],
type = 'quantiles')
postResample(p.r2$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4932883 0.8144309 0.4000000
postResample(p.r2$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3610171 0.7643744 0.3100000
postResample(p.r2$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6555939 0.8141144 0.5603333
The prediction was pretty similar overall, as well.
This isn't a very large set of data, with few predictors.
How much do they contribute?
importance(fit.r2)
# Sepal.Width Petal.Length Petal.Width Species
# 0.06138883 0.71052453 0.22956522 0.18082998
#------------ ranger model with options --------------
# drop a predictor, lower mtry, min.node.size
set.seed(326)
fit.r3 <- ranger(Sepal.Length ~ .,
data = iris[tr, -4], # dropped Sepal.Width
write.forest = TRUE,
importance = 'permutation',
quantreg = TRUE,
keep.inbag = TRUE,
replace = FALSE,
splitrule = "extratrees",
mtry = 2, # has to change (var count lower)
min.node.size = 4, # lowered
num.threads = 15)
fit.r3
# Ranger result
# Type: Regression
# Number of trees: 500
# Sample size: 120
# Number of independent variables: 3
# Mtry: 2
# Target node size: 6
# Variable importance mode: permutation
# Splitrule: extratrees
# Number of random splits: 1
# OOB prediction error (MSE): 0.1050143
# R squared (OOB): 0.8543842
The second most important predictor was removed and it improved.
p.r3 <- predict(fit.r3, iris[-tr, -c(1, 4)],
type = 'quantiles')
postResample(p.r3$predictions[, 1], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.4760952 0.8089810 0.3800000
postResample(p.r3$predictions[, 2], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.3738315 0.7769388 0.3250000
postResample(p.r3$predictions[, 3], iris[-tr, 1])
# RMSE Rsquared MAE
# 0.6085584 0.8032592 0.5170000
importance(fit.r3)
# almost everthing relies on Petal.Length
# Sepal.Width Petal.Length Species
# 0.08008264 0.95440333 0.32570147
I am studying this website about bagging method. https://bradleyboehmke.github.io/HOML/bagging.html
I am going to use train() function with cross validation for bagging. something like below.
as far as I realized nbagg=200 tells r to try 200 trees, calculate RMSE for each and return the number of trees ( here 80 ) for which the best RMSE is achieved.
now how can I see what RMSE other nbagg values have produced in this model. like RMSE vs number of trees plot in that website ( begore introdicing cv method and train() function like plot below)
ames_bag2 <- train(
Sale_Price ~ .,
data = ames_train,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = 200,
control = rpart.control(minsplit = 2, cp = 0)
)
ames_bag2
## Bagged CART
##
## 2054 samples
## 80 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 1849, 1848, 1848, 1849, 1849, 1847, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 26957.06 0.8900689 16713.14
As the example you shared is not completely reproducible, I have taken a different example from the mtcars dataset to illustrate how you can do it. You can extend that for your data.
Note: The RMSE showed here is the average of 10 RMSEs as the CV number is 10 here. So we will store that only. Adding the relevant libraries too in the example here. And setting the maximum number of trees as 15, just for the example.
library(ipred)
library(caret)
library(rpart)
library(dplyr)
data("mtcars")
n_trees <-1
error_df <- data.frame()
while (n_trees <= 15) {
ames_bag2 <- train(
mpg ~.,
data = mtcars,
method = "treebag",
trControl = trainControl(method = "cv", number = 10),
nbagg = n_trees,
control = rpart.control(minsplit = 2, cp = 0)
)
error_df %>%
bind_rows(data.frame(trees=n_trees, rmse=mean(ames_bag2[["resample"]]$RMSE)))-> error_df
n_trees <- n_trees+1
}
error_df will show the output.
> error_df
trees rmse
1 1 2.493117
2 2 3.052958
3 3 2.052801
4 4 2.239841
5 5 2.500279
6 6 2.700347
7 7 2.642525
8 8 2.497162
9 9 2.263527
10 10 2.379366
11 11 2.447560
12 12 2.314433
13 13 2.423648
14 14 2.192112
15 15 2.256778
Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.
The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)
The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest
#>
#> 50 samples
#> 2 predictor
#>
#> No pre-processing
#> Resampling: Cross-Validated (4 fold)
#> Summary of sample sizes: 38, 36, 38, 38
#> Resampling results across tuning parameters:
#>
#> min.node.size mtry splitrule RMSE Rsquared MAE
#> 1 2 maxstat 0.5981673 0.6724245 0.4993722
#> 3 1 extratrees 0.5861116 0.7010012 0.4938035
#> 4 2 maxstat 0.6017491 0.6661093 0.4999057
#>
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#> extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394
Created on 2018-04-09 by the reprex package (v0.2.0).
The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:
set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
number = 4,
search = "random",
savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
y = d$y,
method = "ranger",
trControl = train_control,
tuneLength = 3)
#output
Random Forest
50 samples
2 predictor
No pre-processing
Resampling: Cross-Validated (4 fold)
Summary of sample sizes: 37, 38, 37, 38
Resampling results across tuning parameters:
min.node.size mtry splitrule RMSE Rsquared MAE
8 1 extratrees 0.6106390 0.4360609 0.4926629
12 2 extratrees 0.6156636 0.4294237 0.4954481
19 2 variance 0.6472539 0.3889372 0.5217369
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.
RMSE for best model is 0.6106390
Now calculate the RMSE for each fold and average:
m$pred %>%
group_by(Resample) %>%
mutate(rmse = caret::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
m$pred %>%
group_by(Resample) %>%
mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
summarise(mean = mean(rmse)) %>%
pull(mean) %>%
mean
#output
0.610639
I get different results. This is apparently a random process.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595
If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.
Is there a way to combine multiple predictions from different models in mlr into a single average prediction so that it can be used to calculate performance measures etc.?
library(mlr)
data(iris)
iris2 <- iris
iris2$Species <- ifelse(iris$Species=="setosa", "ja", "nein")
task = makeClassifTask(data = iris2, target = "Species")
lrn = makeLearner("classif.h2o.deeplearning", predict.type="prob")
model1 = train(lrn, task)
model2 = train(lrn, task)
pred1 = predict(model1, newdata=iris2)
pred2 = predict(model2, newdata=iris2)
performance(pred1, measures = auc)
g = generateThreshVsPerfData(pred1)
plotThreshVsPerf(g)
A workaround to show what I mean could be maybe
pred_avg = pred1
pred_avg$data[,c("prob.ja","prob.nein")] = (pred1$data[,c("prob.ja","prob.nein")] +
pred2$data[,c("prob.ja","prob.nein")])/2
performance(pred_avg, measures = auc)
g_avg = generateThreshVsPerfData(pred_avg)
plotThreshVsPerf(g_avg)
Is there a way to do this without a workaround and could this workaround have any unwanted side effects?
It sounds like you are looking for a stacking learner, which is mlr's method of performing ensembles.
from the docs
# Regression
data(BostonHousing, package = "mlbench")
tsk = makeRegrTask(data = BostonHousing, target = "medv")
base = c("regr.rpart", "regr.svm")
lrns = lapply(base, makeLearner)
m = makeStackedLearner(base.learners = lrns,
predict.type = "response", method = "average")
tmp = train(m, tsk)
res = predict(tmp, tsk)
# Prediction: 506 observations
# predict.type: response
# threshold:
# time: 0.02
# id truth response
# 1 1 24.0 27.33742
# 2 2 21.6 22.08853
# 3 3 34.7 33.52007
# 4 4 33.4 32.49923
# 5 5 36.2 32.67973
# 6 6 28.7 22.99323
# ... (506 rows, 3 cols)
performance(res, rmse)
# rmse
# 3.138981