Bootstrap resampling for stacked/ensemble leaner with mlr3 in R - r

So I'm trying to generate bootstrapped resamples for an ensemble model which throws an error. This seems to result from the duplication of row_ids; I suppose these duplicate rows should be expected due to the resampling with replacement but I'm not sure why this fails in the ensemble context but not with a single learner.
Please see below for a reprex.
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(progressr)
lgr::get_logger("mlr3")$set_threshold("warn")
ens.lrnr <- gunion(list(
po("learner_cv",lrn("regr.lm")),
po("learner_cv",lrn("regr.rpart")))) %>>%
po("featureunion") %>>%
lrn("regr.lm", id="master") |>
as_learner()
task <- tsk("boston_housing")
task$select(task$feature_names[! task$feature_names %in% "town"])
boot_res <- function(.lrnr) {
progressr::with_progress(expr = {
mlr3::resample(
task = task,
learner = .lrnr,
resampling = rsmp("bootstrap", repeats = 100, ratio = 1),
store_models = FALSE
)
})
}
# single learner works
rpart_boot <- boot_res(lrn("regr.rpart"))
#ensemble learner fails
ens_boot <- boot_res(ens.lrnr)
#> Error in as_data_backend.data.frame(data, primary_key = row_ids):
#> Assertion on 'data[[primary_key]]' failed: Contains duplicated
#>values, position 8.
#> This happened PipeOp regr.lm's $train()
Created on 2023-02-17 with reprex v2.0.2

Related

Create a multivariate matrix in tidymodels recipes::recipe()

I am trying to do a k-fold cross validation on a model that predicts the joint distribution of the proportion of tree species basal area from satellite imagery. This requires the use of the DiricihletReg::DirichReg() function, which in turn requires that the response variables be prepared as a matrix using the DirichletReg::DR_data() function. I originally tried to accomplish this in the caret:: package, but I found out that caret:: does not support multivariate responses. I have since tried to implement this in the tidymodels:: suite of packages. Following the documentation on how to register a new model in the parsnip:: (I appreciate Max Kuhn's vegetable humor) package, I created a "DREG" model and a "DR" engine. My registered model works when I simply call it on a single training dataset, but my goal is to do kfolds cross-validation, implementing the vfolds_cv(), a workflow(), and the 'fit_resample()' function. With the code I currently have I get warning message stating:
Warning message:
All models failed. See the `.notes` column.
Those notes state that Error in get(resp_char, environment(oformula)): object 'cbind(PSME, TSHE, ALRU2)' not found This, I believe is due to the use of DR_data() to preprocess the response variables into the format necessary for Dirichlet::DirichReg() to run properly. I think the solution I need to implement involve getting this pre-processing to happen in either the recipe() call or in the set_fit() call when I register this model with parsnip::. I have tried to use the step_mutate() function when specifying the recipe, but that performs a function on each column as opposed to applying the function with the columns as inputs. This leads to the following error in the "notes" from the output of fit_resample():
Must subset columns with a valid subscript vector.
Subscript has the wrong type `quosures`.
It must be numeric or character.
Is there a way to get the recipe to either transform several columns to a DirichletRegData class using the DR_data() function with a step_*() function or using the pre= argument in set_fit() and set_pred()?
Below is my reproducible example:
##Loading Necessary Packages##
library(tidymodels)
library(DirichletReg)
##Creating Fake Data##
set.seed(88)#For reproducibility
#Response variables#
PSME_BA<-rnorm(100,50, 15)
TSHE_BA<-rnorm(100,40,12)
ALRU2_BA<-rnorm(100,20,0.5)
Total_BA<-PSME_BA+TSHE_BA+ALRU2_BA
#Predictor variables#
B1<-runif(100, 0, 2000)
B2<-runif(100, 0, 1800)
B3<-runif(100, 0, 3000)
#Dataset for modeling#
DF<-data.frame(PSME=PSME_BA/Total_BA, TSHE=TSHE_BA/Total_BA, ALRU2=ALRU2_BA/Total_BA,
B1=B1, B2=B2, B3=B3)
##Modeling the data using Dirichlet regression with repeated k-folds cross validation##
#Registering the model to parsnip::#
set_new_model("DREG")
set_model_mode(model="DREG", mode="regression")
set_model_engine("DREG", mode="regression", eng="DR")
set_dependency("DREG", eng="DR", pkg="DirichletReg")
set_model_arg(
model = "DREG",
eng = "DR",
parsnip = "param",
original = "model",
func = list(pkg = "DirichletReg", fun = "DirichReg"),
has_submodel = FALSE
)
DREG <-
function(mode = "regression", param = NULL) {
# Check for correct mode
if (mode != "regression") {
rlang::abort("`mode` should be 'regression'")
}
# Capture the arguments in quosures
args <- list(sub_classes = rlang::enquo(param))
# Save some empty slots for future parts of the specification
new_model_spec(
"DREG",
args=args,
eng_args = NULL,
mode = mode,
method = NULL,
engine = NULL
)
}
set_fit(
model = "DREG",
eng = "DR",
mode = "regression",
value = list(
interface = "formula",
protect = NULL,
func = c(pkg = "DirichletReg", fun = "DirichReg"),
defaults = list()
)
)
set_encoding(
model = "DREG",
eng = "DR",
mode = "regression",
options = list(
predictor_indicators = "none",
compute_intercept = TRUE,
remove_intercept = TRUE,
allow_sparse_x = FALSE
)
)
set_pred(
model = "DREG",
eng = "DR",
mode = "regression",
type = "numeric",
value = list(
pre = NULL,
post = NULL,
func = c(fun = "predict.DirichletRegModel"),
args =
list(
object = expr(object$fit),
newdata = expr(new_data),
type = "response"
)
)
)
##Running the Model##
DF$Y<-DR_data(DF[,c(1:3)]) #Preparing the response variables
dreg_spec<-DREG(param="alternative") %>%
set_engine("DR")
dreg_mod<-dreg_spec %>%
fit(Y~B1+B2+B3, data = DF)#Model works when simply run on single dataset
##Attempting Crossvalidation##
#First attempt - simply call Y as the response variable in the recipe#
kfolds<-vfold_cv(DF, v=10, repeats = 2)
rcp<-recipe(Y~B1+B2+B3, data=DF)
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
#second attempt - use step_mutate_at()#
rcp<-recipe(~B1+B2+B3, data=DF) %>%
step_mutate_at(fn=DR_data, var=vars(PSME, TSHE, ALRU2))
dreg_fit<- workflow() %>%
add_model(dreg_spec) %>%
add_recipe(rcp)
dreg_rsmpl<-dreg_fit %>%
fit_resamples(kfolds)#Throws warning about all models failing
This works, but I'm not sure if it's what you were expecting.
First--getting the data setup for CV and DR_data()
I don't know of any package that has built what would essentially be a translation for CV and DirichletReg. Therefore, that part is manually done. You might be surprised to find it's not all that complicated.
Using the data you created and the modeling objects you created for tidymodels (those prefixed with set_), I created the CV structure that you were trying to use.
df1 <- data.frame(PSME = PSME_BA/Total_BA, TSHE = TSHE_BA/Total_BA,
ALRU2=ALRU2_BA/Total_BA, B1, B2, B3)
set.seed(88)
kDf2 <- kDf1 <- vfold_cv(df1, v=10, repeats = 2)
For each of the 20 subset data frames identified in kDf2, I used DR_data to set the data up for the models.
# convert to DR_data (each folds and repeats)
df2 <- map(1:20,
.f = function(x){
in_ids = kDf1$splits[[x]]$in_id
dd <- kDf1$splits[[x]]$data[in_ids, ] # filter rows BEFORE DR_data
dd$Y <- DR_data(dd[, 1:3])
kDf1$splits[[x]]$data <<- dd
})
Because I'm not all that familiar with tidymodels, next conducted the modeling using DirichReg. I then did it again with tidymodels and compared them. (The output is identical.)
DirichReg Models and summaries of the fits
set.seed(88)
# perform crossfold validation on Dirichlet Model
df2.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = DirichReg(Y ~ B1 + B2, daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
# summary of each fitted model
fit.a <- map(1:20,
.f = function(x){
summary(df2.fit[[x]]$fit)
})
tidymodels and summaries of the fits (the code looks the same, but there are a few differences--the output is the same, though)
# I'm not sure what 'alternative' is supposed to do here?
dreg_spec <- DREG(param="alternative") %>% # this is not model = alternative
set_engine("DR")
set.seed(88)
dfa.fit <- map(1:20,
.f = function(x){
Rpt = kDf1$splits[[x]]$id$id
Fld = kDf1$splits[[x]]$id$id2
daf = kDf1$splits[[x]]$data
fit = dreg_spec %>%
fit(Y ~ B1 + B2, data = daf)
list(Rept = Rpt, Fold = Fld, fit = fit)
})
afit.a <- map(1:20,
.f = function(x){
summary(dfa.fit[[x]]$fit$fit) # extra nest for parsnip
})
If you wanted to see the first model?
fit.a[[1]]
afit.a[[1]]
If you wanted the model with the lowest AIC?
# comare AIC, BIC, and liklihood?
# what do you percieve best fit with?
fmin = min(unlist(map(1:20, ~fit.a[[.x]]$aic))) # dir
# find min AIC model number
paste0((map(1:20, ~ifelse(fit.a[[.x]]$aic == fmin, .x, ""))), collapse = "")
fit.a[[19]]
afit.a[[19]]

mlr - unable to use parameter 'importance' in ranger_permutation filter

When I try to use the filter 'ranger_permutation' and pass some parameters to the filter in makeFilterWrapper, I find that I cannot use the parameter 'importance' as that name is clashing with another parameter, but I cannot understand where or why. I get the following error message when I run the code below:
Error in setHyperPars(learner = wl, ..., par.vals = par.vals):
Assertion on 'parameter settings' failed: Must have unique names, but element 4 is duplicated
I get this error when using other base learners as well. If I remove the parameter 'importance', the problem goes away, but not if I remove other parameters. Also I can run ranger on its own, not as a filter, with the same parameters.
library(survival)
#> Warning: package 'survival' was built under R version 3.5.3
library(mlr)
#> Loading required package: ParamHelpers
data(veteran)
set.seed(24601)
task_id = "VET"
vet.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)
outer = makeResampleDesc("CV", iters=2, stratify=TRUE)
set.seed(24601, "L'Ecuyer")
cox.lrn <- makeLearner(cl="surv.coxph", id = "Ranger_Cox", predict.type="response")
filt <- makeFilterWrapper(cox.lrn, fw.method="ranger_permutation", fw.abs=5, cache=TRUE, num.trees=1000, splitrule="maxstat", importance="permutation")
bmr = benchmark(filt, vet.task, outer, list(cindex), show.info = TRUE, models=TRUE, keep.extract=TRUE)
#> Task: VET, Learner: Ranger_Cox.filtered
#> Resampling: cross-validation
#> Measures: cindex
#> Error in setHyperPars(learner = wl, ..., par.vals = par.vals): Assertion on 'parameter settings' failed: Must have unique names, but element 4 is duplicated.
Created on 2019-09-25 by the reprex package (v0.3.0)
When you set fw.method = "ranger_permutation" importance = "permutation" is already set internally. Therefore when you force importance = "permutation" it is defined two times (duplicated) and the error checking complains.
This can be seen from the mlr implementation: https://github.com/mlr-org/mlr/blob/master/R/Filter.R makeFilter(name = "ranger_permutation"... has importance = "permutation" already defined.

mlr - parameter name clash with randomForestSRC_var.select filter using method argument

When I use the randomForestSRC_var.select filter and pass a method parameter to it (e.g. method="vh" for variable hunting) I get a name clash because an internal function also uses a parameter called method. This was raised as an issue on Github, but was said to have been resolved: https://github.com/mlr-org/mlr/issues/1066. I have also opened an issue on Github: https://github.com/mlr-org/mlr/issues/2639 but thought this might be a more appropriate forum, in case it is not a bug but a fault on my part.
Here is my code:
library(survival)
#> Warning: package 'survival' was built under R version 3.5.3
library(mlr)
#> Loading required package: ParamHelpers
data(veteran)
set.seed(24601)
task_id = "VET"
vet.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)
tuning = makeResampleDesc("CV", iters=2, stratify=TRUE)
outer = makeResampleDesc("CV", iters=2, stratify=TRUE)
filt = makeFilterWrapper(
makeLearner(cl="surv.coxph", id = "cox.filt.rfsrc", predict.type="response"),
fw.method="randomForestSRC_var.select",
fw.abs=4,
cache=TRUE,
ntree=500,
method="vh"
)
bmr = benchmark(filt, vet.task, outer, list(cindex), show.info = TRUE, models=TRUE, keep.extract=FALSE)
#> Task: VET, Learner: cox.filt.rfsrc.filtered
#> Resampling: cross-validation
#> Measures: cindex
#> Error in (function (task, method = "randomForestSRC_importance", fval = NULL, : formal argument "method" matched by multiple actual arguments
Created on 2019-09-25 by the reprex package (v0.3.0)
If I change argument method to "metho" to try and avoid the clash I get a different error:
library(survival)
#> Warning: package 'survival' was built under R version 3.5.3
library(mlr)
#> Loading required package: ParamHelpers
data(veteran)
set.seed(24601)
task_id = "VET"
vet.task <- makeSurvTask(id = task_id, data = veteran, target = c("time", "status"))
vet.task <- createDummyFeatures(vet.task)
tuning = makeResampleDesc("CV", iters=2, stratify=TRUE)
outer = makeResampleDesc("CV", iters=2, stratify=TRUE)
filt = makeFilterWrapper(
makeLearner(cl="surv.coxph", id = "cox.filt.rfsrc", predict.type="response"),
fw.method="randomForestSRC_var.select",
fw.abs=4,
cache=TRUE,
ntree=500,
metho="vh"
)
bmr = benchmark(filt, vet.task, outer, list(cindex), show.info = TRUE, models=TRUE, keep.extract=FALSE)
#> Task: VET, Learner: cox.filt.rfsrc.filtered
#> Resampling: cross-validation
#> Measures: cindex
#> Error in -im[, 1L]: invalid argument to unary operator
Created on 2019-09-25 by the reprex package (v0.3.0)
It seems that this error is coming from the line:
setNames(-im[, 1L], rownames(im))
in the RF min depth filter and I assume implies that variable im, the results of the filter, is NULL (although I am not sure why).
Is there any way around this problem? Apologies for posting here and on GH.
Fixed upstream in this Pull Request.

XGBoost - predict not exported in namespace

I am trying to tune an xgboost model with a multiclass dependent variable in R. I am using MLR to do this, however I run into an error where xgboost doesn't have predict within its namespace - which I assume MLR wants to use. I have had a look online and see that other people have encountered similar issues. However, I can't entirely understand the answers that have been provided (e.g. https://github.com/mlr-org/mlr/issues/935), when I try to implement them the issue persists. My code is as follows:
# Tune parameters
#create tasks
train$result <- as.factor(train$result) # Needs to be a factor variable for makeClass to work
test$result <- as.factor(test$result)
traintask <- makeClassifTask(data = train,target = "result")
testtask <- makeClassifTask(data = test,target = "result")
lrn <- makeLearner("classif.xgboost",predict.type = "response")
# Set learner value and number of rounds etc.
lrn$par.vals <- list(
objective = "multi:softprob", # return class with maximum probability,
num_class = 3, # There are three outcome categories
eval_metric="merror",
nrounds=100L,
eta=0.1
)
# Set parameters to be tuned
params <- makeParamSet(
makeDiscreteParam("booster",values = c("gbtree","gblinear")),
makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1)
)
# Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)
# search strategy
ctrl <- makeTuneControlRandom(maxit = 10L)
#parallelStartSocket(cpus = detectCores()) # Enable parallel processing
mytune <- tuneParams(learner = lrn
,task = traintask
,resampling = rdesc
,measures = acc
,par.set = params
,control = ctrl
,show.info = T)
The specific error I get is:
Error: 'predict' is not an exported object from 'namespace:xgboost'
My package versions are:
packageVersion("xgboost")
[1] ‘0.6.4’
packageVersion("mlr")
[1] ‘2.8’
Would anyone know what I should do here?
Thanks in advance.

Using parallelMap Package with Custom Filter in mlr

I working with mlr to do a text classification task. I have written a custom filter as described here
Create Custom Filters
The filter works as intended, however when I try and and ustilise parallelization I receive the following error:
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 4; elements = 2.
Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) :
Errors occurred in 2 slave jobs, displaying at most 10 of them:
00001: Error in parallel:::.slaveRSOCK() :
Assertion on 'method' failed: Must be element of set {'anova.test','carscore','cforest.importance','chi.squared','gain.ratio','information.gain','kruskal.test','linear.correlation','mrmr','oneR','permutation.importance','randomForest.importance','randomForestSRC.rfsrc','randomForestSRC.var.select','rank.correlation','relief','rf.importance','rf.min.depth','symmetrical.uncertainty','univariate','univariate.model.score','variance'}.
I'm assuming from the error that my custom filter needs to be an element in the set to stand a chance of working in parallel, but haven't managed to work out if (a) this is possible, and (b) if it is, how do I go about it.
Thanks in advance for any help,
Azam
Added: Test Script
I can't let you see the actual script/data I'm working with due to sensitivity, but this example reproduces the error I see. Apart from the custom feature selection and data-set, the steps to set up the learner and evaluate it are as I have in my 'real' script. As in my real case, if you remove the parallelStartSocket() command then the script runs as expected.
I should also add that I have sucessfully used (or at least I received no errors) parallel processing when tuning the hyper-parameters of an SVM with RBF kernel: the script being identical apart from the makeParamSet() definition.
library(parallelMap)
library(mlr)
library(kernlab)
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "mlr",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
# set up svm with rbf kernal
svm.lrn <- makeLearner("classif.ksvm",predict.type = "response")
# wrap learner with filter
svm.lrn <- makeFilterWrapper(svm.lrn, fw.method = "nonsense.filter")
# define feature selection parameters
ps.svm = makeParamSet(
makeDiscreteParam("fw.abs", values = seq(2, 3, 1))
)
# define inner search and evaluation strategy
ctrl.svm = makeTuneControlGrid()
inner.svm = makeResampleDesc("CV", iters = 5, stratify = TRUE)
svm.lrn <- makeTuneWrapper(svm.lrn, resampling = inner.svm, par.set = ps.svm,
control = ctrl.svm)
# set up outer resampling
outer.svm <- makeResampleDesc("CV", iters = 10, stratify = TRUE)
# run it...
parallelStartSocket(2)
run.svm <- resample(svm.lrn, iris.task,
resampling = outer.svm, extract = getTuneResult)
parallelStop()
The problem is that makeFilter registers S3 methods, which are not available in separate R processes. You have two options to make this work: either simply use parallelStartMulticore(2) so that everything runs in the same R process, or tell parallelMap about the pieces that need to be present in the other R processes.
There are two parts to the latter. First, use parallelLibrary("mlr") to load mlr everywhere and pull out the definition of the filter into a separate file that can be loaded using parallelSource(). For example:
filter.R:
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "mlr",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
main.R:
library(parallelMap)
library(mlr)
library(kernlab)
parallelStartSocket(2)
parallelLibrary("mlr")
parallelSource("filter.R")
# set up svm with rbf kernal
svm.lrn = makeLearner("classif.ksvm",predict.type = "response")
# wrap learner with filter
svm.lrn = makeFilterWrapper(svm.lrn, fw.method = "nonsense.filter")
# define feature selection parameters
ps.svm = makeParamSet(
makeDiscreteParam("fw.abs", values = seq(2, 3, 1))
)
# define inner search and evaluation strategy
ctrl.svm = makeTuneControlGrid()
inner.svm = makeResampleDesc("CV", iters = 5, stratify = TRUE)
svm.lrn = makeTuneWrapper(svm.lrn, resampling = inner.svm, par.set = ps.svm,
control = ctrl.svm)
# set up outer resampling
outer.svm = makeResampleDesc("CV", iters = 10, stratify = TRUE)
# run it...
run.svm = resample(svm.lrn, iris.task, resampling = outer.svm, extract = getTuneResult)
parallelStop()

Resources