how to use multiclass.au1p measure in mlr - r

I am trying to use multiclass.au1p measure in mlr package. It gave me an error saying
Error in FUN(X[[i]], ...) : Measure multiclass.au1p requires
predict type to be: 'prob'!
When I tried to set the predict type to prob then it gave me an error similar to following for any classifier i used
Error in setPredictType.Learner(learner, predict.type) : Trying to
predict probs, but classif.xgboost.multiclass does not support that!
How can I resolve this?
Following is my code
trainTask <- makeClassifTask(data = no_out_pso,target = "response_grade")
Clslearn = makeLearner("classif.xgboost", predict.type = "prob")
Clslearn = makeMulticlassWrapper(Clslearn, mcw.method = "onevsrest")
Clslearn = setPredictType(Clslearn, "prob")
rdesc = makeResampleDesc("CV", iters = 3)
r = resample(Clslearn, trainTask, rdesc, measures = list(mlr::acc, mlr::multiclass.au1p, mlr::multiclass.au1u))
print(r)

It does not work with the makeMulticlassWrapper, because this does not support probability prediction (at the moment). I get also an error, when I try to set it to prob in your code.
Code that works:
Clslearn = makeLearner("classif.xgboost", predict.type = "prob")
rdesc = makeResampleDesc("CV", iters = 3)
r = resample(Clslearn, iris.task, rdesc, measures = list(mlr::acc, mlr::multiclass.au1p, mlr::multiclass.au1u))

You need to use a classifier that supports predicting probabilities. You can get a list with the listLearners() function:
listLearners(properties = "prob")

Related

R Xgboost validation error as stopping metric

I am using a train and validation dataset on an xgboost binary classification model.
params5 <- list(booster = "gbtree", objective = "binary:logistic",
eta=0.0001, gamma=0.5, max_depth=15, min_child_weight=1, subsample=0.6,
colsample_bytree=0.4,seed =2222)
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 4000,
watchlist = list(validation = dvalid,train = dtrain),
print_every_n =30,early_stopping_rounds = 100
maximize = F ,serialize = TRUE)
It automatically picks the train error as stopping metric. This results in the model continuing to train while overfitting.
Multiple eval metrics are present. Will use train_error for early stopping.
Will train until train_error hasn't improved in 100 rounds.
How do I assign the validation error as stopping metric?
I do not use the R binding of xgboost and the R-package documentation is not specific about it. However, the python-API documentation (see the early_stopping_rounds argument documentation) has a relevant clarification on this issue:
Requires at least one item in evals. If there’s more than one, will use the last.
Here, evals is the list of samples on which metrics will be evaluated, i.e. it is analogous to your watchlist argument. So I would guess, it can be that you just need to swap the order of items in the list provided as that argument
Thanks #abhiieor for the solution. Adding to that from what I observed,when we use only the validation in watchlist:
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 400,watchlist = list(validation = dvalid),
print_every_n =30,early_stopping_rounds = 100, maximize = F ,serialize = TRUE)
log results while it runs:
[1] validation-error:0.222037
Will train until validation_error hasn't improved in 100 rounds.
[31] validation-error:0.201712
[61] validation-error:0.201635
And if we want to see both the train error and validation error while it runs,
adding the validation as 2nd argument in the watch list did it while using validation error as the stopping metric .
xgb_MOD5 <- xgb.train (params = params5, data = dtrain, nrounds = 400,watchlist = list(train =dtrain,validation = dvalid),
print_every_n =30,early_stopping_rounds = 100, maximize = F ,serialize = TRUE)
[1] train-error:0.202131 validation-error:0.232341
Multiple eval metrics are present. Will use validation_error for early stopping.
Will train until validation_error hasn't improved in 100 rounds.
[31] train-error:0.174278 validation-error:0.202871
[61] train-error:0.173909 validation-error:0.202288

Error predict with mlr:

I tried to train a h2o model using the following code and make a prediction for new data, but it leads to an error. How can I avoid this error?
library(mlr)
a <- data.frame(y=factor(c(1,1,1,1,1,1,1,1,0,0,1,0)),
x1=rep(c("a","b","c"), times=c(6,3,3)))
aTask <- makeClassifTask(data = a, target = "y", positive = "1")
h2oLearner <- makeLearner("classif.h2o.deeplearning",
predict.type = "prob")
model <- train(h2oLearner, aTask)
b <- data.frame(x1=rep(c("a","b", "c"), times=c(3,5,4)))
pred <- predict(model, newdata=b)
leads to the following error:
Error in checkPredictLearnerOutput(.learner, .model, p) :
predictLearner for classif.h2o.deeplearning has returned not the class
levels as column names: p0,p1
If I change predict.type to "response" it works. So how to predict probabilities?
This bug was fixed in this commit and will be in the next release. Until then, you can install the Github version:
devtools::install_github("mlr-org/mlr")

XGBoost - predict not exported in namespace

I am trying to tune an xgboost model with a multiclass dependent variable in R. I am using MLR to do this, however I run into an error where xgboost doesn't have predict within its namespace - which I assume MLR wants to use. I have had a look online and see that other people have encountered similar issues. However, I can't entirely understand the answers that have been provided (e.g. https://github.com/mlr-org/mlr/issues/935), when I try to implement them the issue persists. My code is as follows:
# Tune parameters
#create tasks
train$result <- as.factor(train$result) # Needs to be a factor variable for makeClass to work
test$result <- as.factor(test$result)
traintask <- makeClassifTask(data = train,target = "result")
testtask <- makeClassifTask(data = test,target = "result")
lrn <- makeLearner("classif.xgboost",predict.type = "response")
# Set learner value and number of rounds etc.
lrn$par.vals <- list(
objective = "multi:softprob", # return class with maximum probability,
num_class = 3, # There are three outcome categories
eval_metric="merror",
nrounds=100L,
eta=0.1
)
# Set parameters to be tuned
params <- makeParamSet(
makeDiscreteParam("booster",values = c("gbtree","gblinear")),
makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1)
)
# Set resampling strategy
rdesc <- makeResampleDesc("CV",stratify = T,iters=5L)
# search strategy
ctrl <- makeTuneControlRandom(maxit = 10L)
#parallelStartSocket(cpus = detectCores()) # Enable parallel processing
mytune <- tuneParams(learner = lrn
,task = traintask
,resampling = rdesc
,measures = acc
,par.set = params
,control = ctrl
,show.info = T)
The specific error I get is:
Error: 'predict' is not an exported object from 'namespace:xgboost'
My package versions are:
packageVersion("xgboost")
[1] ‘0.6.4’
packageVersion("mlr")
[1] ‘2.8’
Would anyone know what I should do here?
Thanks in advance.

Using parallelMap Package with Custom Filter in mlr

I working with mlr to do a text classification task. I have written a custom filter as described here
Create Custom Filters
The filter works as intended, however when I try and and ustilise parallelization I receive the following error:
Exporting objects to slaves for mode socket: .mlr.slave.options
Mapping in parallel: mode = socket; cpus = 4; elements = 2.
Error in stopWithJobErrorMessages(inds, vcapply(result.list[inds], as.character)) :
Errors occurred in 2 slave jobs, displaying at most 10 of them:
00001: Error in parallel:::.slaveRSOCK() :
Assertion on 'method' failed: Must be element of set {'anova.test','carscore','cforest.importance','chi.squared','gain.ratio','information.gain','kruskal.test','linear.correlation','mrmr','oneR','permutation.importance','randomForest.importance','randomForestSRC.rfsrc','randomForestSRC.var.select','rank.correlation','relief','rf.importance','rf.min.depth','symmetrical.uncertainty','univariate','univariate.model.score','variance'}.
I'm assuming from the error that my custom filter needs to be an element in the set to stand a chance of working in parallel, but haven't managed to work out if (a) this is possible, and (b) if it is, how do I go about it.
Thanks in advance for any help,
Azam
Added: Test Script
I can't let you see the actual script/data I'm working with due to sensitivity, but this example reproduces the error I see. Apart from the custom feature selection and data-set, the steps to set up the learner and evaluate it are as I have in my 'real' script. As in my real case, if you remove the parallelStartSocket() command then the script runs as expected.
I should also add that I have sucessfully used (or at least I received no errors) parallel processing when tuning the hyper-parameters of an SVM with RBF kernel: the script being identical apart from the makeParamSet() definition.
library(parallelMap)
library(mlr)
library(kernlab)
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "mlr",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
# set up svm with rbf kernal
svm.lrn <- makeLearner("classif.ksvm",predict.type = "response")
# wrap learner with filter
svm.lrn <- makeFilterWrapper(svm.lrn, fw.method = "nonsense.filter")
# define feature selection parameters
ps.svm = makeParamSet(
makeDiscreteParam("fw.abs", values = seq(2, 3, 1))
)
# define inner search and evaluation strategy
ctrl.svm = makeTuneControlGrid()
inner.svm = makeResampleDesc("CV", iters = 5, stratify = TRUE)
svm.lrn <- makeTuneWrapper(svm.lrn, resampling = inner.svm, par.set = ps.svm,
control = ctrl.svm)
# set up outer resampling
outer.svm <- makeResampleDesc("CV", iters = 10, stratify = TRUE)
# run it...
parallelStartSocket(2)
run.svm <- resample(svm.lrn, iris.task,
resampling = outer.svm, extract = getTuneResult)
parallelStop()
The problem is that makeFilter registers S3 methods, which are not available in separate R processes. You have two options to make this work: either simply use parallelStartMulticore(2) so that everything runs in the same R process, or tell parallelMap about the pieces that need to be present in the other R processes.
There are two parts to the latter. First, use parallelLibrary("mlr") to load mlr everywhere and pull out the definition of the filter into a separate file that can be loaded using parallelSource(). For example:
filter.R:
makeFilter(
name = "nonsense.filter",
desc = "Calculates scores according to alphabetical order of features",
pkg = "mlr",
supported.tasks = c("classif", "regr", "surv"),
supported.features = c("numerics", "factors", "ordered"),
fun = function(task, nselect, decreasing = TRUE, ...) {
feats = getTaskFeatureNames(task)
imp = order(feats, decreasing = decreasing)
names(imp) = feats
imp
}
)
main.R:
library(parallelMap)
library(mlr)
library(kernlab)
parallelStartSocket(2)
parallelLibrary("mlr")
parallelSource("filter.R")
# set up svm with rbf kernal
svm.lrn = makeLearner("classif.ksvm",predict.type = "response")
# wrap learner with filter
svm.lrn = makeFilterWrapper(svm.lrn, fw.method = "nonsense.filter")
# define feature selection parameters
ps.svm = makeParamSet(
makeDiscreteParam("fw.abs", values = seq(2, 3, 1))
)
# define inner search and evaluation strategy
ctrl.svm = makeTuneControlGrid()
inner.svm = makeResampleDesc("CV", iters = 5, stratify = TRUE)
svm.lrn = makeTuneWrapper(svm.lrn, resampling = inner.svm, par.set = ps.svm,
control = ctrl.svm)
# set up outer resampling
outer.svm = makeResampleDesc("CV", iters = 10, stratify = TRUE)
# run it...
run.svm = resample(svm.lrn, iris.task, resampling = outer.svm, extract = getTuneResult)
parallelStop()

How to incorporate logLoss in caret

I'm attempting to incorporate logLoss as the performance measure used when tuning randomForest (other classifiers) by way of caret (instead of the default options of Accuracy or Kappa).
The first R script executes without error using defaults. However, I get:
Error in { : task 1 failed - "unused argument (model = method)"
when using the second script.
The function logLoss(predict(rfModel,test[,-c(1,95)],type="prob"),test[,95]) works by way of leveraging a separately trained randomForest model.
The dataframe has 100+ columns and 10,000+ rows. All elements are numeric outside of the 9-level categorical "target" at col=95. A row id is located in col=1.
Unfortunately, I'm not correctly grasping the guidance provided by http://topepo.github.io/caret/training.html, nor having much luck via google searches.
Your help are greatly appreciated.
Working R script:
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = train[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
Not working R script:
logLoss = function(data,lev=NULL,method=NULL) {
lLoss = 0
epp = 10^-15
for (i in 1:nrow(data)) {
index = as.numeric(lev[i])
p = max(min(data[i,index],1-epp),epp)
lLoss = lLoss - log(p)
}
lLoss = lLoss/nrow(data)
names(lLoss) = c("logLoss")
lLoss
}
fitControl = trainControl(method = "repeatedcv",number = 10,repeats = 10,summaryFunction = logLoss)
rfGrid = expand.grid(mtry=c(1,9))
rfFit = train(target ~ ., data = trainBal[,-1],method = "rf",trControl = fitControl,verbose = FALSE,tuneGrid = rfGrid)
I think you should set summaryFunction=mnLogLoss in trainControl and metric="logLoss" in train (I found it here). Like this:
# load libraries
library(caret)
# load the dataset
data(iris)
# prepare resampling method
control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=mnLogLoss)
set.seed(7)
fit <- train(Species~., data=iris, method="rf", metric="logLoss", trControl=control)
# display results
print(fit)
Your argument name is not correct (i.e. "unused argument (model = method)"). The webpage says that the last function argument should be called model and not method.

Resources