mlr3 - Apply pre-processing to new data - r

Using lmr3verse package here. Let's say I applied the following pre-processing to the training set used to train Learner:
preprocess <- po("scale", param_vals = list(center = TRUE, scale = TRUE)) %>>%
po("encode",param_vals = list(method = "one-hot"))
And I would like to predict the label of new observations contained in a dataframe (with the original variables) pred with the command predict(Learner, newdata = pred, predict_type="prob"). This won't work since Learner was trained with centered, scaled, and one-hot encoding variables.
How to apply the same pre-processing used on the training set to new data (only features, not response) in order to make predictions?

I am not 100% sure but it seems you can feed newdata to a new task and feed it to predict. This page shows an example of combining mlr_pipeops and learner objects.
library(dplyr)
library(mlr3verse)
df_iris <- iris
df_iris$Petal.Width = df_iris$Petal.Width %>% cut( breaks = c(0,0.5,1,1.5,2,Inf))
task = TaskClassif$new(id = "my_iris",
backend = df_iris,
target = "Species")
train_set = sample(task$nrow, 0.8 * task$nrow)
test_set = setdiff(seq_len(task$nrow), train_set)
task_train = TaskClassif$new(id = "my_iris",
backend = df_iris[train_set,], # use train_set
target = "Species")
graph = po("scale", param_vals = list(center = TRUE, scale = TRUE)) %>>%
po("encode", param_vals = list(method = "one-hot")) %>>%
mlr_pipeops$get("learner",
learner = mlr_learners$get("classif.rpart"))
graph$train(task_train)
graph$pipeops$encode$state$outtasklayout # inspect model input types
graph$pipeops$classif.rpart$predict_type = "prob"
task_test = TaskClassif$new(id = "my_iris_test",
backend = df_iris[test_set,], # use test_set
target = "Species")
pred = graph$predict(task_test)
pred$classif.rpart.output$prob
# when you don't have a target variable, just make up one
df_test2 <- df_iris[test_set,]
df_test2$Species = sample(df_iris$Species, length(test_set)) # made-up target
task_test2 = TaskClassif$new(id = "my_iris_test",
backend = df_test2, # use test_set
target = "Species")
pred2= graph$predict(task_test2)
pred2$classif.rpart.output$prob

As suggested by #missuse, by using graph <- preprocess %>>% Learner and then graph_learner <- GraphLearner$new(graph) commands, I could predict --- predict(TunedLearner, newdata = pred, predict_type="prob") --- using a raw data.frame.

Related

mlrCPO - Task conversion TOCPO

I would like to build a CPO for the mlr::makeClassificationViaRegression wrapper. The wrapper builds regression models that predict for the positive class whether a particular example belongs to it (1) or not (-1). It also calculates predicted probabilities using a softmax.
After reading the documentation and vignettes for makeCPOTargetOp, my attempt is as follows:
cpoClassifViaRegr = makeCPOTargetOp(
cpo.name = 'ClassifViaRegr',
dataformat = 'task', #Not sure - will this work if input is df with unknown target values?
# properties.data = c('numerics', 'factors', 'ordered', 'missings'), #Is this needed?
properties.adding = 'twoclass', #See https://mlrcpo.mlr-org.com/articles/a_4_custom_CPOs.html#task-type-and-conversion
properties.needed = character(0),
properties.target = c('classif', 'twoclass'),
task.type.out = 'regr',
predict.type.map = c(response = 'response', prob = 'response'),
constant.invert = TRUE,
cpo.train = function(data, target) {
getTaskDesc(data)
},
cpo.retrafo = function(data, target, control) {
cat(class(target))
td = getTaskData(target, target.extra = T)
target.name = paste0(control$positive, ".prob")
data = td$data
data[[target.name]] = ifelse(td$target == pos, 1, -1)
makeRegrTask(id = paste0(getTaskId(target), control$positive, '.'),
data = data,
target = target.name,
weights = target$weights,
blocking = target$blocking)
},
cpo.train.invert = NULL, #Since constant.invert = T
cpo.invert = function(target, control.invert, predict.type) {
if(predict.type == 'response') {
factor(ifelse(target > 0, control.invert$positive, control.invert$positive))
} else {
levs = c(control.invert$positive, control.invert$negative)
propVectorToMatrix(vnapply(target, function(x) exp(x) / sum(exp(x))), levs)
}
})
It seems to work as expected, the demo below shows that the inverted prediction is identical to the prediction obtained using the makeClassificationViaRegr wrapper:
lrn = makeLearner("regr.lm")
# Wrapper -----------------------------------------------------------------
lrn2 = makeClassificationViaRegressionWrapper(lrn)
model = train(lrn2, sonar.task, subset = 1:140)
predictions = predict(model, newdata = getTaskData(sonar.task)[141:208, 1:60])
# CPO ---------------------------------------------------------------------
sonar.train = subsetTask(sonar.task, 1:140)
sonar.test = subsetTask(sonar.task, 141:208)
trafd = sonar.train %>>% cpoClassifViaRegr()
mod = train(lrn, trafd)
retr = sonar.test %>>% retrafo(trafd)
pred = predict(mod, retr)
invpred = invert(inverter(retr), pred)
identical(predictions$data$response, invpred$data$response)
The problem is that the after the CPO has converted the task from twoclass to regr, there is no way for me to specify predict.type = 'prob'. In the case of the wrapper, the properties of the base regr learner are modified to accept predict.type = prob (see here). But the CPO is unable to modify the learner in this way, so how can I tell my model to return predicted probabilities instead of the predicted response?
I was thinking I could specify a include.prob parameter, i.e. cpoClassifViaRegr(include.prob = T). If set to TRUE, the cpo.invert returns the predicted probabilities in addition to the predicted response. Would something like this work?

error : argument "x" is missing, with no default?

As im very new to XGBoost, I am trying to tune the parameters using mlr library and model but after using setHayperPars() learning using train() throws an error (in particular when i run xgmodel line): Error in colnames(x) : argument "x" is missing, with no default, and i can't recognize what's this error means, below is the code:
library(mlr)
library(dplyr)
library(caret)
library(xgboost)
set.seed(12345)
n=dim(mydata)[1]
id=sample(1:n, floor(n*0.6))
train=mydata[id,]
test=mydata[-id,]
traintask = makeClassifTask (data = train,target = "label")
testtask = makeClassifTask (data = test,target = "label")
#create learner
lrn = makeLearner("classif.xgboost",
predict.type = "response")
lrn$par.vals = list( objective="multi:softprob",
eval_metric="merror")
#set parameter space
params = makeParamSet( makeIntegerParam("max_depth",lower = 3L,upper = 10L),
makeIntegerParam("nrounds",lower = 20L,upper = 100L),
makeNumericParam("eta",lower = 0.1, upper = 0.3),
makeNumericParam("min_child_weight",lower = 1L,upper = 10L),
makeNumericParam("subsample",lower = 0.5,upper = 1),
makeNumericParam("colsample_bytree",lower = 0.5,upper = 1))
#set resampling strategy
configureMlr(show.learner.output = FALSE, show.info = FALSE)
rdesc = makeResampleDesc("CV",stratify = T,iters=5L)
# set the search optimization strategy
ctrl = makeTuneControlRandom(maxit = 10L)
# parameter tuning
set.seed(12345)
mytune = tuneParams(learner = lrn, task = traintask,
resampling = rdesc, measures = acc,
par.set = params, control = ctrl,
show.info = FALSE)
# build model using the tuned paramters
#set hyperparameters
lrn_tune = setHyperPars(lrn,par.vals = mytune$x)
#train model
xgmodel = train(learner = lrn_tune,task = traintask)
Could anyone tell me what's wrong!?
You have to be very careful when loading multiple packages that may involve methods with the same name - here caret and mlr, which both include a train method. Moreover, the order of the library statements is significant: here, as caret is loaded after mlr, it masks functions with the same name from it (and possibly every other package loaded previously), like train.
In your case, where you obviously want to use the train method from mlr (and not from caret), you should declare this explicitly in your code:
xgmodel = mlr::train(learner = lrn_tune,task = traintask)

Tuning GLMNET using mlr3

MLR3 is really cool. I am trying to tune the regularisation prarameter
searchspace_glmnet_trafo = ParamSet$new(list(
ParamDbl$new("regr.glmnet.lambda", log(0.01), log(10))
))
searchspace_glmnet_trafo$trafo = function(x, param_set) {
x$regr.glmnet.lambda = (exp(x$regr.glmnet.lambda))
x
}
but get the error
Error in glmnet::cv.glmnet(x = data, y = target, family = "gaussian", :
Need more than one value of lambda for cv.glmnet
A minimum non-working example is below. Any help is greatly appreciated.
library(mlr3verse)
data("kc_housing", package = "mlr3data")
library(anytime)
dates = anytime(kc_housing$date)
kc_housing$date = as.numeric(difftime(dates, min(dates), units = "days"))
kc_housing$zipcode = as.factor(kc_housing$zipcode)
kc_housing$renovated = as.numeric(!is.na(kc_housing$yr_renovated))
kc_housing$has_basement = as.numeric(!is.na(kc_housing$sqft_basement))
kc_housing$id = NULL
kc_housing$price = kc_housing$price / 1000
kc_housing$yr_renovated = NULL
kc_housing$sqft_basement = NULL
lrnglm=lrn("regr.glmnet")
kc_housing
tsk = TaskRegr$new("sales", kc_housing, target = "price")
fencoder = po("encode", method = "treatment",
affect_columns = selector_type("factor"))
pipe = fencoder %>>% lrnglm
glearner = GraphLearner$new(pipe)
glearner$train(tsk)
searchspace_glmnet_trafo = ParamSet$new(list(
ParamDbl$new("regr.glmnet.lambda", log(0.01), log(10))
))
searchspace_glmnet_trafo$trafo = function(x, param_set) {
x$regr.glmnet.lambda = (exp(x$regr.glmnet.lambda))
x
}
inst = TuningInstance$new(
tsk, glearner,
rsmp("cv"), msr("regr.mse"),
searchspace_glmnet_trafo, term("evals", n_evals = 100)
)
gsearch = tnr("grid_search", resolution = 100)
gsearch$tune(inst)
lambda needs to be a vector param, not a single value (as the message tells).
I suggest to not tune cv.glmnet.
This algorithm does an internal 10-fold CV optimization and relies on its own sequence for lambda.
Consult the help page of the learner for more information.
You can apply your own tuning (tuning of param s, not lambda) on glmnet::glmnet(). However, this algorithm is not (yet) available for use with {mlr3}.

MLR: How can I wrap the selection of specified features around the learner?

I would like to compare simple logistic regressions models where each model considers a specified set of features only. I would like to perform comparisons of these regression models on resamples of the data.
The R package mlr allows me to select columns at the task level using dropFeatures. The code would be something like:
full_task = makeClassifTask(id = "full task", data = my_data, target = "target")
reduced_task = dropFeatures(full_task, setdiff( getTaskFeatureNames(full_task), list_feat_keep))
Then I can do benchmark experiments where I have a list of tasks.
lrn = makeLearner("classif.logreg", predict.type = "prob")
rdesc = makeResampleDesc(method = "Bootstrap", iters = 50, stratify = TRUE)
bmr = benchmark(lrn, list(full_task, reduced_task), rdesc, measures = auc, show.info = FALSE)
How can I generate a learner that only considers a specified set of features.
As far as I know the filter or selection methods always apply some statistical
procedure but do not allow to select the features directly. Thank you!
The first solution is lazy and also not optimal because the filter calculation is still carried out:
library(mlr)
task = sonar.task
sel.feats = c("V1", "V10")
lrn = makeLearner("classif.logreg", predict.type = "prob")
lrn.reduced = makeFilterWrapper(learner = lrn, fw.method = "variance", fw.abs = 2, fw.mandatory.feat = sel.feats)
bmr = benchmark(list(lrn, lrn.reduced), task, cv3, measures = auc, show.info = FALSE)
The second one uses the preprocessing wrapper to filter the data and should be the fastest solution and is also more flexible:
lrn.reduced.2 = makePreprocWrapper(
learner = lrn,
train = function(data, target, args) list(data = data[, c(sel.feats, target)], control = list()),
predict = function(data, target, args, control) data[, sel.feats]
)
bmr = benchmark(list(lrn, lrn.reduced.2), task, cv3, measures = auc, show.info = FALSE)

R package mlr Multilabel Text Classification: how to classify new data

I found this code in a tutorial about multilabel classification with package mlr.
library("mlr")
yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))
pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])
I understand the structure of the dataset yeast, but I do not understand how to use the code when I have new data which I want to classify because then I wouldn´t have any TRUE or FALSE values for the labels. Actually I would have some training data with the same structure as yeast but for my new data the columns 1:14 would be missing.
Am I missunderstanding something? If not: How can I use the code correctly?
Edit:
Here´s a sample code how I would use the code:
library("tm")
train.data = data.frame("id" = c(1,1,2,3,4,4), "text" = c("Monday is nice weather.", "Monday is nice weather.", "Dogs are cute.", "It is very rainy.", "My teacher is angry.", "My teacher is angry."), "label" = c("label1", "label2", "label3", "label1", "label4", "label5"))
test.data = data.frame("id" = c(5,6), "text" = c("Next Monday I will meet my teacher.", "Dogs do not like rain."))
train.data$text = as.character(train.data$text)
train.data$id = as.character(train.data$id)
train.data$label = as.character(train.data$label)
test.data$text = as.character(test.data$text)
test.data$id = as.character(test.data$id)
### Bring training data into structure
train.data$label = make.names(train.data$label)
labels = unique(train.data$label)
# DocumentTermMatrix for all texts
texts = unique(c(train.data$text, test.data$text))
docs <- Corpus(VectorSource(unique(texts)))
terms <-DocumentTermMatrix(docs)
m <- as.data.frame(as.matrix(terms))
# Logical columns for labels
test = data.frame("id" = train.data$id, "topic"=train.data$label)
test2 = as.data.frame(unclass(table(test)))
test2[,c(1:ncol(test2))] = as.logical(unlist(test2[,c(1:ncol(test2))]))
rownames(test2) = unique(test$id)
# Bind columns from dtm
termsDf = cbind(test2, m[1:nrow(test2),])
names(termsDf) = make.names(names(termsDf))
### Create Multilabel Task
classify.task = makeMultilabelTask(id = "multi", data = termsDf, target = labels)
### Now the model
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, classify.task)
### How can I predict for test.data?
So, the problem is that I do not have any labels for test.data because that is what I would actually like to compute?
Edit2:
When I simply use
names(m) = make.names(names(m))
pred = predict(mod, newdata = m[(nrow(termsDf)+1):(nrow(termsDf)+nrow(test.data)),])
the result is for both texts the same and really not that I would expect.

Resources