I followed the documentation of mlr3 regarding the imputation of data with pipelines. However, the mode that I have trained does not allow predictions if a one column is NA
Do you have any idea why it doesn't work?
train step
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
data("mtcars", package = "datasets")
data = mtcars[, 1:3]
str(data)
task_mtcars = TaskRegr$new(id="cars", backend = data, target = "mpg")
imp_missind = po("missind")
imp_num = po("imputehist", param_vals =list(affect_columns = selector_type("numeric")))
scale = po("scale")
learner = lrn('regr.ranger')
graph = po("copy", 2) %>>%
gunion(list(imp_num %>>% scale,imp_missind)) %>>%
po("featureunion") %>>%
po(learner)
graph$plot()
graphlearner = GraphLearner$new(graph)
predict step
data = task_mtcars$data()[12:12,]
data[1:1, cyl:=NA]
predict(graphlearner, data)
The error is
Error: Missing data in columns: cyl.
The example in the mlr3gallery seems to work for your case, so you basically
have to switch the order of imputehist and missind.
Another approach would be to set the missind's which hyperparameter to "all" in order to enforce the creation of an indicator for every column.
This is actually a bug, where missind returns the full task if trained on data
with no missings (which in turn then overwrites the imputed values).
Thanks a lot for spotting it. I am trying to fix it here PR
Related
So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.
I am using mlr3 for a simple classification model. But I encounter errors with several different models which mlr3 gives access to. Here I provide one reprex to illustrate the problem:
library(data.table)
library(mlr3extralearners)
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3filters)
#Make example data
DT = data.table(target = c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),pred = c(0.05767878,0.05761652,0.06508700,0.06531820,0.07050699,0.07098812,0.07150984,0.07845767,0.07891081,0.07873572,0.08035471,0.08039300,0.08040480,0.08040480,0.08472619,0.08489135,0.08517742,0.08612768,0.08728675,0.08790671,0.08913434,0.08911522,0.09036788,0.09147726,0.09154964,0.09236259,0.09299088,0.09499589,0.09748171,0.09756818,0.09756818,0.09861013,0.10193147,0.10211796,0.10277547,0.10379659,0.10393602,0.10397469,0.10364373,0.10368016,0.10362235,0.10387504,0.10385431,0.10387288,0.10423139,0.10483475,0.10570517,0.10573617,0.10569312,0.10572714,0.10597040,0.10573924,0.10551367,0.10573499,0.10602269,0.10765947,0.10721005,0.10703524,0.10824609,0.10933141,0.10936178,0.10957693,0.10874663,0.10875077))
DT[, target := as.factor(target)] #Target Variable as factor is required
task <- TaskClassif$new(id='pizza', backend = DT, target = "target", positive = '1')
#Select an algo and a filter
randF = lrn("classif.randomForest", predict_type = "prob")#
filter1 = mlr_pipeops$get("filter", filter = mlr3filters::FilterVariance$new(),param_vals = list(filter.cutoff = 0.05))
#Construct a simple graph
graph = filter1 %>>%
PipeOpLearner$new(lrn("classif.randomForest"), id = "randF")
#graph$plot()
#Construct a learner and train it
learner = GraphLearner$new(graph)
learner$train(task)
This give the error:
'Error in reformulate(attributes(Terms)$term.labels) :
'termlabels' must be a character vector of length at least one'
I have the impression, that the task- object of mlr3 somehow doesnt interact well with the graph. The error then comes from the randomForest classifier, but to me it seems like the data was not properly handed over to it. But thats just a theory of mine. I may alter the question if its not clear enough.
Your filter is removing the only feature, and feature filtering is not necessary if there is only a single feature.
I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.
I did try already two approaches, but I have failed twice.
The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.
# reproducible example
data <- iris
# artificial class imbalancing
data <- iris %>%
mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))
Everything works fine while using simple Learner:
# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")
# creating Learner
lrn <- lrn("classif.xgboost")
# setting scoring as prediction type
lrn$predict_type = "prob"
lrn$train(task)
lrn$importance()
Petal.Width Petal.Length
0.90606304 0.09393696
The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:
I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.
# undersampling
po_under <- po("classbalancing",
id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1 / 2)
# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)
# setting the autoTuner
at <- AutoTuner$new(
learner = lrn_under,
resampling = resample,
measure = measure,
search_space = ps_under,
terminator = terminator,
tuner = tuner
)
at$train(task)
The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.
> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights
So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.
> lrn <- lrn("classif.xgboost", importance = "impurity")
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
nie można zmienić wartości zablokowanego połączenia dla 'importance'
Which basically means something like this:
Error in 'instance[[nn]] <- dots[[i]]': can't change value of blocked connection for 'importance'
I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".
So my question is, is there a way to achieve what I am aiming for?
In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?
The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.
The trained GraphLearner is saved inside your AutoTuner under $learner:
# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner
This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).
Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:
Get the untrained Learner object
get the trained state of the Learner and put it into that object
# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost
Now the importance can be queried
xgboostlearner$importance()
The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.
Background
I'm modeling and predicting with the mlr3 package in R. I'm working with one big data set that consists out of test and train sets. Test and train sets are indicated by an indicator column (in code: test_or_train).
Goal
Batch train all learners with the train rows indicated by the train_or_test column in the data set.
Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner.
Code
Place holder data set with test-train indicator column. (In the actual data train-test split is not artifictial)
Two tasks (in the actual code tasks are distinct and there are more.)
library(readr)
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
library(reprex)
library(caret)
# Data
urlfile = 'https://raw.githubusercontent.com/shudras/office_data/master/office_data.csv'
data = read_csv(url(urlfile))[-1]
## Create artificial partition to test and train sets
art_part = createDataPartition(data$imdb_rating, list=FALSE)
train = data[art_part,]
test = data[-art_part,]
## Add test-train indicators
train$test_or_train = 'train'
test$test_or_train = 'test'
## Data set that I want to work / am working with
data = rbind(test, train)
# Create two tasks (Here the tasks are the same but in my data set they differ.)
task1 =
TaskRegr$new(
id = 'office1',
backend = data,
target = 'imdb_rating'
)
task2 =
TaskRegr$new(
id = 'office2',
backend = data,
target = 'imdb_rating'
)
# Model specification
graph =
po('scale') %>>%
lrn('regr.cv_glmnet',
id = 'rp',
alpha = 1,
family = 'gaussian'
)
# Learner creation
learner = GraphLearner$new(graph)
# Goal
## 1. Batch train all learners with the train rows indicated by the train_or_test column in the data set
## 2. Batch predict the rows designated by the 'test' in the test_or_train column with the respective trained learner
Created on 2020-06-22 by the reprex package (v0.3.0)
Note
I tried using benchmark_grid with row_ids to only train the learner with the train rows but this did not work and it was also not possible to work with the column designator with is much easier than with row indices. With the column test-train designator one can work with one rule (for the split) whereas working with the row indices only works as long as the tasks contain the same rows.
benchmark_grid(
tasks = list(task1, task2),
learners = learner,
row_ids = train_rows # Not an argument and not favorable to work with indices
)
You can use benchmark with a custom design.
The following should do the job (note that I instantiate a custom Resampling for each Task seperately.
library(data.table)
design = data.table(
task = list(task1, task2),
learner = list(learner)
)
library(mlr3misc)
design$resampling = map(design$task, function(x) {
# get train/test split
split = x$data()[["test_or_train"]]
# remove train-test split column from the task
x$select(setdiff(x$feature_names, "test_or_train"))
# instantiate a custom resampling with the given split
rsmp("custom")$instantiate(x,
train_sets = list(which(split == "train")),
test_sets = list(which(split == "test"))
)
})
benchmark(design)
Could you specify what you mean by batch-processing more clearly or does this answer your question?
I have been stumped on this problem for a very long time and cannot figure it out. I believe the issue stems from subsets of data.frame objects retaining information of the parent but I also feel it's causing issues when training h2o.deeplearning models on what I think is just my training set (though this may not be true). See below for sample code. I included comments to clarify what I'm doing but it's fairly short code:
dataset = read.csv("dataset.csv")[,-1] # Read dataset in but omit the first column (it's just an index from the original data)
y = dataset[,1] # Create response
X = dataset[,-1] # Create regressors
X = model.matrix(y~.,data=dataset) # Automatically create dummy variables
y=as.factor(y) # Ensure y has factor data type
dataset = data.frame(y,X) # Create final data.frame dataset
train = sample(length(y),length(y)/1.66) # Create training indices -- A boolean
test = (-train) # Create testing indices
h2o.init(nthreads=2) # Initiate h2o
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(y='y',training_frame=as.h2o(dataset[train,,drop=TRUE]),activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
predictions = h2o.predict(mlModel,newdata=as.h2o(dataset[test,-1])) # Predict using mlModel
predictions = as.data.frame(predictions) # Convert predictions to dataframe object. as.vector() caused issues for me
predictions = predictions[,1] # Extract predictions
mean(predictions!=y[test])
The problem is that if I evaluate this against my test subset I get almost 0% error:
[1] 0.0007531255
Has anyone encountered this issue? Have an idea of how to alleviate this problem?
It will be more efficient to use the H2O functions to load the data and split it.
data = h2o.importFile("dataset.csv")
y = 2 #Response is 2nd column, first is an index
x = 3:(ncol(data)) #Learn from all the other columns
data[,y] = as.factor(data[,y])
parts = h2o.splitFrame(data, 0.8) #Split 80/20
train = parts[[1]]
test = parts[[2]]
# BELOW: Create h2o.deeplearning model with subset of dataset.
mlModel = h2o.deeplearning(x=x, y=y, training_frame=train,activation="Rectifier",
hidden=c(6,6),epochs=10,train_samples_per_iteration = -2)
h2o.performance(mlModel, test)
It is hard to say what the problem with your original code is, without seeing the contents of dataset.csv and being able to try it. My guess is that train and test are not being split, and it is actually being trained on the test data.