I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.
I did try already two approaches, but I have failed twice.
The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.
# reproducible example
data <- iris
# artificial class imbalancing
data <- iris %>%
mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))
Everything works fine while using simple Learner:
# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")
# creating Learner
lrn <- lrn("classif.xgboost")
# setting scoring as prediction type
lrn$predict_type = "prob"
lrn$train(task)
lrn$importance()
Petal.Width Petal.Length
0.90606304 0.09393696
The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:
I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.
# undersampling
po_under <- po("classbalancing",
id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1 / 2)
# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)
# setting the autoTuner
at <- AutoTuner$new(
learner = lrn_under,
resampling = resample,
measure = measure,
search_space = ps_under,
terminator = terminator,
tuner = tuner
)
at$train(task)
The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.
> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights
So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.
> lrn <- lrn("classif.xgboost", importance = "impurity")
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
nie można zmienić wartości zablokowanego połączenia dla 'importance'
Which basically means something like this:
Error in 'instance[[nn]] <- dots[[i]]': can't change value of blocked connection for 'importance'
I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".
So my question is, is there a way to achieve what I am aiming for?
In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?
The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.
The trained GraphLearner is saved inside your AutoTuner under $learner:
# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner
This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).
Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:
Get the untrained Learner object
get the trained state of the Learner and put it into that object
# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost
Now the importance can be queried
xgboostlearner$importance()
The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.
Related
So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.
I am using mlr3 for a simple classification model. But I encounter errors with several different models which mlr3 gives access to. Here I provide one reprex to illustrate the problem:
library(data.table)
library(mlr3extralearners)
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3filters)
#Make example data
DT = data.table(target = c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),pred = c(0.05767878,0.05761652,0.06508700,0.06531820,0.07050699,0.07098812,0.07150984,0.07845767,0.07891081,0.07873572,0.08035471,0.08039300,0.08040480,0.08040480,0.08472619,0.08489135,0.08517742,0.08612768,0.08728675,0.08790671,0.08913434,0.08911522,0.09036788,0.09147726,0.09154964,0.09236259,0.09299088,0.09499589,0.09748171,0.09756818,0.09756818,0.09861013,0.10193147,0.10211796,0.10277547,0.10379659,0.10393602,0.10397469,0.10364373,0.10368016,0.10362235,0.10387504,0.10385431,0.10387288,0.10423139,0.10483475,0.10570517,0.10573617,0.10569312,0.10572714,0.10597040,0.10573924,0.10551367,0.10573499,0.10602269,0.10765947,0.10721005,0.10703524,0.10824609,0.10933141,0.10936178,0.10957693,0.10874663,0.10875077))
DT[, target := as.factor(target)] #Target Variable as factor is required
task <- TaskClassif$new(id='pizza', backend = DT, target = "target", positive = '1')
#Select an algo and a filter
randF = lrn("classif.randomForest", predict_type = "prob")#
filter1 = mlr_pipeops$get("filter", filter = mlr3filters::FilterVariance$new(),param_vals = list(filter.cutoff = 0.05))
#Construct a simple graph
graph = filter1 %>>%
PipeOpLearner$new(lrn("classif.randomForest"), id = "randF")
#graph$plot()
#Construct a learner and train it
learner = GraphLearner$new(graph)
learner$train(task)
This give the error:
'Error in reformulate(attributes(Terms)$term.labels) :
'termlabels' must be a character vector of length at least one'
I have the impression, that the task- object of mlr3 somehow doesnt interact well with the graph. The error then comes from the randomForest classifier, but to me it seems like the data was not properly handed over to it. But thats just a theory of mine. I may alter the question if its not clear enough.
Your filter is removing the only feature, and feature filtering is not necessary if there is only a single feature.
I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
The following dummy example should illustrate my point (edited to reflect #StupidWolf's correction):
library(caret)
#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))
# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data,
method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)
The issue persists when using different models (like randomForest and SVM).
If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.
Thanks!
When you call varImp on lvq , it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page:
For two class problems, a series of cutoffs is applied to the
predictor data to predict the class. The sensitivity and specificity
are computed for each cutoff and the ROC curve is computed.
Now if you read the source code of varImp.train() that feeds the data into filterVarImp(), it is the original dataframe and not whatever comes out of the preprocess.
This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:
filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
So using my example and like you have pointed out, you need to onehot encode it:
set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
ohe_data = data.frame(
Label = dummy_data$Label,
model.matrix(Label ~ 0+.,data=dummy_data))
model.lvq <- caret::train(Label~., data=ohe_data,
method="lvq", preProcess="scale",
trControl=control.lvq)
caret::varImp(model.lvq, scale=FALSE)
ROC curve variable importance
Importance
pred1 0.6575
pred20 0.6000
pred21 0.6000
If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.
Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:
model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]
However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.
I am using the mlr3 package and I want to plot ROC curves for different models. If I use cross validation as explained in the documentation it works perfectly well, but if I use "holdout" for the resampling then I get an error Error: Invalid show_cb. Inconsistent with calc_avg of evalmod..
Here is the code:
library("mlr3")
library("mlr3learners")
library("mlr3viz")
# one task only
tasks = lapply(c("german_credit"), tsk)
# get some learners and for all learners ...
# * predict probabilities
# * predict also on the training set
learners = c("classif.featureless", "classif.rpart", "classif.ranger", "classif.kknn")
learners = lapply(learners, lrn,
predict_type = "prob")
# compare via 3-fold cross validation
resamplings = rsmp("holdout", ratio = .8) # holdout instead of cv
# create a BenchmarkDesign object
design = benchmark_grid(tasks, learners, resamplings)
print(design)
bmr = benchmark(design)
autoplot(bmr, type = "roc")
Thanks for your help,
Mathieu
In case someone else is having the same problem here is a solution. The problem occurs because the argument calc_avg is set to TRUE by default in precrec::evalmod() and the function is used as is in mlr3viz::autoplot(). Since as_precrec() returns an object without different dsids (different values coming from different folds in the case of cross-validation, with holdout there is only one element) then averaging is not possible for precrec hence the error (although theoretically it could).
Here is a piece of code that can be used to plot ROC curves with holdout (or any other types of resampling). Using the code in the answer we can do the following:
roc_data <- evalmod(as_precrec(bmr), mode = "rocprc", calc_avg = FALSE) %>% # setting calc_avg to FALSE is critical
fortify() %>% # precrec objects have a fortify generic function
.[.$curvetype == "ROC", ] # both roc and prc are returned
# Tracer les courbes
ggplot(
data = roc_data,
mapping = aes(x = x, y = y, color = modname)
) +
geom_line()
This code also has the advantage of being a ggplot object so it can be modified easily with ggplot2 which is not the case for precrec::autoplot().
I followed the documentation of mlr3 regarding the imputation of data with pipelines. However, the mode that I have trained does not allow predictions if a one column is NA
Do you have any idea why it doesn't work?
train step
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
data("mtcars", package = "datasets")
data = mtcars[, 1:3]
str(data)
task_mtcars = TaskRegr$new(id="cars", backend = data, target = "mpg")
imp_missind = po("missind")
imp_num = po("imputehist", param_vals =list(affect_columns = selector_type("numeric")))
scale = po("scale")
learner = lrn('regr.ranger')
graph = po("copy", 2) %>>%
gunion(list(imp_num %>>% scale,imp_missind)) %>>%
po("featureunion") %>>%
po(learner)
graph$plot()
graphlearner = GraphLearner$new(graph)
predict step
data = task_mtcars$data()[12:12,]
data[1:1, cyl:=NA]
predict(graphlearner, data)
The error is
Error: Missing data in columns: cyl.
The example in the mlr3gallery seems to work for your case, so you basically
have to switch the order of imputehist and missind.
Another approach would be to set the missind's which hyperparameter to "all" in order to enforce the creation of an indicator for every column.
This is actually a bug, where missind returns the full task if trained on data
with no missings (which in turn then overwrites the imputed values).
Thanks a lot for spotting it. I am trying to fix it here PR