randomForest in R is including the class label as a feature prevents classifier from predicting on new dataset - r

So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.

Related

R mlr3 TaskClassif 'termlabels' must be a character vector of length at least one

I am using mlr3 for a simple classification model. But I encounter errors with several different models which mlr3 gives access to. Here I provide one reprex to illustrate the problem:
library(data.table)
library(mlr3extralearners)
library(mlr3)
library(mlr3learners)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3filters)
#Make example data
DT = data.table(target = c(0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0),pred = c(0.05767878,0.05761652,0.06508700,0.06531820,0.07050699,0.07098812,0.07150984,0.07845767,0.07891081,0.07873572,0.08035471,0.08039300,0.08040480,0.08040480,0.08472619,0.08489135,0.08517742,0.08612768,0.08728675,0.08790671,0.08913434,0.08911522,0.09036788,0.09147726,0.09154964,0.09236259,0.09299088,0.09499589,0.09748171,0.09756818,0.09756818,0.09861013,0.10193147,0.10211796,0.10277547,0.10379659,0.10393602,0.10397469,0.10364373,0.10368016,0.10362235,0.10387504,0.10385431,0.10387288,0.10423139,0.10483475,0.10570517,0.10573617,0.10569312,0.10572714,0.10597040,0.10573924,0.10551367,0.10573499,0.10602269,0.10765947,0.10721005,0.10703524,0.10824609,0.10933141,0.10936178,0.10957693,0.10874663,0.10875077))
DT[, target := as.factor(target)] #Target Variable as factor is required
task <- TaskClassif$new(id='pizza', backend = DT, target = "target", positive = '1')
#Select an algo and a filter
randF = lrn("classif.randomForest", predict_type = "prob")#
filter1 = mlr_pipeops$get("filter", filter = mlr3filters::FilterVariance$new(),param_vals = list(filter.cutoff = 0.05))
#Construct a simple graph
graph = filter1 %>>%
PipeOpLearner$new(lrn("classif.randomForest"), id = "randF")
#graph$plot()
#Construct a learner and train it
learner = GraphLearner$new(graph)
learner$train(task)
This give the error:
'Error in reformulate(attributes(Terms)$term.labels) :
'termlabels' must be a character vector of length at least one'
I have the impression, that the task- object of mlr3 somehow doesnt interact well with the graph. The error then comes from the randomForest classifier, but to me it seems like the data was not properly handed over to it. But thats just a theory of mine. I may alter the question if its not clear enough.
Your filter is removing the only feature, and feature filtering is not necessary if there is only a single feature.

Importance based variable reduction

I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.
I did try already two approaches, but I have failed twice.
The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.
# reproducible example
data <- iris
# artificial class imbalancing
data <- iris %>%
mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))
Everything works fine while using simple Learner:
# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")
# creating Learner
lrn <- lrn("classif.xgboost")
# setting scoring as prediction type
lrn$predict_type = "prob"
lrn$train(task)
lrn$importance()
Petal.Width Petal.Length
0.90606304 0.09393696
The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:
I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.
# undersampling
po_under <- po("classbalancing",
id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1 / 2)
# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)
# setting the autoTuner
at <- AutoTuner$new(
learner = lrn_under,
resampling = resample,
measure = measure,
search_space = ps_under,
terminator = terminator,
tuner = tuner
)
at$train(task)
The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.
> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights
So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.
> lrn <- lrn("classif.xgboost", importance = "impurity")
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
nie można zmienić wartości zablokowanego połączenia dla 'importance'
Which basically means something like this:
Error in 'instance[[nn]] <- dots[[i]]': can't change value of blocked connection for 'importance'
I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".
So my question is, is there a way to achieve what I am aiming for?
In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?
The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.
The trained GraphLearner is saved inside your AutoTuner under $learner:
# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner
This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).
Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:
Get the untrained Learner object
get the trained state of the Learner and put it into that object
# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost
Now the importance can be queried
xgboostlearner$importance()
The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.

Error when calculating variable importance with categorical variables using the caret package (varImp)

I've been trying to compute the variable importance for a model with mixed scale features using the varImp function in the caret package. I've tried a number of approaches, including renaming and coding my levels numerically. In each case, I am getting the following error:
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
The following dummy example should illustrate my point (edited to reflect #StupidWolf's correction):
library(caret)
#create small dummy dataset
set.seed(124)
dummy_data = data.frame(Label = factor(sample(c("a","b"),40, replace = TRUE)))
dummy_data$pred1 = ifelse(dummy_data$Label=="a",rnorm(40,-.5,2),rnorm(40,.5,2))
dummy_data$pred2 = factor(ifelse(dummy_data$Label=="a",rbinom(40,1,0.3),rbinom(40,1,0.7)))
# check varImp
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
model.lvq <- caret::train(Label~., data=dummy_data,
method="lvq", preProcess="scale", trControl=control.lvq)
varImp.lvq <- caret::varImp(model.lvq, scale=FALSE)
The issue persists when using different models (like randomForest and SVM).
If anyone knows a solution or can tell me what is going wrong, I would highly appreciate that.
Thanks!
When you call varImp on lvq , it defaults to filterVarImp() because there is no specific variable importance for this model. Now if you check the help page:
For two class problems, a series of cutoffs is applied to the
predictor data to predict the class. The sensitivity and specificity
are computed for each cutoff and the ROC curve is computed.
Now if you read the source code of varImp.train() that feeds the data into filterVarImp(), it is the original dataframe and not whatever comes out of the preprocess.
This means in the original data, if you have a variable that is a factor, it cannot cut the variable, it will throw and error like this:
filterVarImp(data.frame(dummy_data$pred2),dummy_data$Label)
Error in auc3_(actual, predicted, ranks) :
Not compatible with requested type: [type=character; target=double].
So using my example and like you have pointed out, you need to onehot encode it:
set.seed(111)
dummy_data = data.frame(Label = rep(c("a","b"),each=20))
dummy_data$pred1 = rnorm(40,rep(c(-0.5,0.5),each=20),2)
dummy_data$pred2 = rbinom(40,1,rep(c(0.3,0.7),each=20))
dummy_data$pred2 = factor(dummy_data$pred2)
control.lvq <- caret::trainControl(method="repeatedcv", number=10, repeats=3)
ohe_data = data.frame(
Label = dummy_data$Label,
model.matrix(Label ~ 0+.,data=dummy_data))
model.lvq <- caret::train(Label~., data=ohe_data,
method="lvq", preProcess="scale",
trControl=control.lvq)
caret::varImp(model.lvq, scale=FALSE)
ROC curve variable importance
Importance
pred1 0.6575
pred20 0.6000
pred21 0.6000
If you use a model that doesn't have a specific variable importance method, then one option is that you can already calculate the variable importance first, and run the model after that.
Note that this problem can be circumvented by replacing ordinal features (with d levels) by its (d-1)-dimensional indicator encoding:
model.matrix(~dummy_data$pred2-1)[,1:(length(levels(dummy_data$pred2)-1)]
However, why does varImp not handle this automatically? Further, this has the drawback that it yields an importance score for each of the d-1 indicators, not one unified importance score for the original feature.

What's the difference between lgb.train() and lightgbm() in r?

I'm trying to build a regression model with R using lightGBM,
and i'm getting a bit confused with some functions and when/how to use them.
First one is what i've written in the title, what's the difference between lgb.train() and lightgbm()?
The description in the documentation(https://cran.r-project.org/web/packages/lightgbm/lightgbm.pdf) says that lgb.train is 'Logic to train with LightGBM' and lightgbm is 'Simple interface for training a LightGBM model', while both their outcome value is lgb.Booster, a trained model.
One difference I've found is that lgb.train() does not work with valids = , while lightgbm() does.
Second one is about a function lgb.cv(), regarding a cross validation in lightGBM. How do you apply the output of lgb.cv() to a model?
As I understood from the documentation i've linked above, it seems like the output of both lgb.cv and lgb.train is a model.
Is it correct to use it like the example below?
lgbcv <- lgb.cv(params,
lgbtrain,
nrounds = 1000,
nfold = 5,
early_stopping_rounds = 100,
learning_rate = 1.0)
lgbcv <- lightgbm(params,
lgbtrain,
nrounds = 1000,
early_stopping_rounds = 100,
learning_rate = 1.0)
Thank you in advance!
what's the difference between lgb.train() and lightgbm()?
These functions both train a LightGBM model, they're just slightly different interfaces. The biggest difference is in how training data are prepared. LightGBM training requires a special LightGBM-specific representation of the training data, called a Dataset. To use lgb.train(), you have to construct one of these beforehand with lgb.Dataset(). lightgbm(), on the other hand, can accept a data frame, data.table, or matrix and will create the Dataset object for you.
Choose whichever method you feel has a more friendly interface...both will produce a single trained LightGBM model (class "lgb.Booster").
that lgb.train() does not work with valids = , while lightgbm() does.
This is not correct. Both functions accept the keyword argument valids. Run ?lgb.train and ?lightgbm for documentation on those methods.
How do you apply the output of lgb.cv() to a model?
I'm not sure what you mean, but you can find an example of how to use lgb.cv() in the docs that show up when you run ?lgb.cv.
library(lightgbm)
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
params <- list(objective = "regression", metric = "l2")
model <- lgb.cv(
params = params
, data = dtrain
, nrounds = 5L
, nfold = 3L
, min_data = 1L
, learning_rate = 1.0
)
This returns an object of class "lgb.CVBooster". That object has multiple "lgb.Booster" objects in it (the trained models that lightgbm() or lgb.train() produce).
You can extract any one of these from model$boosters. However, in practice I don't recommend using the models from lgb.cv() directly. The goal of cross-validation is to get an estimate of the generalization error for a model. So you can use lgb.cv() to figure out the expected error for a given dataset + set of parameters (by looking at model$record_evals and model$best_score).

R xgb.importance shows error - "feature_names has less elements than there are features used in the model"

I'm exploring the XGBoost in R.
After training the model, I wanted to see the feature-importance data.
xgb.importance(model = bst)
The above call shows the following error. What might be wrong?
Error in xgb.model.dt.tree(feature_names = feature_names, text = model_text_dump, : feature_names has less elements than there are features used in the model
PN - I checked the following section of the xgboost lib code, but still couldn't figure out the actual issue.
# assign feature_names when available
if (!is.null(feature_names)) {
if (length(feature_names) <= max(as.numeric(td$Feature), na.rm = TRUE))
stop("feature_names has less elements than there are features used in the model")
td[isLeaf == FALSE, Feature := feature_names[as.numeric(Feature) + 1] ]
}
Ref - https://github.com/dmlc/xgboost/blob/master/R-package/R/xgb.model.dt.tree.R
I see that the nfeatures variable of the trained model is same as the number of features passed to this model.
Dose your model have feature_names function?
Perhaps try xgb.importance(feature_names=colnames(bst$feature_names), model = bst). Works for me.

Resources