R - partykit - custom model tree nodes and splits - r

In package partykit there is a way to build custom trees by specifying predictor and split. For example:
data("WeatherPlay", package = "partykit")
#create a split
sp_o <- partysplit(3L, breaks = 75)
#create a node
n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:3L, partynode))
#and make a "tree" out of it
t2 <- party(
n1,
data = WeatherPlay,
fitted = data.frame(
"(fitted)" = fitted_node(n1, data = WeatherPlay),
"(response)" = WeatherPlay$play,
check.names = FALSE
),
terms = terms(play ~ ., data = WeatherPlay),
)
t2 <- as.constparty(t2)
t2
plot(t2)
Is this possible for model trees (returned by mob())? Can i build the tree node by node, and then fit specified function to terminal nodes?

In principle, it is possible to build a "modelparty" object (as returned by mob()) by hand. However, there is not a simple coercion function like as.constparty() in the example you cite. The reason is that for building the model trees, and for printing and especially predicting with them, much more detailed information about the model is needed.
I would recommend to build the tree structure ("partynode") first, then fill the overall $info slot (with call, formula, Formula, terms, fit, control, dots, and nreg). And then it should be possible to call refit.modelparty() to refit the models in all the terminal nodes. And this can then be used to fill the $node$info (with coefficients, objfun, nobs, ...).
Disclaimer: All of this is completely untested. But instead of mimicking "modelparty" you could, of course, also create your own way of storing the models in the "party" object and just use more basic building blocks provided by partykit.

Related

randomForest in R is including the class label as a feature prevents classifier from predicting on new dataset

So I have two datasets, og.data and newdata.df. I have matched their features and I want to use a feature from og.data to train a model so I can identify cases of this class in newdata.df. I am using the randomForest package in R documentation for it is here: https://cran.r-project.org/web/packages/randomForest/randomForest.pdf
split <- sample.split(og.data$class_label, SplitRatio = 0.7)
training_set = subset(og.data$class_label, split == TRUE)
test_set = subset(og.data$class_label, split == FALSE)
rf.classifier.object = randomForest(x = training_set[-1],
y = training_set$Engramcell,
ntree = 500)
I then use the test set to calculate the AUC, visualize ROC, precision, recall etc etc.
I do that using prediction probability generated like so...
predictions.df <- as.data.frame(predict(rf.classifier.object,
test_set,
type = "prob")
)
All is good I proceed to try to use the classifier I've trained on new data and now I am encountering a problem because the new data does not contain the feature class label. Whihc is annoying as the entire purpose of training the classifier to to label this newdata.
predictions.df <- as.data.frame(predict(rf.classifier.object,
newdata.df,
type = "prob")
)
Please note the error has different variable names simply because I changed the code to make it more general for readability.
Error in predict.randomForest(rf.classifier.object, newdata.df, :
variables in the training data missing in newdata
As per this stack post predict.randomForest(), called here as predict(), uses rownames of feature importance to make its precitions. And when I checked with a search of the feature names I find that it is infact the class label preventing me from making the test as I show bellow.
# > rownames(rf.classifier.object$importance)[!(rownames(rf.classifier.object$importance) %in% colnames(newdata) )]
# [1] "class_label"
It is not clear to me what I should change in my script so that the classifier can be used on other data than the testing set. I have followed the instructions exactly this seems like a bad design choice to have made the function this way. The class label should not be used for calculating feature importance at all and should not even be considered a feature imo.

how does parsnip know how to match `fit` arguments to function arguments for a model?

I am trying to create a new model for the parsnip package from an existing modeling function foo.
I have followed the tutorial in building new models in parsnip and followed the README on Github, but I still cannot figure out some things.
How does the fit function in parsnip know how to assign its input data (e.g. a matrix) to my idiosyncratic function call?
Imagine if there was an idiosyncratic model function foo where the conventional roles of x and y arguments were reversed: i.e. foo(x,y) where x should be an outcome vector and y should be a predictor matrix, bizarrely.
For example: suppose a is a matrix of predictors and b is a vector of outcomes. Then I call fit_xy(object=my_model, x=a, y=b). Internally, how does fit_xy() know to call foo(x=y,y=x) ?
The function to validate the input is check_final_param, which require that each argument e.g. have to be named. That is why order is not important.
https://github.com/tidymodels/parsnip/blob/f7ba069671684f61af0ca1eadb1927fedec8a9c6/R/misc.R#L235
The README file linked by you pointing out:
"To create the model fit call, the protect arguments are populated with the appropriate objects (usually from the data set), and rlang::call2 is used to create a call that can be executed. "
Example of randomForest which using ntree instead of default trees argument.
They created a translation calls which will be used during evaluation.
https://github.com/tidymodels/parsnip/blob/228a6dc6975fc91562b63d191e43d2164cc78e3d/R/rand_forest_data.R#L339
If we use call2 and unpack the named args the order does not matter. And as we know that args will be properly named because of additional translation step.
args <- list(na.rm = TRUE, trim = 0)
rlang::call2("mean", 1:10, !!!args)
The way we do this is through the set_fit() function. Most models are pretty sensible and we can use default mappings (for example, from data argument to data argument or x to x) but you are right that some models use different norms. An example of this are the Spark models that use x to mean what we might normally call data with a formula method.
The random forest set_fit() function for Spark looks like this:
set_fit(
model = "rand_forest",
eng = "spark",
mode = "classification",
value = list(
interface = "formula",
data = c(formula = "formula", data = "x"),
protect = c("x", "formula", "type"),
func = c(pkg = "sparklyr", fun = "ml_random_forest"),
defaults = list(seed = expr(sample.int(10 ^ 5, 1)))
)
)
Notice especially the data element of the value argument. You can read a bit more here.

Importance based variable reduction

I am facing a difficulty with filtering out the least important variables in my model. I received a set of data with more than 4,000 variables, and I have been asked to reduce the number of variables getting into the model.
I did try already two approaches, but I have failed twice.
The first thing I tried was to manually check variable importance after the modelling and based on that removing non significant variables.
# reproducible example
data <- iris
# artificial class imbalancing
data <- iris %>%
mutate(Species = as.factor(ifelse(Species == "virginica", "1", "0")))
Everything works fine while using simple Learner:
# creating Task
task <- TaskClassif$new(id = "score", backend = data, target = "Species", positive = "1")
# creating Learner
lrn <- lrn("classif.xgboost")
# setting scoring as prediction type
lrn$predict_type = "prob"
lrn$train(task)
lrn$importance()
Petal.Width Petal.Length
0.90606304 0.09393696
The issue is that the data is highly imbalanced, so I decided to use GraphLearner with PipeOp operator to undersample majority group which is then passed to AutoTuner:
I did skip some part of the code which I believe is not important for this case, things like search space, terminator, tuner etc.
# undersampling
po_under <- po("classbalancing",
id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1 / 2)
# combine learner with pipeline graph
lrn_under <- GraphLearner$new(po_under %>>% lrn)
# setting the autoTuner
at <- AutoTuner$new(
learner = lrn_under,
resampling = resample,
measure = measure,
search_space = ps_under,
terminator = terminator,
tuner = tuner
)
at$train(task)
The problem right know is that despite the importance property being still visable within at the $importance() in unavailable.
> at
<AutoTuner:undersample.classif.xgboost.tuned>
* Model: list
* Parameters: list()
* Packages: -
* Predict Type: prob
* Feature types: logical, integer, numeric, character, factor, ordered, POSIXct
* Properties: featureless, importance, missings, multiclass, oob_error, selected_features, twoclass, weights
So I decided to change my approach and try to add filtering into a Learner. And that's where I've failed even more. I have started by looking into this mlr3book blog - https://mlr3book.mlr-org.com/fs.html. I tried to add importance = "impurity" into Learner just like in the blog but id did yield an error.
> lrn <- lrn("classif.xgboost", importance = "impurity")
Błąd w poleceniu 'instance[[nn]] <- dots[[i]]':
nie można zmienić wartości zablokowanego połączenia dla 'importance'
Which basically means something like this:
Error in 'instance[[nn]] <- dots[[i]]': can't change value of blocked connection for 'importance'
I did also try to workaround with PipeOp filtering but it also failed miserably. I believe I won't be able to do it without importance = "impurity".
So my question is, is there a way to achieve what I am aiming for?
In addition I would be greatly thankful for explaining why is filtering by importance possible before modeling? Shouldn't it be based on the model result?
The reason why you can't access $importance of the at variable is that it is an AutoTuner, which does not directly offer variable importance and only "wraps" around the actual Learner being tuned.
The trained GraphLearner is saved inside your AutoTuner under $learner:
# get the trained GraphLearner, with tuned hyperparameters
graphlearner <- at$learner
This object also does not have $importance(). (Theoretically, a GraphLearner could contain more than one Learner and then it wouldn't even know which importance to give!).
Getting the actual LearnerClassifXgboost object is a bit tedious, unfortunately, because of shortcomings in the "R6" object system used by mlr3:
Get the untrained Learner object
get the trained state of the Learner and put it into that object
# get the untrained Learner
xgboostlearner <- graphlearner$graph$pipeops$classif.xgboost$learner
# put the trained model into the Learner
xgboostlearner$state <- graphlearner$model$classif.xgboost
Now the importance can be queried
xgboostlearner$importance()
The example from the book that you link to does not work in your case because the book uses the ranger Learner, while are using xgboost. importance = "impurity" is specific to ranger.

Plot ctree using rpart.plot functionality

Been trying to use the rpart.plot package to plot a ctree from the partykit library. The reason for this being that the default plot method is terrible when the tree is deep. In my case, my max_depth = 5.
I really enjoy rpart.plot's output as it allows for deep trees to visually display better. How the output looks for a simple example:
rpart
library(partykit)
library(rpart)
library(rpart.plot)
df_test <- cu.summary[complete.cases(cu.summary),]
multi.class.model <- rpart(Reliability~., data = df_test)
rpart.plot(multi.class.model)
I would like to get this output from the partykit model using ctree
ctree
multi.class.model <- ctree(Reliability~., data = df_test)
rpart.plot(multi.class.model)
>Error: the object passed to prp is not an rpart object
Is there some way one could coerce the ctree object to rpart so this would run?
To the best of my knowledge all the other packages for visualizing rpart trees are really rpart-specific and not based on the agnostic party class for representing trees/recursive partitions. Also, we haven't tried to implement an as.rpart() method for party objects because the rpart class is really not well-suited for this.
But you can try to tweak the partykit visualizations which are customizable through panel functions for almost all aspects of the tree. One thing that might be helpful is to compute a simpleparty object which has all sorts of simple summary information in the $info of each node. This can then be used in the node_terminal() panel function for printing information in the tree display. Consider the following simple example for predicting one of three school types in the German Socio-Economic Panel. To achieve the desired depth I switch significance testing essentiall off:
library("partykit")
data("GSOEP9402", package = "AER")
ct <- ctree(school ~ ., data = GSOEP9402, maxdepth = 5, alpha = 0.5)
The default plot(ct) on a sufficiently big device gives you:
When turning the tree into a simpleparty you get a textual summary by default:
st <- as.simpleparty(ct)
plot(st)
This has still overlapping labels so we could set up a small convenience function that extracts the interesting bits from the $info of each node and puts them into a longer character vector with less wide entries:
myfun <- function(i) c(
as.character(i$prediction),
paste("n =", i$n),
format(round(i$distribution/i$n, digits = 3), nsmall = 3)
)
plot(st, tp_args = list(FUN = myfun), ep_args = list(justmin = 20))
In addition to the arguments of the terminal panel function (tp_args) I have tweaked the arguments of the edge panel function (ep_args) to avoid some of the overplotting in the edges.
Of course, you could also change the entire panel function and roll your own...

Using r and weka. How can I use meta-algorithms along with nfold evaluation method?

Here is an example of my problem
library(RWeka)
iris <- read.arff("iris.arff")
Perform nfolds to obtain the proper accuracy of the classifier.
m<-J48(class~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(e)
The results provided here are obtained by building the model with a part of the dataset and testing it with another part, therefore provides accurate precision
Now I Perform AdaBoost to optimize the parameters of the classifier
m2 <- AdaBoostM1(class ~. , data = temp ,control = Weka_control(W = list(J48, M = 30)))
summary(m2)
The results provided here are obtained by using the same dataset for building the model and also the same ones used for evaluating it, therefore the accuracy is not representative of real life precision in which we use other instances to be evaluated by the model. Nevertheless this procedure is helpful for optimizing the model that is built.
The main problem is that I can not optimize the model built, and at the same time test it with data that was not used to build the model, or just use a nfold validation method to obtain the proper accuracy.
I guess you misinterprete the function of evaluate_Weka_classifier. In both cases, evaluate_Weka_classifier does only the cross-validation based on the training data. It doesn't change the model itself. Compare the confusion matrices of following code:
m<-J48(Species~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(m)
e
m2 <- AdaBoostM1(Species ~. , data = iris ,
control = Weka_control(W = list(J48, M = 30)))
e2 <- evaluate_Weka_classifier(m2,numFolds = 5)
summary(m2)
e2
In both cases, the summary gives you the evaluation based on the training data, while the function evaluate_Weka_classifier() gives you the correct crossvalidation. Neither for J48 nor for AdaBoostM1 the model itself gets updated based on the crossvalidation.
Now regarding the AdaBoost algorithm itself : In fact, it does use some kind of "weighted crossvalidation" to come to the final classifier. Wrongly classified items are given more weight in the next building step, but the evaluation is done using equal weight for all observations. So using crossvalidation to optimize the result doesn't really fit into the general idea behind the adaptive boosting algorithm.
If you want a true crossvalidation using a training set and a evaluation set, you could do the following :
id <- sample(1:length(iris$Species),length(iris$Species)*0.5)
m3 <- AdaBoostM1(Species ~. , data = iris[id,] ,
control = Weka_control(W = list(J48, M=5)))
e3 <- evaluate_Weka_classifier(m3,numFolds = 5)
# true crossvalidation
e4 <- evaluate_Weka_classifier(m3,newdata=iris[-id,])
summary(m3)
e3
e4
If you want a model that gets updated based on a crossvalidation, you'll have to go to a different algorithm, eg randomForest() from the randomForest package. That collects a set of optimal trees based on crossvalidation. It can be used in combination with the RWeka package as well.
edit : corrected code for a true crossvalidation. Using the subset argument has effect in the evaluate_Weka_classifier() as well.

Resources