I have used RFE (Recursive Feature Elimination) method to find the best model out of 39 variables. I used the following codes:
set.seed(10)
ctrl <- rfeControl(functions = caretFuncs, method = "repeatedcv",repeats = 5,number= 5,allowParallel = TRUE)
RF_39 <- rfe(X, Y,sizes = c(1:39),method ='rf',rfeControl = ctrl,tuneGrid = data.frame(mtry=6))
The best model is made using 36 variables.
If I want to see those 36 variables I can use
predictors(RF_39) or RF_39$optVariable functions.
However, I need to see the variables used for making the other models rather than the best model.
For example, what are the variables used for making model number 12?
How can I see the variables of a specific model made by RFE method?
Thanks for your help.
I would like to divide my working pipeline in 2:
One place (internal) where to benchmark and auto-tune the alrithms to select the final model.
Apply the selected models to new datasets (external).
For the second part, I will need to somehow save the resulting model object to later use
model$predict_newdata() and transporting it without needing to re-train the algorithm and taking with it the original training data.
The idea is synthesized with the following error:
library("mlr3")
task = tsk("iris")
learner = lrn("classif.rpart")
learner$train(task, row_ids = 1:120)
predictions = learner$predict(task, row_ids = 121:150)
predictions
So far so good, but now I have to save this model into an object outside the R Session, but of course, this won't work:
store_model = learner$model
save(store_model, 'model_rpart.RData')
The solution is to save the whole object as an .rds object as Brian suggested.
saveRDS(learner, 'learner_rpart.rds')
model <- readRDS('learner_rpart.rds')
predictions = model$predict(task, row_ids = 121:150)
predictions$confusion
In package partykit there is a way to build custom trees by specifying predictor and split. For example:
data("WeatherPlay", package = "partykit")
#create a split
sp_o <- partysplit(3L, breaks = 75)
#create a node
n1 <- partynode(id = 1L, split = sp_o, kids = lapply(2L:3L, partynode))
#and make a "tree" out of it
t2 <- party(
n1,
data = WeatherPlay,
fitted = data.frame(
"(fitted)" = fitted_node(n1, data = WeatherPlay),
"(response)" = WeatherPlay$play,
check.names = FALSE
),
terms = terms(play ~ ., data = WeatherPlay),
)
t2 <- as.constparty(t2)
t2
plot(t2)
Is this possible for model trees (returned by mob())? Can i build the tree node by node, and then fit specified function to terminal nodes?
In principle, it is possible to build a "modelparty" object (as returned by mob()) by hand. However, there is not a simple coercion function like as.constparty() in the example you cite. The reason is that for building the model trees, and for printing and especially predicting with them, much more detailed information about the model is needed.
I would recommend to build the tree structure ("partynode") first, then fill the overall $info slot (with call, formula, Formula, terms, fit, control, dots, and nreg). And then it should be possible to call refit.modelparty() to refit the models in all the terminal nodes. And this can then be used to fill the $node$info (with coefficients, objfun, nobs, ...).
Disclaimer: All of this is completely untested. But instead of mimicking "modelparty" you could, of course, also create your own way of storing the models in the "party" object and just use more basic building blocks provided by partykit.
The xgboost package allows to build a random forest (in fact, it chooses a random subset of columns to choose a variable for a split for the whole tree, not for a nod, as it is in a classical version of the algorithm, but it can be tolerated). But it seems that for regression only one tree from the forest (maybe, the last one built) is used.
To ensure that, consider just a standard toy example.
library(xgboost)
library(randomForest)
data(agaricus.train, package = 'xgboost')
dtrain = xgb.DMatrix(agaricus.train$data,
label = agaricus.train$label)
bst = xgb.train(data = dtrain,
nround = 1,
subsample = 0.8,
colsample_bytree = 0.5,
num_parallel_tree = 100,
verbose = 2,
max_depth = 12)
answer1 = predict(bst, dtrain);
(answer1 - agaricus.train$label) %*% (answer1 - agaricus.train$label)
forest = randomForest(x = as.matrix(agaricus.train$data), y = agaricus.train$label, ntree = 50)
answer2 = predict(forest, as.matrix(agaricus.train$data))
(answer2 - agaricus.train$label) %*% (answer2 - agaricus.train$label)
Yes, of course, the default version of the xgboost random forest uses not a Gini score function but just the MSE; it can be changed easily. Also it is not correct to do such a validation and so on, so on. It does not affect a main problem. Regardless of which sets of parameters are being tried results are suprisingly bad compared with the randomForest implementation. This holds for another data sets as well.
Could anybody provide a hint on such strange behaviour? When it comes to the classification task the algorithm does work as expected.
#
Well, all trees are grown and all are used to make a prediction. You may check that using the parameter 'ntreelimit' for the 'predict' function.
The main problem remains: is the specific form of the Random Forest algorithm that is produced by the xgbbost package valid?
Cross-validation, parameter tunning and other crap have nothing to do with that -- every one may add necessary corrections to the code and see what happens.
You may specify the 'objective' option like this:
mse = function(predict, dtrain)
{
real = getinfo(dtrain, 'label')
return(list(grad = 2 * (predict - real),
hess = rep(2, length(real))))
}
This provides that you use the MSE when choosing a variable for the split. Even after that, results are suprisingly bad compared to those of randomForest.
Maybe, the problem is of academical nature and concerns the way how a random subset of features to make a split is chosen. The classical implementation chooses a subset of features (the size is specified with 'mtry' for the randomForest package) for EVERY split separately and the xgboost implementation chooses one subset for a tree (specified with 'colsample_bytree').
So this fine difference appears to be of great importance, at least for some types of datasets. It is interesting, indeed.
xgboost(random forest style) does use more than one tree to predict. But there are many other differences to explore.
I myself am new to xgboost, but curious. So I wrote the code below to visualize the trees. You can run the code yourself to verify or explore other differences.
Your data set of choice is a classification problem as labels are either 0 or 1. I like to switch to a simple regression problem to visualize what xgboost does.
true model: $y = x_1 * x_2$ + noise
If you train a single tree or multiple tree, with the code examples below you observe that the learned model structure does contain more trees. You cannot argue alone from the prediction accuracy how many trees are trained.
Maybe the predictions are different because the implementations are different. None of the ~5 RF implementations I know of are exactly alike, and this xgboost(rf style) is as closest a distant "cousin".
I observe the colsample_bytree is not equal to mtry, as the former uses the same subset of variable/columns for the entire tree. My regression problem is one big interaction only, which cannot be learned if trees only uses either x1 or x2. Thus in this case colsample_bytree must be set to 1 to use both variables in all trees. Regular RF could model this problem with mtry=1, as each node would use either X1 or X2
I see your randomForest predictions are not out-of-bag cross-validated. If drawing any conclusions on predictions you must cross-validate, especially for fully grown trees.
NB You need to fix the function vec.plot as does not support xgboost out of the box, because xgboost out of some other box do not take data.frame as an valid input. The instruction in the code should be clear
library(xgboost)
library(rgl)
library(forestFloor)
Data = data.frame(replicate(2,rnorm(5000)))
Data$y = Data$X1*Data$X2 + rnorm(5000)*.5
gradientByTarget =fcol(Data,3)
plot3d(Data,col=gradientByTarget) #true data structure
fix(vec.plot) #change these two line in the function, as xgboost do not support data.frame
#16# yhat.vec = predict(model, as.matrix(Xtest.vec))
#21# yhat.obs = predict(model, as.matrix(Xtest.obs))
#1 single deep tree
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget,grid=200)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget)
#clearly just one tree
#100 trees (gbm boosting)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=100,params = list(max.depth=16,eta=.5,subsample=.6))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget)
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])),col=gradientByTarget) ##predictions are not OOB cross-validated!
#20 shallow trees (bagging)
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=250,
num_parallel_tree=20,colsample_bytree = .5, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #bagged mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2]))) #terrible fit!!
#problem, colsample_bytree is NOT mtry as columns are only sampled once
# (this could be raised as an issue on their github page, that this does not mimic RF)
#20 deep tree (bagging), no column limitation
xgb.model = xgboost(data = as.matrix(Data[,1:2]),label=Data$y,
nrounds=1,params = list(max.depth=500,
num_parallel_tree=200,colsample_bytree = 1, subsample = .5))
vec.plot(xgb.model,as.matrix(Data[,1:2]),1:2,col=gradientByTarget) #boosted mix of trees
plot(Data$y,predict(xgb.model,as.matrix(Data[,1:2])))
#voila model can fit data
Here is an example of my problem
library(RWeka)
iris <- read.arff("iris.arff")
Perform nfolds to obtain the proper accuracy of the classifier.
m<-J48(class~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(e)
The results provided here are obtained by building the model with a part of the dataset and testing it with another part, therefore provides accurate precision
Now I Perform AdaBoost to optimize the parameters of the classifier
m2 <- AdaBoostM1(class ~. , data = temp ,control = Weka_control(W = list(J48, M = 30)))
summary(m2)
The results provided here are obtained by using the same dataset for building the model and also the same ones used for evaluating it, therefore the accuracy is not representative of real life precision in which we use other instances to be evaluated by the model. Nevertheless this procedure is helpful for optimizing the model that is built.
The main problem is that I can not optimize the model built, and at the same time test it with data that was not used to build the model, or just use a nfold validation method to obtain the proper accuracy.
I guess you misinterprete the function of evaluate_Weka_classifier. In both cases, evaluate_Weka_classifier does only the cross-validation based on the training data. It doesn't change the model itself. Compare the confusion matrices of following code:
m<-J48(Species~., data=iris)
e<-evaluate_Weka_classifier(m,numFolds = 5)
summary(m)
e
m2 <- AdaBoostM1(Species ~. , data = iris ,
control = Weka_control(W = list(J48, M = 30)))
e2 <- evaluate_Weka_classifier(m2,numFolds = 5)
summary(m2)
e2
In both cases, the summary gives you the evaluation based on the training data, while the function evaluate_Weka_classifier() gives you the correct crossvalidation. Neither for J48 nor for AdaBoostM1 the model itself gets updated based on the crossvalidation.
Now regarding the AdaBoost algorithm itself : In fact, it does use some kind of "weighted crossvalidation" to come to the final classifier. Wrongly classified items are given more weight in the next building step, but the evaluation is done using equal weight for all observations. So using crossvalidation to optimize the result doesn't really fit into the general idea behind the adaptive boosting algorithm.
If you want a true crossvalidation using a training set and a evaluation set, you could do the following :
id <- sample(1:length(iris$Species),length(iris$Species)*0.5)
m3 <- AdaBoostM1(Species ~. , data = iris[id,] ,
control = Weka_control(W = list(J48, M=5)))
e3 <- evaluate_Weka_classifier(m3,numFolds = 5)
# true crossvalidation
e4 <- evaluate_Weka_classifier(m3,newdata=iris[-id,])
summary(m3)
e3
e4
If you want a model that gets updated based on a crossvalidation, you'll have to go to a different algorithm, eg randomForest() from the randomForest package. That collects a set of optimal trees based on crossvalidation. It can be used in combination with the RWeka package as well.
edit : corrected code for a true crossvalidation. Using the subset argument has effect in the evaluate_Weka_classifier() as well.