feature importance in mlr3 with iml for classification forests - r

I calculate feature importance for 2 different types of machine learning models (SVM and Classification Forest). I cannot post the data here, but I describe what I do:
My (classification) task has about 400 observations of 70 variables. Some of them are highly, but nor perfectly correlated
I fit the models with
learner_1$train(task)
learner_2$train(task)
where learner1 is a svm and learner 2 is a classification forest.
Now, I want to calculate feature importance with iml, so for each of the learners I use the following code (here the code for learner_1)
model_analyzed=Predictor$new(learner_1,
data=dplyr::select(task$data(), task$feature_names),
y=dplyr::select(task$data(), task$target_names))
used_features <- task$feature_names
effect = FeatureImp$new(model_analyzed, loss="ce", n.repetitions=10, compare="ratio")
print(effect$plot(features=used_features))
My results are the following
a) For the SVM
b) For the classification forest
I do not understand the second picture:
a) should the "anchor" point not be around 1, as I observe for the SVM? If the ce is not made worse by shuffling for any feature, then the graph shoud show a 1 and not a 0?
b) If all features show a value very close to zero, as I see in the second graph, does it mean that the classification error is zero, if the feature is shuffled? So for each single feature, I would get a perfect model if just this one feature is omitted or shuffled?
I am really confused here, can someone help me understand what happens?

Related

How to get classification probabilities of each tree in the random forest using R

I want to get classification probabilities of each class by each tree in the randomForest.
(1) This outputs individual outputs but its type is response, not probabilities:
predict(rf_cl, newdata, predict.all=TRUE)$individual
(2) This outputs probabilities but it belongs to the forest not all trees:
predict(rf_cl, newdata, type="prob")
(3) When I tried this, I got the same output as the first one.
predict(rf_cl, newdata, predict.all=TRUE, type="prob")$individual
I am searching for a long time on net. But no use. Please help or try to give some ideas how to achieve this. Thanks in advance.
The member decision trees of a randomForest decision tree ensemble make "pure" predictions. That is, the probability of the winning category is 1.0, and the probabilities of all other categories are 0.0.
The random forest computes the aggregate probability using the voting mechanism - the number of "pure" predictions (aka votes) for each class, divided by the total number of member decision trees. Knowing this will help you choose the number of decision trees in order to achieve the desired "precision" of aggregate probabilities, and avoid any ties. For example, when modeling a binary target, then you should choose an odd number of member decision trees to avoid a 0.5 vs. 0.5 tie.

How to calculate Bias and Variance for SVM and Random Forest Model

I'm working on a classification problem (predicting three classes) and I'm comparing SVM against Random Forest in R.
For evaluation and comparison I want to calculate the bias and variance of the models. I've looked up the two terms in many machine learning books and I'd say I do understand the sense of variance and bias (easiest explanation with the bullseye). But I can't really figure out how to apply it in my case.
Let's say I predict the results for a test set with 4 SVM-models that were trained with 4 different training sets. Each time I get a total error (meaning all wrong predictions/all predictions).
Do I then get the bias for SVM by calculating this?
which would mean that the bias is more or less the mean of the errors?
I hope you can help me with not to complicated formula, because I've already seen many of them.

Inconsistency between confusion matrix and classified image

Due to computational limitations with my GIS software, I am trying to implement random forests in R for image classification purposes. My input is a multi-band TIFF image, which is being trained on an ArcGIS shapefile (target values 0 and 1). The code technically works and produces a valid output. When I view the confusion matrix I get the following:
0 1 class.error
0 11 3 0.214285714
1 1 13 0.071428571
This is sensible for my data. However when I plot up the output of the image classification in my GIS software (the binary reclassified tiff with values 0 and 1), it predicts the training data with a 100% success rate. In other words there is no classification error with the output image. How is this the case when the confusion matrix indicates there are classification errors?
Am I missing something really obvious here? Code snippet below.
rf.mdl <- randomForest(x=samples#data[, names(PredMaps)], y=samples#data[, ValueFld], ntree=501, proximity=TRUE, importance=TRUE, keep.forest=TRUE,keep.inbag=TRUE)
ConfMat = rf.mdl$confusion
write.csv(ConfMat,file = "ConfMat1.csv")
predict(PredMaps, rf.mdl, filename=classifiedPath, type="response", na.rm=T, overwrite=T, progress="text")
I expected the output classified image to misclassify 1 of the Value=1 training points and misclassify 3 of the Value=0 training points based on what is indicated in the confusion matrix.
The Random Forest algorithm is a bagging method. This means it creates numerous weak classifiers, then has each weak classifier "vote" to create the end prediction. In RF, each weak classifier is one decision tree that is trained on a random sample of observations in the training set. Think of the random samples each decision tree is trained on as a "bag" of data.
What is being shown in the confusion matrix is something called "out-of-bag error" (OOB error). This OOB error is an accurate estimate of how your model would generalize to data it has never seen before (this estimate is usually achieved by testing your model on a withheld testing set). Since each decision tree is trained on only one bag from your training data, the rest of the data (data that's "outside the bag") can stand in for this withheld data.
OOB error is calculated by making a prediction for each observation in the training set. However, when predicting each individual observation, only decision trees whose bags did not include that observation are allowed to participate in the voting process. The result is the confusion matrix available after training a RF model.
When you predict the observations in the training set using the complete model, decision trees whose bags did include each observation are now involved in the voting process. Since these decision trees "remember" the observation they were trained on, they skew the prediction toward the correct answer. This is why you achieve 100% accuracy.
Essentially, you should trust the confusion matrix that uses OOB error. It's a robust estimate of how the model will generalize to unseen data.

Multivariate Analysis on random forest results

Apologies in advance for no data samples:
I built out a random forest of 128 trees with no tuning having 1 binary outcome and 4 explanatory continuous variables. I then compared the AUC of this forest against a forest already built and predicting on cases. What I want to figure out is how to determine what exactly is lending predictive power to this new forest. Univariate analysis with the outcome variable led to no significant findings. Any technique recommendations would be greatly appreciated.
EDIT: To summarize, I want to perform multivariate analysis on these 4 explanatory variables to identify what interactions are taking place that may explain the forest's predictive power.
Random Forest is what's known as a "black box" learning algorithm, because there is no good way to interpret the relationship between input and outcome variables. You can however use something like the variable importance plot or partial dependence plot to give you a sense of what variables are contributing the most in making predictions.
Here are some discussions on variable importance plots, also here and here. It is implemented in the randomForest package as varImpPlot() and in the caret package as varImp(). The interpretation of this plot depends on the metric you are using to assess variable importance. For example if you use MeanDecreaseAccuracy, a high value for a variable would mean that on average, a model that includes this variable reduces classification error by a good amount.
Here are some other discussions on partial dependence plots for predictive models, also here. It is implemented in the randomForest package as partialPlot().
In practice, 4 explanatory variables is not many, so you can just easily run a binary logistic regression (possibly with a L2 regularization) for a more interpretative model. And compare it's performance against a random forest. See this discussion about variable selection. It is implemented in the glmnet package. Basically a L2 regularization, also known as ridge, is a penalty term added to your loss function that shrinks your coefficients for reduced variance, at the expense of increased bias. This effectively reduces prediction error if the amount of reduced variance more than compensates for the bias (this is often the case). Since you only have 4 inputs variables, I suggested L2 instead of L1 (also known as lasso, which also does automatic feature selection). See this answer for ridge and lasso shrinkage parameter tuning using cv.glmnet: How to estimate shrinkage parameter in Lasso or ridge regression with >50K variables?

Weight response with sampsize for unbalanced data in randomForest

I am new to machine learning and R.
I tried to fit some models including trees, boosted trees, random forests, ada boosting, svm, and logistic regression with R.
In my case, probability that the rare event (class 1) occurs in the training data is 0.0075.
In the trees and boosted trees training, I added a weight parameter into a model i.e. weight class 0 with 1 and class 1 with sqrt(1/0.0075). Is that a correct way to do this?
I have some issue with random forest. I searched for using sampsize in order to deal with unbalanced data like this.
However, I am not quite sure how to give proper weight to each class.
I looked here and there is a suggestion to reduce imbalance ratio down. How do I chose the proper one?
Also, I have no idea how to include weights in ada boosting and logistic regression.

Resources