Due to computational limitations with my GIS software, I am trying to implement random forests in R for image classification purposes. My input is a multi-band TIFF image, which is being trained on an ArcGIS shapefile (target values 0 and 1). The code technically works and produces a valid output. When I view the confusion matrix I get the following:
0 1 class.error
0 11 3 0.214285714
1 1 13 0.071428571
This is sensible for my data. However when I plot up the output of the image classification in my GIS software (the binary reclassified tiff with values 0 and 1), it predicts the training data with a 100% success rate. In other words there is no classification error with the output image. How is this the case when the confusion matrix indicates there are classification errors?
Am I missing something really obvious here? Code snippet below.
rf.mdl <- randomForest(x=samples#data[, names(PredMaps)], y=samples#data[, ValueFld], ntree=501, proximity=TRUE, importance=TRUE, keep.forest=TRUE,keep.inbag=TRUE)
ConfMat = rf.mdl$confusion
write.csv(ConfMat,file = "ConfMat1.csv")
predict(PredMaps, rf.mdl, filename=classifiedPath, type="response", na.rm=T, overwrite=T, progress="text")
I expected the output classified image to misclassify 1 of the Value=1 training points and misclassify 3 of the Value=0 training points based on what is indicated in the confusion matrix.
The Random Forest algorithm is a bagging method. This means it creates numerous weak classifiers, then has each weak classifier "vote" to create the end prediction. In RF, each weak classifier is one decision tree that is trained on a random sample of observations in the training set. Think of the random samples each decision tree is trained on as a "bag" of data.
What is being shown in the confusion matrix is something called "out-of-bag error" (OOB error). This OOB error is an accurate estimate of how your model would generalize to data it has never seen before (this estimate is usually achieved by testing your model on a withheld testing set). Since each decision tree is trained on only one bag from your training data, the rest of the data (data that's "outside the bag") can stand in for this withheld data.
OOB error is calculated by making a prediction for each observation in the training set. However, when predicting each individual observation, only decision trees whose bags did not include that observation are allowed to participate in the voting process. The result is the confusion matrix available after training a RF model.
When you predict the observations in the training set using the complete model, decision trees whose bags did include each observation are now involved in the voting process. Since these decision trees "remember" the observation they were trained on, they skew the prediction toward the correct answer. This is why you achieve 100% accuracy.
Essentially, you should trust the confusion matrix that uses OOB error. It's a robust estimate of how the model will generalize to unseen data.
Related
I calculate feature importance for 2 different types of machine learning models (SVM and Classification Forest). I cannot post the data here, but I describe what I do:
My (classification) task has about 400 observations of 70 variables. Some of them are highly, but nor perfectly correlated
I fit the models with
learner_1$train(task)
learner_2$train(task)
where learner1 is a svm and learner 2 is a classification forest.
Now, I want to calculate feature importance with iml, so for each of the learners I use the following code (here the code for learner_1)
model_analyzed=Predictor$new(learner_1,
data=dplyr::select(task$data(), task$feature_names),
y=dplyr::select(task$data(), task$target_names))
used_features <- task$feature_names
effect = FeatureImp$new(model_analyzed, loss="ce", n.repetitions=10, compare="ratio")
print(effect$plot(features=used_features))
My results are the following
a) For the SVM
b) For the classification forest
I do not understand the second picture:
a) should the "anchor" point not be around 1, as I observe for the SVM? If the ce is not made worse by shuffling for any feature, then the graph shoud show a 1 and not a 0?
b) If all features show a value very close to zero, as I see in the second graph, does it mean that the classification error is zero, if the feature is shuffled? So for each single feature, I would get a perfect model if just this one feature is omitted or shuffled?
I am really confused here, can someone help me understand what happens?
I know when random forest (RF) is used for classification, the AUC normally is used to assess the quality of classification after applying it to test data. However,I have no clue the parameter to assess the quality of regression with RF. Now I want to use RF for the regression analysis, e.g. using a metrics with several hundreds samples and features to predict the concentration (numerical) of chemicals.
The first step is to run randomForest to build the regression model, with y as continuous numerics. How can I know whether the model is good or not, based on the Mean of squared residuals and % Var explained? Sometime my % Var explained is negative.
Afterwards, if the model is fine and/or used straightforward for test data, and I get the predicted values. Now how can I assess the predicted values good or not? I read online some calculated the accuracy (formula: 1-abs(predicted-actual)/actual), which also makes sense to me. However, I have many zero values in my actual dataset, are there any other solutions to assess the accuracy of predicted values?
Looking forward to any suggestions and thanks in advance.
The randomForest R package comes with an importance function which can used to determine the accuracy of a model. From the documentation:
importance(x, type=NULL, class=NULL, scale=TRUE, ...), where x is the output from your initial call to randomForest.
There are two types of importance measurements. One uses a permutation of out of bag data to test the accuracy of the model. The other uses the GINI index. Again, from the documentation:
Here are the definitions of the variable importance measures. The first measure is computed from permuting OOB data: For each tree, the prediction error on the out-of-bag portion of the data is recorded (error rate for classification, MSE for regression). Then the same is done after permuting each predictor variable. The difference between the two are then averaged over all trees, and normalized by the standard deviation of the differences. If the standard deviation of the differences is equal to 0 for a variable, the division is not done (but the average is almost always equal to 0 in that case).
The second measure is the total decrease in node impurities from splitting on the variable, averaged over all trees. For classification, the node impurity is measured by the Gini index. For regression, it is measured by residual sum of squares.
For further information, one more simple importance check you may do, really more of a sanity check than anything else, is to use something called the best constant model. The best constant model has a constant output, which is the mean of all responses in the test data set. The best constant model can be assumed to be the crudest model possible. You may compare the average performance of your random forest model against the best constant model, for a given set of test data. If the latter does not outperform the former by at least a factor of say 3-5, then your RF model is not very good.
I'm working with random forest models in R as a part of an independent research project. I have fit my random forest model and generated the overall importance of each predictor to the models accuracy. However, in order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
Is there a way to produce this information from a random forest model? I.e. I expect age to have a positive impact on the likelihood a surgical complication occurs, but existence of osteoarthritis not so much.
Code:
surgery.bagComp = randomForest(complication~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.90,0.10)) #The cutoff is the probability for each group selection, probs of 10% or higher are classified as 'Complication' occurring
surgery.bagComp #Get stats for random forest model
imp=as.data.frame(importance(surgery.bagComp)) #Analyze the importance of each variable in the model
imp = cbind(vars=rownames(imp), imp)
imp = imp[order(imp$MeanDecreaseAccuracy),]
imp$vars = factor(imp$vars, levels=imp$vars)
dotchart(imp$MeanDecreaseAccuracy, imp$vars,
xlim=c(0,max(imp$MeanDecreaseAccuracy)), pch=16,xlab = "Mean Decrease Accuracy",main = "Complications - Variable Importance Plot",color="black")
Importance Plot:
Any suggestions/areas of research anyone can suggest would be greatly appreciated.
In order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
You need to be perform "feature impact" analysis, not "feature importance" analysis.
Algorithmically, it's about traversing decision tree data structures and observing what was the impact of each split on the prediction outcome. For example, consider the split "age <= 40". Does the left branch (condition evaluates to true) carry lower likelihood than the right branch (condition evaluates to false)?
Feature importances may give you a hint which features to look for, but it cannot be "transformed" to feature impacts.
You might find the following articles helpful: WHY did your model predict THAT? (Part 1 of 2) and WHY did your model predict THAT? (Part 2 of 2).
The explanation given for the predicted value of a random forest object is 'the predicted values of the input data based on out-of-bag samples'.
English is not my native language and I'm having trouble understanding this sentence. I'm currently working on a simulated regression problem using the random forest technique. The goal is to find the out-of-bag error for each sample in the simulation. After searching for a bit, I found this predicted component.
From what I understand from the sentence, for each tree, predicted returns that predicted value of the whole data subset that was not used for that particular tree. Suppose that I have N trees in the random forest, how many predicted-values will I get back?
Can the results of predicted be used as OOB prediction errors? Suppose that I have the value of predicted for the ith tree (rf$predicted[i]). Is the OOB error for the ith tree given by (rf$predicted[i] - response_of_tree_i)?
Thank you very much.
The i-th predicted value returns the mean of the predicted values across all trees that have the i-th value as OOB.
For example:
library(randomForest)
library(MASS)
set.seed(111)
rf = randomForest(medv ~ .,data=Boston,keep.forest=TRUE,keep.inbag=TRUE)
dim(Boston)
[1] 506 14
dim(rf$inbag)
[1] 506 500
So for the first observation, it is out of the bag in 198 trees:
table(rf$inbag[1,]==0)
FALSE TRUE
302 198
If you like to get the prediction of all trees, you can use predict and then work out that the predicted value you see is the mean of i-th predictions in trees that have it as OOB:
allpred = predict(rf,newdata=Boston,predict.all=TRUE)$individual
rf$predicted[1]
1
28.70521
mean(allpred[1,rf$inbag[1,]==0])
[1] 28.70521
Hence the predicted value can be used as OOB for the whole model, not for individual trees, because it is not of interest in a random forest model. You can also see this in the object, where rf$mse is the mean squared error with the i-th tree, so at 500 trees, you have the final mse of the model:
rf$mse[length(rf$mse)]
[1] 9.902396
mean((rf$predicted-rf$y)^2)
[1] 9.902396
If you would want to calculate the OOB error for each tree, bear in mind it is not the conventional OOB associated with random forest and you have to define it properly. You can also read about the random forest in this introduction article
I'm using the randomForest package in R on a classification problem (outcome is binary).
I want to get the probability output of each one of the trees (to get a prediction interval).
I've set the predict.all=TRUE argument in the predictions, but it gives me a matrix of 800 columns (=the number of trees in my forest) and each of them is a 1 or a 0. How do I get the probability output rather than class?
PS: the size of my nodes=1, which means that this should make sense. however, I changed the node size=50, still got all 0's and 1's no probabilities.
Here's what Im doing:
#build model (node size=1)
rf<-randomForest(y~. ,data=train, ntree=800,replace=TRUE, proximilty=TRUE, keep.inbag=TRUE)
#get the predictions
#store the predictions from all the trees
all_tree_train<-predict(rf, test, type="prob", predict.all= TRUE)$individual
This gives a matrix of 0's and 1's rather than probabilities.
I realise this question is old, but it might help anyone with a similar question.
If you query the trees for their results, you'll always get the end classifications which are deterministic given an initialised forest. You can extract the probabilities by setting predict all to TRUE as you've done and summing across the votes for a probability.
If you have more than 2 classes, the forest classifies an item 'm' as class 'x' with probability
(number of trees which bin m as x)/(number of trees)
As you only have a binary classification, the column sums of the prediction matrix give you the probability of being in class 1.
So the documentation for predict.randomForest states:
If predict.all=TRUE, then the individual component of the returned
object is a character matrix where each column contains the predicted
class by a tree in the forest.
...so it does not appear that it is possible to have a probability returned for each individual tree.
If you want something like a prediction interval for classification, you might try fitting a random forest with many more trees and then generating predictions from many different (random?) subsets of the forest.
One thing you need to be careful of though is that you appear to be feeding your training data to predict.randomForest. This will of course give you biased predictions, unless you use the information from the inbag component of the random forest object to only select trees on which that observation was out of bag.