Get the accuracy of a random forest in R - r

I have created a random forest out of my data:
fit=randomForest(churn~., data=data_churn[3:17], ntree=1,
importance=TRUE, proximity=TRUE)
I can easily see my confusion matrix:
conf <- fit$confusion
> conf
No Yes class.error
No 945 80 0.07804878
Yes 84 101 0.45405405
Now I need to know the accuracy for the random forest. I searched around and realized that caret library has a confusionMatrix method that gets a confusion matrix and returns the accuracy (alongside with many other values). However, the method needs another parameter called "reference". My question is how can I provide a reference for the method to get the accuracy of my random forest?
And... is it the correct way to get the accuracy of a random forest?

Use randomForest(..., do.trace=T) to see the OOB error during training, by both class and ntree.
(FYI you chose ntree=1 so you'll only get just one rpart decision-tree, not a forest, this kind of defeats the purpose of using RF, and of randomly choosing a subset of both features and samples. You probably want to vary ntree values.)
And after training, you can get per-class error from the rightmost column of the confusion matrix as you already found:
> fit$confusion[, 'class.error']
class.error
No Yes
0.07804878 0.45405405
(Also you probably want to set options('digits'=3) to not see those excessive decimal places)
As to converting that list of class errors (accuracies = 1 - errors) to one overall accuracy number, that's easy to do. You could use mean, class-weighted mean, harmonic mean (of accuracies, not of errors) etc. It depends on your application and the relative penalty for misclassifying. Your example is simple, it's only two-class.
(or e.g. there are more complicated measures of inter-rater agreement)

Related

How to obtain Brier Score in Random Forest in R?

I am having trouble getting the Brier Score for my Machine Learning Predictive models. The outcome "y" was categorical (1 or 0). Predictors are a mix of continuous and categorical variables.
I have created four models with different predictors, I will call them "model_1"-"model_4" here (except predictors, other parameters are the same). Example code of my model is:
Model_1=rfsrc(y~ ., data=TrainTest, ntree=1000,
mtry=30, nodesize=1, nsplit=1,
na.action="na.impute", nimpute=3,seed=10,
importance=T)
When I run the "Model_1" function in R, I got the results:
My question was how can I get the predicted possibility for those 412 people? And how to find the observed probability for each person? Do I need to calculate by hand? I found the function BrierScore() in "DescTools" package.
But I tried "BrierScore(Model_1)", it gives me no results.
codes I added:
library(scoring)
library(DescTools)
BrierScore(Raw_SB)
class(TrainTest$VL_supress03)
TrainTest$VL_supress03_nu<-as.numeric(as.character(TrainTest$VL_supress03))
class(TrainTest$VL_supress03_nu)
prediction_Raw_SB = predict(Raw_SB, TrainTest)
BrierScore(prediction_Raw_SB, as.numeric(TrainTest$VL_supress03) - 1)
BrierScore(prediction_Raw_SB, as.numeric(as.character(TrainTest$VL_supress03)) - 1)
BrierScore(prediction_Raw_SB, TrainTest$VL_supress03_nu - 1)
I tried some codes: have so many error messages:
One assumption I am making about your approach is that you want to compute the BrierScore on the data you train your model on (which is usually not the correct approach, google train-test split if you need more info there).
In general, therefore you should reflect on whether your approach is correct there.
The BrierScore method in DescTools only has a defined method for glm models, otherwise, it expects as input a vector of predicted probabilities and a vector of true values (see ?BrierScore).
What you would need to do though is to predict on your data using:
prediction = predict(model_1, TrainTest, na.action="na.impute")
and then compute the brier score using
BrierScore(as.numeric(TrainTest$y) - 1, prediction$predicted[, 1L])
(Note, that we transform TrainTest$y into a numeric vector of 0's and 1's in order to compute the brier score.)
Note: The randomForestSRC package also prints a normalized brier score when you call print(prediction).
In general, using one of the available workbenches for machine learning in R (mlr3, tidymodels, caret) might simplify this approach for you and prevent a lot of errors in this direction. This is a really good practice, especially if you are less experienced in ML as it can prevent many errors.
See e.g. this chapter in the mlr3 book for more information.
For reference, here is some very similar code using the mlr3 package, automatically also taking care of train-test splits.
data(breast, package = "randomForestSRC") # with target variable "status"
library(mlr3)
library(mlr3extralearners)
task = TaskClassif$new(id = "breast", backend = breast, target = "status")
algo = lrn("classif.rfsrc", na.action = "na.impute", predict_type = "prob")
resample(task, algo, rsmp("holdout", ratio = 0.8))$score(msr("classif.bbrier"))

why Strauss-hardcore model could has a gamma bigger than 1?

the spatstat book said clearly that a Strauss model is invalid with a gamma bigger than 1, that is true:
multiple.Strauss<-ppm(P1a4.multiple~1, Strauss(r=51),method='ho')
#Warning message:
#Fitted model is invalid - cannot be simulated
as the L(r) function does has a trough first, I refit the data as a Strauss-hardcore model:
Mo.hybrid<-Hybrid(H=Hardcore(),S=Strauss(51))
multiple.hybrid<-ppm(P1a4.multiple~1,Mo.hybrid,method='ho')
#Hard core distance: 12.65963
#Fitted S interaction parameter gamma: 2.7466492
it interesting to see that the model fitted suceessfully, with a gamma>1 !
I want to know whether the gamma in Strauss-Hardcore model has same meaning with Strauss model, therefore could used as a indicator of aggregation?
Yes, the interpretation is similar and indicates some aggregation behaviour. The model with gamma>1 may be less intuitive to understand: Say the hardcore distance is r=12 and the Strauss interaction distance is R=50. Then you say that pairs of points within distance 12 of each other are heavily penalized (not permitted at all) while pairs of points separated by between 12 and 50 are encouraged (have a higher probability of occurring than at random). Pairs of points separated by more than 50 do not change the baseline probability (complete randomness).
Simulations from the StraussHardcore model often shows strange aggregation behavior, but it may be suitable for your data.

bnlearn::bn.fit difference and calculation of methods "mle" and "bayes"

I try to understand the differences between the two methods bayes and mle in the bn.fit function of the package bnlearn.
I know about the debate between the frequentist and the bayesian approach on understanding probabilities. On a theoretical level I suppose the maximum likelihood estimate mle is a simple frequentist approach setting the relative frequencies as the probability. But what calculations are done to get the bayes estimate? I already checked out the bnlearn documenation, the description of the bn.fit function and some application examples, but nowhere there's a real description of what's happening.
I also tried to understand the function in R by first checking out bnlearn::bn.fit, leading to bnlearn:::bn.fit.backend, leading to bnlearn:::smartSapply but then I got stuck.
Some help would be really appreciated as I use the package for academic work and therefore I should be able to explain what happens.
Bayesian parameter estimation in bnlearn::bn.fit applies to discrete variables. The key is the optional iss argument: "the imaginary sample size used by the bayes method to estimate the conditional probability tables (CPTs) associated with discrete nodes".
So, for a binary root node X in some network, the bayes option in bnlearn::bn.fit returns (Nx + iss / cptsize) / (N + iss) as the probability of X = x, where N is your number of samples, Nx the number of samples with X = x, and cptsize the size of the CPT of X; in this case cptsize = 2. The relevant code is in the bnlearn:::bn.fit.backend.discrete function, in particular the line: tab = tab + extra.args$iss/prod(dim(tab))
Thus, iss / cptsize is the number of imaginary observations for each entry in a CPT, as opposed to N, the number of 'real' observations. With iss = 0 you would be getting a maximum likelihood estimate, as you would have no prior imaginary observations.
The higher iss with respect to N, the stronger the effect of the prior on your posterior parameter estimates. With a fixed iss and a growing N, the Bayesian estimator and the maximum likelihood estimator converge to the same value.
A common rule of thumb is to use a small non-zero iss so that you avoid zero entries in the CPTs, corresponding to combinations that were not observed in the data. Such zero entries could then result in a network which generalizes poorly, such as some early versions of the Pathfinder system.
For more details on Bayesian parameter estimation you can have a look at the book by Koller and Friedman. I suppose many other Bayesian network books also cover the topic.

how to use classwt in randomForest of R?

I have a highly imbalanced data set with target class instances in the following ratio 60000:1000:1000:50 (i.e. a total of 4 classes). I want to use randomForest for making predictions of the target class.
So, to reduce the class imbalance, I played with sampsize parameter, setting it to c(5000, 1000, 1000, 50) and some other values, but there was not much use of it. Actually, the accuracy of the 1st class decreased while I played with sampsize, though the improvement in other class predictions was very minute.
While digging through the archives, I came across two more features of randomForest(), which are strata and classwt that are used to offset class imbalance issue.
All the documents upon classwt were old (generally belonging to the 2007, 2008 years), which all suggested not the use the classwt feature of randomForest package in R as it does not completely implement its complete functionality like it does in fortran. So the first question is:
Is classwt completely implemented now in randomForest package of R? If yes, what does passing c(1, 10, 10, 10) to the classwt argument represent? (Assuming the above case of 4 classes in the target variable)
Another argument which is said to offset class imbalance issue is stratified sampling, which is always used in conjunction with sampsize. I understand what sampsize is from the documentation, but there is not enough documentation or examples which gave a clear insight into using strata for overcoming class imbalance issue. So the second question is:
What type of arguments have to be passed to stratain randomForest and what does it represent?
I guess the word weight which I have not explicitly mentioned in the question should play a major role in the answer.
classwt is correctly passed on to randomForest, check this example:
library(randomForest)
rf = randomForest(Species~., data = iris, classwt = c(1E-5,1E-5,1E5))
rf
#Call:
# randomForest(formula = Species ~ ., data = iris, classwt = c(1e-05, 1e-05, 1e+05))
# Type of random forest: classification
# Number of trees: 500
#No. of variables tried at each split: 2
#
# OOB estimate of error rate: 66.67%
#Confusion matrix:
# setosa versicolor virginica class.error
#setosa 0 0 50 1
#versicolor 0 0 50 1
#virginica 0 0 50 0
Class weights are the priors on the outcomes. You need to balance them to achieve the results you want.
On strata and sampsize this answer might be of help: https://stackoverflow.com/a/20151341/2874779
In general, sampsize with the same size for all classes seems reasonable. strata is a factor that's going to be used for stratified resampling, in your case you don't need to input anything.
You can pass a named vector to classwt.
But how weight is calculated is very tricky.
For example, if your target variable y has two classes "Y" and "N", and you want to set balanced weight, you should do:
wn = sum(y="N")/length(y)
wy = 1
Then set classwt = c("N"=wn, "Y"=wy)
Alternatively, you may want to use ranger package. This package offers flexible builds of random forests, and specifying class / sample weight is easy. ranger is also supported by caret package.
Random forests are probably not the right classifier for your problem as they are extremely sensitive to class imbalance.
When I have an unbalanced problem I usually deal with it using sampsize like you tried. However I make all the strata equal size and I use sampling without replacement.
Sampling without replacement is important here, as otherwise samples from the smaller classes will contain many more repetitions, and the class will still be underrepresented. It may be necessary to increase mtry if this approach leads to small samples, sometimes even setting it to the total number of features.
This works quiet well when there are enough items in the smallest class. However, your smallest class has only 50 items. I doubt you would get useful results with sampsize=c(50,50,50,50).
Also classwt has never worked for me.

setting values for ntree and mtry for random forest regression model

I'm using R package randomForest to do a regression on some biological data. My training data size is 38772 X 201.
I just wondered---what would be a good value for the number of trees ntree and the number of variable per level mtry? Is there an approximate formula to find such parameter values?
Each row in my input data is a 200 character representing the amino acid sequence, and I want to build a regression model to use such sequence in order to predict the distances between the proteins.
The default for mtry is quite sensible so there is not really a need to muck with it. There is a function tuneRF for optimizing this parameter. However, be aware that it may cause bias.
There is no optimization for the number of bootstrap replicates. I often start with ntree=501 and then plot the random forest object. This will show you the error convergence based on the OOB error. You want enough trees to stabilize the error but not so many that you over correlate the ensemble, which leads to overfit.
Here is the caveat: variable interactions stabilize at a slower rate than error so, if you have a large number of independent variables you need more replicates. I would keep the ntree an odd number so ties can be broken.
For the dimensions of you problem I would start ntree=1501. I would also recommended looking onto one of the published variable selection approaches to reduce the number of your independent variables.
The short answer is no.
The randomForest function of course has default values for both ntree and mtry. The default for mtry is often (but not always) sensible, while generally people will want to increase ntree from it's default of 500 quite a bit.
The "correct" value for ntree generally isn't much of a concern, as it will be quite apparent with a little tinkering that the predictions from the model won't change much after a certain number of trees.
You can spend (read: waste) a lot of time tinkering with things like mtry (and sampsize and maxnodes and nodesize etc.), probably to some benefit, but in my experience not a lot. However, every data set will be different. Sometimes you may see a big difference, sometimes none at all.
The caret package has a very general function train that allows you to do a simple grid search over parameter values like mtry for a wide variety of models. My only caution would be that doing this with fairly large data sets is likely to get time consuming fairly quickly, so watch out for that.
Also, somehow I forgot that the ranfomForest package itself has a tuneRF function that is specifically for searching for the "optimal" value for mtry.
Could this paper help ?
Limiting the Number of Trees in Random Forests
Abstract. The aim of this paper is to propose a simple procedure that
a priori determines a minimum number of classifiers to combine in order
to obtain a prediction accuracy level similar to the one obtained with the
combination of larger ensembles. The procedure is based on the McNemar
non-parametric test of significance. Knowing a priori the minimum
size of the classifier ensemble giving the best prediction accuracy, constitutes
a gain for time and memory costs especially for huge data bases
and real-time applications. Here we applied this procedure to four multiple
classifier systems with C4.5 decision tree (Breiman’s Bagging, Ho’s
Random subspaces, their combination we labeled ‘Bagfs’, and Breiman’s
Random forests) and five large benchmark data bases. It is worth noticing
that the proposed procedure may easily be extended to other base
learning algorithms than a decision tree as well. The experimental results
showed that it is possible to limit significantly the number of trees. We
also showed that the minimum number of trees required for obtaining
the best prediction accuracy may vary from one classifier combination
method to another
They never use more than 200 trees.
One nice trick that I use is to initially start with first taking square root of the number of predictors and plug that value for "mtry". It is usually around the same value that tunerf funtion in random forest would pick.
I use the code below to check for accuracy as I play around with ntree and mtry (change the parameters):
results_df <- data.frame(matrix(ncol = 8))
colnames(results_df)[1]="No. of trees"
colnames(results_df)[2]="No. of variables"
colnames(results_df)[3]="Dev_AUC"
colnames(results_df)[4]="Dev_Hit_rate"
colnames(results_df)[5]="Dev_Coverage_rate"
colnames(results_df)[6]="Val_AUC"
colnames(results_df)[7]="Val_Hit_rate"
colnames(results_df)[8]="Val_Coverage_rate"
trees = c(50,100,150,250)
variables = c(8,10,15,20)
for(i in 1:length(trees))
{
ntree = trees[i]
for(j in 1:length(variables))
{
mtry = variables[j]
rf<-randomForest(x,y,ntree=ntree,mtry=mtry)
pred<-as.data.frame(predict(rf,type="class"))
class_rf<-cbind(dev$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
dev_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
dev_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,type="prob"))
prob_rf<-cbind(dev$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
dev_auc<-as.numeric(auc#y.values)
pred<-as.data.frame(predict(rf,val,type="class"))
class_rf<-cbind(val$Target,pred)
colnames(class_rf)[1]<-"actual_values"
colnames(class_rf)[2]<-"predicted_values"
val_hit_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, predicted_values ==1))
val_coverage_rate = nrow(subset(class_rf, actual_values ==1&predicted_values==1))/nrow(subset(class_rf, actual_values ==1))
pred_prob<-as.data.frame(predict(rf,val,type="prob"))
prob_rf<-cbind(val$Target,pred_prob)
colnames(prob_rf)[1]<-"target"
colnames(prob_rf)[2]<-"prob_0"
colnames(prob_rf)[3]<-"prob_1"
pred<-prediction(prob_rf$prob_1,prob_rf$target)
auc <- performance(pred,"auc")
val_auc<-as.numeric(auc#y.values)
results_df = rbind(results_df,c(ntree,mtry,dev_auc,dev_hit_rate,dev_coverage_rate,val_auc,val_hit_rate,val_coverage_rate))
}
}

Resources