Choose the number of mtry (without cause bias)? - r

I have this (example) code and I am trying to understand some characteristics. There are many questions about Random Forest and always comes up the issue of the number of trees and the mtry. This data frame is just an example but how can I explain the plot of the model (error) in order to set the number of trees without cause bias? Also the No. of variables tried at each split equals to 1 here (why?)
I think tuneR and train may cause bias so I want to try to find the best number of trees and mtry (default p/3) based on the error.
#' an example of a data frame and the model
clin=data.frame(1:500)
clin$k=clin$X1.500*0.2
clin$z=clin$X1.500*14.1/6
names(clin)=c("pr1","pr2","res")
rf=randomForest(res~pr1+pr2,data=clin,ntree=1000,importance=TRUE,keep.inbag=T)
plot(rf)
rf
Call:
randomForest(formula = res ~ pr1 + pr2, data = clin, ntree = 1000, importance = TRUE, keep.inbag = T)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 1
Mean of squared residuals: 2.051658
% Var explained: 100

The RF is based on a subset of the total number of predictors p (p/3). In this example you only have 2 predictors to explain "res". RF will therefore only randomly select one.
ntree and mtry should be defined so that your results are consistent.
If you set ntree too low and compute the RF multiple times you'll see a huge variation in RMSEP between the different RF.
The same holds true for mtry.
A previous answer with reference to Breiman's paper on the matter
edit regarding the predictor chosen for the split: when dealing with large numbers of predictors (2 is definitely too low to make good use of a RF) you may be interested in variable importance to see which one are more meaningful than the others.

Related

Determine the validity of feature importance for a regression

My goal is to determine the "Feature Importance" in a regression task (i.e. I want to know which feature influences the target variable the most).
Let´s assume I use a multiple regression.
There are different methods to determine the feature importance: Model dependent measures like (standardized) Betacoefficients, Shapley Value Regression and model independent measures like Permutation Importance.
All methods are made for specific situations (i.e. Shapely Value Regression is made for situations in which the predictors are highly correlated, where permutation importance takes interaction effects into account). However, these assumptions are a result of theoretical considerations and simulated data sets, which means that the validity (the difference between the "true" values and the measured values on the actual data) is dependent on the structure of the actual data set.
Is there a way In R to determine the measured difference between the measured and the "true" values of the feature importance?
I can use cross-validation the measure the difference between the predicted and the actual values. As an example, I use the Hitters data set.
library(ISLR)
library(caret)
fit = train(Salary~., data = Hitters, method = "lm", trControl = trainControl(method = "repeatedcv", repeats = 3), na.action = na.omit)
I can take a look at the tuning results and the difference between true and predicted values. Here I have an RMSE of ~ 334.
fit$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 333.9667 0.468769 239.7604 75.13388 0.1987425 40.58342
But what´s about the Betacoefficients. I can look at them but can't figure out how they differ from the actual betas on the test set.
coef(fit$finalModel)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
163.1035878 -1.9798729 7.5007675 4.3308829 -2.3762100 -1.0449620 6.2312863 -3.4890543 -0.1713405 0.1339910 -0.1728611
CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1.4543049 0.8077088 -0.8115709 62.5994230 -116.8492456 0.2818925 0.3710692 -3.3607605 -24.7623251
Is there a way in R to get this information (also for other measures like the Permutation Importance or Shapely Value Regression)?

Random forest regression - cumulative MSE?

I am new to Random Forests and I have a question about regression. I am using R package randomForests to calculate RF models.
My final goal is to select sets of variables important for prediction of a continuous trait, and so I am calculating a model, then I remove the variable with lowest mean decrease in accuracy, and I calculate a new model, and so on. This worked with RF classification, and I compared the models using the OOB errors from prediction (training set), development and validation data sets. Now with regression I want to compare the models based on %variation explained and MSE.
I was evaluating the results for MSE and %var explained, and I get exactly the same results when calculating manually using the prediction from model$predicted. But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.
As an example you can try this code in R:
library(randomForest)
data("iris")
head(iris)
TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1] #creating training set - Y vector
TestingX<-iris[101:150,2:4] #creating test set - X matrix
TestingY<-iris[101:150,1] #creating test set - Y vector
set.seed(2)
model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
xtest = TestingX, ytest = TestingY)
#for prediction (training set)
pred<-model$predicted
meanY<-sum(TrainingY)/length(TrainingY)
varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)
mseY<-sum((TrainingY-pred)^2)/length(TrainingY)
r2<-(1-(mseY/varpY))*100
#for testing (test set)
pred_2<-model$test$predicted
meanY_2<-sum(TestingY)/length(TestingY)
varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)
mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)
r2_2<-(1-(mseY_2/varpY_2))*100
training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)
c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model
As a result after running this code you will obtain a table containing values for MSE and %var explaines (also called rsq). The first line is called "last tree" and contains the values of MSE and %var explained for the 500th tree in the forest. The second line is called "by hand" and it contains results calculated in R based on the vectors model$predicted and model$test$predicted.
So, my questions are:
1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)
2- Is the last tree to be considered as an average of all the others?
3- Why are MSE and %var explained of the RF model (presented in the main board when you call model) the same as the ones from the 500th tree (see first line of table)? Do the vectors model$mse or model$rsq contain cumulative values?
After the last edit I found this post from Andy Liaw (one of the creators of the package) that says that MSE and %var explained are in fact cumulative!: https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html.
Not sure I understand what your issue is; I'll give it a try nevertheless...
1- Are the predictions of the trees somehow cumulative? Or are they
independent from each other? (I thought they were independent)
You thought correctly; the trees are fit independently of each other, hence their predictions are indeed independent. In fact, this is a crucial advantage of RF models, since it allows for parallel implementations.
2- Is the last tree to be considered as an average of all the others?
No; as clarified above, all trees are independent.
3- If each tree gets a prediction, how can I get the matrix with all the trees, since what I need is the MSE and % var explained for the forest?
Here is where what you ask starts being really unclear, given your code above; the MSE and r2 you say you need are exactly what you are already computing in mseY and r2:
mseY
[1] 0.1232342
r2
[1] 81.90718
which, unsurpizingly, are the very same values reported by model:
model
# result:
Call:
randomForest(x = TrainingX, y = TrainingY, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 0.1232342
% Var explained: 81.91
so I'm not sure I can really see your issue, or what these values have to do with the "matrix with all the trees"...
But when I do model$mse, the value presented corresponds to the value
of MSE for the last tree calculated, and the same happens for % var
explained.
Most certainly not: model$mse is a vector of length equal to the number of trees (here 500), containing the MSE for each individual tree; (see UPDATE below) I have never seen any use for this in practice (similarly for model$rsq):
length(model$mse)
[1] 500
length(model$rsq)
[1] 500
UPDATE: Kudos to the OP herself (see comments), who discovered that the quantities in model$mse and model$rsq are indeed cumulative (!); from an old (2004) thread by package maintainer Andy Liaw, Extracting the MSE and % Variance from RandomForest:
Several ways:
Read ?randomForest, especially the `Value' section.
Look at str(myforest.rf).
Look at print.randomForest.
If the forest has 100 trees, then the mse and rsq are vectors with 100
elements each, the i-th element being the mse (or rsq) of the forest
consisting of the first i trees. So the last element is the mse (or
rsq) of the whole forest.

R - caret::train "random forest" parameters

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.
My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??
I'm confused on whether there is something wrong with my model or my data cleaning, etc.
set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)
fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 41
OOB estimate of error rate: 0.01%
Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.
SOLUTION: remove variables that have a high proportion of missing values from the model.

Overcoming Multicollinearity in Random Forest Regression and still keeping all variables in the model

I am new to Random Forest Regression. I have 300 Continuous variables ( 299 predictors and 1 target)in prep1, where some predictors are highly correlated. The problem is that I still need to get the importance value for each one of the predictors, so eliminating some is not an option.
Here are my questions:
1) Is there a way to choose for each tree only variables that are not highly correlated, if yes, how should the below code be adjusted?
2) Assuming yes to 1), will this take care of the multi-collinearity problem?
bound <- floor(nrow(prep1)/2)
df <- prep1[sample(nrow(prep1)), ]
train <- df[1:bound, ]
test <- df[(bound+1):nrow(df), ]
modelFit <- randomForest(continuous_target ~., data = train)
prediction <- predict(modelFit, test)
Random Forest has the nature of selecting samples with replacement as well as selecting subsets of features on those samples randomly. As per your scenario, given that you don't have skewness in the response variable, building LARGE NUMBER of trees should give you importance for all of the variables. Though this should increase the computational complexity as you for every bag you are capturing the same importance multiple number of times. Also multicollinearity won't affect the predictive power.

Pruning rule based classification tree (PART algorithm)

I am using PART algorithm in R (via package RWeka) for multi-class classification. Target attribute is time bucket in which an invoice will be paid by customer (like 7-15 days, 15-30 days etc). I am using following code for fitting and predicting from the model :
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE, data= trainingData)
predictedTrainingValues <- predict(fit, trainingData)
By using this model, I am getting around 82 % accuracy on training data. But accuracy on test data comes around 59 %. I understand that I am over-fitting the model. I tried to reduce the number of predictor variables (predictor variables in above code are reduced variables), but it is not helping much.Reducing the number of variables improves accuracy on test data to around 61 % and reduces the accuracy on training data to around 79 %.
Since PART algorithm is based on partial decision tree, another option can be to prune the tree. But I am not aware that how to prune tree for PART algorithm. On internet search, I found that FOIL criteria can be used for pruning rule based algorithm. But I am not able to find implementation of FOIL criterion in R or RWeka.
My question is that how to prune tree for PART algorithm, or any other suggestion to improve accuracy on test data are also welcome.
Thanks in advance!!
NOTE : I calculate accuracy as number of correctly classified instances divided by total number of instances.
In order to prune the tree with PART you need to specify it in the control argument of the function:
There is a complete list of the commands you can pass into the control argument here
I quote some of the options here which are relevant to pruning:
Valid options are:
-C confidence
Set confidence threshold for pruning. (Default: 0.25)
M number
Set minimum number of instances per leaf. (Default: 2)
-R
Use reduced error pruning.
-N number
Set number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)
Looks like the C argument from above might be of help to you and then maybe R and N and M.
In order to use those in the function do:
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE,
data= trainingData,
control = Weka_control(R = TRUE, N = 5, M = 100)) #random choices
On a separate note for the accuracy metric:
Comparing the accuracy between the training set and the test set to determine over-fitting is not optimal in my opinion. The model was trained on the training set and therefore you expect it to work better there than the test set. A better test is cross-validation. Try performing a 10-fold cross-validation first (you could use caret's function train) and then compare the average cross-validation accuracy to your test set's accuracy. I think this will better. If you do not know what cross-validation is, in general it splits your training set into smaller training and tests sets and trains on the training and tests on the test set. Can read more about it here.

Resources