Pruning rule based classification tree (PART algorithm) - r

I am using PART algorithm in R (via package RWeka) for multi-class classification. Target attribute is time bucket in which an invoice will be paid by customer (like 7-15 days, 15-30 days etc). I am using following code for fitting and predicting from the model :
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE, data= trainingData)
predictedTrainingValues <- predict(fit, trainingData)
By using this model, I am getting around 82 % accuracy on training data. But accuracy on test data comes around 59 %. I understand that I am over-fitting the model. I tried to reduce the number of predictor variables (predictor variables in above code are reduced variables), but it is not helping much.Reducing the number of variables improves accuracy on test data to around 61 % and reduces the accuracy on training data to around 79 %.
Since PART algorithm is based on partial decision tree, another option can be to prune the tree. But I am not aware that how to prune tree for PART algorithm. On internet search, I found that FOIL criteria can be used for pruning rule based algorithm. But I am not able to find implementation of FOIL criterion in R or RWeka.
My question is that how to prune tree for PART algorithm, or any other suggestion to improve accuracy on test data are also welcome.
Thanks in advance!!
NOTE : I calculate accuracy as number of correctly classified instances divided by total number of instances.

In order to prune the tree with PART you need to specify it in the control argument of the function:
There is a complete list of the commands you can pass into the control argument here
I quote some of the options here which are relevant to pruning:
Valid options are:
-C confidence
Set confidence threshold for pruning. (Default: 0.25)
M number
Set minimum number of instances per leaf. (Default: 2)
-R
Use reduced error pruning.
-N number
Set number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)
Looks like the C argument from above might be of help to you and then maybe R and N and M.
In order to use those in the function do:
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE,
data= trainingData,
control = Weka_control(R = TRUE, N = 5, M = 100)) #random choices
On a separate note for the accuracy metric:
Comparing the accuracy between the training set and the test set to determine over-fitting is not optimal in my opinion. The model was trained on the training set and therefore you expect it to work better there than the test set. A better test is cross-validation. Try performing a 10-fold cross-validation first (you could use caret's function train) and then compare the average cross-validation accuracy to your test set's accuracy. I think this will better. If you do not know what cross-validation is, in general it splits your training set into smaller training and tests sets and trains on the training and tests on the test set. Can read more about it here.

Related

Random forest regression - cumulative MSE?

I am new to Random Forests and I have a question about regression. I am using R package randomForests to calculate RF models.
My final goal is to select sets of variables important for prediction of a continuous trait, and so I am calculating a model, then I remove the variable with lowest mean decrease in accuracy, and I calculate a new model, and so on. This worked with RF classification, and I compared the models using the OOB errors from prediction (training set), development and validation data sets. Now with regression I want to compare the models based on %variation explained and MSE.
I was evaluating the results for MSE and %var explained, and I get exactly the same results when calculating manually using the prediction from model$predicted. But when I do model$mse, the value presented corresponds to the value of MSE for the last tree calculated, and the same happens for % var explained.
As an example you can try this code in R:
library(randomForest)
data("iris")
head(iris)
TrainingX<-iris[1:100,2:4] #creating training set - X matrix
TrainingY<-iris[1:100,1] #creating training set - Y vector
TestingX<-iris[101:150,2:4] #creating test set - X matrix
TestingY<-iris[101:150,1] #creating test set - Y vector
set.seed(2)
model<-randomForest(x=TrainingX, y= TrainingY, ntree=500, #calculating model
xtest = TestingX, ytest = TestingY)
#for prediction (training set)
pred<-model$predicted
meanY<-sum(TrainingY)/length(TrainingY)
varpY<-sum((TrainingY-meanY)^2)/length(TrainingY)
mseY<-sum((TrainingY-pred)^2)/length(TrainingY)
r2<-(1-(mseY/varpY))*100
#for testing (test set)
pred_2<-model$test$predicted
meanY_2<-sum(TestingY)/length(TestingY)
varpY_2<-sum((TestingY-meanY_2)^2)/length(TestingY)
mseY_2<-sum((TestingY-pred_2)^2)/length(TestingY)
r2_2<-(1-(mseY_2/varpY_2))*100
training_set_mse<-c(model$mse[500], mseY)
training_set_rsq<-c(model$rsq[500]*100, r2)
testing_set_mse<-c(model$test$mse[500],mseY_2)
testing_set_rsq<-c(model$test$rsq[500]*100, r2_2)
c<-cbind(training_set_mse,training_set_rsq,testing_set_mse, testing_set_rsq)
rownames(c)<-c("last tree", "by hand")
c
model
As a result after running this code you will obtain a table containing values for MSE and %var explaines (also called rsq). The first line is called "last tree" and contains the values of MSE and %var explained for the 500th tree in the forest. The second line is called "by hand" and it contains results calculated in R based on the vectors model$predicted and model$test$predicted.
So, my questions are:
1- Are the predictions of the trees somehow cumulative? Or are they independent from each other? (I thought they were independent)
2- Is the last tree to be considered as an average of all the others?
3- Why are MSE and %var explained of the RF model (presented in the main board when you call model) the same as the ones from the 500th tree (see first line of table)? Do the vectors model$mse or model$rsq contain cumulative values?
After the last edit I found this post from Andy Liaw (one of the creators of the package) that says that MSE and %var explained are in fact cumulative!: https://stat.ethz.ch/pipermail/r-help/2004-April/049943.html.
Not sure I understand what your issue is; I'll give it a try nevertheless...
1- Are the predictions of the trees somehow cumulative? Or are they
independent from each other? (I thought they were independent)
You thought correctly; the trees are fit independently of each other, hence their predictions are indeed independent. In fact, this is a crucial advantage of RF models, since it allows for parallel implementations.
2- Is the last tree to be considered as an average of all the others?
No; as clarified above, all trees are independent.
3- If each tree gets a prediction, how can I get the matrix with all the trees, since what I need is the MSE and % var explained for the forest?
Here is where what you ask starts being really unclear, given your code above; the MSE and r2 you say you need are exactly what you are already computing in mseY and r2:
mseY
[1] 0.1232342
r2
[1] 81.90718
which, unsurpizingly, are the very same values reported by model:
model
# result:
Call:
randomForest(x = TrainingX, y = TrainingY, ntree = 500)
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 0.1232342
% Var explained: 81.91
so I'm not sure I can really see your issue, or what these values have to do with the "matrix with all the trees"...
But when I do model$mse, the value presented corresponds to the value
of MSE for the last tree calculated, and the same happens for % var
explained.
Most certainly not: model$mse is a vector of length equal to the number of trees (here 500), containing the MSE for each individual tree; (see UPDATE below) I have never seen any use for this in practice (similarly for model$rsq):
length(model$mse)
[1] 500
length(model$rsq)
[1] 500
UPDATE: Kudos to the OP herself (see comments), who discovered that the quantities in model$mse and model$rsq are indeed cumulative (!); from an old (2004) thread by package maintainer Andy Liaw, Extracting the MSE and % Variance from RandomForest:
Several ways:
Read ?randomForest, especially the `Value' section.
Look at str(myforest.rf).
Look at print.randomForest.
If the forest has 100 trees, then the mse and rsq are vectors with 100
elements each, the i-th element being the mse (or rsq) of the forest
consisting of the first i trees. So the last element is the mse (or
rsq) of the whole forest.

Possibly overfitted classification tree but with stable prediction error

I have a question regarding rpart and overfitting. My goal is only to do well on prediction. My dataset is large, almost 20000 points. Using around 2.5% of these points as training I get a prediction error around 50%. But using 97.5% of the data as training I get around 30%. Since I am using so much data for training I guess there is a risk for overfitting.
I run this 1000 times with random training/test data + pruning the tree which is some sort of cross validation if I have understood it correctly, and I get pretty much stable results (same prediction error and importance of variables).
Can overfitting still be a problem, even though I have run this 1000 times and the prediction error is stable?
I also have a question regarding correlation between my explanatory variables. Can that be a problem in CART (as with regression)? In regression I would maybe use Lasso to try to fix the correlation. How can I fix the correlation with my classification tree?
When I plot the cptree I get this graph:
cptree plot
Here is the code I am running (I have repeated this 1000 times with different random splits each time).
set.seed(1) # For reproducability
train_frac = 0.975
n = dim(beijing_data)[1]
# Split into training and testing data
ii = sample(seq(1,dim(beijing_data)[1]),n*train_frac)
data_train = beijing_data[ii,]
data_test = beijing_data[-ii,]
fit = rpart(as.factor(PM_Dongsi_levels)~DEWP+HUMI+PRES+TEMP+Iws+
precipitation+Iprec+wind_dir+tod+pom+weekend+month+
season+year+day,
data = data_train, minsplit = 0, cp = 0)
plotcp(fit)
# Find the split with minimum CP and prune the tree
cp_fit = fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"]
pfit = prune(fit, cp = cp_fit)
pp <- predict(pfit, newdata = data_test, type = "class")
err = sum(data_test[,"PM_Dongsi_levels"] != pp)/length(pp)
print(err)
Link to beijing_data (as a RData-file so you can reproduce my example)
https://www.dropbox.com/s/6t3lcj7f7bqfjnt/beijing_data.RData?dl=0
The question is quite complex and it will be very hard to comprehensively answer. I will try to provide some insights and references for further reading.
Correlated features do not pose a severe problem for tree based methods as they do for models that use a hyper-plane as classification boundaries. When there are multiple correlated features the tree will just pick one and the rest will be ignored. However correlated features often cloud the interpretability of such a model, mask interaction and so on. Tree based models can also benefit from the removal of such variables since they will have to search a lesser space. Here is a decent resource on trees. Also check these videos 1, 2 and 3 and the ISLR book.
Models based on one tree tend to not perform as good as hyper plane based methods. So if you are interested mainly in the quality of prediction then you should explore models based on a bunch of trees such as bagging and boosting models. Popular implementations of bagging and boosting in R are randomForest and xgboost. Both can be utilized with little to no experience and can result in good predictions. Here is a resource on how to use the popular R machine learning library caret to tune a random forest. Another resource is the R mlr library which provides great wrappers for many great things related to ML, for instance here is a short blog post on Model based optimization of xgboost.
Re-sampling strategy for model validation varies with task and available data. With 20 k rows I would probably use over 50 - 60 % for training, 20 % for validation and 20 -30 % as test set. The 50 % test set I would use to select a suitable ML method, features, hyper parameters and so on by repeated K-fold cross validation (2-3 times repeated 4-5 - fold or similar). The 20 % validation set I would use to fine tune stuff and to get a feel on how good my cross validation on the train set generalizes. When I am satisfied with everything I would use the test set as a final proof I have a good model. Here are some resources on re-sampling: 1, 2, 3 and nested resampling.
In your situation I would use
z <- caret::createDataPartition(data$y, p = 0.6, list = FALSE)
train <- data[z,]
test <- data[-z,]
to split the data to train and test sets, I would then repeat the process to split the test set again with p = 0.5.
On the train data I would use this tutorial on random forests to tune the mtry and ntree parameters (Extend Caret section) using 5 fold repeated cross validation in caret and a grid search.
control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
tunegrid <- expand.grid(.mtry = c(1:15), .ntree = c(200, 500, 700, 1000, 1200, 1500))
and so on, as detailed in the mentioned link.
On a final note, the more data you have to train on, the less likely you are to over-fit.

R random forest feature selection based on AUC

For binary option prediction (rise, fall) I am trying random forest in R but the importance measures and OOB are biased in my case
I found this article but it is Python related.
Is there an R package approach for automatic feature selection that
is based on AUC
maybe allows me to define my own evaluation function (money earned is function of recall and precision rates)
maybe allows me to specify the cross-validation approach: randomly selecting traing and test case is biased, as there are timeseries data, where test data must be later than training data
I just came across this question, I found this package that might help you:
i. It's called AUCRF, it performs feature selection in a random forest model based on optimizing AUC.
https://cran.r-project.org/web/packages/AUCRF/AUCRF.pdf
ii. It does allow cross-validation of your AUC based selection
AUCRFcv(x, nCV = 5, M = 20)
where nCV is number of folds, M = number of repeats.
iii. Regarding allowing your own evaluation, it does have an option where you can specify the formula using ~ but you will have to explore that more for your specific case, since you have not provided test code.
Hope this helps!

randomForest using R for regression, make sense?

I want to exam which variable impacts most on the outcome, in my data, which is the stock yield. My data is like below.
And my code is also attached.
library(randomForest)
require(data.table)
data = fread("C:/stockcrazy.csv")
PEratio <- data$offeringPE/data$industryPE
data_update <- data.frame(data,PEratio)
train <- data_update[1:47,]
test <- data_update[48:57,]
For the above subset data set train and test, I am not sure if I need to do a cross validation on this data. And I don't know how to do it.
data.model <- randomForest(yield ~ offerings + offerprice + PEratio + count + bingo
+ purchase , data=train, importance=TRUE)
par(mfrow = c(1, 1))
varImpPlot(data.model, n.var = 6, main = "Random Forests: Top 6 Important Variables")
importance(data.model)
plot(data.model)
model.pred <- predict(data.model, newdata=test)
model.pred
d <- data.frame(test,model.pred)
I am sure not sure if the result of IncMSE is good or bad. Can someone interpret this?
Additionally, I found the predicted values of the test data is not a good prediction of the real data. So how can I improve this?
Let's see. Let's start with %IncMSE:
I found this really good answer on cross validated about %IncMSE which I quote:
if a predictor is important in your current model, then assigning
other values for that predictor randomly but 'realistically' (i.e.:
permuting this predictor's values over your dataset), should have a
negative influence on prediction, i.e.: using the same model to
predict from data that is the same except for the one variable, should
give worse predictions.
So, you take a predictive measure (MSE) with the original dataset and
then with the 'permuted' dataset, and you compare them somehow. One
way, particularly since we expect the original MSE to always be
smaller, the difference can be taken. Finally, for making the values
comparable over variables, these are scaled.
This means that in your case the most important variable is purchase i.e. when the variable purchase was permuted (i.e. the order of the values randomly changed) the resulting model was 12% worse than having the variable in its original order in terms of calculating the mean square error. The MSE was 12% higher using a permuted purchase variable meaning that the this variable is the most important. Variable importance is just a measure of how important your predictor variables were in the model you used. In your case purchase was the most important and P/E ratio was the least (for those 6 variables). This is not something you can interpret as good or bad because it doesn't show you how well the model fits unseen data. I hope this is clear now.
For the cross-validation:
You do not need to do a cross validation during the training phase because it happens automatically. Approximately, 2/3 of the records are used for the creation of a tree and the 1/3 that is left out (out-of-bag data) is used to assess the tree afterwards (the R squared for the tree is computed using the oob data)
As for the improvement of the model:
By showing just the 10 first lines of the predicted and the actual values of yield, you cannot make a safe decision on whether the model is good or bad. What you need is a test of fitness. The most common one is the R squared. It is simplistic but for comparing models and getting a first opinion about your model it does its job. This is calculated by the model for every tree that you make and can be accessed by data.model$rsq. This ranges from 0 to 1 with 1 being the perfect model and 0 showing really poor fit ( it can sometimes even take negative values which shows a bad fit). If your rsq is bad then you can try the following to improve your model although it is not certain that you will get the results you wish for:
Calibrate your trees in a different way. Change the number of trees grown and prune the trees by specifying a big nodesize number. (here you use the default 500 trees and a nodesize of 5 which might overfit your model.)
Increase the number of variables if possible.
Choose a different model. There are cases were a random Forest would not work well

Leave one out cross validation with lm function in R

I have a dataset of 506 rows on which I am performing Leave-one-out Cross Validation, once I get the mean squared errors , I am computing the mean of the mean squared errors I found. This is changing everytime I run it. Is this expected ? If so, Can someone please explain why is it changing everytime I run it ?
To do leave one out CV, I shuffle the rows first , df is the data frame
df <-df[sample.int(nrow(df)),]
Then, I split the dataframe into 506 data frames and send it to lm() and get the MSE for each data frame (in this case, each row)
fit <- lm(train[,lastcolumn] ~.,data = train)
pred <- predict(fit,test)
pred <- mean((pred - test[,lastcolumn])^2)
And then I take the mean of all the MSEs I got.
Everytime I run all this , I get a different mean. Is this expected ?
Leave-one-out cross-validation is a validation paradigm. You have to state what algorithm you are using for your predictions and you have to look whether there is some random initialization of the parameters in the prediction algorithm. If that initialization changes randomly that could explain a different result everytime the underlying algorithm is run. You have to mention which estimator / prediction algorithm you are using. If you use a Gaussian Mixture Model e.g. for classification with different initialization for means and covariances that would be a possible algorithm where performance is not necessarily always the same in a LOOCV. Gaussian mixture models and K-means algorithms typically randomize the selection of data points to represent a mean. Also the number of Gaussians in the mixture could change with different initializations if an information theoretic criterion i used for estimating the number of Gaussians.

Resources