R random forest feature selection based on AUC - r

For binary option prediction (rise, fall) I am trying random forest in R but the importance measures and OOB are biased in my case
I found this article but it is Python related.
Is there an R package approach for automatic feature selection that
is based on AUC
maybe allows me to define my own evaluation function (money earned is function of recall and precision rates)
maybe allows me to specify the cross-validation approach: randomly selecting traing and test case is biased, as there are timeseries data, where test data must be later than training data

I just came across this question, I found this package that might help you:
i. It's called AUCRF, it performs feature selection in a random forest model based on optimizing AUC.
https://cran.r-project.org/web/packages/AUCRF/AUCRF.pdf
ii. It does allow cross-validation of your AUC based selection
AUCRFcv(x, nCV = 5, M = 20)
where nCV is number of folds, M = number of repeats.
iii. Regarding allowing your own evaluation, it does have an option where you can specify the formula using ~ but you will have to explore that more for your specific case, since you have not provided test code.
Hope this helps!

Related

Use glm to predict on fresh data

I'm relatively new to glm - so please bear with me.
I have created a glm (logistic regression) to predict whether an individual CONTINUES studies ("0") or does NOTCONTINUE ("1"). I am interested in predicting the latter. The glm uses seven factors in the dataset and the confusion matrices are very good for what I need and combining seven years' of data have also been done. Straight-forward.
However, I now need to apply the model to the current years' data, which of course does not have the NOTCONTINUE column in it. Lets say the glm model is "CombinedYears" and the new data is "Data2020"
How can I use the glm model to get predictions of who will ("0") or will NOT ("1") continue their studies? Do I need to insert a NOTCONTINUE column into the latest file ?? I have tried this structure
Predict2020 <- predict(CombinedYears, data.frame(Data2020), type = 'response')
but the output only holds values <0.5.
Any help very gratefully appreciated. Thank you in advance
You mentioned that you already created a prediction model to predict whether a particular student will continue studies or not. You used the glm package and your model name is CombinedYears.
Now, what you have to know is that your problem is a binary classification and you used logistic regression for this. The output of your model when you apply it on new data, or even the same data used to fit the model, is probabilities. These are values between zero and one. In the development phase of your model, you need to determine the cutoff threshold of these probabilities which you can use later on when you predict new data. For example, you may determine 0.5 as a cutoff, and every probability above that is considered NOTCONTINUE and below that is CONTINUE. However, the best threshold can be determined from your data as well by maximizing both specificity and sensitivity. This can be done by calculating the area under the receiver operating characteristic curve (AUC). There are many packages than can do this for you, such as pROC and AUC packages in R. The same packages can determine the best cutoff as well.
What you have to do is the following:
Determine the cutoff threshold after calculating the AUC
library(pROC)
roc_object = roc(your_fit_data$NOTCONTINUE ~ fitted(CombinedYears))
coords(roc.roc_object, "best", ret="threshold", transpose = FALSE)
Use your model to predict on your new data year (as you did)
Predict2020 = predict(CombinedYears, data.frame(Data2020), type = 'response')
Now, the content of Predict2020 is just probabilities for each
student. Use the cutoff you obtained from step (1) to classify your
students accordingly

MLR package: generateFilterValuesData chi.squared and information.gain

I am experimenting with the mlr package and would like to get chi-squared and information-gain values.
library(mlr)
library(FSelector)
data(PimaIndiansDiabetes)
indi <- sample(1:nrow(PimaIndiansDiabetes), 0.6 * nrow(PimaIndiansDiabetes))
train <- PimaIndiansDiabetes[indi,]
trainTask <- makeClassifTask(data = train, target = "diabetes", positive = "pos")
#Feature importance
im_feat <- generateFilterValuesData(trainTask, method = c("information.gain","chi.squared"))
plotFilterValues(im_feat)
im_feat
I am not sure about the consequences that there are two zeros in information.gain and chi.squared for the variables triceps and pressure. Does that indicate I should not use them for setting up a model (e.g. random forest)?
When I use
tbl <- table(train$triceps, train$diabetes)
chisq.test(tbl)
it gives me 60.473 for chi-squared. Why is it not 0? What's the difference between chisq and the chi-squared-method from mlr?
Regarding your first question, values of 0 generally indicate that the feature is not predictive wrt the variable that you're interested, based on the particular evaluation method that you applied. This does not necessarily mean that the same is true for a particular type of model, and hence it usually doesn't make sense to remove it. Apart from that, many models perform feature selection internally (one of these being random forests), so this kind of preprocessing doesn't make sense in general, unless you have so many features that a random forest takes too long to build a model, for example.
The chi.squared test in mlr and chi.sq are based on different implementations; not sure why they're not returning the same result.

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

Pruning rule based classification tree (PART algorithm)

I am using PART algorithm in R (via package RWeka) for multi-class classification. Target attribute is time bucket in which an invoice will be paid by customer (like 7-15 days, 15-30 days etc). I am using following code for fitting and predicting from the model :
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE, data= trainingData)
predictedTrainingValues <- predict(fit, trainingData)
By using this model, I am getting around 82 % accuracy on training data. But accuracy on test data comes around 59 %. I understand that I am over-fitting the model. I tried to reduce the number of predictor variables (predictor variables in above code are reduced variables), but it is not helping much.Reducing the number of variables improves accuracy on test data to around 61 % and reduces the accuracy on training data to around 79 %.
Since PART algorithm is based on partial decision tree, another option can be to prune the tree. But I am not aware that how to prune tree for PART algorithm. On internet search, I found that FOIL criteria can be used for pruning rule based algorithm. But I am not able to find implementation of FOIL criterion in R or RWeka.
My question is that how to prune tree for PART algorithm, or any other suggestion to improve accuracy on test data are also welcome.
Thanks in advance!!
NOTE : I calculate accuracy as number of correctly classified instances divided by total number of instances.
In order to prune the tree with PART you need to specify it in the control argument of the function:
There is a complete list of the commands you can pass into the control argument here
I quote some of the options here which are relevant to pruning:
Valid options are:
-C confidence
Set confidence threshold for pruning. (Default: 0.25)
M number
Set minimum number of instances per leaf. (Default: 2)
-R
Use reduced error pruning.
-N number
Set number of folds for reduced error pruning. One fold is used as the pruning set. (Default: 3)
Looks like the C argument from above might be of help to you and then maybe R and N and M.
In order to use those in the function do:
fit <- PART(DELAY_CLASS ~ AMT_TO_PAY + NUMBER_OF_CREDIT_DAYS + AVG_BASE_PRICE,
data= trainingData,
control = Weka_control(R = TRUE, N = 5, M = 100)) #random choices
On a separate note for the accuracy metric:
Comparing the accuracy between the training set and the test set to determine over-fitting is not optimal in my opinion. The model was trained on the training set and therefore you expect it to work better there than the test set. A better test is cross-validation. Try performing a 10-fold cross-validation first (you could use caret's function train) and then compare the average cross-validation accuracy to your test set's accuracy. I think this will better. If you do not know what cross-validation is, in general it splits your training set into smaller training and tests sets and trains on the training and tests on the test set. Can read more about it here.

Feature Selection for Regression Models in R

I’m trying to find a Feature Selection Package in R that can be used for Regression most of the packages implement their methods for classification using a factor or class for the response variable. In particular I’m interested if there’s a method using Random Forest for that purpose. Also a good paper in this field would be helpfull.
IIRC the randomForest package also does regression trees. You could start with the Breiman paper and go from there.
There are many ways you can use randomforest for calculating variable importance.
I. Mean Decrease Impurity (MDI) / Gini Importance :
This makes use of a random forest model or a decision tree. When training a tree, it is measured by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Here is an example of the same using R.
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Gini Impurity Index Plot")
And the output will look like this: Gini Importance Plot
II. Permutation Importance or Mean Decrease in Accuracy (MDA) : Permutation Importance or Mean Decrease in Accuracy (MDA) is assessed for each feature by removing the association between that feature and the target. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed. Example in R:
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=1))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Permutation Importance Plot")
This two are are the ones which use Random Forest directly. There are some more easy to use metrics for variable importance calculation purpose. 'Boruta' method and Weight of evidence (WOE) and Information Value (IV) might also be helpful.

Resources