Feature Selection for Regression Models in R - r

I’m trying to find a Feature Selection Package in R that can be used for Regression most of the packages implement their methods for classification using a factor or class for the response variable. In particular I’m interested if there’s a method using Random Forest for that purpose. Also a good paper in this field would be helpfull.

IIRC the randomForest package also does regression trees. You could start with the Breiman paper and go from there.

There are many ways you can use randomforest for calculating variable importance.
I. Mean Decrease Impurity (MDI) / Gini Importance :
This makes use of a random forest model or a decision tree. When training a tree, it is measured by how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure. Here is an example of the same using R.
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=2))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Gini Impurity Index Plot")
And the output will look like this: Gini Importance Plot
II. Permutation Importance or Mean Decrease in Accuracy (MDA) : Permutation Importance or Mean Decrease in Accuracy (MDA) is assessed for each feature by removing the association between that feature and the target. This is achieved by randomly permuting the values of the feature and measuring the resulting increase in error. The influence of the correlated features is also removed. Example in R:
fit <- randomForest(Target ~.,importance = T,ntree = 500, data=training_data)
var.imp1 <- data.frame(importance(fit, type=1))
var.imp1$Variables <- row.names(var.imp1)
varimp1 <- var.imp1[order(var.imp1$MeanDecreaseGini,decreasing = T),]
par(mar=c(10,5,1,1))
giniplot <- barplot(t(varimp1[-2]/sum(varimp1[-2])),las=2,
cex.names=1,
main="Permutation Importance Plot")
This two are are the ones which use Random Forest directly. There are some more easy to use metrics for variable importance calculation purpose. 'Boruta' method and Weight of evidence (WOE) and Information Value (IV) might also be helpful.

Related

R - Interpreting Random Forest Importance

I'm working with random forest models in R as a part of an independent research project. I have fit my random forest model and generated the overall importance of each predictor to the models accuracy. However, in order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
Is there a way to produce this information from a random forest model? I.e. I expect age to have a positive impact on the likelihood a surgical complication occurs, but existence of osteoarthritis not so much.
Code:
surgery.bagComp = randomForest(complication~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.90,0.10)) #The cutoff is the probability for each group selection, probs of 10% or higher are classified as 'Complication' occurring
surgery.bagComp #Get stats for random forest model
imp=as.data.frame(importance(surgery.bagComp)) #Analyze the importance of each variable in the model
imp = cbind(vars=rownames(imp), imp)
imp = imp[order(imp$MeanDecreaseAccuracy),]
imp$vars = factor(imp$vars, levels=imp$vars)
dotchart(imp$MeanDecreaseAccuracy, imp$vars,
xlim=c(0,max(imp$MeanDecreaseAccuracy)), pch=16,xlab = "Mean Decrease Accuracy",main = "Complications - Variable Importance Plot",color="black")
Importance Plot:
Any suggestions/areas of research anyone can suggest would be greatly appreciated.
In order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
You need to be perform "feature impact" analysis, not "feature importance" analysis.
Algorithmically, it's about traversing decision tree data structures and observing what was the impact of each split on the prediction outcome. For example, consider the split "age <= 40". Does the left branch (condition evaluates to true) carry lower likelihood than the right branch (condition evaluates to false)?
Feature importances may give you a hint which features to look for, but it cannot be "transformed" to feature impacts.
You might find the following articles helpful: WHY did your model predict THAT? (Part 1 of 2) and WHY did your model predict THAT? (Part 2 of 2).

R language, how to use bootstraps to generate maximum likelihood and AICc?

Sorry for a quite stupid question. I am doing multiple comparisons of morphologic traits through correlations of bootstraped data. I'm curious if such multiple comparisons are impacting my level of inference, as well as the effect of the potential multicollinearity in my data. Perhaps, a reasonable option would be to use my bootstraps to generate maximum likelihood and then generate AICc-s to do comparisons with all of my parameters, to see what comes out as most important... the problem is that although I have (more or less clear) the way, I don't know how to implement this in R. Can anybody be so kind as to throw some light on this for me?
So far, here an example (using R language, but not my data):
library(boot)
data(iris)
head(iris)
# The function
pearson <- function(data, indices){
dt<-data[indices,]
c(
cor(dt[,1], dt[,2], method='p'),
median(dt[,1]),
median(dt[,2])
)
}
# One example: iris$Sepal.Length ~ iris$Sepal.Width
# I calculate the r-squared with 1000 replications
set.seed(12345)
dat <- iris[,c(1,2)]
dat <- na.omit(dat)
results <- boot(dat, statistic=pearson, R=1000)
# 95% CIs
boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = results, type = "bca")
Intervals :
Level BCa
95% (-0.2490, 0.0423 )
Calculations and Intervals on Original Scale
plot(results)
I have several more pairs of comparisons.
More of a Cross Validated question.
Multicollinearity shouldn't be a problem if you're just assessing the relationship between two variables (in your case correlation). Multicollinearity only becomes an issue when you fit a model, e.g. multiple regression, with several highly correlated predictors.
Multiple comparisons is always a problem though because it increases your type-I error. The way to address that is to do a multiple comparison correction, e.g. Bonferroni-Holm or the less conservative FDR. That can have its downsides though, especially if you have a lot of predictors and few observations - it may lower your power so much that you won't be able to find any effect, no matter how big it is.
In high-dimensional setting like this, your best bet may be with some sort of regularized regression method. With regularization, you put all predictors into your model at once, similarly to doing multiple regression, however, the trick is that you constrain the model so that all of the regression slopes are pulled towards zero, so that only the ones with the big effects "survive". The machine learning versions of regularized regression are called ridge, LASSO, and elastic net, and they can be fitted using the glmnet package. There is also Bayesian equivalents in so-called shrinkage priors, such as horseshoe (see e.g. https://avehtari.github.io/modelselection/regularizedhorseshoe_slides.pdf). You can fit Bayesian regularized regression using the brms package.

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

R random forest feature selection based on AUC

For binary option prediction (rise, fall) I am trying random forest in R but the importance measures and OOB are biased in my case
I found this article but it is Python related.
Is there an R package approach for automatic feature selection that
is based on AUC
maybe allows me to define my own evaluation function (money earned is function of recall and precision rates)
maybe allows me to specify the cross-validation approach: randomly selecting traing and test case is biased, as there are timeseries data, where test data must be later than training data
I just came across this question, I found this package that might help you:
i. It's called AUCRF, it performs feature selection in a random forest model based on optimizing AUC.
https://cran.r-project.org/web/packages/AUCRF/AUCRF.pdf
ii. It does allow cross-validation of your AUC based selection
AUCRFcv(x, nCV = 5, M = 20)
where nCV is number of folds, M = number of repeats.
iii. Regarding allowing your own evaluation, it does have an option where you can specify the formula using ~ but you will have to explore that more for your specific case, since you have not provided test code.
Hope this helps!

Survey Weighted Regression Without FPC in R

I'm using the svydesign package in R to run survey weighted logit regressions as follows:
sdobj <- svydesign(id = ~0, weights = ~chweight, strata = ~strata, data = svdat)
model1 <- svyglm(formula=formula1,design=sdobj,family = quasibinomial)
However, the documentation states a caveat about regressions without specifying finite population corrections (FPC):
If fpc is not specified then sampling is assumed to be
with replacement at the top level and only the first stage of
cluster is used in computing variances.
Unfortunately, I do not have sufficient information to specify my populations at each level (of which I sampling very little). Any information on how to specify survey weights without FPC information would be very helpful.
You're doing it right. "With replacement" is survey statistics jargon for what you want in this case.
If the sampling fraction is low, it is standard to use an approximation that would be exact if the sampling fraction were infinitesimal or sampling were with replacement. No-one actually does surveys with replacement, but the approximation is almost universal. With this approximation you don't need to supply fpc, and conversely, if you don't supply fpc, svydesign() assumes you want this approximation.

Resources