Determine the validity of feature importance for a regression - r

My goal is to determine the "Feature Importance" in a regression task (i.e. I want to know which feature influences the target variable the most).
Let´s assume I use a multiple regression.
There are different methods to determine the feature importance: Model dependent measures like (standardized) Betacoefficients, Shapley Value Regression and model independent measures like Permutation Importance.
All methods are made for specific situations (i.e. Shapely Value Regression is made for situations in which the predictors are highly correlated, where permutation importance takes interaction effects into account). However, these assumptions are a result of theoretical considerations and simulated data sets, which means that the validity (the difference between the "true" values and the measured values on the actual data) is dependent on the structure of the actual data set.
Is there a way In R to determine the measured difference between the measured and the "true" values of the feature importance?
I can use cross-validation the measure the difference between the predicted and the actual values. As an example, I use the Hitters data set.
library(ISLR)
library(caret)
fit = train(Salary~., data = Hitters, method = "lm", trControl = trainControl(method = "repeatedcv", repeats = 3), na.action = na.omit)
I can take a look at the tuning results and the difference between true and predicted values. Here I have an RMSE of ~ 334.
fit$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 333.9667 0.468769 239.7604 75.13388 0.1987425 40.58342
But what´s about the Betacoefficients. I can look at them but can't figure out how they differ from the actual betas on the test set.
coef(fit$finalModel)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
163.1035878 -1.9798729 7.5007675 4.3308829 -2.3762100 -1.0449620 6.2312863 -3.4890543 -0.1713405 0.1339910 -0.1728611
CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1.4543049 0.8077088 -0.8115709 62.5994230 -116.8492456 0.2818925 0.3710692 -3.3607605 -24.7623251
Is there a way in R to get this information (also for other measures like the Permutation Importance or Shapely Value Regression)?

Related

Difference between fitted values and cross validation values from pls model in r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?
Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

R - Interpreting Random Forest Importance

I'm working with random forest models in R as a part of an independent research project. I have fit my random forest model and generated the overall importance of each predictor to the models accuracy. However, in order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
Is there a way to produce this information from a random forest model? I.e. I expect age to have a positive impact on the likelihood a surgical complication occurs, but existence of osteoarthritis not so much.
Code:
surgery.bagComp = randomForest(complication~ahrq_ccs+age+asa_status+bmi+baseline_cancer+baseline_cvd+baseline_dementia+baseline_diabetes+baseline_digestive+baseline_osteoart+baseline_psych+baseline_pulmonary,data=surgery,mtry=2,importance=T,cutoff=c(0.90,0.10)) #The cutoff is the probability for each group selection, probs of 10% or higher are classified as 'Complication' occurring
surgery.bagComp #Get stats for random forest model
imp=as.data.frame(importance(surgery.bagComp)) #Analyze the importance of each variable in the model
imp = cbind(vars=rownames(imp), imp)
imp = imp[order(imp$MeanDecreaseAccuracy),]
imp$vars = factor(imp$vars, levels=imp$vars)
dotchart(imp$MeanDecreaseAccuracy, imp$vars,
xlim=c(0,max(imp$MeanDecreaseAccuracy)), pch=16,xlab = "Mean Decrease Accuracy",main = "Complications - Variable Importance Plot",color="black")
Importance Plot:
Any suggestions/areas of research anyone can suggest would be greatly appreciated.
In order to interpret my results in a research paper, I need to understand whether the variables have a positive or negative impact on the response variable.
You need to be perform "feature impact" analysis, not "feature importance" analysis.
Algorithmically, it's about traversing decision tree data structures and observing what was the impact of each split on the prediction outcome. For example, consider the split "age <= 40". Does the left branch (condition evaluates to true) carry lower likelihood than the right branch (condition evaluates to false)?
Feature importances may give you a hint which features to look for, but it cannot be "transformed" to feature impacts.
You might find the following articles helpful: WHY did your model predict THAT? (Part 1 of 2) and WHY did your model predict THAT? (Part 2 of 2).

How to build regression models and then compare their fits with data held out from the model training-testing?

I have been building a couple different regression models using the caret package in R in order to make predictions about how fluorescent certain genetic sequences will become under certain experimental conditions.
I have followed the basic protocol of splitting my data into two sets: one "training-testing set" (80%) and one "hold-out set" (20%), the former of which would be utilized to build the models, and the latter would be used to test them in order to compare and pick the final model, based on metrics such as their R-squared and RMSE values. One such guide of the many I followed can be found here (http://www.kimberlycoffey.com/blog/2016/7/16/compare-multiple-caret-run-machine-learning-models).
However, I run into a block in that I do not know how to test and compare the different models based on how well they can predict the scores in the hold-out set. In the guide I linked to above, the author uses a ConfusionMatrix in order to calculate the specificity and accuracy for each model after building a predict.train object that applied the recently built models on the hold-out set of data (which is referred to as test in the link). However, ConfusionMatrix can only be applied to classification models, wherein the outcome (or response) is a categorical value (as far as my research has indicated. Please correct me if this is incorrect, as I have not been able to conclude without any doubt that this is the case).
I have found that the resamples method is capable of comparing multiple models against each other (source: https://www.rdocumentation.org/packages/caret/versions/6.0-77/topics/resamples), but it cannot take into account how the new models fit with the data that I excluded from the training-testing sessions.
I tried to create predict objects using the recently built models and hold-out data, then calculate Rsquared and RMSE values using caret's R2 and RMSE methods. But I'm not sure if such an approach is best possible way for comparing and picking the best model.
At this point, I should note that all the model building methods I am using are based on linear regression, since I need to be able to extract the coefficients and apply them in a separate Python script.
Another option I considered was setting a threshold in my outcome, wherein any genetic sequence that had a fluorescence value over 100 was considered useful, while sequences scoring values under 100 were not. This would allow me utilize the ConfusionMatrix. But I'm not sure how I should implement this within my R code to make these two classes in my outcome variable. I'm further concerned that this approach might make it difficult to apply my regression models to other data and make predictions.
For what it's worth, each of the predictors is either an integer or a float, and have ranges that are not normally distributed.
Here is the code I thus far been using:
library(caret)
data <- read.table("mydata.csv")
sorted_Data<- data[order(data$fluorescence, decreasing= TRUE),]
splitprob <- 0.8
traintestindex <- createDataPartition(sorted_Data$fluorescence, p=splitprob, list=F)
holdoutset <- sorted_Data[-traintestindex,]
trainingset <- sorted_Data[traintestindex,]
traindata<- trainingset[c('x1', 'x2', 'x3', 'x4', 'x5', 'fluorescence')]
cvCtrl <- trainControl(method = "repeatedcv", number= 20, repeats = 20, verboseIter = FALSE)
modelglmStepAIC <- train(fluorescence~., traindata, method = "glmStepAIC", preProc = c("center","scale"), trControl = cvCtrl)
model_rlm <- train(fluorescence~., traindata, method = "rlm", preProc = c("center","scale"), trControl = cvCtrl)
pred_glmStepAIC<- predict.lm(modelglmStepAIC$finalModel, holdoutset)
pred_rlm<- predict.lm(model_rlm$finalModel, holdoutset)
glmStepAIC_r2<- R2(pred_glmStepAIC, holdoutset$fluorescence)
glmStepAIC_rmse<- RMSE(pred_glmStepAIC, holdoutset$fluorescence)
rlm_r2<- R2(pred_rlm, holdoutset$fluorescence)
rlm_rmse<- RMSE(pred_rlm, holdoutset$fluorescence)
The out-of-sample performance measures offered by Caret are RMSE, MAE and squared correlation between fitted and observed values (called R2). See more info here https://topepo.github.io/caret/measuring-performance.html
At least in time series regression context, RMSE is the standard measure for out-of-sample performance of regression models.
I would advise against discretising continuous outcome variable, because you are essentially throwing away information by discretising.

R - caret::train "random forest" parameters

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.
My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??
I'm confused on whether there is something wrong with my model or my data cleaning, etc.
set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)
fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 41
OOB estimate of error rate: 0.01%
Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.
SOLUTION: remove variables that have a high proportion of missing values from the model.

Choose the number of mtry (without cause bias)?

I have this (example) code and I am trying to understand some characteristics. There are many questions about Random Forest and always comes up the issue of the number of trees and the mtry. This data frame is just an example but how can I explain the plot of the model (error) in order to set the number of trees without cause bias? Also the No. of variables tried at each split equals to 1 here (why?)
I think tuneR and train may cause bias so I want to try to find the best number of trees and mtry (default p/3) based on the error.
#' an example of a data frame and the model
clin=data.frame(1:500)
clin$k=clin$X1.500*0.2
clin$z=clin$X1.500*14.1/6
names(clin)=c("pr1","pr2","res")
rf=randomForest(res~pr1+pr2,data=clin,ntree=1000,importance=TRUE,keep.inbag=T)
plot(rf)
rf
Call:
randomForest(formula = res ~ pr1 + pr2, data = clin, ntree = 1000, importance = TRUE, keep.inbag = T)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 1
Mean of squared residuals: 2.051658
% Var explained: 100
The RF is based on a subset of the total number of predictors p (p/3). In this example you only have 2 predictors to explain "res". RF will therefore only randomly select one.
ntree and mtry should be defined so that your results are consistent.
If you set ntree too low and compute the RF multiple times you'll see a huge variation in RMSEP between the different RF.
The same holds true for mtry.
A previous answer with reference to Breiman's paper on the matter
edit regarding the predictor chosen for the split: when dealing with large numbers of predictors (2 is definitely too low to make good use of a RF) you may be interested in variable importance to see which one are more meaningful than the others.

Resources