R - caret::train "random forest" parameters - r

I'm trying to build a classification model on 60 variables and ~20,000 observations using the train() fx within the caret package. I'm using the random forest method and am returning 0.999 Accuracy on my training set, however when I use the model to predict, it classifies each test observation as the same class (i.e. each of the 20 observations are classified as "1's" out of 5 possible outcomes). I'm certain this is wrong (the test set is for a Coursera quiz, hence my not posting exact code) but I'm not sure what is happening.
My question is that when I call the final model of fit (fit$finalModel), it says it made 500 total trees (default and expected), however the number of variables tried at each split is 35. I know that will classification, the standard number of observations chosen for each split is the square root of the number of total predictors (therefore, should be sqrt(60) = 7.7, call it 8). Could this be the problem??
I'm confused on whether there is something wrong with my model or my data cleaning, etc.
set.seed(10000)
fitControl <- trainControl(method = "cv", number = 5)
fit <- train(y ~ ., data = training, method = "rf", trControl = fitControl)
fit$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 41
OOB estimate of error rate: 0.01%

Use of Random Forest for final project for the Johns Hopkins Practical Machine Learning course on Coursera will generate the same prediction for all 20 test cases for the quiz if students fail to remove independent variables that have more than 50% NA values.
SOLUTION: remove variables that have a high proportion of missing values from the model.

Related

All observations for some variables is zero, which regression model to use?

I have an assignment to create a regression model using leave-one-out cross validation.
Due to specific biological nature, the variables cannot be negatively correlated to the response variable.
There are 12 variables and 66 observations in total. 1 of the variable had observations that are either numeric or zero, while another 2 variables were zeros for all observations. The remaining 9 variables were numeric.
I have ran the following and ended up with a warning that says "In predict.lm(modelFit, newdata) : prediction from a rank-deficient fit may be misleading".
set.seed(2)
install.packages('caret')
library(caret)
#Define training control
train.control <- trainControl(method = 'LOOCV')
#Train the model
train.model <- train(Y ~., data = OSbiome, method = 'lm', trControl = train.control)
summary(train.model)
print(train.model)
warnings()
Not sure if the warning received was due to the fact that 2 of the variables were zero for all observations? Any suggestions on what model to use then? Thanks!

Difference between fitted values and cross validation values from pls model in r

I only have a small dataset of 30 samples, so I only have a training data set but no test set. So I want to use cross-validation to assess the model. I have run pls models in r using cross-validation and LOO. The mvr output has the fitted values and validation$preds values, and these are different. As the final results of R2 and RMSE for just the training set should I be using the final fitted values or the validation$preds values?
Short answer is if you want to know how good the model is at predicting, you will use the validation$preds because it is tested on unseen data. The values under $fitted.values are obtained by fitting the final model on all your training data, meaning the same training data is used in constructing model and prediction. So values obtained from this final fit, will underestimate the performance of your model on unseen data.
You probably need to explain what you mean by "valid" (in your comments).
Cross-validation is used to find which is the best hyperparameter, in this case number of components for the model.
During cross-validation one part of the data is not used for fitting and serves as a test set. This actually provides a rough estimate the model will work on unseen data. See this image from scikit learn for how CV works.
LOO works in a similar way. After finding the best parameter supposedly you obtain a final model to be used on the test set. In this case, mvr trains on all models from 2-6 PCs, but $fitted.values is coming from a model trained on all the training data.
You can also see below how different they are, first I fit a model
library(pls)
library(mlbench)
data(BostonHousing)
set.seed(1010)
idx = sample(nrow(BostonHousing),400)
trainData = BostonHousing[idx,]
testData = BostonHousing[-idx,]
mdl <- mvr(medv ~ ., 4, data = trainData, validation = "CV",
method = "oscorespls")
Then we calculate mean RMSE in CV, full training model, and test data, using 4 PCs:
calc_RMSE = function(pred,actual){ mean((pred - actual)^2)}
# error in CV
calc_RMSE(mdl$validation$pred[,,4],trainData$medv)
[1] 43.98548
# error on full training model , not very useful
calc_RMSE(mdl$fitted.values[,,4],trainData$medv)
[1] 40.99985
# error on test data
calc_RMSE(predict(mdl,testData,ncomp=4),testData$medv)
[1] 42.14615
You can see the error on cross-validation is closer to what you get if you have test data. Again this really depends on your data.

Determine the validity of feature importance for a regression

My goal is to determine the "Feature Importance" in a regression task (i.e. I want to know which feature influences the target variable the most).
Let´s assume I use a multiple regression.
There are different methods to determine the feature importance: Model dependent measures like (standardized) Betacoefficients, Shapley Value Regression and model independent measures like Permutation Importance.
All methods are made for specific situations (i.e. Shapely Value Regression is made for situations in which the predictors are highly correlated, where permutation importance takes interaction effects into account). However, these assumptions are a result of theoretical considerations and simulated data sets, which means that the validity (the difference between the "true" values and the measured values on the actual data) is dependent on the structure of the actual data set.
Is there a way In R to determine the measured difference between the measured and the "true" values of the feature importance?
I can use cross-validation the measure the difference between the predicted and the actual values. As an example, I use the Hitters data set.
library(ISLR)
library(caret)
fit = train(Salary~., data = Hitters, method = "lm", trControl = trainControl(method = "repeatedcv", repeats = 3), na.action = na.omit)
I can take a look at the tuning results and the difference between true and predicted values. Here I have an RMSE of ~ 334.
fit$results
intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
1 TRUE 333.9667 0.468769 239.7604 75.13388 0.1987425 40.58342
But what´s about the Betacoefficients. I can look at them but can't figure out how they differ from the actual betas on the test set.
coef(fit$finalModel)
(Intercept) AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
163.1035878 -1.9798729 7.5007675 4.3308829 -2.3762100 -1.0449620 6.2312863 -3.4890543 -0.1713405 0.1339910 -0.1728611
CRuns CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN
1.4543049 0.8077088 -0.8115709 62.5994230 -116.8492456 0.2818925 0.3710692 -3.3607605 -24.7623251
Is there a way in R to get this information (also for other measures like the Permutation Importance or Shapely Value Regression)?

caret package: how to extract the test data metrics

I have a dataset that is publicly available ("banknote_authentication"). It has four predictor variables (variance, skewness, entropy and kurtosis) and one target variable (class). The dataset contains 1372 records. I am using R version 3.3.2 and RStudio on a Windows machine.
I'm using the caret package to create a cross-validation approach for the following models: logistic regression, LDA, QDA, KNN with k=1,2,5,10,20,50,100. I need to obtain the test error as well as sensitivity and specificity for each of these methods and present the results in the form of boxplots, compare test error/sensitivity/specificity across these methods. Here is an example of my logistic regression code:
ctrl <- trainControl(method = "repeatedcv", number = 10, savePredictions = TRUE)
mod_fit <- train(class_label~variance+skewness+kurtosis+entropy, data=SBank, method="glm", family="binomial", trControl = ctrl, tuneLength = 5)
pred = predict(mod_fit, newdata=SBank)
And here is how I am evaluating my models:
confusionMatrix(data=pred, SBank$class_label)
How do I extract the accuracy, sensitivity and specificity metrics from the test dataset so that I can create the boxplots? I do not need the summary metrics that are output from the confusion matrix, I need a dataset of these metrics that I can represent graphically.

Choose the number of mtry (without cause bias)?

I have this (example) code and I am trying to understand some characteristics. There are many questions about Random Forest and always comes up the issue of the number of trees and the mtry. This data frame is just an example but how can I explain the plot of the model (error) in order to set the number of trees without cause bias? Also the No. of variables tried at each split equals to 1 here (why?)
I think tuneR and train may cause bias so I want to try to find the best number of trees and mtry (default p/3) based on the error.
#' an example of a data frame and the model
clin=data.frame(1:500)
clin$k=clin$X1.500*0.2
clin$z=clin$X1.500*14.1/6
names(clin)=c("pr1","pr2","res")
rf=randomForest(res~pr1+pr2,data=clin,ntree=1000,importance=TRUE,keep.inbag=T)
plot(rf)
rf
Call:
randomForest(formula = res ~ pr1 + pr2, data = clin, ntree = 1000, importance = TRUE, keep.inbag = T)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 1
Mean of squared residuals: 2.051658
% Var explained: 100
The RF is based on a subset of the total number of predictors p (p/3). In this example you only have 2 predictors to explain "res". RF will therefore only randomly select one.
ntree and mtry should be defined so that your results are consistent.
If you set ntree too low and compute the RF multiple times you'll see a huge variation in RMSEP between the different RF.
The same holds true for mtry.
A previous answer with reference to Breiman's paper on the matter
edit regarding the predictor chosen for the split: when dealing with large numbers of predictors (2 is definitely too low to make good use of a RF) you may be interested in variable importance to see which one are more meaningful than the others.

Resources