Issues when using randomForest in caret with ROC as optimization metric - r

I'm having an issue when constructing random forest models using caret. I have a dataset of about 46k rows and 10 columns (one of which is the optimization target). From this dataset, I'm trying to compare different classifiers. I did the following:
ctrl = trainControl(method="boot"
,classProbs=TRUE
,summaryFunction=twoClassSummary )
#GLM Model:
model.glm = train(x=d[,2:10]
,y=d$CONV_BT, method='glm'
,trControl=ctrl, metric="ROC"
,family="binomial")
#Random Forest Model:
model.rf = train(x=d[,2:10]
,y=d$CONV_BT, method='rf'
,trControl=ctrl, metric="ROC")
#Naive Bayes Model:
model.nb = train(x=d[,2:10]
,y=d$CONV_BT, method='nb'
,trControl=ctrl, metric="ROC" )
Then, model.glm and model.nb both look pretty decent. I can look at the 25 bootstrap replications, and each case has an ROC of around .7. However, something appears to be wrong with model.rf, because the reported ROC scores are all around .3. That suggests to me that something is being specified incorrectly, because I could just switch my predictions from the rf model from p to 1-p and my ROC would then be .7, right?
I'm sorry that I can't provide the data (because it's pretty big to upload and it's proprietary). The other bizarre thing is that when I simulate data, I no longer have this issue. Any idea what this could be??? Thanks for your help!

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

Why does ranger predict give different numbers when re-applied to training data?

I am very new to machine learning. I am trying to explore fitting random forests with the ranger library in R. My dependent variable is continuous - so it would be a regression tree (and not just classification). Upon trying out the functions, I have noticed that there seems to be a discrepancy between ranger and predict ranger. The following lines result in different predictions in results and results_alternative:
rf_reg <- ranger(formula = y ~ ., data = training_df)
results <- rf_reg$predictions
results_alterantive <- predict(rf_reg, data = training_df)$predictions
Could anybody please explain why there is a discrepancy and what is causing it? Which one is correct? I have tried it with classification on iris data and that seemed to give the same results. Many thanks!

Multinomial logit with random effects does not converge using mblogit

I would like to estimate a random effects (RE) multinomial logit model.
I have been applying mblogit from the mclogit package. However, once I introduce RE into my model, it fails to converge.
Is there a workaround this?
For instance, I tried to adjust the fitting process of mblogit and increase the maximal number of iterations (maxit), but did not succeed to correctly write the syntax for the control function. Would this be the right approach? And if so, could you advise me how to implement it into my model which so far looks as follows:
meta.mblogit <- mblogit(Migration ~ ClimateHazard4 , weights = logNsquare,
data = meta.df, subset= Panel==1, random = ~1|StudyID,
)
Here, both variables (Migration and ClimateHazard4) are factor variables.
Or is there an alternative approach you could recommend me for an estimation of RE multinomial logit?
Thank you very much!

Alternative performance measures for multiclass classification in caret

I do want to tune a classification algorithm predicting probabilities using caret.
Since my data-set is highly unbalanced, the default Accuracy option of caret seems not to be so helpful according to this post: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification.
In my specific case, I want to determine the optimal mtry parameter of a random forest, which predicts probabilities. I do have 3 classes and a palance ratio of 98.7% - 0.45% - 0.85%. An reproducible example - which has sadely no unbalanced data-set - is given by:
library(caret)
data(iris)
control = trainControl(method="CV", number=5,verboseIter = TRUE,classProbs=TRUE)
grid = expand.grid(mtry = 1:3)
rf_gridsearch = train(y=iris[,5],x=iris[-5],method="ranger", num.trees=2000, tuneGrid=grid, trControl=control)
rf_gridsearch
So my two questions basically are:
What alternative summary metrics besides the Accuracy do I have?
(Using multiROC is not my favourite, due to: https://stats.stackexchange.com/questions/68702/r-caret-difference-between-roc-curve-and-accuracy-for-classification. I think of sth. like a Brier Score)
How do I implement them?
Many thanks!

Random forest evaluation in R

I am a newbie in R and I am trying to do my best to create my first model. I am working in a 2- classes random forest project and so far I have programmed the model as follows:
library(randomForest)
set.seed(2015)
randomforest <- randomForest(as.factor(goodkit) ~ ., data=training1, importance=TRUE,ntree=2000)
varImpPlot(randomforest)
prediction <- predict(randomforest, test,type='prob')
print(prediction)
I am not sure why I don't get the overall prediction for my model.I must be missing something in my code. I get the OOB and the prediction per case in the test set but not the overall prediction of the model.
library(pROC)
auc <-roc(test$goodkit,prediction)
print(auc)
This doesn't work at all.
I have been through the pROC manual but I cannot get to understand everything. It would be very helpful if anyone can help with the code or post a link to a good practical sample.
Using the ROCR package, the following code should work for calculating the AUC:
library(ROCR)
predictedROC <- prediction(prediction[,2], as.factor(test$goodkit))
as.numeric(performance(predictedROC, "auc")#y.values))
Your problem is that predict on a randomForest object with type='prob' returns two predictions: each column contains the probability to belong to each class (for binary prediction).
You have to decide which of these predictions to use to build the ROC curve. Fortunately for binary classification they are identical (just reversed):
auc1 <-roc(test$goodkit, prediction[,1])
print(auc1)
auc2 <-roc(test$goodkit, prediction[,2])
print(auc2)

Resources