SVM performance not consistent with AUC score - r

I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick).
I have tried to implement an SVM model to predict patient status based on these variables.
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")#y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997.
But when I view the predictions, all the patients have been assigned as healthy.
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
Can anyone help me with this problem?

Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.
If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")#y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good

Related

Why does 'ranger predict' predict actual values when actual data is fed through the trained model?

Here is a simple reproducible example of my dilemma.
My goal is very simple - train a random forest model using 'ranger' and run the train data through the trained model to double check against the model's predicted values. The 2 predicted values do not match. In fact, running the train data through the trained model produces the actual train target data.
Persons have posted related questions to the forum, though I have yet to discover a definitive response.
This makes no sense to me. Exposing any trained model to the data from the training set should provide consistent solutions.
library(tidyverse)
library(ranger)
train <- tibble(target = ifelse( runif( 1000) > 0.5, 1, 0 ),
feature1 = runif( 1000),
feature2 = runif( 1000)
)
# Train the model
rf <- ranger(
target ~ .,
data = train,
classification = TRUE,
num.trees = 500,
seed = 123
)
# Obtain the in-sample model predictions
in_sample_predictions_2 <- rf$predictions
# Alternatively, run the train data through the trained model to ensure the results are the same as provided by 'in_sample_predictions_2'
in_sample_predictions_1 <- predict( rf, data = train )$predictions
# Check for equivalency fails, Sum of Squared differences should equal 0
sum( (in_sample_predictions_1 - in_sample_predictions_2 )^2 )
# [1] 506
# Yet, the incorrect predictions exactly equals the OOB error rate = 50.6% = 506/1000
# It turns out that the predicted values from the trained model using the train data is equivalent to the actual train 'target' data.
sum( (train$target - in_sample_predictions_1)^2 )
# [1] 0

How can I plot the average ROC of a bootstrap of cross-validated BRT models (gbm.step) in R with confidence intervals?

I would like to produce a ROC curve from 100 runs of a 10-fold cross validated model produced with gbm.step from the gbm and dismo packages, with the curve representing the average and also displaying the confidence interval, something like this (not my graph):
I am unsure how to do this - I have been able to plot the ROC of each model run as an individual line, but I would prefer the above.
My code:
df <- read.csv("data.csv")
library(gbm)
library(dismo)
library(dplyr)
library(ROCR)
library(mlbench)
library(colorspace)
Pal = qualitative_hcl(10)
## Number of iterations
n.iter <- 100
plot(NULL,xlim=c(0,1),ylim=c(0,1),
xlab="False positive rate",ylab="True positive rate")
## Run bootstrapped BRT model
for(i in 1:n.iter){
## Sample data
train.num <- round(nrow(df) *0.8)
train.obs = sample(nrow(df), train.num)
## Separate covariates and response
flavidf.x <- df[10:52]
flavidf.y <- df$Presence
# X is training sample
x.train = df.x[train.obs, ]
# Create a holdout set for evaluating model performance
x.val = df.x[-train.obs, ]
# Subset outcome variable
y.train = df.y[train.obs]
y.val = df.y[-train.obs]
## Datasets
train.df <- cbind(y.train, x.train)
test.df <- cbind(y.val, x.val)
## Run model
brt.model <- gbm.step(data=train.df, gbm.x = c(2:44), gbm.y = 1, family = "bernoulli", tree.complexity = 5, learning.rate = 0.001, bag.fraction = 0.6)
brt.model
## Predictions from BRT
x2 <- test.df[2:44]
pred.brt <- predict(brt.model, newdata= x2, n.trees=brt.model$gbm.call$best.trees, type="response")
## Add predictions to data
brt.df <- cbind(test.df, pred.brt)
## AUC
predictions=as.vector(pred.brt)
pred=prediction(predictions, test.df$y.val)
### roc
perf_ROC=performance(pred,"tpr", "fpr") #Calculate the ROC value
ROC=perf_ROC#y.values[[1]]
ROC <- cbind(ROC, i)
lines(perf_ROC#x.values[[1]],perf_ROC#y.values[[1]],col=Pal[i]) # add line to plot
### auc
perf_AUC=performance(pred,"auc") #Calculate the AUC value
AUC=perf_AUC#y.values[[1]]
AUC <- cbind(AUC, i)
# AUC for each iteration
if(exists("brt.auc")){
brt.auc <- rbind(brt.auc, AUC)
rm(AUC)
}
if(!exists("brt.auc")){
brt.auc <- AUC
}
}
In this way I was able to produce a plot of ROC curves as in the image below (produced from reduced # of iterations for speed), but unsure how to get something more like the first example.

R: how to improve gradient boosting model fit

I tried fitting a gradient boosted model (weak learners are max.depth = 2 trees) to the iris data set using gbm in the gbm package. I set the number of iterations to M = 1000 with a learning rate of learning.rate = 0.001. I then compared the results to those of a regression tree (using rpart). However, it seems that the regression tree is outperforming the gradient boosted model. What's the reason behind this? And how can I improve the gradient boosted model's performance? I thought a learning rate of 0.001 should suffice with 1000 iterations/boosted trees.
library(rpart)
library(gbm)
data(iris)
train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]
learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M,
interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)
tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)
> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438
In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.
library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()
Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:
set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)
gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm",
distribution="gaussian",tuneGrid=tg,trControl=tr)
Then we use the same samples to fit rpart:
#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)
Finally we compare them, and they are very similar:
gbm_fit$resample
RMSE Rsquared MAE Resample
1 0.3459311 0.5000575 0.2585884 0
2 0.3421506 0.4536114 0.2631338 1
3 0.3428588 0.5600722 0.2693837 2
RMSE Rsquared MAE Resample
1 0.3492542 0.3791232 0.2695451 0
2 0.3320841 0.4276960 0.2550386 1
3 0.3284239 0.4343378 0.2570833 2
So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.

Use pROC package with h2o

I'm doing a binary classification with a GBM using the h2o package. I want to assess the predictive power of a certain variable, and if I'm correct I can do so by comparing the AUC of a model with the specific variable and a model without the specific variable.
I'm taking the titanic dataset as an example.
So my hypothesis is:
Age has significant predictive value whether someone will survive.
df <- h2o.importFile(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
response <- "survived"
df[[response]] <- as.factor(df[[response]])
## use all other columns (except for the name) as predictors
predictorsA <- setdiff(names(df), c(response, "name"))
predictorsB <- setdiff(names(df), c(response, "name", "age"))
splits <- h2o.splitFrame(
data = df,
ratios = c(0.6,0.2), ## only need to specify 2 fractions, the 3rd is implied
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
gbmA <- h2o.gbm(x = predictorsA, y = response, distribution="bernoulli", training_frame = train)
gbmB <- h2o.gbm(x = predictorsB, y = response, distribution="bernoulli", training_frame = train)
## Get the AUC
h2o.auc(h2o.performance(gbmA, newdata = valid))
[1] 0.9631624
h2o.auc(h2o.performance(gbmB, newdata = test))
[1] 0.9603211
I know the pROC package has a roc.test function to compare AUC of two ROC curves and I would like to apply this function on the outcomes of my h2o model.
You can do something like this-
valid_A <- as.data.frame(h2o.predict(gbmA,valid))
valid_B <- as.data.frame(h2o.predict(gbmB,valid))
valid_df <- as.data.frame(valid)
roc1 <- roc(valid_df$survived,valid_A$p1)
roc2 <- roc(valid_df$survived,valid_B$p1)
> roc.test(roc1,roc2)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = -0.087489, p-value = 0.9303
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9500141 0.9504367

Generating confidence intervals for predicted probabilities after running mlogit function in R

I have been struggling with the following problem for some time and would be very grateful for any help. I am running a logit model in R using the mlogit function and am able to generate the predicted probability of choosing each alternative for a given value of the predictors as follows:
library(mlogit)
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
Fish_fit<-Fish[-(1:4),]
Fish_test<-Fish[1:4,]
m <- mlogit(mode ~price+ catch | income, data = Fish_fit)
predict(m,newdata=Fish_test,)
I cannot, however, work out how to add confidence intervals to the predicted probability estimates. I have already tried adding arguments to the predict function, but none seem to generate them. Any ideas on how it can be achieved would be much appreciated.
One approach here is Monte Carlo simulation. You'd simulate repeated draws from a multivariate-normal sampling distribution whose parameters are given by your model results.
For each simulation, estimate your predicted probabilities, and use their empirical distribution over simulations to get your confidence intervals.
library(MASS)
est_betas <- m$coefficients
est_preds <- predict(m, newdata = Fish_test)
sim_betas <- mvrnorm(1000, m$coefficients, vcov(m))
sim_preds <- apply(sim_betas, 1, function(x) {
m$coefficients <- x
predict(m, newdata = Fish_test)
})
sim_ci <- apply(sim_preds, 1, quantile, c(.025, .975))
cbind(prob = est_preds, t(sim_ci))
# prob 2.5% 97.5%
# beach 0.1414336 0.10403634 0.1920795
# boat 0.3869535 0.33521346 0.4406527
# charter 0.3363766 0.28751240 0.3894717
# pier 0.1352363 0.09858375 0.1823240

Resources