I'm doing a binary classification with a GBM using the h2o package. I want to assess the predictive power of a certain variable, and if I'm correct I can do so by comparing the AUC of a model with the specific variable and a model without the specific variable.
I'm taking the titanic dataset as an example.
So my hypothesis is:
Age has significant predictive value whether someone will survive.
df <- h2o.importFile(path = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
response <- "survived"
df[[response]] <- as.factor(df[[response]])
## use all other columns (except for the name) as predictors
predictorsA <- setdiff(names(df), c(response, "name"))
predictorsB <- setdiff(names(df), c(response, "name", "age"))
splits <- h2o.splitFrame(
data = df,
ratios = c(0.6,0.2), ## only need to specify 2 fractions, the 3rd is implied
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 1234
)
train <- splits[[1]]
valid <- splits[[2]]
test <- splits[[3]]
gbmA <- h2o.gbm(x = predictorsA, y = response, distribution="bernoulli", training_frame = train)
gbmB <- h2o.gbm(x = predictorsB, y = response, distribution="bernoulli", training_frame = train)
## Get the AUC
h2o.auc(h2o.performance(gbmA, newdata = valid))
[1] 0.9631624
h2o.auc(h2o.performance(gbmB, newdata = test))
[1] 0.9603211
I know the pROC package has a roc.test function to compare AUC of two ROC curves and I would like to apply this function on the outcomes of my h2o model.
You can do something like this-
valid_A <- as.data.frame(h2o.predict(gbmA,valid))
valid_B <- as.data.frame(h2o.predict(gbmB,valid))
valid_df <- as.data.frame(valid)
roc1 <- roc(valid_df$survived,valid_A$p1)
roc2 <- roc(valid_df$survived,valid_B$p1)
> roc.test(roc1,roc2)
DeLong's test for two correlated ROC curves
data: roc1 and roc2
Z = -0.087489, p-value = 0.9303
alternative hypothesis: true difference in AUC is not equal to 0
sample estimates:
AUC of roc1 AUC of roc2
0.9500141 0.9504367
Related
I am undertaking a ordinal logistic regression using R package MASS.
For example:
library(MASS)
house.plr <- polr(Sat ~ Infl + Type + Cont, weights = Freq, data = housing)
summary(house.plr, digits = 3)
I am using the s3 method predict() to obtain the predicted values
test_dat <- data.frame(Infl = factor(rep("Low",4)),
Cont = factor(rep("Low",4)),
Type = unique(housing$Type))
predict(house.plr, test_dat, type = "p")
Low Medium High
1 0.3784493 0.2876752 0.3338755
2 0.5190445 0.2605077 0.2204478
3 0.4675584 0.2745383 0.2579033
4 0.6444840 0.2114256 0.1440905
The result is a table of predicted means for each level of Sat given the variables defined in the test_dat.
How might I extract the variation around each of these means in the form of a standard error or standard deviation?
First, your predicted values are the predicted probability of each outcome for each observation. It is not the predicted mean on the response scale.
Second, you can use the marginaleffects package to get the standard errors for the predicted probabilities and then calculate the confidence intervals yourself. Alternatively, you may implement the non-parametric bootstrap. I implement both in the below. Note that I shifted the order of the columns around in the test data to match the training data.
# Packages
library(MASS)
library(marginaleffects)
library(dplyr)
# Create a test set
N <- 4
test_dat <- data.frame(
Infl = factor(rep("Low", N)),
Type = unique(housing$Type),
Cont = factor(rep("Low", N))
)
# Fit ordered logistic regression model
house.plr <- polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = housing,
Hess = TRUE)
# Demonstrate that predict() doesn't provide any measure of variability
# for the predicted class probabilities, as shown in question
predict(house.plr, test_dat, type = "probs")
# Use the marginaleffects package to get delta method standard errors for
# each predicted probability
probs <- marginaleffects::predictions(house.plr,
newdata = test_dat,
type = "probs")
# Compute CIs from the standard error using normal approximation
probs$predicted - 1.96*probs$std.error
probs$predicted + 1.96*probs$std.error
# Alternatively, use non-parametric bootstrapped confidence intervals.
# note that this does not adjust the weights to a constant sum for
# each bootstrap, although it is easy to implement. You're free to
# determine how to handle the weights, including resampling based
# on the weights.
# Generate bootstrapped data.frames
set.seed(123)
sims <- 5
samples <- vector(mode = "list", length = sims)
samples <- lapply(samples, function(x){ slice_sample(housing, n = nrow(housing), replace = TRUE)})
# Fit model on each bootstrapped data.frame
models <- lapply(samples, function(x){polr(Sat ~ Infl + Type + Cont,
weights = Freq,
data = x,
Hess = TRUE)})
# Get test predictions into a data.frame
probs_boot <- lapply(models, function(x) {
marginaleffects::predictions(x,
newdata = test_dat,
type = "probs")
})
probs_boot_df <- bind_rows(probs_boot)
# Compute CIs
probs_boot_df %>%
group_by(group, Type.x, Infl, Type.y, Cont) %>%
summarise(ci_low = quantile(predicted, probs = 0.025),
ci_high = quantile(predicted, probs = 0.975))
I am trying to calibrate probabilities that I get with the predict function in the R package.
I have in my case two classes and mutiple predictors. I used the iris dataset as an example for you to try and help me out.
my_data <- iris %>% #reducing the data to have two classes only
dplyr::filter((Species =="virginica" | Species == "versicolor") ) %>% dplyr::select(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species)
my_data <- droplevels(my_data)
index <- createDataPartition(y=my_data$Species,p=0.6,list=FALSE)
#creating train and test set for machine learning
Train <- my_data[index,]
Test <- my_data[-index,]
#machine learning based on Train data partition with glmnet method
classCtrl <- trainControl(method = "repeatedcv", number=10,repeats=5,classProbs = TRUE,savePredictions = "final")
set.seed(355)
glmnet_ML <- train(Species~., Train, method= "glmnet", trControl=classCtrl)
glmnet_ML
#probabilities to assign each row of data to one class or the other on Test
predTestprob <- predict(glmnet_ML,Test,type="prob")
pred
#trying out calibration following "Applied predictive modeling" book from Max Kuhn p266-273
predTrainprob <- predict(glmnet_ML,Train,type="prob")
predTest <- predict(glmnet_ML,Test)
predTestprob <- predict(glmnet_ML,Test,type="prob")
Test$PredProb <- predTestprob[,"versicolor"]
Test$Pred <- predTest
Train$PredProb <- predTrainprob[,"versicolor"]
#logistic regression to calibrate
sigmoidalCal <- glm(relevel(Species, ref= "virginica") ~ PredProb,data = Train,family = binomial)
coef(summary(sigmoidalCal))
#predicting calibrated scores
sigmoidProbs <- predict(sigmoidalCal,newdata = Test[,"PredProb", drop = FALSE],type = "response")
Test$CalProb <- sigmoidProbs
#plotting to see if it works
calCurve2 <- calibration(Species ~ PredProb + CalProb, data = Test)
xyplot(calCurve2,auto.key = list(columns = 2))
According to me, the result given by the plot is not good which indicates a mistake in the calibration, the Calprob curve should follow the diagonal but it doe not.
Has anyone done anything similar ?
I have a dataset that contains information about patients. It includes several variables and their clinical status (0 if they are healthy, 1 if they are sick).
I have tried to implement an SVM model to predict patient status based on these variables.
library(e1071)
Index <-
order(Ytrain, decreasing = FALSE)
SVMfit_Var <-
svm(Xtrain[Index, ], Ytrain[Index],
type = "C-classification", gamma = 0.005, probability = TRUE, cost = 0.001, epsilon = 0.1)
preds1 <-
predict(SVMfit_Var, Xtest, probability = TRUE)
preds1 <-
attr(preds1, "probabilities")[,1]
samples <- !is.na(Ytest)
pred <- prediction(preds1[samples],Ytest[samples])
AUC<-performance(pred,"auc")#y.values[[1]]
prediction <- predict(SVMfit_Var, Xtest)
xtab <- table(Ytest, prediction)
To test the performance of the model, I have calculated the ROC AUC, and with the validation set I obtain an AUC = 0.997.
But when I view the predictions, all the patients have been assigned as healthy.
AUC = 0.997
> xtab
prediction
Ytest 0 1
0 72 0
1 52 0
Can anyone help me with this problem?
Did you look at the probabilities versus the fitted values? You can read about how probability works with SVM here.
If you want to look at the performance you can use the library DescTools and the function Conf or with the library caret and the function confusionMatrix. (They provide the same output.)
library(DescTools)
library(caret)
# for the training performance with DescTools
Conf(table(SVMfit_Var$fitted, Ytrain[Index]))
# svm.model$fitted, y-values for training
# training performance with caret
confusionMatrix(SVMfit_Var$fitted, as.factor(Ytrain[Index]))
# svm.model$fitted, y-values
# if y.values aren't factors, use as.factor()
# for testing performance with DescTools
# with `table()` in your question, you must flip the order:
# predicted first, then actual values
Conf(table(prediction, Ytest))
# and for caret
confusionMatrix(prediction, as.factor(Ytest))
Your question isn't reproducible, so I went through this with iris data. The probability was the same for every observation. I included this, so you can see this with another data set.
library(e1071)
library(ROCR)
library(caret)
data("iris")
# make it binary
df1 <- iris %>% filter(Species != "setosa") %>% droplevels()
# check the subset
summary(df1)
set.seed(395) # keep the sample repeatable
tr <- sample(1:nrow(df1), size = 70, # 70%
replace = F)
# create the model
svm.fit <- svm(df1[tr, -5], df1[tr, ]$Species,
type = "C-classification",
gamma = .005, probability = T,
cost = .001, epsilon = .1)
# look at probabilities
pb.fit <- predict(svm.fit, df1[-tr, -5], probability = T)
# this shows EVERY row has the same outcome probability distro
pb.fit <- attr(pb.fit, "probabilities")[,1]
# look at performance
performance(prediction(pb.fit, df1[-tr, ]$Species), "auc")#y.values[[1]]
# [1] 0.03555556 that's abysmal!!
# test the model
p.fit = predict(svm.fit, df1[-tr, -5])
confusionMatrix(p.fit, df1[-tr, ]$Species)
# 93% accuracy with NIR at 50%... the AUC score was not useful
# check the trained model performance
confusionMatrix(svm.fit$fitted, df1[tr, ]$Species)
# 87%, with NIR at 50%... that's really good
I am very new to deep learning. I trained a neural net using the packages deepnet and caret. For this regression problem caretuses a sigmoid function as activation function and a linear one as output function.
I preprocessed the predictors using preprocess = "range" (which I thought normalizes the predictors).
library(caret)
library(deepnet)
set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion")
# create data
dat <- as.data.frame(ChickWeight)
dat$vari <- sample(LETTERS, nrow(dat), replace = TRUE)
dat$Chick <- as.character(dat$Chick)
preds <- dat[1:100,2:5]
response <- dat[1:100,1]
vali <- dat[101:150,]
# change format of categorical predictors to one-hot encoded format
dmy <- dummyVars(" ~ .", data = preds)
preds_dummies <- data.frame(predict(dmy, newdata = preds))
# specifiy trainControl for tuning mtry and with specified folds
control <- caret::trainControl(search = "grid", method="repeatedcv", number=3,
repeats=2,
savePred = T)
# tune hyperparameters and build final model
tunegrid <- expand.grid(layer1 = c(5,50),
layer2 = c(0,5,50),
layer3 = c(0,5,50),
hidden_dropout = c(0, 0.1),
visible_dropout = c(0, 0.1))
model <- caret::train(x = preds_dummies,
y = response,
method="dnn",
metric= "RMSE",
tuneGrid=tunegrid,
trControl= control,
preProcess = "range"
)
When I predict using the validation set with the tuned neural network model, it produces only one prediction value despite of various input predictors.
# predict with validation set
# create dummies
dmy <- dummyVars(" ~ .", data = vali)
vali_dummies <- data.frame(predict(dmy, newdata = vali))
vali_dummies <- vali_dummies[,which(names(vali_dummies) %in% model$finalModel$xNames)]
# add empty columns for categorical preds of the one used in the model (to have the same matix)
not_included <- setdiff(model$finalModel$xNames, names(vali_dummies))
vali_add <- as.data.frame(matrix(rep(0, length(not_included)*nrow(vali_dummies)),
nrow = nrow(vali_dummies),
ncol = length(not_included))
)
# change names
names(vali_add) <- not_included
# add to vali_dummies
vali_dummies <- cbind(vali_dummies, vali_add)
# put it in the same order as preds_dummies (sort the columns)
vali_dummies <- vali_dummies[names(preds_dummies)]
# normalize also the validation set
pp = preProcess(vali_dummies, method = c("range"))
vali_dummies <- predict(pp, vali_dummies)
# save obs and pred for predictions with the outer CV out-of-fold test set
temp <- data.frame(obs = vali[,1],
pred = caret::predict.train(object = model, newdata = vali_dummies))
temp
When I am using the Boston data set from the MASS package where no categorical predictors are present, I get slightly different prediction values for all the different input predictors of the validation set.
How can I fix this and create a neural network which predicts "different" predictions when using numeric as well as categorical predictors? What else besides normalization should I try?
I am building two different classifiers to predict a binary out come. Then I want to compare the results of the two models by using a ROC curve and the area under it (AUC).
I split the data set into a training and testing set. On the training set I perform a form of cross-validation. From the held-out samples of the cross validation I am able to build a ROC curve per model. Then I use the models on the testing set and build another set of ROC curves.
The results are contradictory which is confusing me. I am not sure which result is the correct one or if I am doing something completely wrong. The held-out sample ROC curve shows that RF is the better model and the training set ROC curve shows that SVM is the better model.
Analysis
library(ggplot2)
library(caret)
library(pROC)
library(ggthemes)
library(plyr)
library(ROCR)
library(reshape2)
library(gridExtra)
my_data <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
str(my_data)
names(my_data)[1] <- "Class"
my_data$Class <- ifelse(my_data$Class == 1, "event", "noevent")
my_data$Class <- factor(emr$Class, levels = c("noevent", "event"), ordered = TRUE)
set.seed(1732)
ind <- createDataPartition(my_data$Class, p = 2/3, list = FALSE)
train <- my_data[ ind,]
test <- my_data[-ind,]
Next I train two models: Random Forest and SVM. Here I also use Max Kuhns function to get the averaged ROC curves from held-out samples for both models and save those results into a another data.frame along with the AUC from the curves.
#Train RF
ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = twoClassSummary)
grid <- data.frame(mtry = seq(1,3,1))
set.seed(1537)
rf_mod <- train(Class ~ .,
data = train,
method = "rf",
metric = "ROC",
tuneGrid = grid,
ntree = 1000,
trControl = ctrl)
rfClasses <- predict(rf_mod, test)
#This is the ROC curve from held out samples. Source is from Max Kuhns 2016 UseR! code here: https://github.com/topepo/useR2016
roc_train <- function(object, best_only = TRUE, ...) {
lvs <- object$modelInfo$levels(object$finalModel)
if(best_only) {
object$pred <- merge(object$pred, object$bestTune)
}
## find tuning parameter names
p_names <- as.character(object$modelInfo$parameters$parameter)
p_combos <- object$pred[, p_names, drop = FALSE]
## average probabilities across resamples
object$pred <- plyr::ddply(.data = object$pred,
.variables = c("obs", "rowIndex", p_names),
.fun = function(dat, lvls = lvs) {
out <- mean(dat[, lvls[1]])
names(out) <- lvls[1]
out
})
make_roc <- function(x, lvls = lvs, nms = NULL, ...) {
out <- pROC::roc(response = x$obs,
predictor = x[, lvls[1]],
levels = rev(lvls))
out$model_param <- x[1,nms,drop = FALSE]
out
}
out <- plyr::dlply(.data = object$pred,
.variables = p_names,
.fun = make_roc,
lvls = lvs,
nms = p_names)
if(length(out) == 1) out <- out[[1]]
out
}
temp <- roc_train(rf_mod)
plot_data_ROC <- data.frame(Model='Random Forest', sens = temp$sensitivities, spec=1-temp$specificities)
#This is the AUC of the held-out samples roc curve for RF
auc.1 <- abs(sum(diff(1-temp$specificities) * (head(temp$sensitivities,-1)+tail(temp$sensitivities,-1)))/2)
#Build SVM
set.seed(1537)
svm_mod <- train(Class ~ .,
data = train,
method = "svmRadial",
metric = "ROC",
trControl = ctrl)
svmClasses <- predict(svm_mod, test)
#ROC curve into df
temp <- roc_train(svm_mod)
plot_data_ROC <- rbind(plot_data_ROC, data.frame(Model='Support Vector Machine', sens = temp$sensitivities, spec=1-temp$specificities))
#This is the AUC of the held-out samples roc curve for SVM
auc.2 <- abs(sum(diff(1-temp$specificities) * (head(temp$sensitivities,-1)+tail(temp$sensitivities,-1)))/2)
Next I will plot the results
#Plotting Final
#ROC of held-out samples
q <- ggplot(data=plot_data_ROC, aes(x=spec, y=sens, group = Model, colour = Model))
q <- q + geom_path() + geom_abline(intercept = 0, slope = 1) + xlab("False Positive Rate (1-Specificity)") + ylab("True Positive Rate (Sensitivity)")
q + theme(axis.line = element_line(), axis.text=element_text(color='black'),
axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
#ROC of testing set
rf.probs <- predict(rf_mod, test,type="prob")
pr <- prediction(rf.probs$event, factor(test$Class, levels = c("noevent", "event"), ordered = TRUE))
pe <- performance(pr, "tpr", "fpr")
roc.data <- data.frame(Model='Random Forest',fpr=unlist(pe#x.values), tpr=unlist(pe#y.values))
svm.probs <- predict(svm_mod, test,type="prob")
pr <- prediction(svm.probs$event, factor(test$Class, levels = c("noevent", "event"), ordered = TRUE))
pe <- performance(pr, "tpr", "fpr")
roc.data <- rbind(roc.data, data.frame(Model='Support Vector Machine',fpr=unlist(pe#x.values), tpr=unlist(pe#y.values)))
q <- ggplot(data=roc.data, aes(x=fpr, y=tpr, group = Model, colour = Model))
q <- q + geom_line() + geom_abline(intercept = 0, slope = 1) + xlab("False Positive Rate (1-Specificity)") + ylab("True Positive Rate (Sensitivity)")
q + theme(axis.line = element_line(), axis.text=element_text(color='black'),
axis.title = element_text(colour = 'black'), legend.text=element_text(), legend.title=element_text())
#AUC of hold out samples
data.frame(Rf = auc.1, Svm = auc.2)
#AUC of testing set. Source is from Max Kuhns 2016 UseR! code here: https://github.com/topepo/useR2016
test_pred <- data.frame(Class = factor(test$Class, levels = c("noevent", "event"), ordered = TRUE))
test_pred$Rf <- predict(rf_mod, test, type = "prob")[, "event"]
test_pred$Svm <- predict(svm_mod, test, type = "prob")[, "event"]
get_auc <- function(pred, ref){
auc(roc(ref, pred, levels = rev(levels(ref))))
}
apply(test_pred[, -1], 2, get_auc, ref = test_pred$Class)
The results from the held-out samples and from the testing set are totally different (I know they will be different but by this much?).
Rf Svm
0.656044 0.5983193
Rf Svm
0.6326531 0.6453428
From the held-out samples one would choose the RF model but from the testing set one would pick the SVM model.
Which is the "correct" or "better" way to chose the model?
Am I making a big mistake somewhere or not understanding something correctly?
If I understand correctly then you have 3 labeled data sets:
Training
Hold-out CV sample from training
"Testing" CV sample
While, yes, under a hold-out sample CV strategy you normally choose your model based on the hold-out sample, you also don't normally also have a larger validation data sample.
Clearly, if both the hold-out and the Testing data sets are (a) labeled and (b) as close to the level of orthogonality as possible from from the training data, then you'd choose your model based on whichever has the larger sample size.
In your case it looks like what you're calling the hold-out sample is just the repeated CV resampling from training. That being the case you have even more reason to prefer the results from the Testing data set validation. See Steffen's related note on repeated CV.
In theory Random Forest's bagging has a inherit form of cross-validation through the OOB stats and the CV conducted within the training phase should give you some measure of validation. However, in practice it's common to observe a lack of orthogonality and an increased likelihood of overfitting since the samples are coming from the training data itself and may be reinforcing the mistake of overfitting for accuracy.
I can explain that theoretically as above to some extent, then beyond that I just have to tell you that empirically I've found that the performance results from the so-called CV and OOB error calculated from the training data can be highly misleading and the true hold-out (Testing) data that was never touched during training is the far better validation.
Your true hold-out sample is the Testing data set, since none of its data is using during training. Use those results.