My apologies if I'm missing something obvious. I've been thoroughly enjoying working with h2o in the last few days using R interface. I would like to evaluate my model, say a random forest, by plotting an ROC. The documentation seems to suggest that there is a straightforward way to do that:
Interpreting a DRF Model
By default, the following output displays:
Model parameters (hidden)
A graph of the scoring history (number of trees vs. training MSE)
A graph of the ROC curve (TPR vs. FPR)
A graph of the variable importances
...
I've also seen that in python you can apply roc function here. But I can't seem to be able to find the way to do the same in R interface. Currently I'm extracting predictions from the model using h2o.cross_validation_holdout_predictions and then use pROC package from R to plot the ROC. But I would like to be able to do it directly from the H2O model object, or, perhaps, a H2OModelMetrics object.
Many thanks!
A naive solution is to use plot() generic function to plot a H2OMetrics object:
logit_fit <- h2o.glm(colnames(training)[-1],'y',training_frame =
training.hex,validation_frame=validation.hex,family = 'binomial')
plot(h2o.performance(logit_fit),valid=T),type='roc')
This will give us a plot:
But it is hard to customize, especially to change the line type, since the type parameter is already taken as 'roc'. Also I have not found a way to plot multiple models' ROC curves together on one plot. I have come up with a method to extract true positive rate and false positive rate from the H2OMetrics object and use ggplot2 to plot the ROC curves on one plot by myself. Here is the example code(uses a lot of tidyverse syntax):
# for example I have 4 H2OModels
list(logit_fit,dt_fit,rf_fit,xgb_fit) %>%
# map a function to each element in the list
map(function(x) x %>% h2o.performance(valid=T) %>%
# from all these 'paths' in the object
.#metrics %>% .$thresholds_and_metric_scores %>%
# extracting true positive rate and false positive rate
.[c('tpr','fpr')] %>%
# add (0,0) and (1,1) for the start and end point of ROC curve
add_row(tpr=0,fpr=0,.before=T) %>%
add_row(tpr=0,fpr=0,.before=F)) %>%
# add a column of model name for future grouping in ggplot2
map2(c('Logistic Regression','Decision Tree','Random Forest','Gradient Boosting'),
function(x,y) x %>% add_column(model=y)) %>%
# reduce four data.frame to one
reduce(rbind) %>%
# plot fpr and tpr, map model to color as grouping
ggplot(aes(fpr,tpr,col=model))+
geom_line()+
geom_segment(aes(x=0,y=0,xend = 1, yend = 1),linetype = 2,col='grey')+
xlab('False Positive Rate')+
ylab('True Positive Rate')+
ggtitle('ROC Curve for Four Models')
Then the ROC curve is:
you can get the roc curve by passing the model performance metrics to H2O's plot function.
shortened code snippet which assumes you created a model, call it glm, and split your dataset into train and validation sets:
perf <- h2o.performance(glm, newdata = validation)
h2o.plot(perf)
full code snippet below:
h2o.init()
# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath = system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex = h2o.importFile(path = prostatePath, destination_frame = "prostate.hex")
glm = h2o.glm(y = "CAPSULE", x = c("AGE","RACE","PSA","DCAPS"), training_frame = prostate.hex, family = "binomial", nfolds = 0, alpha = 0.5, lambda_search = FALSE)
perf <- h2o.performance(glm, newdata = prostate.hex)
h2o.plot(perf)
and this will produce the following:
There is not currently a function in H2O R or Python client to plot the ROC curve directly. The roc method in Python returns the data neccessary to plot the ROC curve, but does not plot the curve itself. ROC curve plotting directly from R and Python seems like a useful thing to add, so I've created a JIRA ticket for it here: https://0xdata.atlassian.net/browse/PUBDEV-4449
The reference to the ROC curve in the docs refers to the H2O Flow GUI, which will automatically plot a ROC curve for any binary classification model in your H2O cluster. All the other items in that list are in fact available directly in R and Python, however.
If you train a model in R, you can visit the Flow interface (e.g. localhost:54321) and click on a binomial model to see it's ROC curves (training, validation and cross-validated versions). It will look like this:
Building off #Lauren's example, after you run model.performance you can extract all necessary information for ggplot from perf#metrics$thresholds_and_metric_scores. This code produces the ROC curve, but you can also add precision, recall to the selected variables for plotting the PR curve.
Here is some example code using the same model as above.
library(h2o)
library(dplyr)
library(ggplot2)
h2o.init()
# Run GLM of CAPSULE ~ AGE + RACE + PSA + DCAPS
prostatePath <- system.file("extdata", "prostate.csv", package = "h2o")
prostate.hex <- h2o.importFile(
path = prostatePath,
destination_frame = "prostate.hex"
)
glm <- h2o.glm(
y = "CAPSULE",
x = c("AGE", "RACE", "PSA", "DCAPS"),
training_frame = prostate.hex,
family = "binomial",
nfolds = 0,
alpha = 0.5,
lambda_search = FALSE
)
# Model performance
perf <- h2o.performance(glm, newdata = prostate.hex)
# Extract info for ROC curve
curve_dat <- data.frame(perf#metrics$thresholds_and_metric_scores) %>%
select(c(tpr, fpr))
# Plot ROC curve
ggplot(curve_dat, aes(x = fpr, y = tpr)) +
geom_point() +
geom_line() +
geom_segment(
aes(x = 0, y = 0, xend = 1, yend = 1),
linetype = "dotted",
color = "grey50"
) +
xlab("False Positive Rate") +
ylab("True Positive Rate") +
ggtitle("ROC Curve") +
theme_bw()
Which produces this plot:
roc_plot
Related
~Beginner in R~
I have the following code for a data set that has variables: price, mileage, and color. I have plotted a basic plot of x=mileage and y=price, and fitted a linear regression line to the plot.
cd = read.csv("https://bitbucket.org/remcc/rob-data-sets/downloads/susedcars.csv")
cd = cd[,c('price','mileage','color')]
n = nrow(cd)
set.seed(99)
pin = .75 #percent train (or percent in-sample)
ii = sample(1:n,floor(pin*n))
cdtr = cd[ii,]
cdte = cd[-ii,]
dim(cd)
plot(cd$mileage, cd$price, xlab="Mileage", ylab="Price", pch=16,cex=.8)
abline(lm(cd$price ~ cd$mileage), col="red", lwd=2)
## FITTING KNN
pred_knn=knn(data.frame(cdtr$mileage), data.frame(cdte$mileage), cl=cdtr$price, k=50)
I am trying to fit a line using pred_knn to the plot so that my plot looks like this:
However, I am not sure how to go about adding the kNN fit to my plot
As dcarlson mentions in the comments, it looks like you're using class::knn to predict a continuous variable, when it's meant to be used for classification (i.e. categorical responses).
The FNN package allows for kNN regression. Does this help:
library(FNN)
pred_knn = FNN::knn.reg(train = cdtr[2], test = cdte[2], y = cdtr[1], k = 50)
plot(cdte$mileage,cdte$price,, xlab="Mileage", ylab="Price", pch=16,cex=.8))
ORDER = order(cdte$mileage)
lines(cdte$mileage[ORDER],pred_knn$pred[ORDER])
I am using R. I am following this tutorial over here (https://rviews.rstudio.com/2017/09/25/survival-analysis-with-r/ ) and I am trying to adapt the code for a similar problem.
In this tutorial, a statistical model is developed on a dataset and then this statistical model is used to predict 3 news observations. We then plot the results for these 3 observations:
#load libraries
library(survival)
library(dplyr)
library(ranger)
library(data.table)
library(ggplot2)
#use the built in "lung" data set
#remove missing values (dataset is called "a")
a = na.omit(lung)
#create id variable
a$ID <- seq_along(a[,1])
#create test set with only the first 3 rows
new = a[1:3,]
#create a training set by removing first three rows
a = a[-c(1:3),]
#fit survival model (random survival forest)
r_fit <- ranger(Surv(time,status) ~ age + sex + ph.ecog + ph.karno + pat.karno + meal.cal + wt.loss, data = a, mtry = 4, importance = "permutation", splitrule = "extratrees", verbose = TRUE)
#create new intermediate variables required for the survival curves
death_times <- r_fit$unique.death.times
surv_prob <-data.frame(r_fit$survival)
avg_prob <- sapply(surv_prob, mean)
#use survival model to produce estimated survival curves for the first three observations
pred <- predict(r_fit, new, type = 'response')$survival
pred <- data.table(pred)
colnames(pred) <- as.character(r_fit$unique.death.times)
#plot the results for these 3 patients
plot(r_fit$unique.death.times, pred[1,], type = "l", col = "red")
lines(r_fit$unique.death.times, r_fit$survival[2,], type = "l", col = "green")
lines(r_fit$unique.death.times, r_fit$survival[3,], type = "l", col = "blue")
From here, I would like to try an add confidence interval (confidence regions) to each of these 3 curves, so that they look something like this:
I found a previous stackoverflow post (survfit() Shade 95% confidence interval survival plot ) that shows how to do something similar, but I am not sure how to extend the results from this post to each individual observation.
Does anyone know if there is a direct way to add these confidence intervals?
Thanks
If you create your plot using ggplot, you can use the geom_ribbon function to draw confidence intervals as follows:
ggplot(data=...)+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )+
geom_line(aes(x=..., y=...),color=...)+
geom_ribbon(aes(x=.. ,ymin =.., ymax =..), fill=.. , alpha =.. )
You can put + after geom_line and repeat the same steps for each observation.
You can also check:
Having trouble plotting multiple data sets and their confidence intervals on the same GGplot. Data Frame included and
https://bookdown.org/ripberjt/labbook/appendix-guide-to-data-visualization.html
I am looking for a way to obtain the piecewise quantile linear regression with R. I have been able to compute the Quantile regression with the package quantreg. However, I don't want just 1 unique slope but want to check for breakpoints in my dataset. I have seen that the segmented package can do so. While it works good if the fit is carried out with lm or glm (as shown below in an example), it doesn't manage to work for quantile.
On the segmented package info I have read that there is a segmented.default which can be used for specific regression models, such as Quantiles. However, when I apply it for my quantile outcome it gives me the following errors:
Error in diag(vv) : invalid 'nrow' value (too large or NA)
In addition: Warning message:
cannot compute the covariance matrix
If instead of using K=2 I use for example psi I get other type of errors:
Error in rq.fit.br(x, y, tau = tau, ...) : Singular design matrix
I have created an example with the mtcars data so you can see the errors that I get.
library(quantreg)
library(segmented)
data(mtcars)
out.rq <- rq(mpg ~ wt, data= mtcars)
out.lm <- lm(mpg ~ wt, data= mtcars)
# Plotting the results
plot(mpg ~ wt, data = mtcars, pch = 1, main = "mpg ~ wt")
abline(out.lm, col = "red", lty = 2)
abline(out.rq, col = "blue", lty = 2)
legend("topright", legend = c("linear", "quantile"), col = c("red", "blue"), lty = 2)
#Generating segmented LM
o <- segmented(out.lm, seg.Z= ~wt, npsi=2, control=seg.control(display=FALSE))
plot(o, lwd=2, col=2:6, main="Segmented regression", res=FALSE) #lwd: line width #col: from 2 to 6 #RES: show datapoints
#Generating segmented Quantile
#using K=2
o.quantile <- segmented.default(out.rq, seg.Z= ~wt, control=seg.control(display=FALSE, K=2))
# using psi
o.quantile <- segmented.default(out.rq, seg.Z= ~wt, psi=list(wt=c(2,4)), control=seg.control(display=FALSE))
I came across this post after a long time because I have the same issue. Just in case others might be stuck with the problem in the future, I wanted to point out what the problem is.
I examined "segmented.default". There is a line in the source code as follows:
Cov <- try(vcov(objF), silent = TRUE)
vcov is used to calculate the covariance matrix but does not work for quantile regression object objF. To get the covariance matrix for quantile regression, you need:
summary(objF,se="boot",cov=TRUE)$cov
Here, I used bootstrap method to compute the covariance matrix by selecting se="boot" but you should choose the appropriate method for you. Check ?summary.rq then "se" section for different methods.
Additionally, you need to assign the row/column names as follows:
dimnames(Cov)[[1]] <- dimnames(Cov)[[2]] <- unlist(attributes(objF$coef))
After modifying the function, it worked for me.
Maybe the other answer isn't particularly clean, as you need to modify a package function.
Additionally, maybe boot isn't such a good idea for SEs, according to this answer.
To get it working a bit easier, add a function to your workspace:
vcov.rq <- function(object, ...) {
result = summary(object, se = "nid", covariance = TRUE)$cov
rownames(result) = colnames(result) = names(coef(object))
return(result)
}
Caveats from the Cross-Validated link apply.
The response variable for my dataset is comprised of observations Y[1], Y[2], ...., Y[49]. I came up with a Bayesian Hierarchical Model to make Bayesian predictions for Y[50]. I also have MCMC samples for Y[1],...,Y[49], which I can use to assess the overall fit of my Bayesian model by comparing them with the actual values of Y[1], Y[2], ...., Y[49].
Is there any way that I can draw the caterpillar plots of my Bayesian Predictions from the MCMC object of the Hierarchical Model along with the points that stands for actual observed Y's from my original dataset on R?
Thank you,
First you need to extract your confidence intervals for each $Y_i$ . (usually this is done with quantile function if you're not using a standard S3 object).
Then you create the following df:
df <- data_frame(
obs = seq(from = 1,
to = 49,
by = 1),
lower = q1,
upper = q2,
estimate = estimate,
actual = actual)
Then you go:
df %>% ggplot(aes(x = obs)) +
geom_line(aes(y = actual)) +
geom_pointrange(aes(ymin = lower, ymax = upper, y = estimate)) +
coord_flip()
If you're doing hierarchical models I really recommend using rstanarm package which is compatible with the tidybayes library (which produces automatic caterpillar plots).
I have a set of estimated parameters for an Ex-gaussian curve (i.e. mu, sigma, tau).
Currently I'm creating a visualization of that distribution by simulating data based on those parameters and plotting them in ggplot.
I would rather create a visualization that is effectively a smooth fitted ex-gaussian curve - i.e. an estimated curve for data that presents with the parameters I've estimated. The goal is to not have curves with the same parameters appear differently.
Here is the current simulation approach I'm utilizing:
library(retimes)
library(ggplot2)
g <- rexgauss(1000,mu=1,sigma = 1,tau =1)
g <- as.data.frame(g); colnames(g) <- "obs"
ggplot(g) + geom_density(aes(x = obs), size=1, alpha=.4)
You can use stat_function from ggplot2. It takes a function in fun, and parameters to pass to that function in args. It works well for situations like this where you want to compare a simulation to a calculated distribution, because the x values you supply to aes will be the ones automatically used in showing the function, without you having to do any work to match them up or calculate the range of x values in your simulation.
Here's an example with retimes::rexgauss. I also simplified your data frame creation, and put the parameters in a vector so you can use them in both the simulation and the calculated function.
My laptop is too slow to do all 1000 observations, so yours is probably smoother and closer to the calculated distribution than mine.
library(ggplot2)
exgauss_params <- c(mu = 1, sigma = 1, tau = 1)
exgauss_sim <- data.frame(obs = retimes::rexgauss(n = 100, exgauss_params))
ggplot(exgauss_sim, aes(x = obs)) +
geom_density(aes(color = "simulated")) +
stat_function(aes(color = "calculated"),
fun = retimes::dexgauss, args = exgauss_params)
Created on 2018-05-18 by the reprex package (v0.2.0).