Overlay two contours of bivariate gaussian distribution using ggplot2 - r

I want to overlay two contours of bivariate guassian distribution on the same plot using ggplot2 using different color for each contour. I looked at a previous post about how to plot contours of bivariate gaussian (Plot multivariate Gaussian contours with ggplot2). But that is only plotting one contour. I tried using stat_density2d, but was unsuccessful. Here is my code with reproducible example.
set.seed(13)
m1 <- c(.5, -.5)
sigma1 <- matrix(c(1,.5,.5,1), nrow=2)
m2 <- c(0, 0)
sigma2 <- matrix(c(140,67,67,42), nrow=2)
data.grid <- expand.grid(s.1 = seq(-25, 25, length.out=200), s.2 = seq(-25,
25, length.out=200))
q.samp <- cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m2,
sigma = sigma2))
ggplot(q.samp, aes(x=s.1, y=s.2, z=prob)) +
geom_contour() +
coord_fixed(xlim = c(-25, 25), ylim = c(-25, 25), ratio = 1)

If I follow your code and create q1.samp and q2.sampfrom your parameters:
q2.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m2, sigma=sigma2))
q1.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m1, sigma=sigma1))
then I can do this:
ggplot() +
geom_contour(data=q1.samp,aes(x=s.1,y=s.2,z=prob)) +
geom_contour(data=q2.samp,aes(x=s.1,y=s.2,z=prob),col="red")
then I get one set of contours in the default colour and one in red.

Another option would be to combine the data into one data.frame and map color to the "origin" of where the data came from. This gives you a handy legend, should you need one, and all its benefits (like mapping color).
q1.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m1, sigma=sigma1))
q1.samp$origin <- "q1"
q2.samp = cbind(data.grid, prob = mvtnorm::dmvnorm(data.grid, mean = m2, sigma=sigma2))
q2.samp$origin <- "q2"
q <- rbind(q1.samp, q2.samp)
ggplot(q, aes(x=s.1, y=s.2, z=prob, color = origin)) +
geom_contour() +
coord_fixed(xlim = c(-25, 25), ylim = c(-25, 25), ratio = 1)

Related

How to make beautiful ROC curves for two models in the same plot?

I've trained two xgboost models, say model1 and model2. I have the AUC scores for each model and I want them to appear in the plot. I want to make beautiful ROC curves for both models in the same plot. Something like this:
How can I do that?
I usually use the library pROC, and I know I need to extract the scores, and the truth from each model, right?
so something like this maybe:
roc1 = roc(model1$truth, model1$scores)
roc2 = roc(model2$truth, model2$scores)
I also need the fpr and tpr for each model:
D1 = data.frame = (fpr = 1 - roc1$specificities, tpr = roc1$sensitivities)
D2 = data.frame = (fpr = 1 - roc2$specificities, tpr = roc2$sensitivities)
Then I can maybe add arrows to point out which curve is which:
arrows = tibble(x1 = c(0.5, 0.13) , x2 = c(0.32, 0.2), y1 = c(0.52, 0.83), y2 = c(0.7,0.7) )
And finally ggplot: (this part is missing)
ggplot(data = D1, aes(x = fpr, y = tpr)) +
geom_smooth(se = FALSE) +
geom_smooth(data = D2, color = 'red', se = FALSE) +
annotate("text", x = 0.5, 0.475, label = 'score of model 1') +
annotate("text", x = 0.13, y = 0.9, label = scores of model 2') +
So I need help with two things:
How do I get the right information out from the models, to make ROC curves? How do I get the truth and the prediction scores? The truth are just the labels of the target feature in the training set maybe?
How do I continue the code? and is my code right so far?
You can get the sensitivity and specifity in a data frame using coords from pROC. Just rbind the results for the two models after first attaching a column labelling each set as model 1 or model 2. To get the smooth-looking ROC with automatic labels you can use geom_textsmooth from the geomtextpath package:
library(pROC)
library(geomtextpath)
roc1 <- roc(model1$truth, model1$scores)
roc2 <- roc(model2$truth, model2$scores)
df <- rbind(cbind(model = "Model 1", coords(roc1)),
cbind(model = "Model 2", coords(roc2)))
ggplot(df, aes(1 - specificity, sensitivity, color = model)) +
geom_textsmooth(aes(label = model), size = 7, se = FALSE, span = 0.2,
textcolour = "black", vjust = 1.5, linewidth = 1,
text_smoothing = 50) +
geom_abline() +
scale_color_brewer(palette = "Set1", guide = "none", direction = -1) +
scale_x_continuous("False Positive Rate", labels = scales::percent) +
scale_y_continuous("True Positive Rate", labels = scales::percent) +
coord_equal(expand = FALSE) +
theme_classic(base_size = 20) +
theme(plot.margin = margin(10, 30, 10, 10))
Data used
set.seed(2023)
model1 <- model2 <- data.frame(scores = rep(1:100, 50))
p1 <- model2$scores + rnorm(5000, 0, 20)
p2 <- model1$scores/100
model1$truth <- rbinom(5000, 1, (p1 - min(p1))/diff(range(p1)))
model2$truth <- rbinom(5000, 1, p2)

ggplot plot 2d probability density function on top of points on ggplot

I have the following example:
require(mvtnorm)
require(ggplot2)
set.seed(1234)
xx <- data.frame(rmvt(100, df = c(13, 13)))
ggplot(data = xx, aes(x = X1, y= X2)) + geom_point() + geom_density2d()
Here is what I get:
However, I would like to get the density contour from the mutlivariate t density given by the dmvt function. How do I tweak geom_density2d to do that?
This is not an easy question to answer: because the contours need to be calculated and the ellipse drawn using the ellipse package.
Done with elliptical t-densities to illustrate the plotting better.
nu <- 5 ## this is the degrees of freedom of the multivariate t.
library(mvtnorm)
library(ggplot2)
sig <- matrix(c(1, 0.5, 0.5, 1), ncol = 2) ## this is the sigma parameter for the multivariate t
xx <- data.frame( rmvt(n = 100, df = c(nu, nu), sigma = sig)) ## generating the original sample
rtsq <- rowSums(x = matrix(rt(n = 2e6, df = nu)^2, ncol = 2)) ## generating the sample for the ellipse-quantiles. Note that this is a cumbersome calculation because it is the sum of two independent t-squared random variables with the same degrees of freedom so I am using simulation to get the quantiles. This is the sample from which I will create the quantiles.
g <- ggplot( data = xx
, aes( x = X1
, y = X2
)
) + geom_point(colour = "red", size = 2) ## initial setup
library(ellipse)
for (i in seq(from = 0.01, to = 0.99, length.out = 20)) {
el.df <- data.frame(ellipse(x = sig, t = sqrt(quantile(rtsq, probs = i)))) ## create the data for the given quantile of the ellipse.
names(el.df) <- c("x", "y")
g <- g + geom_polygon(data=el.df, aes(x=x, y=y), fill = NA, linetype=1, colour = "blue") ## plot the ellipse
}
g + theme_bw()
This yields:
I still have a question: how does one reduce the size of the plotting ellispe lines?

dot plot different indicators, depending on the value, in R

I am visualising odds ratios.
You can find fake data and a plot below
Data <- data.frame(
odds = sample(0:9),
pvalue = c(0.1,0.04,0.02,0.03,0.2,0.5,0.03,
0.12,0.12,0.014),
Y = sample(c("a", "b"), 5, replace = TRUE),
letters = letters[1:10]
)
library(lattice)
dotplot(letters ~ odds| Y, data =Data,
aspect=0.5, layout = c(1,2), ylab=NULL)
I would like to show solid circles for p-values greater than 0.05, and empty circles if values are less than 0.05.
We could specify the pch with values 1/20 for empty/solid circles based on the 'pvalue' column.
dotplot(letters ~ odds| Y, data=Data, aspect= 0.5, layout= c(1,2),
ylab=NULL, pch= ifelse(Data$pvalue > 0.05, 20, 1))
The group argument together with pch should also do the job:
dotplot(letters ~ odds| Y, data =Data,
aspect=0.5, layout = c(1,2), ylab=NULL,
groups = pvalue <= 0.05,
pch = c(19, 21))
This is easy to create with ggplot2:
library(ggplot2)
Data$significant <- Data$pvalue > 0.05
ggplot(Data, aes(x=odds, y=letters, shape=significant)) +
geom_point(size=4) +
scale_x_continuous(breaks = seq(from=0, to= 8, by=2)) +
scale_shape_manual(values=c(1, 16)) +
ylab("") +
facet_wrap(~ Y, ncol = 1, nrow = 2) +
theme_bw()

R: How to add the noise cluster into DBSCAN plot

I'm trying to plot DBSCAN results. This is what I have done so far. My distance matrix is here.
dbs55_CR_EUCL = dbscan(writeCRToMatrix,eps=0.006, MinPts = 4, method = "dist")
plot(writeCRToMatrix[dbs55_CR_EUCL$cluster>0,],
col=dbs55_CR_EUCL$cluster[dbs55_CR_EUCL$cluster>0],
main="DBSCAN Clustering K = 4 \n (EPS=0.006, MinPts=4) without noise",
pch = 20)
This is the plot:
When I tried plotting all the clusters including the noise cluster I could only see 2 points in my plot.
What I'm looking for are
To add the points in the noise cluster to the plot but with a different symbol. Something similar to the following picture
Shade the cluster areas like in the following picture
Noise clusters have an id of 0. R plots usually ignore a color of 0 so if you want to show the noise points (as black) then you need to do the following:
plot(writeCRToMatrix,
col=dbs55_CR_EUCL$cluster+1L,
main="DBSCAN Clustering K = 4 \n (EPS=0.006, MinPts=4) with noise",
pch = 20)
If you want a different symbol for noise then you could do the following (adapted from the man page):
library(dbscan)
n <- 100
x <- cbind(
x = runif(10, 0, 10) + rnorm(n, sd = 0.2),
y = runif(10, 0, 10) + rnorm(n, sd = 0.2)
)
res <- dbscan::dbscan(x, eps = .2, minPts = 4)
plot(x, col=res$cluster, pch = 20)
points(x[res$cluster == 0L], col = "grey", pch = "+")
Here is code that will create a shaded convex hull for each cluster
library(ggplot2)
library(data.table)
library(dbscan)
dt <- data.table(x, level=as.factor(res$cluster), key = "level")
hulls <- dt[, .SD[chull(x, y)], by = level]
### get rid of hull for noise
hulls <- hulls[level != "0",]
cols <- c("0" = "grey", "1" = "red", "2" = "blue")
ggplot(dt, aes(x=x, y=y, color=level)) +
geom_point() +
geom_polygon(data = hulls, aes(fill = level, group = level),
alpha = 0.2, color = NA) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols)
Hope this helps.

R Language - Sorting data into ranges; averaging; ignore outliers

I am analyzing data from a wind turbine, normally this is the sort of thing I would do in excel but the quantity of data requires something heavy-duty. I have never used R before and so I am just looking for some pointers.
The data consists of 2 columns WindSpeed and Power, so far I have arrived at importing the data from a CSV file and scatter-plotted the two against each other.
What I would like to do next is to sort the data into ranges; for example all data where WindSpeed is between x and y and then find the average of power generated for each range and graph the curve formed.
From this average I want recalculate the average based on data which falls within one of two standard deviations of the average (basically ignoring outliers).
Any pointers are appreciated.
For those who are interested I am trying to create a graph similar to this. Its a pretty standard type of graph but like I said the shear quantity of data requires something heavier than excel.
Since you're no longer in Excel, why not use a modern statistical methodology that doesn't require crude binning of the data and ad hoc methods to remove outliers: locally smooth regression, as implemented by loess.
Using a slight modification of csgillespie's sample data:
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
plot(w_sp, power)
x_grid <- seq(0, 100, length = 100)
lines(x_grid, predict(loess(power ~ w_sp), x_grid), col = "red", lwd = 3)
Throw this version, similar in motivation as #hadley's, into the mix using an additive model with an adaptive smoother using package mgcv:
Dummy data first, as used by #hadley
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
df <- data.frame(power = power, w_sp = w_sp)
Fit the additive model using gam(), using an adaptive smoother and smoothness selection via REML
require(mgcv)
mod <- gam(power ~ s(w_sp, bs = "ad", k = 20), data = df, method = "REML")
summary(mod)
Predict from our model and get standard errors of fit, use latter to generate an approximate 95% confidence interval
x_grid <- with(df, data.frame(w_sp = seq(min(w_sp), max(w_sp), length = 100)))
pred <- predict(mod, x_grid, se.fit = TRUE)
x_grid <- within(x_grid, fit <- pred$fit)
x_grid <- within(x_grid, upr <- fit + 2 * pred$se.fit)
x_grid <- within(x_grid, lwr <- fit - 2 * pred$se.fit)
Plot everything and the Loess fit for comparison
plot(power ~ w_sp, data = df, col = "grey")
lines(fit ~ w_sp, data = x_grid, col = "red", lwd = 3)
## upper and lower confidence intervals ~95%
lines(upr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
lines(lwr ~ w_sp, data = x_grid, col = "red", lwd = 2, lty = "dashed")
## add loess fit from #hadley's answer
lines(x_grid$w_sp, predict(loess(power ~ w_sp, data = df), x_grid), col = "blue",
lwd = 3)
First we will create some example data to make the problem concrete:
w_sp = sample(seq(0, 100, 0.01), 1000)
power = 1/(1+exp(-(rnorm(1000, mean=w_sp, sd=5) -40)/5))
Suppose we want to bin the power values between [0,5), [5,10), etc. Then
bin_incr = 5
bins = seq(0, 95, bin_incr)
y_mean = sapply(bins, function(x) mean(power[w_sp >= x & w_sp < (x+bin_incr)]))
We have now created the mean values between the ranges of interest. Note, if you wanted the median values, just change mean to median. All that's left to do, is to plot them:
plot(w_sp, power)
points(seq(2.5, 97.5, 5), y_mean, col=3, pch=16)
To get the average based on data that falls within two standard deviations of the average, we need to create a slightly more complicated function:
noOutliers = function(x, power, w_sp, bin_incr) {
d = power[w_sp >= x & w_sp < (x + bin_incr)]
m_d = mean(d)
d_trim = mean(d[d > (m_d - 2*sd(d)) & (d < m_d + 2*sd(d))])
return(mean(d_trim))
}
y_no_outliers = sapply(bins, noOutliers, power, w_sp, bin_incr)
Here are some examples of fitted curves (weibull analysis) for commercial turbines:
http://www.inl.gov/wind/software/
http://www.irec.cmerp.net/papers/WOE/Paper%20ID%20161.pdf
http://www.icaen.uiowa.edu/~ie_155/Lecture/Power_Curve.pdf
I'd recommend also playing around with Hadley's own ggplot2. His website is a great resource: http://had.co.nz/ggplot2/ .
# If you haven't already installed ggplot2:
install.pacakges("ggplot2", dependencies = T)
# Load the ggplot2 package
require(ggplot2)
# csgillespie's example data
w_sp <- sample(seq(0, 100, 0.01), 1000)
power <- 1/(1+exp(-(w_sp -40)/5)) + rnorm(1000, sd = 0.1)
# Bind the two variables into a data frame, which ggplot prefers
wind <- data.frame(w_sp = w_sp, power = power)
# Take a look at how the first few rows look, just for fun
head(wind)
# Create a simple plot
ggplot(data = wind, aes(x = w_sp, y = power)) + geom_point() + geom_smooth()
# Create a slightly more complicated plot as an example of how to fine tune
# plots in ggplot
p1 <- ggplot(data = wind, aes(x = w_sp, y = power))
p2 <- p1 + geom_point(colour = "darkblue", size = 1, shape = "dot")
p3 <- p2 + geom_smooth(method = "loess", se = TRUE, colour = "purple")
p3 + scale_x_continuous(name = "mph") +
scale_y_continuous(name = "power") +
opts(title = "Wind speed and power")

Resources