How to produce overlapping QQ plots in R? - r

I would like to make an overlapping QQ-plot from the GWAS results similar to the attached figure. I have run two GWAS analyses and want to generate a figure where the QQ-plot from both GWAS are overlaid on one another. I am using the R-package "qqman" for that.
Can someone please tell me how to do that in R?
Thank you.
Sample figure

You can use the R package CMplot for producing overlapping QQ plots in R.
library(CMplot)
data(pig60K)
pig60K$trait1[sample(1:nrow(pig60K), round(nrow(pig60K)*0.80))] <- NA
pig60K$trait2[sample(1:nrow(pig60K), round(nrow(pig60K)*0.25))] <- NA
CMplot(pig60K,plot.type="q",col=c("dodgerblue1", "olivedrab3", "darkgoldenrod1"),threshold=1e-6,
ylab.pos=2,signal.pch=c(19,6,4),signal.cex=1.2,signal.col="red",conf.int=TRUE,box=FALSE,multracks=
TRUE,cex.axis=2,file="jpg",memo="",dpi=300,file.output=TRUE,verbose=TRUE,ylim=c(0,8),width=5,height=5)

This works
(abd = data.frame(a = runif(100, 0, 3),
b = runif(100, 346, 455),
d = runif(100, 3952, 4903)) %>%
ggplot() +
geom_qq(aes(sample = a, color = "a")) +
geom_qq(aes(sample = b, color = "b")) +
geom_qq(aes(sample = d, color = "d")) +
theme_minimal())
I'm not sure what your objective is exactly, but more than likely the data will overlap and hide important information. Perhaps what you really need is a bi-variate or multi-variate version of a qq plot? A chi-square qq plot using the Mahalanobis distance might be a better bet, if that's the case.
This plot will show with the arguments set for it with MVN::mvn()
abd = data.frame(a = runif(100, 0, 3),
b = runif(100, 346, 455),
d = runif(100, 3952, 4903))
MVN::mvn(abd, multivariatePlot = "qq", multivariateOutlierMethod = "quan")
You can manually create the chi-square qq plot this way:
(cvAbd = cov(and)) # covariance matrix
(dif = scale(abd, scale = F)) # scaling
# solve() used in next call calculates the inverse matrix
(d = diag(dif %*%
solve(cvAbd) %*% # matrix multiplication
t(dif)))
(r = rank(d)) # ranking
n = dim(abd)[1] # number of observations
p = dim(abd)[2] # number of variables
(ch <- qchisq((r - 0.5)/n, p)) # determine chi-square quantiles
ggplot(data = data.frame(d = d, ch = ch),
aes(d, ch)) +
geom_point() +
ggtitle("Chi-Square Q-Q Plot") +
xlab("Squared Mahalanobis Distance") +
ylab("Chi-Square Quantile") +
theme_minimal()
You were looking for something different -- try this
you can replace runif(10000,.5,1) with your data,
-o and e for the same vector (same data in both)
-o2 and e2 is a second vector or second model (same data in both)
This uses plotly...
# first model
o = -log10(sort(runif(10000,.5,1), decreasing = FALSE))
e = -log10(ppoints(length(runif(10000,.5,1))))
# second model
o2 = -log10(sort(runif(10000,0,1), decreasing = FALSE))
e2 = -log10(ppoints(length(runif(10000,0,1))))
plotly::plot_ly(x = ~e, y = ~o, name = "p1",
type = "scatter", mode = "markers") %>%
add_trace(x = ~e2, y = ~o2, name = "p2") %>%
add_trace(x = c(0, max(e,e2)), y = c(0, max(e,e2)),
mode="lines", name = "Log normal") %>%
layout(xaxis = list(title = 'Expected -log[10](<i>p</i>)'),
yaxis = list(title = 'Observed -log[10](<i>p</i>)'))
You requested a ggplot2 version -
ggplot(data.frame(o = o, e = e, o2 = o2, e2 = e2),
aes(e, o, color = "First Model")) +
geom_point() +
geom_point(aes(o2, e2, color = "Second Model")) +
geom_abline(intercept = 0, slope = 1, color = "darkred") +
scale_color_discrete("") +
xlab(expression(paste('Expected -log[10](', italic('p'),')'))) +
ylab(expression(paste('Observed -log[10](', italic('p'),')'))) +
theme_minimal()

Related

Monte Carlo Sim in R plots STRAIGHTS

So I am getting started with Monte Carlo Sims, and went with this basic code to simulate Returns for a given portfolio. Well somehow a portion of the simulated returns always results in straight linear lines which are easy to see on the plotted graph. First I decreased the number of sims so you can see it clearer and I also played around with some other factors but they keep showing up. The rest of the output looks promising and "random".
Added the link to the image as my account is new and also the code, appreciate any help!:
library(quantmod)
library(ggplot2)
maxDate<- "2000-01-01"
tickers<-c("MSFT", "AAPL", "BRK-B")
getSymbols(tickers, from=maxDate)
Port.p<-na.omit(merge(Cl(AAPL),Cl(MSFT),Cl(`BRK-B`)))
Port.r<-ROC(Port.p, type = "discrete")[-1,]
stock_Price<- as.matrix(Port.p[,1:3])
stock_Returns <- as.matrix(Port.r[,1:3])
mc_rep = 50 # Number of Sims
training_days = 200
portfolio_Weights = c(0.5,0.3,0.2)
coVarMat = cov(stock_Returns)
miu = colMeans(stock_Returns)
Miu = matrix(rep(miu, training_days), nrow = 3)
portfolio_Returns_m = matrix(0, training_days, mc_rep)
set.seed(2000)
for (i in 1:mc_rep) {
Z = matrix ( rnorm( dim(stock_Returns)[2] * training_days ), ncol = training_days )
L = t( chol(coVarMat) )
daily_Returns = Miu + L %*% Z
portfolio_Returns_200 = cumprod( portfolio_Weights %*% daily_Returns + 1 )
portfolio_Returns_m[,i] = portfolio_Returns_200;
}
x_axis = rep(1:training_days, mc_rep)
y_axis = as.vector(portfolio_Returns_m-1)
plot_data = data.frame(x_axis, y_axis)
ggplot(data = plot_data, aes(x = x_axis, y = y_axis)) + geom_path(col = 'red', size = 0.1) +
xlab('Days') + ylab('Portfolio Returns') +
ggtitle('Simulated Portfolio Returns in 200 days')+
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
The lines are the 'return' from the end of each series to the beginning of the next. You can keep the lines separate by adding a grouping variable to your plotting data and using the group aesthetic to tell ggplot about it:
g <- rep(1:training_days, each = mc_rep)
plot_data = data.frame(x_axis, y_axis, g)
ggplot(data = plot_data, aes(x = x_axis, y = y_axis, group = g)) + ...

How to fit custom curves for each level of facet_wrap in ggplot2 R?

I have a scatterplot with x = xaxis and y = yaxis and two levels that I need to split and analyze the data separately with. For each set of data, I need to plot a custom function of the form y ~ (A*((B*x/C)-1) - 1)*log(x)). For each of the levels, I need to specify the A, B, and C. How do I do this in ggplot2?
A sample code of ggplot2. I have generated random data, so the smooth function may not make sense for this data (but does for my original data)
xaxis = c(1:100)
yaxis = rnorm(100,2,0.1)
level = rep(c("A","B"), 50)
df <- data.frame(xaxis, yaxis, level)
ggplot(df,aes(x= xaxis, y = yaxis, col = level)) + geom_point() +facet_wrap(.~level, scales="free")+ geom_smooth(method="lm", formula = y ~ (A*((B*x/C)-1) - 1)*log(x))
If are a trying to fit a model to find the values of A, B and C using your formula, you can't, for the simple reason that B and C only appear as a ratio in your formula. So if the true value of "B" was 4 and the true value of "C" was 2, then this would be indistinguishable from, say, a "B" of 12 and a "C" of 6. Only the ratio of the two can be estimated by regression.
If you just want to plot the lines specified by your formula, where you pre-specify A, B and C, then that is possible. For example:
my_func <- function(x, A, B, C) (A*((B*x/C)-1) - 1)*log(x)
ggplot(df,aes(x= xaxis, y = yaxis, col = level)) +
geom_point() +
facet_wrap(.~level, scales = "free")+
geom_function(data = data.frame(xaxis = 0, yaxis = 0, level = "A"),
fun = my_func,
args = list(A = 0.05, B = 2, C = 5)) +
geom_function(data = data.frame(xaxis = 0, yaxis = 0, level = "B"),
fun = my_func,
args = list(A = 0.1, B = -0.0005, C = 1))

Underscore plot in R

Introduction and Current Work Done
[Note: For those interested, I have provided code at the end for reproducing my example.]
I have some data and I have conducted an ANOVA analysis and obtained Tukey's pairwise comparisons:
model1 = aov(trt ~ grp, data = df)
anova(model1)
> TukeyHSD(model1)
diff lwr upr p adj
B-A 0.03481504 -0.40533118 0.4749613 0.9968007
C-A 0.36140489 -0.07874134 0.8015511 0.1448379
D-A 1.53825179 1.09810556 1.9783980 0.0000000
C-B 0.32658985 -0.11355638 0.7667361 0.2166301
D-B 1.50343674 1.06329052 1.9435830 0.0000000
D-C 1.17684690 0.73670067 1.6169931 0.0000000
I can also plot Tukey's pairwise comparisons
> plot(TukeyHSD(model1))
We can see from Tukey's confidence intervals and the plot that A-B, B-C and A-C are not significantly different.
Problem
I have been asked to create something called an "underscore plot" which is described as follows:
We plot the group means on the real line and we draw a line segment between group means to indicate that there is no significant difference between those two particular groups.
Obtaining the means is not difficult:
> aggregate(df$trt ~ df$grp, FUN = mean)
df$grp df$trt
1 A 2.032086
2 B 2.066901
3 C 2.393491
4 D 3.570338
Desired Output
Using the data in this example, the desired plot should appear like the one below:
There is a line segment between the groups that are not significantly different (i.e. a line segment between A-B, B-C and A-C as indicated by Tukey's).
Note: Please note that the plot above is not to scale and it was created in keynote for illustrative purposes only.
Is there a way to get the "underscore plot" described above using R (using either base R or a library such as ggplot2)?
Edit
Here is the code that I used to create the example above:
library(data.table)
set.seed(3)
A = runif(20, 1,3)
A = data.frame(A, rep("A", length(A)))
B = runif(20, 1.25,3.25)
B = data.frame(B, rep("B", length(B)))
C = runif(20, 1.5,3.5)
C = data.frame(C, rep("C", length(C)))
D = runif(20, 2.75,4.25)
D = data.frame(D, rep("D", length(D)))
df = list(A, B, C, D)
df = rbindlist(df)
colnames(df) = c("trt", "grp")
Here's a ggplot version of the underscore plot. We'll load the tidyverse package, which loads ggplot2, dplyr and a few other packages from the tidyverse. We create a data frame of coefficients to plot the group names, coefficient values, and vertical segments and a data frame of non-significant pairs for generating the horizontal underscores.
library(tidyverse)
model1 = aov(trt ~ grp, data=df)
# Get coefficients and label coefficients with names of levels
coefs = coef(model1)
coefs[2:4] = coefs[2:4] + coefs[1]
names(coefs) = levels(model1$model$grp)
# Get non-significant pairs
pairs = TukeyHSD(model1)$grp %>%
as.data.frame() %>%
rownames_to_column(var="pair") %>%
# Keep only non-significant pairs
filter(`p adj` > 0.05) %>%
# Add coefficients to TukeyHSD results
separate(pair, c("pair1","pair2"), sep="-", remove=FALSE) %>%
mutate(start = coefs[match(pair1, names(coefs))],
end = coefs[match(pair2, names(coefs))]) %>%
# Stagger vertical positions of segments
mutate(ypos = seq(-0.03, -0.04, length=3))
# Turn coefs into a data frame
coefs = enframe(coefs, name="grp", value="coef")
ggplot(coefs, aes(x=coef)) +
geom_hline(yintercept=0) +
geom_segment(aes(x=coef, xend=coef), y=0.008, yend=-0.008, colour="blue") +
geom_text(aes(label=grp, y=0.011), size=4, vjust=0) +
geom_text(aes(label=sprintf("%1.2f", coef)), y=-0.01, size=3, angle=-90, hjust=0) +
geom_segment(data=pairs, aes(group=pair, x=start, xend=end, y=ypos, yend=ypos),
colour="red", size=1) +
scale_y_continuous(limits=c(-0.05,0.04)) +
theme_void()
Base R
d1 = data.frame(TukeyHSD(model1)[[1]])
inds = which(sign(d1$lwr) * (d1$upr) <= 0)
non_sig = lapply(strsplit(row.names(d1)[inds], "-"), sort)
d2 = aggregate(df$trt ~ df$grp, FUN=mean)
graphics.off()
windows(width = 400, height = 200)
par("mai" = c(0.2, 0.2, 0.2, 0.2))
plot(d2$`df$trt`, rep(1, NROW(d2)),
xlim = c(min(d2$`df$trt`) - 0.1, max(d2$`df$trt`) + 0.1), lwd = 2,
type = "l",
ann = FALSE, axes = FALSE)
segments(x0 = d2$`df$trt`,
y0 = rep(0.9, NROW(d2)),
x1 = d2$`df$trt`,
y1 = rep(1.1, NROW(d2)),
lwd = 2)
text(x = d2$`df$trt`, y = rep(0.8, NROW(d2)), labels = round(d2$`df$trt`, 2), srt = 90)
text(x = d2$`df$trt`, y = rep(0.75, NROW(d2)), labels = d2$`df$grp`)
lapply(seq_along(non_sig), function(i){
lines(cbind(d2$`df$trt`[match(non_sig[[i]], d2$`df$grp`)], rep(0.9 - 0.01 * i, 2)))
})

Draw line through 2d density plot

I have a large dataset of gene expression from ~10,000 patient samples (TCGA), and I'm plotting a predicted expression value (x) and the actual observed value (y) of a certain gene signature. For my downstream analysis, I need to draw a precise line through the plot and calculate different parameters in samples above/below the line.
No matter how I draw a line through the data (geom_smooth(method = 'lm', 'glm', 'gam', or 'loess')), the line always seems imperfect - it doesn't cut through the data to my liking (red line is lm in figure).
After playing around for a while, I realized that the 2d kernel density lines (geom_density2d) actually do a good job of showing the slope/trends of my data, so I manually drew a line that kind of cuts through the density lines (black line in figure).
My question: how can I automatically draw a line that cuts through the kernel density lines, as for the black line in the figure? (Rather than manually playing with different intercepts and slopes till something looks good).
The best approach I can think of is to somehow calculate intercept and slope of the longest diameter for each of the kernel lines, take an average of all those intercepts and slopes and plot that line, but that's a bit out of my league. Maybe someone here has experience with this and can help?
A more hacky approach may be getting the x,y coords of each kernel density line from ggplot_build, and going from there, but it feels too hacky (and is also out of my league).
Thanks!
EDIT: Changed a few details to make the figure/analysis easier. (Density lines are smoother now).
Reprex:
library(MASS)
set.seed(123)
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1] # standard normal (mu=0, sd=1)
y <- data[, 2] # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)
lm(y ~ x, test.df)
ggplot(test.df, aes(x, y)) +
geom_point(color = 'grey') +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) + ### EDIT: h = c(2,2)
geom_smooth(method = "glm", se = F, lwd = 1, color = 'red') +
geom_abline(intercept = 0, slope = 0.7, lwd = 1, col = 'black') ## EDIT: slope to 0.7
Figure:
I generally agree with #Hack-R.
However, it was kind of a fun problem and looking into ggplot_build is not such a big deal.
require(dplyr)
require(ggplot2)
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2))
#basic version of your plot
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),] %>%
select(x,y) # extracts the x/y coordinates of the points on the largest ellipse from your 2d-density contour
Now this answer helped me to find the points on this ellipse which are furthest apart.
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2)) #extra column specifying the distance of each point to the mean of those points
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
# gives the coordinates of the point farthest away from the mean point
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
#now this looks which of the points is the farthest from the point farthest from the mean point :D
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = c(2,2)) +
# geom_segment using the coordinates of the points farthest apart
geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
xend = coord_fff['x'], yend = coord_fff['y']))) +
geom_smooth(method = "glm", se = F, lwd = 1, color = 'red') +
# as per your request with your geom_smooth line
coord_equal()
coord_equal is super important, because otherwise you will get super weird results - it messed up my brain too. Because if the coordinates are not set equal, the line will seemingly not pass through the point furthest apart from the mean...
I leave it to you to build this into a function in order to automate it. Also, I'll leave it to you to calculate the y-intercept and slope from the two points
Tjebo's approach was kind of good initially, but after a close look, I found that it found the longest distance between two points on an ellipse. While this is close to what I wanted, it failed with either an irregular shape of the ellipse, or the sparsity of points in the ellipse. This is because it measured the longest distance between two points; whereas what I really wanted is the longest diameter of an ellipse; i.e.: the semi-major axis. See image below for examples/details.
Briefly:
To find/draw density contours of specific density/percentage:
R - How to find points within specific Contour
To get the longest diameter ("semi-major axis") of an ellipse:
https://stackoverflow.com/a/18278767/3579613
For function that returns intercept and slope (as in OP), see last piece of code.
The two pieces of code and images below compare two Tjebo's approach vs. my new approach based on the above posts.
#### Reprex from OP
require(dplyr)
require(ggplot2)
require(MASS)
set.seed(123)
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1] # standard normal (mu=0, sd=1)
y <- data[, 2] # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)
#### From Tjebo
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 2)
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2)) #extra column specifying the distance of each point to the mean of those points
p_maxring = p_maxring[round(seq(1, nrow(p_maxring), nrow(p_maxring)/23)),] #### Make a small ellipse to illustrate flaws of approach
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
# gives the coordinates of the point farthest away from the mean point
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
#now this looks which of the points is the farthest from the point farthest from the mean point :D
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
farthest_2_points = data.frame(t(cbind(coord_farthest, coord_fff)))
plot(p_maxring[,1:2], asp=1)
lines(farthest_2_points, col = 'blue', lwd = 2)
#### From answer in another post
d = cbind(p_maxring[,1], p_maxring[,2])
r = ellipsoidhull(d)
exy = predict(r) ## the ellipsoid boundary
lines(exy)
me = colMeans((exy))
dist2center = sqrt(rowSums((t(t(exy)-me))^2))
max(dist2center) ## major axis
lines(exy[dist2center == max(dist2center),], col = 'red', lwd = 2)
#### The plot here is made from the data in the reprex in OP, but with h = 0.5
library(MASS)
set.seed(123)
samples <- 10000
r <- 0.9
data <- mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x <- data[, 1] # standard normal (mu=0, sd=1)
y <- data[, 2] # standard normal (mu=0, sd=1)
test.df <- data.frame(x = x, y = y)
## MAKE BLUE LINE
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 0.5) ## NOTE h = 0.5
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2))
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
## MAKE RED LINE
## h = 0.5
## Given the highly irregular shape of the contours, I will use only the largest contour line (0.95) for draing the line.
## Thus, average = 1. See function below for details.
ln = long.diam("x", "y", test.df, h = 0.5, average = 1) ## NOTE h = 0.5
## PLOT
ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 0.5, contour = T, h = 0.5) + ## NOTE h = 0.5
geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
xend = coord_fff['x'], yend = coord_fff['y'])), col = 'blue', lwd = 2) +
geom_abline(intercept = ln[1], slope = ln[2], color = 'red', lwd = 2) +
coord_equal()
Finally, I came up with the following function to deal with all this. Sorry for the lack of comments/clarity
#### This will return the intercept and slope of the longest diameter (semi-major axis).
####If Average = TRUE, it will average the int and slope across different density contours.
long.diam = function(x, y, df, probs = c(0.95, 0.5, 0.1), average = T, h = 2) {
fun.df = data.frame(cbind(df[,x], df[,y]))
colnames(fun.df) = c("x", "y")
dens = kde2d(fun.df$x, fun.df$y, n = 200, h = h)
dx <- diff(dens$x[1:2])
dy <- diff(dens$y[1:2])
sz <- sort(dens$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(probs, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
names(levels) = paste0("L", str_sub(formatC(probs, 2, format = 'f'), -2))
#plot(fun.df$x,fun.df$y, asp = 1)
#contour(dens, levels = levels, labels=probs, add=T, col = c('red', 'blue', 'green'), lwd = 2)
#contour(dens, add = T, col = 'red', lwd = 2)
#abline(lm(fun.df$y~fun.df$x))
ls <- contourLines(dens, levels = levels)
names(ls) = names(levels)
lines.info = list()
for (i in 1:length(ls)) {
d = cbind(ls[[i]]$x, ls[[i]]$y)
exy = predict(ellipsoidhull(d))## the ellipsoid boundary
colnames(exy) = c("x", "y")
me = colMeans((exy)) ## center of the ellipse
dist2center = sqrt(rowSums((t(t(exy)-me))^2))
#plot(exy,type='l',asp=1)
#points(d,col='blue')
#lines(exy[order(dist2center)[1:2],])
#lines(exy[rev(order(dist2center))[1:2],])
max.dist = data.frame(exy[rev(order(dist2center))[1:2],])
line.fit = lm(max.dist$y ~ max.dist$x)
lines.info[[i]] = c(as.numeric(line.fit$coefficients[1]), as.numeric(line.fit$coefficients[2]))
}
names(lines.info) = names(ls)
#plot(fun.df$x,fun.df$y, asp = 1)
#contour(dens, levels = levels, labels=probs, add=T, col = c('red', 'blue', 'green'), lwd = 2)
#abline(lines.info[[1]], col = 'red', lwd = 2)
#abline(lines.info[[2]], col = 'blue', lwd = 2)
#abline(lines.info[[3]], col = 'green', lwd = 2)
#abline(apply(simplify2array(lines.info), 1, mean), col = 'black', lwd = 4)
if (isTRUE(average)) {
apply(simplify2array(lines.info), 1, mean)
} else {
lines.info[[average]]
}
}
Finally, here's the final implementation of the different answers:
library(MASS)
set.seed(123)
samples = 10000
r = 0.9
data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(2, r, r, 2), nrow=2))
x = data[, 1] # standard normal (mu=0, sd=1)
y = data[, 2] # standard normal (mu=0, sd=1)
#plot(x, y)
test.df = data.frame(x = x, y = y)
#### Find furthest two points of contour
## BLUE
p <- ggplot(test.df, aes(x, y)) +
geom_density2d(color = 'red', lwd = 2, contour = T, h = 2)
p_built <- ggplot_build(p)
p_data <- p_built$data[[1]]
p_maxring <- p_data[p_data[['level']] == min(p_data[['level']]),][,2:3]
coord_mean <- c(x = mean(p_maxring$x), y = mean(p_maxring$y))
p_maxring <- p_maxring %>%
mutate (mean_dev = sqrt((x - mean(x))^2 + (y - mean(y))^2))
coord_farthest <- c('x' = p_maxring$x[which.max(p_maxring$mean_dev)], 'y' = p_maxring$y[which.max(p_maxring$mean_dev)])
farthest_from_farthest <- sqrt((p_maxring$x - coord_farthest['x'])^2 + (p_maxring$y - coord_farthest['y'])^2)
coord_fff <- c('x' = p_maxring$x[which.max(farthest_from_farthest)], 'y' = p_maxring$y[which.max(farthest_from_farthest)])
#### Find the average intercept and slope of 3 contour lines (0.95, 0.5, 0.1), as in my long.diam function above.
## RED
ln = long.diam("x", "y", test.df)
#### Plot everything. Black line is GLM
ggplot(test.df, aes(x, y)) +
geom_point(color = 'grey') +
geom_density2d(color = 'red', lwd = 1, contour = T, h = 2) +
geom_smooth(method = "glm", se = F, lwd = 1, color = 'black') +
geom_abline(intercept = ln[1], slope = ln[2], col = 'red', lwd = 1) +
geom_segment((aes(x = coord_farthest['x'], y = coord_farthest['y'],
xend = coord_fff['x'], yend = coord_fff['y'])), col = 'blue', lwd = 1) +
coord_equal()

R: How to add the noise cluster into DBSCAN plot

I'm trying to plot DBSCAN results. This is what I have done so far. My distance matrix is here.
dbs55_CR_EUCL = dbscan(writeCRToMatrix,eps=0.006, MinPts = 4, method = "dist")
plot(writeCRToMatrix[dbs55_CR_EUCL$cluster>0,],
col=dbs55_CR_EUCL$cluster[dbs55_CR_EUCL$cluster>0],
main="DBSCAN Clustering K = 4 \n (EPS=0.006, MinPts=4) without noise",
pch = 20)
This is the plot:
When I tried plotting all the clusters including the noise cluster I could only see 2 points in my plot.
What I'm looking for are
To add the points in the noise cluster to the plot but with a different symbol. Something similar to the following picture
Shade the cluster areas like in the following picture
Noise clusters have an id of 0. R plots usually ignore a color of 0 so if you want to show the noise points (as black) then you need to do the following:
plot(writeCRToMatrix,
col=dbs55_CR_EUCL$cluster+1L,
main="DBSCAN Clustering K = 4 \n (EPS=0.006, MinPts=4) with noise",
pch = 20)
If you want a different symbol for noise then you could do the following (adapted from the man page):
library(dbscan)
n <- 100
x <- cbind(
x = runif(10, 0, 10) + rnorm(n, sd = 0.2),
y = runif(10, 0, 10) + rnorm(n, sd = 0.2)
)
res <- dbscan::dbscan(x, eps = .2, minPts = 4)
plot(x, col=res$cluster, pch = 20)
points(x[res$cluster == 0L], col = "grey", pch = "+")
Here is code that will create a shaded convex hull for each cluster
library(ggplot2)
library(data.table)
library(dbscan)
dt <- data.table(x, level=as.factor(res$cluster), key = "level")
hulls <- dt[, .SD[chull(x, y)], by = level]
### get rid of hull for noise
hulls <- hulls[level != "0",]
cols <- c("0" = "grey", "1" = "red", "2" = "blue")
ggplot(dt, aes(x=x, y=y, color=level)) +
geom_point() +
geom_polygon(data = hulls, aes(fill = level, group = level),
alpha = 0.2, color = NA) +
scale_color_manual(values = cols) +
scale_fill_manual(values = cols)
Hope this helps.

Resources