R - create tomography plot with ggplot - r

I have the following graph:
set.seed(123456)
test1_1 <- round(rnorm(20,mean=40,sd=5),0)/100
test1_2 <- round(rnorm(20,mean=60,sd=5),0)/100
test.data <- as.data.frame(cbind(test1_1,test1_2))
test <- ggplot(test.data, aes(test1_1,test1_2))+
geom_point()+
scale_y_continuous(limits = c(0, 1)) +
scale_x_continuous(limits = c(0, 1)) + # OP missing `+`
abline(0.5,0.5)
test
Now I have point, which are created with the following formular:
line <- function(beta_2, test1_1,test1_2){
beta_1 = (test1_2/(1-test1_1))-(test1_1/(1-test1_1))*beta_2
return(beta_1)}
output1 <- as.data.frame(matrix(0,20,1))
beta_2 <- 1
for (i in 1:nrow(test.data)){
output1[i,] <- line(beta_2,test.data[i,1],test.data[i,2])
}
output2 <- as.data.frame(matrix(0,20,1))
beta_2 <- 0
for (i in 1:nrow(ei.data)){
output2[i,] <- line(beta_2,test.data[i,1],test.data[i,2])
}
output <- cbind(output1,output2)
I would like the add the data in the second data frame as lines in the plot created above (always connect the points per one row). However, using
abline(output[1,1],output[1,2])
does not work. How could I achieve this?

abline() is syntax for base R, for ggplot2 you need to use geom_abline(output[1,1],output[1,2])
ggplot(test.data, aes(test1_1,test1_2)) +
geom_point()+
scale_y_continuous(limits = c(0, 1))+
scale_x_continuous(limits = c(0, 1)) +
geom_abline(slope = output[1,1],intercept = output[1,2])
Note: Your reproducible had an error in which ei.data was used instead of test.data, and part of your ggplot call was missing a + sign, I only point this out so others (or you) do not get caught in a separate error without realizing it.

Related

Evenly-spaced 1/freq for R function spec.pgram()

The goal is to plot of the coherency between two time series (i.e. the correlation coefficient with respect to frequencies). How can I get 1/freq (i.e. the period) in the x-axis to be evenly-spaced?
t <- 0:99
ts1 <- ts(2*cos((2*pi)/24*t))
ts2 <- ts(2*cos((2*pi)/48*t))
ts12 <- ts.intersect(ts1, ts2)
Coh <- spec.pgram(ts12, spans=3)
plot(Coh$freq, Coh$coh, type='l')
plot(1/Coh$freq, Coh$coh, type='l') # how to get 1/freq to be evenly-space?
I have tried to modify the function spec.pgram() but without success. More specifically, I replace the line:
freq <- seq.int(from = xfreq/N, by = xfreq/N, length.out = Nspec)
with:
freq.tmp <- seq.int(from = xfreq/N, by = xfreq/N, length.out = Nspec)
freq <- rev(1/seq(from=1/max(freq.tmp), to=1/min(freq.tmp), length.out=Nspec))
Has anyone else had better luck? Thanks
Do you mean that you just want to relabel the x-axis with periods, rather than frequencies? That would maintain the spacing of the values on the x-axis, at the expense of a non-linear scaling for the x values. For example (using ggplot2):
library(ggplot2)
dat = as.data.frame(Coh[c("freq","spec","coh","phase")])
ggplot(dat, aes(freq, coh)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=dat$freq[seq(1,nrow(dat),3)],
minor_breaks=dat$freq,
labels=round(1/dat$freq[seq(1,nrow(dat),3)],1)) +
labs(x="Period")
You could also set the x-value labels to fall on integer periods:
breaks = c(1:10,15,25,50,100)
ggplot(dat, aes(freq, coh)) +
geom_line() +
geom_point() +
scale_x_continuous(breaks=1/breaks,
minor_breaks = 1/(breaks[-1] - 0.5 * diff(breaks)),
labels=breaks) +
labs(x="Period")

geom_jitter removes different number of points due to missing values for plot each time its run

geom_jitter in R ggplot seems to remove a different number of points each time I plot data. I suspect this is due to overplotting (stacked points)? e.g., if I create the data frame once, and then run the ggplot command multiple times, I will get varying numbers of points removed due to missing data (ranging from 0 to 1+). Is there a way to ensure a consistent number of missing points (or none)? I tried tinkering with the size, and jitter width/height, to no avail. thanks!
d <- data.frame(a = rnorm(n = 100, mean = 0, sd = 1), b = rnorm(n = 100, mean = 0, sd = 1))
ggplot(d, aes(a,b)) + geom_point(position=position_jitter(width=0.3, height=.3), size=2) + theme(panel.background=element_blank()) + scale_x_continuous(limits=c(-3, 3)) + scale_y_continuous(limits=c(-3, 3))
The jitter is pushing the points out of the ranges you specify, and the noise is calculated with each run. Try jittering yourself, so it won't change every time, or remove the range constraints.
set.seed(0)
d <- data.frame(a = rep(-2:2, each=20), b=rnorm(100))
## Specify your own jitter: 0.1 in width, 1 in height in this example
d <- d + rnorm(nrow(d)*2, 0, sd=rep(c(0.1, 1), each=nrow(d)))
## Always 4 rows removed, unless you rejitter
ggplot(d, aes(a, b)) +
geom_point(size=2) +
theme(panel.background=element_blank()) +
scale_x_continuous(limits=c(-3,3)) +
scale_y_continuous(limits=c(-3,3))
Edit
Actually much simpler, just set.seed prior to running what you have :)
set.seed(0)
ggplot(d, aes(a,b)) +
geom_point(position=position_jitter(width=0.3, height=.3), size=2) +
theme(panel.background=element_blank()) + scale_x_continuous(limits=c(-3, 3)) +
scale_y_continuous(limits=c(-3, 3))
Another option is to not use the limits argument of scale_x_continuous. Instead, use the xlim and ylim arguments of coord_cartesian. This is the code that's meant for zooming into a portion of the plot. The limits argument in the x and y axis scales actually subsets the data that's to be plotted. Usually this makes little difference unless you're talking about statistical summaries that include data not visible on the plot.
Note: you won't get the warnings when your data points fall out of the graph.
d <- data.frame(a = rnorm(n = 100, mean = 0, sd = 1),
b = rnorm(n = 100, mean = 0, sd = 1))
ggplot(d, aes(a,b)) +
geom_point(position=position_jitter(width=0.3, height=.3), size=2) +
theme(panel.background=element_blank()) +
coord_cartesian(xlim=c(-3,3), ylim=c(-3,3))
Another, lesser known, option is to change the way scales handle their bounds, by setting the out of bounds (oob) argument.
This is not really my idea, but very much inspired by user axeman in this very similar thread.
library(ggplot2)
set.seed(0)
d <- data.frame(a = rnorm(n = 100, mean = 0, sd = 1), b = rnorm(n = 100, mean = 0, sd = 1))
ggplot(d, aes(a,b)) +
geom_point(position=position_jitter(width=0.3, height=.3), size=2) +
theme(panel.background=element_blank()) +
scale_x_continuous(limits=c(-3, 3), oob = scales::squish) +
scale_y_continuous(limits=c(-3, 3), oob = scales::squish)
Created on 2021-04-27 by the reprex package (v2.0.0)

How to plot a contour line showing where 95% of values fall within, in R and in ggplot2

Say we have:
x <- rnorm(1000)
y <- rnorm(1000)
How do I use ggplot2 to produce a plot containing the two following geoms:
The bivariate expectation of the two series of values
A contour line showing where 95% of the estimates fall within?
I know how to do the first part:
df <- data.frame(x=x, y=y)
p <- ggplot(df, aes(x=x, y=y))
p <- p + xlim(-10, 10) + ylim(-10, 10) # say
p <- p + geom_point(x=mean(x), y=mean(y))
And I also know about the stat_contour() and stat_density2d() functions within ggplot2.
And I also know that there are 'bins' options within stat_contour.
However, I guess what I need is something like the probs argument within quantile, but over two dimensions rather than one.
I have also seen a solution within the graphics package. However, I would like to do this within ggplot.
Help much appreciated,
Jon
Unfortunately, the accepted answer currently fails with Error: Unknown parameters: breaks on ggplot2 2.1.0. I cobbled together an alternative approach based on the code in this answer, which uses the ks package for computing the kernel density estimate:
library(ggplot2)
set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
kd <- ks::kde(d, compute.cont=TRUE)
contour_95 <- with(kd, contourLines(x=eval.points[[1]], y=eval.points[[2]],
z=estimate, levels=cont["5%"])[[1]])
contour_95 <- data.frame(contour_95)
ggplot(data=d, aes(x, y)) +
geom_point() +
geom_path(aes(x, y), data=contour_95) +
theme_bw()
Here's the result:
TIP: The ks package depends on the rgl package, which can be a pain to compile manually. Even if you're on Linux, it's much easier to get a precompiled version, e.g. sudo apt install r-cran-rgl on Ubuntu if you have the appropriate CRAN repositories set up.
Riffing off of Ben Bolker's answer, a solution that can handle multiple levels and works with ggplot 2.2.1:
library(ggplot2)
library(MASS)
library(reshape2)
# create data:
set.seed(8675309)
Sigma <- matrix(c(0.1,0.3,0.3,4),2,2)
mv <- data.frame(mvrnorm(4000,c(1.5,16),Sigma))
# get the kde2d information:
mv.kde <- kde2d(mv[,1], mv[,2], n = 400)
dx <- diff(mv.kde$x[1:2]) # lifted from emdbook::HPDregionplot()
dy <- diff(mv.kde$y[1:2])
sz <- sort(mv.kde$z)
c1 <- cumsum(sz) * dx * dy
# specify desired contour levels:
prob <- c(0.95,0.90,0.5)
# plot:
dimnames(mv.kde$z) <- list(mv.kde$x,mv.kde$y)
dc <- melt(mv.kde$z)
dc$prob <- approx(sz,1-c1,dc$value)$y
p <- ggplot(dc,aes(x=Var1,y=Var2))+
geom_contour(aes(z=prob,color=..level..),breaks=prob)+
geom_point(aes(x=X1,y=X2),data=mv,alpha=0.1,size=1)
print(p)
The result:
This works, but is quite inefficient because you actually have to compute the kernel density estimate three times.
set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
getLevel <- function(x,y,prob=0.95) {
kk <- MASS::kde2d(x,y)
dx <- diff(kk$x[1:2])
dy <- diff(kk$y[1:2])
sz <- sort(kk$z)
c1 <- cumsum(sz) * dx * dy
approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(d$x,d$y)
library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(x,y)) +
stat_density2d(geom="tile", aes(fill = ..density..),
contour = FALSE)+
stat_density2d(colour="red",breaks=L95)
(with help from http://comments.gmane.org/gmane.comp.lang.r.ggplot2/303)
update: with a recent version of ggplot2 (2.1.0) it doesn't seem possible to pass breaks to stat_density2d (or at least I don't know how), but the method below with geom_contour still seems to work ...
You can make things a little more efficient by computing the kernel density estimate once and plotting the tiles and contours from the same grid:
kk <- with(dd,MASS::kde2d(x,y))
library(reshape2)
dimnames(kk$z) <- list(kk$x,kk$y)
dc <- melt(kk$z)
ggplot(dc,aes(x=Var1,y=Var2))+
geom_tile(aes(fill=value))+
geom_contour(aes(z=value),breaks=L95,colour="red")
doing the 95% level computation from the kk grid (to reduce the number of kernel computations to 1) is left as an exercise
I'm not sure why stat_density2d(geom="tile") and geom_tile give slightly different results (the former is smoothed)
I haven't added the bivariate mean, but something like annotate("point",x=mean(d$x),y=mean(d$y),colour="red") should work.
I had an example where the MASS::kde2d() bandwidth specifications were not flexible enough, so I ended up using the ks package and the ks::kde() function and, as an example, the ks::Hscv() function to estimate flexible bandwidths that captured the smoothness better. This computation can be a bit slow, but it has much better performance in some situations. Here is a version of the above code for that example:
set.seed(1001)
d <- data.frame(x=rnorm(1000),y=rnorm(1000))
getLevel <- function(x,y,prob=0.95) {
kk <- MASS::kde2d(x,y)
dx <- diff(kk$x[1:2])
dy <- diff(kk$y[1:2])
sz <- sort(kk$z)
c1 <- cumsum(sz) * dx * dy
approx(c1, sz, xout = 1 - prob)$y
}
L95 <- getLevel(d$x,d$y)
library(ggplot2); theme_set(theme_bw())
ggplot(d,aes(x,y)) +
stat_density2d(geom="tile", aes(fill = ..density..),
contour = FALSE)+
stat_density2d(colour="red",breaks=L95)
## using ks::kde
hscv1 <- Hscv(d)
fhat <- ks::kde(d, H=hscv1, compute.cont=TRUE)
dimnames(fhat[['estimate']]) <- list(fhat[["eval.points"]][[1]],
fhat[["eval.points"]][[2]])
library(reshape2)
aa <- melt(fhat[['estimate']])
ggplot(aa, aes(x=Var1, y=Var2)) +
geom_tile(aes(fill=value)) +
geom_contour(aes(z=value), breaks=fhat[["cont"]]["50%"], color="red") +
geom_contour(aes(z=value), breaks=fhat[["cont"]]["5%"], color="purple")
For this particular example, the differences are minimal, but in an example where the bandwidth specification requires more flexibility, this modification may be important. Note that the 95% contour is specified using the breaks=fhat[["cont"]]["5%"], which I found a little bit counter-intuitive, because it is called here the "5% contour".
Just mixing answers from above, putting them in a more tidyverse friendly way, and allowing for multiple contour levels. I use here geom_path(group=probs), adding them manually geom_text. Another approach is to use geom_path(colour=probs) which will automatically label the contours as legend.
library(ks)
library(tidyverse)
set.seed(1001)
## data
d <- MASS::mvrnorm(1000, c(0, 0.2), matrix(c(1, 0.4, 1, 0.4), ncol=2)) %>%
magrittr::set_colnames(c("x", "y")) %>%
as_tibble()
## density function
kd <- ks::kde(d, compute.cont=TRUE, h=0.2)
## extract results
get_contour <- function(kd_out=kd, prob="5%") {
contour_95 <- with(kd_out, contourLines(x=eval.points[[1]], y=eval.points[[2]],
z=estimate, levels=cont[prob])[[1]])
as_tibble(contour_95) %>%
mutate(prob = prob)
}
dat_out <- map_dfr(c("10%", "20%","80%", "90%"), ~get_contour(kd, .)) %>%
group_by(prob) %>%
mutate(n_val = 1:n()) %>%
ungroup()
## clean kde output
kd_df <- expand_grid(x=kd$eval.points[[1]], y=kd$eval.points[[2]]) %>%
mutate(z = c(kd$estimate %>% t))
ggplot(data=kd_df, aes(x, y)) +
geom_tile(aes(fill=z)) +
geom_point(data = d, alpha = I(0.4), size = I(0.4), colour = I("yellow")) +
geom_path(aes(x, y, group = prob),
data=filter(dat_out, !n_val %in% 1:3), colour = I("white")) +
geom_text(aes(label = prob), data =
filter(dat_out, (prob%in% c("10%", "20%","80%") & n_val==1) | (prob%in% c("90%") & n_val==20)),
colour = I("black"), size =I(3))+
scale_fill_viridis_c()+
theme_bw() +
theme(legend.position = "none")
Created on 2019-06-25 by the reprex package (v0.3.0)

Create data with function inside shinyServer to feed ggplot

I have a Shiny app what calculates some power estimates for a type of genetic association study. The ui.R is pretty simple, and the server.R has a function that gives a data frame (I think I can't have this function as reactive because it has some parameters).
The link to the Gist is here. To run it:
library(shiny)
shiny:: runGist('5895082')
The app calculates correctly the estimates, but I have two questions regarding it:
Is it possible to have the output$powTable actually represent all the values contained within the range, in the first sliderInput(n.cases)?. It only seems to represent the two extreme values of the range... what I'm doing wrong?
There's an error when running the app:
Error: Reading objects from shinyoutput object not allowed.
How can I pass the data (reactivity?) from the function f() to feed the ggplot? After much trial and error, I am very lost. Where can be the error in my code? Many thaks in advance!
The original code of the function works well: (EDITED)
f <- function(ncases, p0, OR.cas.ctrl, Nh, sig.level) {
num.cases <- ncases
p0 <- p0
Nh <- Nh
OR.cas.ctrl <- OR.cas.ctrl
sig.level <- sig.level
# Parameters related to sig.level, from [Table 2] of Samuels et al.
# For 90% power and alpha = .05, Nscaled = 8.5
if (sig.level == 0.05){
A <- -28 # Parameter A for alpha=.05
x0 <- 2.6 # Parameter x0 for alpha=.05
d <- 2.4 # Parameter d for alpha=.05
}
if (sig.level == 0.01){
A <- -13 # Parameter A for alpha=.01
x0 <- 5 # Parameter x0 for alpha=.01
d <- 2.5 # Parameter d for alpha=.01
}
if (sig.level == 0.001){
A <- -7 # Parameter A for alpha=.001
x0 <- 7.4 # Parameter x0 for alpha=.001
d <- 2.8 # Parameter d for alpha=.001
}
out.pow <- NULL # initialize vector
for(ncases in ncases){
OR.ctrl.cas <- 1 / OR.cas.ctrl # 1. CALCULATE P1 FROM A PREDEFINED P0, AND A DESIRED OR
OR <- OR.ctrl.cas
bracket.pw <- p0 / (OR - OR*p0) # obtained after isolating p1 in OR equation [3].
p1 <- bracket.pw / (1 + bracket.pw)
Nh037 <- Nh^0.37 # 2. CALCULATE NSCALED
num.n <- num.cases*((p1-p0)^2)
den.n <- (p1*(1-p1) + p0*(1-p0))*Nh037
Nscaled <- num.n/den.n
num.power <- A - 100 # 3. CALCULATE POWER
den.power <- 1 + exp((Nscaled - x0)/d)
power <- 100 + (num.power/den.power) # The power I have to detect a given OR with my data, at a given alpha
}
OR <- OR.cas.ctrl
out.pow <- data.frame(num.cases, Nh, Nscaled, p0, OR, sig.level, power)
out.pow
}
mydata <- f(ncases=seq(50,1000, by=50), 0.4, 2.25, 11, 0.05)
mydata
library(ggplot2)
print(ggplot(data = mydata, aes(num.cases, power)) +
theme_bw() +
theme(text=element_text(family="Helvetica", size=12)) +
labs(title = "Ad-hoc power for haplogroup") +
scale_color_brewer(palette = "Dark2", guide = guide_legend(reverse=TRUE)) +
xlab("number of cases/controls") +
ylab("power") +
scale_x_log10() +
geom_line(alpha=0.8, size=0.2) +
geom_point(aes(shape = factor(OR)), colour="black"))
First of all, you have n.cases named inconsistently I think. It's n.cases sometimes, and ncases other times. Is that a mistake?
Anyway, output$mydata() is incorrect. It isn't an output. It should be just:
mydata <- reactive(f(input$n.cases,
input$p0,
input$OR.cas.ctrl,
input$Nh,
input$sig.level))
And then when executing it in output$powHap() it should be:
output$powHap <- renderPlot(
{
print(ggplot(data = mydata(), aes(ncases, power)) +
theme_bw() +
theme(text=element_text(family="Helvetica", size=12)) +
labs(title = "Ad-hoc power for haplogroup") +
scale_color_brewer(palette = "Dark2", guide = guide_legend(reverse=TRUE)) +
xlab("number of cases/controls") +
ylab("power") +
scale_x_log10() +
geom_line(alpha=0.8, size=0.2) +
geom_point(aes(shape = factor(OR)), colour="black"))
})
The important part there is that you need to do:
data = mydata()
rather than
data = output$mydata
Because output$mydata is a (reactive) function.
I would suggest reading the documentation on how reactives work. The whole thing should make a lot more sense afterwards. +1 for a very reproducible example by the way. This is how all questions should be posted.

Pairwise graphical comparison of several distributions

This is an edited version of a previous question.
We are given an m by n table of n observations (samples) over m variables (genes, etc), and we are looking to study behavior of the variables between each pair of observations - For instance the two observations having the highest positive or negative correlation. For this purpose I have seen a great chart in Stadler et.al. Nature paper (2011):
Here it could be a sample dataset to be used.
m <- 1000
samples <- data.frame(unif1 = runif(m), unif2 = runif(m, 1, 2), norm1 = rnorm(m),
norm2 = rnorm(m, 1), norm3 = rnorm(m, 0, 5))
I have already tested gpairs(samples) of package gpairs that produces this one. It's a good start, but has no option to put correlation coefficients on the upper-right section, nor the density plots on the lower corner:
Next I used ggpairs(samples, lower=list(continuous="density")) of package GGally (Thanks #LucianoSelzer for a comment below). Now we have correlations on the upper corner and the densities on the lower corner, but we are missing the diagonal barplots, and the density plots are not heatmap shaped.
Any ideas to make the more closer to the desired picture (the first one)?
You could try to combine several different plotting methods and combine the results. Here's an example, which could be tweaked accordingly:
cors<-round(cor(samples),2) #correlations
# make layout for plot layout
laymat<-diag(1:5) #histograms
laymat[upper.tri(laymat)]<-6:15 #correlations
laymat[lower.tri(laymat)]<-16:25 #heatmaps
layout(laymat) #define layout using laymat
par(mar=c(2,2,2,2)) #define marginals etc.
# Draw histograms, tweak arguments of hist to make nicer figures
for(i in 1:5)
hist(samples[,i],main=names(samples)[i])
# Write correlations to upper diagonal part of the graph
# Again, tweak accordingly
for(i in 1:4)
for(j in (i+1):5){
plot(-1:1,-1:1, type = "n",xlab="",ylab="",xaxt="n",yaxt="n")
text(x=0,y=0,labels=paste(cors[i,j]),cex=2)
}
# Plot heatmaps, here I use kde2d function for density estimation
# image function for generating heatmaps
library(MASS)
for(i in 2:5)
for(j in 1:(i-1)){
k <- kde2d(samples[,i],samples[,j])
image(k,col=heat.colors(1000))
}
edit: Corrected indexing on the last loop.
You can do something like this using three different packages and two different functions as below:
cor_fun is for the upper triangle correlative calculation.
my_fn is for the lower triangle plotting
You also need ggpairs.
library(GGally)
library(ggplot2)
library(RColorBrewer)
m <- 1000
samples <- data.frame(unif1 = runif(m), unif2 = runif(m, 1, 2), norm1 = rnorm(m),
norm2 = rnorm(m, 1), norm3 = rnorm(m, 0, 5))
cor_fun <- function(data, mapping, method="pearson", ndp=2, sz=5, stars=TRUE){ #ndp is to adjust the number of decimals
x <- eval_data_col(data, mapping$x)
y <- eval_data_col(data, mapping$y)
corr <- cor.test(x, y, method=method)
est <- corr$estimate
lb.size <- sz
if(stars){
stars <- c("***", "**", "*", "")[findInterval(corr$p.value, c(0, 0.001, 0.01, 0.05, 1))]
lbl <- paste0(round(est, ndp), stars)
}else{
lbl <- round(est, ndp)
}
ggplot(data=data, mapping=mapping) +
annotate("text", x=mean(x, na.rm=TRUE), y=mean(y, na.rm=TRUE), label=lbl, size=lb.size)+
theme(panel.grid = element_blank(), panel.background=element_rect(fill="snow1"))
}
colfunc<-colorRampPalette(c("darkblue","cyan","yellow","red"))
my_fn <- function(data, mapping){
p <- ggplot(data = data, mapping = mapping) +
stat_density2d(aes(fill=..density..), geom="tile", contour = FALSE) +
scale_fill_gradientn(colours = colfunc(100)) + theme_classic()
}
ggpairs(samples, columns = c(1,2,3,4,5),
lower=list(continuous=my_fn),
diag=list(continuous=wrap("densityDiag", fill="gray92")), #densityDiag is a function
upper=list(continuous=cor_fun)) + theme(panel.background=element_rect(fill="white")) +
theme(axis.text.x = element_text(angle = 0, vjust = 1, color = "black")) +
theme(axis.text.y = element_text(angle = 0, vjust = 1 , color = "black"))

Resources