Can I use Annotate in ggplot2 to fill my violin graph? [duplicate] - r

ggplot2 can create a very attractive filled violin plot:
ggplot() + geom_violin(data=data.frame(x=1, y=rnorm(10 ^ 5)),
aes(x=x, y=y), fill='gray90', color='black') +
theme_classic()
I'd like to restrict the fill to the central 95% of the distribution if possible, leaving the outline intact. Does anyone have suggestions on how to accomplish this?

Does this do what you want? It requires some data-processing and the drawing of two violins.
set.seed(1)
dat <- data.frame(x=1, y=rnorm(10 ^ 5))
#calculate for each point if it's central or not
dat_q <- quantile(dat$y, probs=c(0.025,0.975))
dat$central <- dat$y>dat_q[1] & dat$y < dat_q[2]
#plot; one'95' violin and one 'all'-violin with transparent fill.
p1 <- ggplot(data=dat, aes(x=x,y=y)) +
geom_violin(data=dat[dat$central,], color="transparent",fill="gray90")+
geom_violin(color="black",fill="transparent")+
theme_classic()
Edit: the rounded edges bothered me, so here is a second approach. If I were doing this, I would want straight lines. So I did some playing with the density (which is what violin plots are based on)
d_y <- density(dat$y)
right_side <- data.frame(x=d_y$y, y=d_y$x) #note flip of x and y, prevents coord_flip later
right_side$central <- right_side$y > dat_q[1]&right_side$y < dat_q[2]
#add the 'left side', this entails reversing the order of the data for
#path and polygon
#and making x negative
left_side <- right_side[nrow(right_side):1,]
left_side$x <- 0 - left_side$x
density_dat <- rbind(right_side,left_side)
p2 <- ggplot(density_dat, aes(x=x,y=y)) +
geom_polygon(data=density_dat[density_dat$central,],fill="red")+
geom_path()
p2

Just make a selection first. Proof of concept:
df1 <- data.frame(x=1, y=rnorm(10 ^ 5))
df2 <- subset(df1, y > quantile(df1$y, 0.025) & y < quantile(df1$y, 0.975))
ggplot(mapping = aes(x = x, y = y)) +
geom_violin(data = df1, aes(fill = '100%'), color = NA) +
geom_violin(data = df2, aes(fill = '95%'), color = 'black') +
theme_classic() +
scale_fill_grey(name = 'level')

#Heroka gave a great answer. Here is a more general function based on his answer that allows to fill the violin plot according to any ranges (not just quantiles).
violincol <- function(x,from=-Inf,to=Inf,col='grey'){
d <- density(x)
right <- data.frame(x=d$y, y=d$x) #note flip of x and y, prevents coord_flip later
whichrange <- function(r,x){x <= r[2] & x > r[1]}
ranges <- cbind(from,to)
right$col <- sapply(right$y,function(y){
id <- apply(ranges,1,whichrange,y)
if(all(id==FALSE)) NA else col[which(id)]
})
left <- right[nrow(right):1,]
left$x <- 0 - left$x
dat <- rbind(right,left)
p <- ggplot(dat, aes(x=x,y=y)) +
geom_polygon(data=dat,aes(fill=col),show.legend = F)+
geom_path()+
scale_fill_manual(values=col)
return(p)
}
x <- rnorm(10^5)
violincol(x=x)
violincol(x=x,from=c(-Inf,0),to=c(0,Inf),col=c('green','red'))
r <- seq(-5,5,0.5)
violincol(x=x,from=r,to=r+0.5,col=rainbow(length(r)))

Related

add decile breaks to stat_function

I would like to add decile breaks to the following plot.
ggplot(data = data.frame(x = c(-0, 1)), aes(x)) +
stat_function(fun = dexp, n = 101,
args = list(rate=3)
)
As an example of what I would like to achieve, here is another plot but I would like to find a way to stick to stat_function since I can easily change the underlying distribution and parameters and also the plot is always more smooth.
vals <-rexp(1001,rate=3)
dens <- density(vals)
plot(dens)
df <- data.frame(x=dens$x, y=dens$y)
probs <-seq(0,1,0.1)
quantiles <- quantile(vals, prob=probs)
df$quant <- factor(findInterval(df$x,quantiles))
ggplot(df, aes(x,y)) + geom_line() + geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) +
scale_x_continuous(breaks=quantiles) + scale_fill_brewer(guide="none")
This is a bit hacky, uses pexp with some rounding to calculate which decile each x location is in. It doesn't use stat_function but has the same advantages as you mention.
# Set up range and number of x points
x=seq(0,1,l=1e4)
# Create x and y positions with dexp
df <- data.frame(x=x, y=dexp(x,rate=3))
# Identify quantiles using pexp
df$quant <- scales::number(floor(10*pexp(df$x,rate=3))*10, suffix="%")
# Find x positions for breaks
xpos <- aggregate(df, x~quant, min)
# Plot
ggplot(df, aes(x,y)) +
geom_line() +
geom_ribbon(aes(ymin=0, ymax=y, fill=quant)) +
scale_x_continuous(breaks=xpos$x, labels=xpos$quant)

Coloring lines in tanglegram based on position of nodes

I am creating tanglegrams with the following code:
library(ggtree)
library(ape)
tree1 <- read.tree(text='(((A:4.2,B:4.2):3.1,C:7.3):6.3,D:13.6);')
tree2 <- read.tree(text='(((B:4.2,A:4.2):3.1,C:7.3):6.3,D:13.6);')
p1 <- ggtree(tree1)
p2 <- ggtree(tree2)
d1 <- p1$data
d2 <- p2$data
d2$x <- max(d2$x) - d2$x + max(d1$x) + 1
pp <- p1 + geom_tree(data=d2)
dd <- bind_rows(d1, d2) %>%
filter(!is.na(label))
final_plot <- pp + geom_line(aes(x, y, group=label), data=dd, color='grey')
What I want to do is to color the lines based on the position of the nodes. In other words, if the line is straight, meaning that they have the same position in both trees, the color should be x, while if they have changed, it should be y.
Something like this:
It would also be nice to get a legend for this to explain the colors.
You can construct a column in dd that checks if the line will be horizontal. Here I grouped by label and checked whether the number of unique id's is 1. Then you use that column to the color argument in the aes of the line.
dd <- dd %>% group_by(label) %>% mutate(is.horiz = n_distinct(node) == 1)
pp +
geom_line(aes(x, y, group=label, color = is.horiz), data=dd) +
scale_color_manual(values = c('TRUE' = "lightblue", 'FALSE' = "purple")) +
theme(legend.position = c(.9,.9)) +
labs(color = 'Horizontal Nodes')
You can play around with the colors of the lines and the names of everything.

ggplot2 violin plot: fill central 95% only?

ggplot2 can create a very attractive filled violin plot:
ggplot() + geom_violin(data=data.frame(x=1, y=rnorm(10 ^ 5)),
aes(x=x, y=y), fill='gray90', color='black') +
theme_classic()
I'd like to restrict the fill to the central 95% of the distribution if possible, leaving the outline intact. Does anyone have suggestions on how to accomplish this?
Does this do what you want? It requires some data-processing and the drawing of two violins.
set.seed(1)
dat <- data.frame(x=1, y=rnorm(10 ^ 5))
#calculate for each point if it's central or not
dat_q <- quantile(dat$y, probs=c(0.025,0.975))
dat$central <- dat$y>dat_q[1] & dat$y < dat_q[2]
#plot; one'95' violin and one 'all'-violin with transparent fill.
p1 <- ggplot(data=dat, aes(x=x,y=y)) +
geom_violin(data=dat[dat$central,], color="transparent",fill="gray90")+
geom_violin(color="black",fill="transparent")+
theme_classic()
Edit: the rounded edges bothered me, so here is a second approach. If I were doing this, I would want straight lines. So I did some playing with the density (which is what violin plots are based on)
d_y <- density(dat$y)
right_side <- data.frame(x=d_y$y, y=d_y$x) #note flip of x and y, prevents coord_flip later
right_side$central <- right_side$y > dat_q[1]&right_side$y < dat_q[2]
#add the 'left side', this entails reversing the order of the data for
#path and polygon
#and making x negative
left_side <- right_side[nrow(right_side):1,]
left_side$x <- 0 - left_side$x
density_dat <- rbind(right_side,left_side)
p2 <- ggplot(density_dat, aes(x=x,y=y)) +
geom_polygon(data=density_dat[density_dat$central,],fill="red")+
geom_path()
p2
Just make a selection first. Proof of concept:
df1 <- data.frame(x=1, y=rnorm(10 ^ 5))
df2 <- subset(df1, y > quantile(df1$y, 0.025) & y < quantile(df1$y, 0.975))
ggplot(mapping = aes(x = x, y = y)) +
geom_violin(data = df1, aes(fill = '100%'), color = NA) +
geom_violin(data = df2, aes(fill = '95%'), color = 'black') +
theme_classic() +
scale_fill_grey(name = 'level')
#Heroka gave a great answer. Here is a more general function based on his answer that allows to fill the violin plot according to any ranges (not just quantiles).
violincol <- function(x,from=-Inf,to=Inf,col='grey'){
d <- density(x)
right <- data.frame(x=d$y, y=d$x) #note flip of x and y, prevents coord_flip later
whichrange <- function(r,x){x <= r[2] & x > r[1]}
ranges <- cbind(from,to)
right$col <- sapply(right$y,function(y){
id <- apply(ranges,1,whichrange,y)
if(all(id==FALSE)) NA else col[which(id)]
})
left <- right[nrow(right):1,]
left$x <- 0 - left$x
dat <- rbind(right,left)
p <- ggplot(dat, aes(x=x,y=y)) +
geom_polygon(data=dat,aes(fill=col),show.legend = F)+
geom_path()+
scale_fill_manual(values=col)
return(p)
}
x <- rnorm(10^5)
violincol(x=x)
violincol(x=x,from=c(-Inf,0),to=c(0,Inf),col=c('green','red'))
r <- seq(-5,5,0.5)
violincol(x=x,from=r,to=r+0.5,col=rainbow(length(r)))

2D heatmap of mean values with R

I'm trying to plot something like this, where alpha is intended to be a mean of dat$c per bin:
library(ggplot2)
set.seed(1)
dat <- data.frame(a = rnorm(1000), b = rnorm(1000), c = 1/rnorm(1000),
d = as.factor(sample(c(0, 1), 1000, replace=TRUE)))
# plot
p <- ggplot(dat, environment = environment()) +
geom_bin2d(aes(x=a, y=b, alpha=c, fill=d),
binwidth = c(1.0/10, 1.0/10))
but it doesn't look like alpha is correct. Please help
I'm not sure what you're expecting to see, but this will calculate mean(dat$c) in each bin and plot the result.
library(ggplot2)
brks <- seq(-5,5,0.1)
lbls <- brks[-1]-0.05
gg.df <- aggregate(c~cut(a, brks, lbls)+cut(b,brks, lbls)+d,dat,FUN=mean)
names(gg.df)[1:2] <- c("a","b")
gg.df$a <- as.numeric(as.character(gg.df$a))
gg.df$b <- as.numeric(as.character(gg.df$b))
ggplot(gg.df, aes(x=a, y=b, alpha=c, fill=d)) + geom_raster() + coord_fixed()
Edit: Response to OP's comment.
You could try:
dat$c <- with(dat,1/(a^2+b^2))
This makes dat$c inversely proportional to the radius (distance from (0,0) to the point). Now running the same code as above:
gg.df <- aggregate(c~cut(a, brks, lbls)+cut(b,brks, lbls)+d,dat,FUN=mean)
names(gg.df)[1:2] <- c("a","b")
gg.df$a <- as.numeric(as.character(gg.df$a))
gg.df$b <- as.numeric(as.character(gg.df$b))
ggplot(gg.df, aes(x=a, y=b, alpha=c, fill=d)) + geom_raster() + coord_fixed() +
scale_alpha_continuous(trans="log",breaks=10^(0:3))
Produces this, as expected: a plot having tiles with higher alpha (less transparent) near the center.
I needed to use a log scale for alpha because the values range over several orders of magnitude.

ggplot2 Scatter Plot Labels

I'm trying to use ggplot2 to create and label a scatterplot. The variables that I am plotting are both scaled such that the horizontal and the vertical axis are plotted in units of standard deviation (1,2,3,4,...ect from the mean). What I would like to be able to do is label ONLY those elements that are beyond a certain limit of standard deviations from the mean. Ideally, this labeling would be based off of another column of data.
Is there a way to do this?
I've looked through the online manual, but I haven't been able to find anything about defining labels for plotted data.
Help is appreciated!
Thanks!
BEB
Use subsetting:
library(ggplot2)
x <- data.frame(a=1:10, b=rnorm(10))
x$lab <- letters[1:10]
ggplot(data=x, aes(a, b, label=lab)) +
geom_point() +
geom_text(data = subset(x, abs(b) > 0.2), vjust=0)
The labeling can be done in the following way:
library("ggplot2")
x <- data.frame(a=1:10, b=rnorm(10))
x$lab <- rep("", 10) # create empty labels
x$lab[c(1,3,4,5)] <- LETTERS[1:4] # some labels
ggplot(data=x, aes(x=a, y=b, label=lab)) + geom_point() + geom_text(vjust=0)
Subsetting outside of the ggplot function:
library(ggplot2)
set.seed(1)
x <- data.frame(a = 1:10, b = rnorm(10))
x$lab <- letters[1:10]
x$lab[!(abs(x$b) > 0.5)] <- NA
ggplot(data = x, aes(a, b, label = lab)) +
geom_point() +
geom_text(vjust = 0)
Using qplot:
qplot(a, b, data = x, label = lab, geom = c('point','text'))

Resources