Split violin plot with points on top to indicate info - r

This is a followup post from here
and here
I have successfully implemented the split violin ggplot2 for my data (two median estimator densities, for two cases) that need to be compared. Now, since i would like to add some confidence interval. I m following the code posted in the links above:
EDIT: A reproducible example
tmp <- rnorm(1000,0,1)
tmp.2 <- rnorm(1000,0,1)
x.1 <- density(tmp)
y.1 <- density(tmp.2)
Here, i m making the densities, extracting the (x,y) pairs. Then i m getting the quantiles back,
# Make densities
densities <- as.data.frame(c(x.1$x,y.1$x))
colnames(densities) <- "loc"
densities$dens <- c(x.1$y,y.1$y)
densities$drop_case <- c(rep("B",512),rep("S",512))
densities$dens <- ifelse(densities$drop_case=="B",densities$dens*-1,densities$dens)
densities$dens <- ifelse(densities$drop_case=="S",densities$dens*1,densities$dens)
conf <- as.data.frame(c(quantile(tmp,c(0.025,0.975))[1],quantile(tmp,c(0.025,0.975))[2],quantile(tmp.2,c(0.025,0.975))[1],quantile(tmp.2,c(0.025,0.975))[2]))
colnames(conf) <- "intervals"
conf$drop_case <- c(rep("B",2),rep("S",2))
conf$length <- rep(1000,4)
Now here i am trying to extract the values inside the densities, as was noted in the linked posts
Find data points in densities
val.tmp <- rep(0,4)
val.tmp.2 <- rep(0,4)
for (i in 1:4) {
x.here <- densities$loc
y.here <- densities$dens
your.number<- conf$intervals[i]
pos.tmp <- which(abs(x.here-your.number)==min(abs(x.here-your.number)))
val.tmp[i] <- x.here[pos.tmp]
val.tmp.2[i] <- y.here[pos.tmp]
}
conf$positions <- val.tmp
conf$length <- val.tmp.2
conf$length <- ifelse(conf$drop_case=="B",conf$length*-1,conf$length)
conf$length <- ifelse(conf$drop_case=="S",conf$length*1,conf$length)
ggplot(densities,aes(dens, loc, fill = factor(drop_case)))+
geom_polygon()+
scale_x_continuous(breaks = 0, name = info$Name)+
ylab('Estimator Density') +
theme(axis.title.x = element_blank())+
geom_point(data = conf, aes(x = positions, y = length, fill = factor(drop_case), group = factor(drop_case))
,shape = 21, colour = "black", show.legend = FALSE)
Then unfortuantely I am facing the following, the points are not mapped on the densities but are rather mapped on the plane.

There is a bunch of little mistakes in the code. Firstly, within that for loop, you can't set x.here and y.here to all of the density and location values, since that includes both groups. Secondly, since the signs are already changed in densities there is no need to use those ifelse statements afterwards. Thirdly, you would only need the top ifelse anyway, since the bottom one does absolutely nothing. Finally, you had the x and y mappings in geom_point the wrong way around!
There is a bunch of other things one could change to make the code more understandable and pretty, but I'm on limited time, so I'll leave those for what they are.
Below the full adjusted code:
tmp <- rnorm(1000,0,1)
tmp.2 <- rnorm(1000,0,1)
x.1 <- density(tmp)
y.1 <- density(tmp.2)
# Make densities
densities <- as.data.frame(c(x.1$x,y.1$x))
colnames(densities) <- "loc"
densities$dens <- c(x.1$y,y.1$y)
densities$drop_case <- c(rep("B",512),rep("S",512))
densities$dens <- ifelse(densities$drop_case=="B",densities$dens*-1,densities$dens)
conf <- as.data.frame(c(quantile(tmp,c(0.025,0.975)), quantile(tmp.2,c(0.025,0.975))))
colnames(conf) <- "intervals"
conf$drop_case <- c(rep("B",2),rep("S",2))
conf$length <- rep(1000,4)
val.tmp <- rep(0,4)
val.tmp.2 <- rep(0,4)
for (i in 1:4) {
x.here <- densities$loc[densities$drop_case == conf$drop_case[i]]
y.here <- densities$dens[densities$drop_case == conf$drop_case[i]]
your.number<- conf$intervals[i]
pos.tmp <- which(abs(x.here-your.number)==min(abs(x.here-your.number)))
val.tmp[i] <- x.here[pos.tmp]
val.tmp.2[i] <- y.here[pos.tmp]
}
conf$positions <- val.tmp
conf$length <- val.tmp.2
ggplot(densities, aes(dens, loc, fill = drop_case)) +
geom_polygon()+
ylab('Estimator Density') +
theme(axis.title.x = element_blank())+
geom_point(data = conf, aes(x = length, y = positions, fill = drop_case),
shape = 21, colour = "black", show.legend = FALSE)
This results in:
I would personally prefer a plot with line segments:
ggplot(densities, aes(dens, loc, fill = factor(drop_case)))+
geom_polygon()+
ylab('Estimator Density') +
theme(axis.title.x = element_blank())+
geom_segment(data = conf, aes(x = length, xend = 0, y = positions, yend = positions))

Related

How to highlight a single value on a continuous color scale in ggplot2

A question posted here shows how to declare some of the values missing. I have a similar problem except I wish to highlight a single value with a different color eg. mpg = 20. Ideally, I would like it to show up on the legend as well.
To be clear, I wish to highlight a specific value on the gradient.
I am reusing the code that was used in the other post to seed the effort. This code specifies the lower limit of the data but does not allow for an arbitrarily chosen value.
I was wondering if people know how to do this with our without using something like scale_colour_gradientn.
library(ggplot2)
dat <- head(mtcars)
dat$model <- head(colnames(mtcars))
dat$is_low <- ifelse(dat$mpg < 20, TRUE, FALSE)
ggplot(dat, aes(x = model, y = mpg, fill = mpg)) +
geom_col() +
scale_fill_continuous(limits=c(20,max(dat$mpg)))
This is adapted from the answer I gave here, but it requires some messing around with the palette.
This is a custom palette function that replaces the values between the target values with the replace_colour, but it requires to know the range of the data first. Note that the function isn't very user friendly, but it does the job.
library(ggplot2)
library(scales)
my_palette <- function(colours, target = c(20.5, 21.5),
range = range(target), values = NULL,
replace_colour = "green") {
target <- (target - range[1]) / diff(range)
ramp <- scales::colour_ramp(colours)
force(values)
function(x) {
# Decide what values to replace
replace <- x > target[1] & x < target[2]
if (length(x) == 0)
return(character())
if (!is.null(values)) {
xs <- seq(0, 1, length.out = length(values))
f <- stats::approxfun(values, xs)
x <- f(x)
}
out <- ramp(x)
# Actually replace values
out[replace] <- replace_colour
out
}
}
You can then use that function with a custom scale as follows. I chose to highlight around 21 because 20 doesn't occur in dat$mpg.
dat <- head(mtcars)
dat$model <- head(colnames(mtcars))
dat$is_low <- ifelse(dat$mpg < 20, TRUE, FALSE)
colours <- seq_gradient_pal("#132B43", "#56B1F7")(seq(0, 1, length.out = 12))
ggplot(dat, aes(x = model, y = mpg, fill = mpg)) +
geom_col() +
continuous_scale(
"fill", "my_pal",
my_palette(colours, range = range(dat$mpg), target = c(20.9, 21.1)),
guide = guide_colourbar(nbin = 500) # Give guide plenty bins
)
Created on 2021-04-13 by the reprex package (v1.0.0)
Applying this to log scaled values requires you to log scale all the input data to my_palette too.
dat <- head(mtcars)
dat$model <- head(colnames(mtcars))
dat$mpg <- c(1e-6, 1e-4, 1e-2, 1e0, 1e2, 1e4)
colours <- seq_gradient_pal("#132B43", "#56B1F7")(seq(0, 1, length.out = 12))
ggplot(dat, aes(x = model, y = mpg, fill = mpg)) +
geom_col() +
scale_y_log10() +
continuous_scale(
"fill", "my_pal", trans = "log10",
my_palette(colours, range = log10(range(dat$mpg)),
target = log10(1e2) * c(0.9, 1.1)),
guide = guide_colourbar(nbin = 500) # Give guide plenty bins
)

created a nested cdf that doesn't reach 1

Here is some workable example of data I wish to plot:
set.seed(123)
x <- rweibull(n = 2000, shape = 2, scale = 10)
x <- round(x, digits = 0)
x <- sort(x, decreasing = FALSE)
y <- c(rep(0.1, times = 500),rep(0.25, times = 500),rep(0.4, times = 500),rep(0.85, times = 500))
z <- rbinom(n=2000, size=1, prob=y)
df1 <- data.frame(x,z)
I want to plot the overal fequency of z across x.
unlike a typical cdf, the function should not reach 1.0, but instead
sum(df1$z)/length(df1$z)
a ymax of 0.36 (721/2000).
using ggplot2 we can create a cdf of x with the following command:
library(ggplot2)
ggplot(df1, aes(x)) + stat_ecdf()
But i want to extend this plot to show the cumulative percentage of z (as a function of 'x')
The end result should like like
EDIT
with some very poor data manipulation I am able to generate the something similiar to a cdf plot, but there must be a more beautiful and easy method using various packages and ggplot
mytable <- table(df1$x, df1$z)
mydf <- as.data.frame.matrix(mytable)
colnames(mydf) <- c("z_no", "z_yes")
mydf$A <- 1:length(mydf$z_no)
mydf$sum <- cumsum(mydf$z_yes)
mydf$dis <- mydf$sum/length(z)
plot(mydf$A, mydf$dis)
You can use the package dplyr to process the data as follows:
library(dplyr)
plot_data <- group_by(df1, x) %>%
summarise(z_num = sum(z)) %>%
mutate(cum_perc_z = cumsum(z_num)/nrow(df1))
This gives the same result as the data processing that you describe in your edit. Note, however, that I get sum(df1$z) = 796 and the maximal y value is thus 796/2000 = 0.398.
For the plot, you can use geom_step() to have a step function and add the horizontal line with geom_hline():
ggplot(plot_data, aes(x = x, y = cum_perc_z)) +
geom_step(colour = "red", size = 0.8) +
geom_hline(yintercept = max(plot_data$cum_perc_z))

Custom scatterplot matrix using facet_grid in ggplot2

I'm trying to write a custom scatterplot matrix function in ggplot2 using facet_grid. My data have two categorical variables and one numeric variable.
I'd like to facet (make the scatterplot rows/cols) according to one of the categorical variables and change the plotting symbol according to the other categorical.
I do so by first constructing a larger dataset that includes all combinations (combs) of the categorical variable from which I'm creating the scatterplot panels.
My questions are:
How to use geom_rect to white-out the diagonal and upper panels in facet_grid (I can only make the middle ones black so far)?
How can you move the titles of the facets to the bottom and left hand sides respectively?
How does one remove tick axes and labels for the top left and bottom right facets?
Thanks in advance.
require(ggplot2)
# Data
nC <- 5
nM <- 4
dat <- data.frame(
Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
# Change factors to characters
dat <- within(dat, {
Control <- as.character(Control)
measure <- as.character(measure)
})
# Check, lapply(dat, class)
# Define scatterplot() function
scatterplotmatrix <- function(data,...){
controls <- with(data, unique(Control))
measures <- with(data, unique(measure))
combs <- expand.grid(1:length(controls), 1:length(measures), 1:length(measures))
# Add columns for values
combs$value1 = 1
combs$value2 = 0
for ( i in 1:NROW(combs)){
combs[i, "value1"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,2]], select = value)
combs[i, "value2"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,3]], select = value)
}
for ( i in 1:NROW(combs)){
combs[i,"Control"] <- controls[combs[i,1]]
combs[i,"Measure1"] <- measures[combs[i,2]]
combs[i,"Measure2"] <- measures[combs[i,3]]
}
# Final pairs plot
plt <- ggplot(combs, aes(x = value1, y = value2, shape = Control)) +
geom_point(size = 8, colour = "#F8766D") +
facet_grid(Measure2 ~ Measure1) +
ylab("") +
xlab("") +
scale_x_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
geom_rect(data = subset(combs, subset = Measure1 == Measure2), colour='white', xmin = -Inf, xmax = Inf,ymin = -Inf,ymax = Inf)
return(plt)
}
# Call
plt1 <- scatterplotmatrix(dat)
plt1
I'm not aware of a way to move the panel strips (the labels) to the bottom or left. Also, it's not possible to format the individual panels separately (e.g., turn off the tick marks for just one facet). So if you really need these features, you will probably have to use something other than, or in addition to ggplot. You should really look into GGally, although I've never had much success with it.
As far as leaving some of the panels blank, here is a way.
nC <- 5; nM <- 4
set.seed(1) # for reproducible example
dat <- data.frame(Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
scatterplotmatrix <- function(data,...){
require(ggplot2)
require(data.table)
require(plyr) # for .(...)
DT <- data.table(data,key="Control")
gg <- DT[DT,allow.cartesian=T]
setnames(gg,c("Control","H","x","V","y"))
fmt <- function(x) format(x,nsmall=1)
plt <- ggplot(gg, aes(x,y,shape = Control)) +
geom_point(subset=.(as.numeric(H)<as.numeric(V)),size=5, colour="#F8766D") +
facet_grid(V ~ H) +
ylab("") + xlab("") +
scale_x_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05)) +
scale_y_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05))
return(plt)
}
scatterplotmatrix(dat)
The main feature of this is the use of subset=.(as.numeric(H)<as.numeric(V)) in the call to geom_point(...). This subsets the dataset so you only get a point layer when the condition is met, e.g. in facets where is.numeric(H)<is.numeric(V). This works because I've left the H and V columns as factors and is.numeric(...) operating on a factor returns the levels, not the names.
The rest is just a more compact (and much faster) way of creating what you called comb.

ggplot axis order (factor) changes when using last_plot()

I've been able to successfully create a dotpot in ggplot for percentages across gender. But, I want to highlight the significant differences. I thought I could do this with a combination of subsetting and the use of last_plot().
Here’s my data:
require(ggplot2)
require(reshape2)
prog <- c("Honors", "Academic", "Social", "Media")
m <- c(30,35,40,23)
f <- c(25,40,45,15)
s <- c(0.7, 0.4, 0.1, 0.03)
temp <- as.data.frame(cbind(prog, m, f, s), stringsAsFactors=FALSE)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt <- as.data.frame(cbind(first.melt,temp[,4]), , stringsAsFactors=FALSE)
names(first.melt) <- c("program", "Gender", "Percent", "sig")
first.melt$program <- as.factor(first.melt$program)
Here’s where I reverse order my Program variable, so that when graphed if will be alphabetical from top to bottom.
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt$sig <- as.numeric(as.character(first.melt$sig))
first.melt$Percent <- as.numeric(as.character(first.melt$Percent))
Now, I subset...
first.melt.ns <- subset(first.melt,sig > 0.05)
first.melt.sig <- subset(first.melt,sig <= 0.05)
ggplot(first.melt.ns, aes(program, y=Percent, shape=Gender)) +
geom_point(size=3) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
The first run at ggplot get’s me my non-significant Program pairs – and it’s in the right order – so, I add my the two new points for male and female (making them solid, to draw attention as a significant pair):
last_plot() +
geom_point(data=first.melt.sig, aes(program[Gender=="m"], y=Percent[Gender=="m"]), size=3, shape=19) +
geom_point(data=first.melt.sig, aes(program[Gender=="f"], y=Percent[Gender=="f"]),size=4, shape=18)
The points get added just fine – ggplot works. But notice my Program axis – it’s correct, but reversed now.
First, you really should avoid as.data.frame(cbind(...)). It is dramatically increasing the amount of work necessary to prepare your data. The function for creating data frames is (naturally) data.frame. Use it!
What you're doing here is basically trying to get around the limitation of only having one shape scale. It's probably easiest to just do this:
temp <- data.frame(prog,m,f,s)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt$sig <- rep(temp$s,times = 2)
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt.sig <- subset(first.melt,sig < 0.05)
first.melt$Percent[first.melt$sig < 0.05] <- NA
ggplot() +
geom_point(data = first.melt,aes(x = prog,y = Percent,shape = Gender),size = 3) +
geom_point(data = first.melt.sig[1,],aes(x = prog,y = Percent),shape = 19) +
geom_point(data = first.melt.sig[2,],aes(x = prog,y = Percent),shape = 18) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
In general, work to structure your ggplot code so that you're subsetting data frames, not variables inside of aes. That gets both tricky and dangerous, because ggplot is assuming certain things about what you pass inside of aes in order for the evaluation to work properly.

How to produce a meaningful draftsman/correlation plot for discrete values

One of my favorite tools for exploratory analysis is pairs(), however in the case of a limited number of discrete values, it falls flat as the dots all align perfectly. Consider the following:
y <- t(rmultinom(n=1000,size=4,prob=rep(.25,4)))
pairs(y)
It doesn't really give a good sense of correlation. Is there an alternative plot style that would?
If you change y to a data.frame you can add some 'jitter' and with the col option you can set the transparency level (the 4th number in rgb):
y <- data.frame(y)
pairs(sapply(y,jitter), col = rgb(0,0,0,.2))
Or you could use ggplot2's plotmatrix:
library(ggplot2)
plotmatrix(y) + geom_jitter(alpha = .2)
Edit: Since plotmatrix in ggplot2 is deprecated use ggpairs (GGally package mentioned in #hadley's comment above)
library(GGally)
ggpairs(y, lower = list(params = c(alpha = .2, position = "jitter")))
Here is an example using corrplot:
M <- cor(y)
corrplot.mixed(M)
You can find more examples in the intro
http://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html
Here are a couple of options using ggplot2:
library(ggplot2)
## re-arrange data (copied from plotmatrix function)
prep.plot <- function(data) {
grid <- expand.grid(x = 1:ncol(data), y = 1:ncol(data))
grid <- subset(grid, x != y)
all <- do.call("rbind", lapply(1:nrow(grid), function(i) {
xcol <- grid[i, "x"]
ycol <- grid[i, "y"]
data.frame(xvar = names(data)[ycol], yvar = names(data)[xcol],
x = data[, xcol], y = data[, ycol], data)
}))
all$xvar <- factor(all$xvar, levels = names(data))
all$yvar <- factor(all$yvar, levels = names(data))
return(all)
}
dat <- prep.plot(data.frame(y))
## plot with transparent jittered points
ggplot(dat, aes(x = x, y=y)) +
geom_jitter(alpha=.125) +
facet_grid(xvar ~ yvar) +
theme_bw()
## plot with color representing density
ggplot(dat, aes(x = factor(x), y=factor(y))) +
geom_bin2d() +
facet_grid(xvar ~ yvar) +
theme_bw()
I don't have enough credits yet to comment on #Vincent 's post - when doing
library(GGally)
ggpairs(y, lower = list(params = c(alpha = .2, position = "jitter")))
I get
Error in stop_if_params_exist(obj$params) :
'params' is a deprecated argument. Please 'wrap' the function to supply arguments. help("wrap", package = "GGally")
So it seems, based on the indicated help page, that it would need to be in this case here:
ydf <- as.data.frame(y)
regularPlot <- ggpairs(ydf, lower = list(continuous = wrap(ggally_points, alpha = .2, position = "jitter")))
regularPlot

Resources