How to add inbetween space in nested boxplots ggplot2 - r

I would like to added a marginal space between groups of box plots by using the stats_summary method.
Here is a small example of my problem
library(ggplot2)
library(reshape2)
data1 <- (lapply(letters[1:5], function(l1) return(matrix(rt(5*3, 1), nrow = 5, ncol = 3, dimnames = list(cat2=letters[6:10], cat3=letters[11:13])))))
names(data1) <- letters[1:5]
data2 <- melt(data1)
customstats <- function(x) {
xs <- sort(x)
return(c(ymin=min(x), lower= mean(xs[xs < mean(x)]), middle = mean(x) , upper = mean(xs[xs > mean(x)]), ymax=max(x)))
}
ggplot(data2, aes(x=cat2, y=value, fill=cat3), width=2) +
stat_summary(fun.data = customstats, geom = "boxplot",
alpha = 0.5, position = position_dodge(1), mapping = aes(fill=cat3))
The result is the following picture.
I would like to achieve a visual separation for each "cat2" and add a "space" between the group of boxplots (I'm retricted to using the stats_summary since I have a custom statistic). How can I do it?

I have fixed a similar problem in an ugly (but effective for me) way by creating a dataframe with the same plotting variables as my original data, but with x (or y) positioned or factored that it fits between the two points I want to separate and missing values for y (or x). For your problem, I added the following code and got an image with spacial separation of clusters.
library(plyr)
empties <- data.frame(cat2_orig=unique(data2$cat2)[-length(unique(data2$cat2))])
#no extra space needed between last cluster and edge of plot
empties$cat2 <- paste0(empties$cat2_orig,empties$cat2_orig)
empties$value <- NA
data2_space <- rbind.fill(data2,empties)
ggplot(data2_space, aes(x=cat2, y=value, fill=cat3), width=2) +
stat_summary(fun.data = customstats, geom = "boxplot",
alpha = 0.5, position = position_dodge(1), mapping = aes(fill=cat3)) +
#remove tickmarks for non-interesting points on x-axis
scale_x_discrete(breaks=unique(data2$cat2))
Before & after

Related

R control jitter function - avoid overplotting / non-random jitter

My problems seems simple, I am using ggplot2 with geom_jitter() to plot a variable. (take my picture as an example)
Jitter now adds some random noise to the variable (the variable is just called "1" in this example) to prevent overplotting. So I have now random noise in the y-direction and clearly what otherwise would be completely overplotted is now better visible.
But here is my question:
As you can see, there are still some points, that overplot each other. In my example here, this could be easily prevented, if it wouldn't be random noise in y-direction... but somehow more strategically placed offsets.
Can I somehow alter the geom_jitter() behavior or is there a similar function in ggplot2 that does exactly this?
Not really a minimal example, but also not too long:
library("imputeTS")
library("ggplot2")
data <- tsAirgap
# 2.1 Create required data
# Get all indices of the data that comes directly before and after an NA
na_indx_after <- which(is.na(data[1:(length(data) - 1)])) + 1
# starting from index 2 moves all indexes one in front, so no -1 needed for before
na_indx_before <- which(is.na(data[2:length(data)]))
# Get the actual values to the indices and put them in a data frame with a label
before <- data.frame(id = "1", type = "before", input = na_remove(data[na_indx_before]))
after <- data.frame(id = "1", type = "after", input = na_remove(data[na_indx_after]))
all <- data.frame(id = "1", type = "source", input = na_remove(data))
# Get n values for the plot labels
n_before <- length(before$input)
n_all <- length(all$input)
n_after <- length(after$input)
# 2.4 Create dataframe for ggplot2
# join the data together in one dataframe
df <- rbind(before, after, all)
# Create the plot
gg <- ggplot(data = df) +
geom_jitter(mapping = aes(x = id, y = input, color = type, alpha = type), width = 0.5 , height = 0.5)
gg <- gg + ggplot2::scale_color_manual(
values = c("before" = "skyblue1", "after" = "yellowgreen","source" = "gray66"),
)
gg <- gg + ggplot2::scale_alpha_manual(
values = c("before" = 1, "after" = 1,"source" = 0.3),
)
gg + ggplot2::theme_linedraw() + theme(aspect.ratio = 0.5) + ggplot2::coord_flip()
So many good suggestions...here is what Bens suggestion would look like for my example:
I changed parts of my code to:
gg <- ggplot(data = df, aes(x = input, color = type, fill = type, alpha = type)) +
geom_dotplot(binwidth = 15)
Would basically also work as intended for me. ggbeeplot as suggested by Jon also worked great for my purpose.
I thought of a hack I really like, using ggrepel. It's normally used for labels, but nothing preventing you from making the label into a point.
df <- data.frame(x = rnorm(200),
col = sample(LETTERS[1:3], 200, replace = TRUE),
y = 1)
ggplot(df, aes(x, y, label = "●", color = col)) + # using unicode black circle
ggrepel::geom_text_repel(segment.color = NA,
box.padding = 0.01, key_glyph = "point")
A downside of this method is that ggrepel can take a lot time for a large number of points, and will recalculate differently each time you change the plot size. A faster alternative would be to use ggbeeswarm::geom_quasirandom, which uses a deterministic process to define jitter that looks random.
ggplot(df, aes(x,y, color = col)) +
ggbeeswarm::geom_quasirandom(groupOnX = FALSE)

Custom scatterplot matrix using facet_grid in ggplot2

I'm trying to write a custom scatterplot matrix function in ggplot2 using facet_grid. My data have two categorical variables and one numeric variable.
I'd like to facet (make the scatterplot rows/cols) according to one of the categorical variables and change the plotting symbol according to the other categorical.
I do so by first constructing a larger dataset that includes all combinations (combs) of the categorical variable from which I'm creating the scatterplot panels.
My questions are:
How to use geom_rect to white-out the diagonal and upper panels in facet_grid (I can only make the middle ones black so far)?
How can you move the titles of the facets to the bottom and left hand sides respectively?
How does one remove tick axes and labels for the top left and bottom right facets?
Thanks in advance.
require(ggplot2)
# Data
nC <- 5
nM <- 4
dat <- data.frame(
Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
# Change factors to characters
dat <- within(dat, {
Control <- as.character(Control)
measure <- as.character(measure)
})
# Check, lapply(dat, class)
# Define scatterplot() function
scatterplotmatrix <- function(data,...){
controls <- with(data, unique(Control))
measures <- with(data, unique(measure))
combs <- expand.grid(1:length(controls), 1:length(measures), 1:length(measures))
# Add columns for values
combs$value1 = 1
combs$value2 = 0
for ( i in 1:NROW(combs)){
combs[i, "value1"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,2]], select = value)
combs[i, "value2"] <- subset(data, subset = Control==controls[combs[i,1]] & measure == measures[combs[i,3]], select = value)
}
for ( i in 1:NROW(combs)){
combs[i,"Control"] <- controls[combs[i,1]]
combs[i,"Measure1"] <- measures[combs[i,2]]
combs[i,"Measure2"] <- measures[combs[i,3]]
}
# Final pairs plot
plt <- ggplot(combs, aes(x = value1, y = value2, shape = Control)) +
geom_point(size = 8, colour = "#F8766D") +
facet_grid(Measure2 ~ Measure1) +
ylab("") +
xlab("") +
scale_x_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
scale_y_continuous(breaks = c(0,0.5,1), labels = c("0", "0.5", "1"), limits = c(-0.05, 1.05)) +
geom_rect(data = subset(combs, subset = Measure1 == Measure2), colour='white', xmin = -Inf, xmax = Inf,ymin = -Inf,ymax = Inf)
return(plt)
}
# Call
plt1 <- scatterplotmatrix(dat)
plt1
I'm not aware of a way to move the panel strips (the labels) to the bottom or left. Also, it's not possible to format the individual panels separately (e.g., turn off the tick marks for just one facet). So if you really need these features, you will probably have to use something other than, or in addition to ggplot. You should really look into GGally, although I've never had much success with it.
As far as leaving some of the panels blank, here is a way.
nC <- 5; nM <- 4
set.seed(1) # for reproducible example
dat <- data.frame(Control = rep(LETTERS[1:nC], nM),
measure = rep(letters[1:nM], each = nC),
value = runif(nC*nM))
scatterplotmatrix <- function(data,...){
require(ggplot2)
require(data.table)
require(plyr) # for .(...)
DT <- data.table(data,key="Control")
gg <- DT[DT,allow.cartesian=T]
setnames(gg,c("Control","H","x","V","y"))
fmt <- function(x) format(x,nsmall=1)
plt <- ggplot(gg, aes(x,y,shape = Control)) +
geom_point(subset=.(as.numeric(H)<as.numeric(V)),size=5, colour="#F8766D") +
facet_grid(V ~ H) +
ylab("") + xlab("") +
scale_x_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05)) +
scale_y_continuous(breaks=c(0,0.5,1), labels=fmt, limits=c(-0.05, 1.05))
return(plt)
}
scatterplotmatrix(dat)
The main feature of this is the use of subset=.(as.numeric(H)<as.numeric(V)) in the call to geom_point(...). This subsets the dataset so you only get a point layer when the condition is met, e.g. in facets where is.numeric(H)<is.numeric(V). This works because I've left the H and V columns as factors and is.numeric(...) operating on a factor returns the levels, not the names.
The rest is just a more compact (and much faster) way of creating what you called comb.

ggplot axis order (factor) changes when using last_plot()

I've been able to successfully create a dotpot in ggplot for percentages across gender. But, I want to highlight the significant differences. I thought I could do this with a combination of subsetting and the use of last_plot().
Here’s my data:
require(ggplot2)
require(reshape2)
prog <- c("Honors", "Academic", "Social", "Media")
m <- c(30,35,40,23)
f <- c(25,40,45,15)
s <- c(0.7, 0.4, 0.1, 0.03)
temp <- as.data.frame(cbind(prog, m, f, s), stringsAsFactors=FALSE)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt <- as.data.frame(cbind(first.melt,temp[,4]), , stringsAsFactors=FALSE)
names(first.melt) <- c("program", "Gender", "Percent", "sig")
first.melt$program <- as.factor(first.melt$program)
Here’s where I reverse order my Program variable, so that when graphed if will be alphabetical from top to bottom.
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt$sig <- as.numeric(as.character(first.melt$sig))
first.melt$Percent <- as.numeric(as.character(first.melt$Percent))
Now, I subset...
first.melt.ns <- subset(first.melt,sig > 0.05)
first.melt.sig <- subset(first.melt,sig <= 0.05)
ggplot(first.melt.ns, aes(program, y=Percent, shape=Gender)) +
geom_point(size=3) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
The first run at ggplot get’s me my non-significant Program pairs – and it’s in the right order – so, I add my the two new points for male and female (making them solid, to draw attention as a significant pair):
last_plot() +
geom_point(data=first.melt.sig, aes(program[Gender=="m"], y=Percent[Gender=="m"]), size=3, shape=19) +
geom_point(data=first.melt.sig, aes(program[Gender=="f"], y=Percent[Gender=="f"]),size=4, shape=18)
The points get added just fine – ggplot works. But notice my Program axis – it’s correct, but reversed now.
First, you really should avoid as.data.frame(cbind(...)). It is dramatically increasing the amount of work necessary to prepare your data. The function for creating data frames is (naturally) data.frame. Use it!
What you're doing here is basically trying to get around the limitation of only having one shape scale. It's probably easiest to just do this:
temp <- data.frame(prog,m,f,s)
first <- temp[,1:3]
first.melt <- melt(first, id.vars = 'prog', variable.name = 'Gender', value.name = 'Percent')
first.melt$sig <- rep(temp$s,times = 2)
first.melt[,1] = with(first.melt, factor(first.melt[,1], levels = rev(levels(first.melt[,1]))))
first.melt.sig <- subset(first.melt,sig < 0.05)
first.melt$Percent[first.melt$sig < 0.05] <- NA
ggplot() +
geom_point(data = first.melt,aes(x = prog,y = Percent,shape = Gender),size = 3) +
geom_point(data = first.melt.sig[1,],aes(x = prog,y = Percent),shape = 19) +
geom_point(data = first.melt.sig[2,],aes(x = prog,y = Percent),shape = 18) +
coord_flip() +
scale_shape_manual(values=c("m"=1, "f"=5))
In general, work to structure your ggplot code so that you're subsetting data frames, not variables inside of aes. That gets both tricky and dangerous, because ggplot is assuming certain things about what you pass inside of aes in order for the evaluation to work properly.

R: Similar plot with big and small data frame

I am trying to find a way to plot data frames of different size using the same function. The data is quite similar to the dfs below. Order of xs is not important.
GetDf <- function(n)
data.frame(x = seq(1, n), y = rnorm(n, 3.5, 0.5), group = runif(n) > 0.5)
PlotIt <- function(df) {
p <- ggplot(df) + geom_point(aes(x = x, y = y, colour = group)) +
expand_limits(y = 1) + expand_limits(y = 5) +
geom_hline(aes(yintercept = c(2.5, 4.5)), linetype = "dotdash")
print(p)
}
df1 <- GetDf(1000)
df2 <- GetDf(10000)
df3 <- GetDf(100000)
df4 <- GetDf(1000000)
PlotIt(df1) looks ok, but PlotIt(df2) is already bad. Points overlap. I could set the point size smaller when n is large, but then the plots of df1 - df4 would look radically different. If the size is fixed, then the plot of df3 needs something like size = 0.75, and PlotIt(df1) is bad.
I know there is the library hexbin and geom_hex(), but it doesn't seem to produce what I want. I would like to have groups shown in different colors, hexbin is not good for plotting df1, etc.
What would be the best way to plot at least df1 - df3, preferably also df4, so that the plots would "feel" the same and look good? (I'm sorry about vagueness, but I don't know how to be more specific.)
I followed krlmlr answer, and wrote a function that calculates alpha from the row count of df. Also, choosing a better shape made the plots nicer. override.aes is needed for low alpha values.
PlotIt <- function(df) {
Alpha <- function(x) pmax(0.1, pmin(1, 2.05 - 0.152 * log(x)))
p <- ggplot(df) +
geom_point(aes(x = x, y = y, colour = group), size = 1.5,
shape = 1, alpha = Alpha(nrow(df))) +
expand_limits(y = 1) + expand_limits(y = 5) +
geom_hline(aes(yintercept = c(2.5, 4.5)), linetype = "dotdash") +
guides(colour = guide_legend(override.aes = list(alpha = 1)))
print(p)
}
Plots of df1 - df3 look ok to me (full screen). The question is somewhat similar to Scatterplot with too many points. Differences: same function should apply to big and small data frames, and the order of x's is not important.
I suspect you don't want to trace individual points in a scatter plot of 1000 or more points. Why don't you use a sample?
PlotIt <- function(df) {
df <- sample.rows(df, 1000, replace=F)
...
}
(sample.rows is in my kimisc package).
If you really want to show all points, use an alpha value in geom_point. Be sure to export your plot as raster and not as vector image, it will take ages to render otherwise:
geom_point(aes(...), alpha=get_reasonable_alpha_value(df))
You'll have to do some experimentation for implementing get_reasonable_alpha_value. It should return a value between 0 (fully transparent) and 1 (opaque).
Perhaps a two-dimensional density estimation will suit you better:
geom_density2d(...)

How to correctly interpret ggplot's stat_density2d

My initial goal was to plot a population of individual points and then draw a convex hull enclosing 80% of that population centered on the mass of the population.
After trying a number of ideas, the best solution I came up with was to use ggplot's stat_density2d. While this works great for a qualitative analysis, I still need to indicate an 80% boundary. I started out looking for a way to outline the 80th percentile population boundary, but I can work with an 80% probability density boundary instead.
Here's where I'm looking for help. The bin parameter for kde2d (used by stat_density2d) is not clearly documented. If I set bin = 4 in the example below, am I correct in interpreting the central (green) region as containing a 25% probability mass and the combined yellow, red, and green areas as representing a 75% probability mass? If so, by changing the bin to = 5, would the area inscribed then equal an 80% probability mass?
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=4, geom = "polygon", ) +
geom_point(aes(x = x, y = y)) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "black"))
TestData
I repeated a number of test cases and manually counted the excluded points [would love to find a way to count them based on what ..level.. they were contained within] but given the random nature of the data (both my real data and the test data) the number of points outside of the stat_density2d area varied enough to warrant asking for help.
Summarizing, is there a practical means of drawing a polygon around the central 80% of the population of points in the data frame? Or, baring that, am I safe to use stat_density2d and set bin equal to 5 to produce an 80% probability mass?
Excellent answer from Bryan Hanson dispelling the fuzzy notion that I could pass an undocumented bin parameter in stat_density2d. The results looked close at values for bin around 4 to 6, but as he stated, the actual function is unknown and therefore not usable.
I used the HDRegionplot as provided in the accepted answer by DWin to solve my problem. To that, I added a center of gravity (COGravity) and point in polygon (pnt.in.poly) from the SDMTools package to complete the analysis.
library(MASS)
library(coda)
library(SDMTools)
library(emdbook)
library(ggplot2)
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
ContourLines <- as.data.frame(HPDregionplot(mcmc(data.matrix(df)), prob=0.8))
df$inpoly <- pnt.in.poly(df, ContourLines[, c("x", "y")])$pip
dp <- df[df$inpoly == 1,]
COG100 <- as.data.frame(t(COGravity(df$x, df$y)))
COG80 <- as.data.frame(t(COGravity(dp$x, dp$y)))
TestData <- ggplot (data = df) +
stat_density2d(aes(x = x, y = y, fill = as.factor(..level..)),
bins=5, geom = "polygon", ) +
geom_point(aes(x = x, y = y, colour = as.factor(inpoly)), alpha = 1) +
geom_point(data=COG100, aes(COGx, COGy),colour="white",size=2, shape = 4) +
geom_point(data=COG80, aes(COGx, COGy),colour="green",size=4, shape = 3) +
geom_polygon(data = ContourLines, aes(x = x, y = y), color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
TestData
nrow(dp)/nrow(df) # actual number of population members inscribed within the 80% probability polgyon
Alright, let me start by saying I'm not entirely sure of this answer, and it's only a partial answer! There is no bin parameter for MASS::kde2d which is the function used by stat_density2d. Looking at the help page for kde2d and the code for it (seen simply by typing the function name in the console), I think the bin parameter is h (how these functions know to pass bin to h is not clear however). Following the help page, we see that if h is not provided, it is computed by MASS:bandwidth.nrd. The help page for that function says this:
# The function is currently defined as
function(x)
{
r <- quantile(x, c(0.25, 0.75))
h <- (r[2] - r[1])/1.34
4 * 1.06 * min(sqrt(var(x)), h) * length(x)^(-1/5)
}
Based on this, I think the answer to your last question ("Am I safe...") is definitely no. r in the above function is what you need for your assumption to be safe, but it is clearly modified, so you are not safe. HTH.
Additional thought: Do you have any evidence that your code is using your bins argument? I'm wondering if it is being ignored. If so, try passing h in place of bins and see if it listens.
HPDregionplot in package:emdbook is supposed to do that. It does use MASS::kde2d but it normalizes the result. It has the disadvantage to my mind that it requires an mcmc object.
library(MASS)
library(coda)
HPDregionplot(mcmc(data.matrix(df)), prob=0.8)
with(df, points(x,y))
Building on the answer by 42, I've simplified HPDregionplot() to reduce dependencies and remove the requirement to work with mcmc-objects. The function works on a two-column data.frame and creates no intermediate plots. Note, however, that the this approach breaks as soon as grDevices::contourLines() return multiple contours.
hpd_contour <- function (x, n = 50, prob = 0.95, ...) {
post1 <- MASS::kde2d(x[[1]], x[[2]], n = n, ...)
dx <- diff(post1$x[1:2])
dy <- diff(post1$y[1:2])
sz <- sort(post1$z)
c1 <- cumsum(sz) * dx * dy
levels <- sapply(prob, function(x) {
approx(c1, sz, xout = 1 - x)$y
})
as.data.frame(grDevices::contourLines(post1$x, post1$y, post1$z, levels = levels))
}
theme_set(theme_bw(16))
set.seed(1)
n=100
df <- data.frame(x=rnorm(n, 0, 1), y=rnorm(n, 0, 1))
ContourLines <- hpd_contour(df, prob=0.8)
ggplot(df, aes(x = x, y = y)) +
stat_density2d(aes(fill = as.factor(..level..)), bins=5, geom = "polygon") +
geom_point() +
geom_polygon(data = ContourLines, color = "blue", fill = NA) +
scale_fill_manual(values = c("yellow","red","green","royalblue", "brown", "black", "white", "black", "white","black")) +
scale_colour_manual(values = c("red", "black"))
Moreover, the workflow now easily extends to grouped data.
ContourLines <- iris[, c("Species", "Sepal.Length", "Sepal.Width")] %>%
group_by(Species) %>%
do(hpd_contour(.[, c("Sepal.Length", "Sepal.Width")], prob=0.8))
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point(size = 3, alpha = 0.6) +
geom_polygon(data = ContourLines, fill = NA) +
guides(color = FALSE) +
theme(plot.margin = margin())

Resources