Draw vertical quantile lines over histogram - r

I currently generate the following plot using ggplot in R:
The data is stored in a single dataframe with three columns: PDF (y-axis in the plot above), mids(x) and dataset name. This is created from histograms.
What I want to do is to plot a color-coded vertical line for each dataset representing the 95th quantile, like I manually painted below as an example:
I tried to use + geom_line(stat="vline", xintercept="mean") but of course I'm looking for the quantiles, not for the mean, and AFAIK ggplot does not allow that. Colors are fine.
I also tried + stat_quantile(quantiles = 0.95) but I'm not sure what it does exactly. Documentation is very scarce. Colors, again, are fine.
Please note that density values are very low, down to 1e-8. I don't know if the quantile() function likes that.
I understand that calculating the quantile of an histogram is not quite the same as calculating that of a list of numbers. I don't know how it would help, but the HistogramToolspackage contains an ApproxQuantile() function for histogram quantiles.
Minimum working example is included below. As you can see I obtain a data frame from each histogram, then bind the dataframes together and plot that.
library(ggplot2)
v <- c(1:30, 2:50, 1:20, 1:5, 1:100, 1, 2, 1, 1:5, 0, 0, 0, 5, 1, 3, 7, 24, 77)
h <- hist(v, breaks=c(0:100))
df1 <- data.frame(h$mids,h$density,rep("dataset1", 100))
colnames(df1) <- c('Bin','Pdf','Dataset')
df2 <- data.frame(h$mids*2,h$density*2,rep("dataset2", 100))
colnames(df2) <- c('Bin','Pdf','Dataset')
df_tot <- rbind(df1, df2)
ggplot(data=df_tot[which(df_tot$Pdf>0),], aes(x=Bin, y=Pdf, group=Dataset, colour=Dataset)) +
geom_point(aes(color=Dataset), alpha = 0.7, size=1.5)

Precomputing these values and plotting them separately seems like the simplest option. Doing so with dplyr requires minimal effort:
library(dplyr)
q.95 <- df_tot %>%
group_by(Dataset) %>%
summarise(Bin_q.95 = quantile(Bin, 0.95))
ggplot(data=df_tot[which(df_tot$Pdf>0),],
aes(x=Bin, y=Pdf, group=Dataset, colour=Dataset)) +
geom_point(aes(color=Dataset), alpha = 0.7, size=1.5) +
geom_vline(data = q.95, aes(xintercept = Bin_q.95, colour = Dataset))

Related

Behavior of "fill" argument in geom_polygon in R

I am trying to understand the behavior of the "fill" argument in geom_polygon for ggplot.
I have a dataframe where I have multiple values from a measure of interest, obtained in different counties for each state. I have merged my database with the coordinates from the "maps" package and then I call the plot via ggplot. I don't understand how ggplot chooses what color to show for a state considering that different numbers are provided in the fill variable (mean?median?interpolation?)
Reproducing a piece of my dataframe to explain what I mean:
state=rep("Alabama",3)
counties=c("Russell","Clay","Montgomery")
long=c(-87.46201,-87.48493,-87.52503)
lat=c(30.38968,30.37249,30.33239)
group=rep(1,3)
measure=c(22,28,17)
df=data.frame(state, counties, long,lat,group,measure)
Call for ggplot
p <- ggplot()
p <- p + geom_polygon(data=df, aes(x=long, y=lat, group=group, fill=df$measure),colour="black"
)
print(p)
Using the full dataframe, I have hundreds of rows with iterations of 17 counties and all the set of coordinates for the Alabama polygon. How is it that ggplot provides the state fill with a single color?
Again, I would assume it is somehow interpolating the fill values provided at each set of coordinate, but I am not sure about it.
Thanks everyone for the help.
Through trial and error, it looks like the first value of the fill mapping is used for the fill of the polygon. The range of the fill scale takes all values into account. This makes sense because the documentation doesn't mention any aggregation---I agree that an aggregate function would also make sense, but I would assume that the aggregation function would be set via an argument if that were the implementation.
Instead, the documentation shows an example (and recommends) starting with two data frames, one of which has coordinates for each vertex, and one which has a single row (and single fill value) per polygon, and joining them based on an ID column.
Here's a demonstration:
long=c(1, 1, 2)
lat=c(1, 2, 2)
group=rep(1,3)
df=data.frame(long,lat,group,
m1 = c(1, 1, 1),
m2 = c(1, 2, 3),
m3 = c(3, 1, 2),
m4 = c(1, 10, 11),
m5 = c(1, 5, 11),
m6 = c(11, 1, 10))
library(ggplot2)
plots = lapply(paste0("m", 1:6), function(f)
ggplot(df, aes(x = long, y = lat, group = group)) +
geom_polygon(aes_string(fill = f)) +
labs(title = sprintf("%s:, %s", f, toString(df[[f]])))
)
do.call(gridExtra::grid.arrange, plots)

Changing colour under particular threshold for geom_line [duplicate]

I have the following dataframe that I would like to plot. I was wondering if it is possible to color portions of the lines connecting my outcome variable(stackOne$y) in a different color, depending on whether it is less than a certain value or not. For example, I would like portions of the lines falling below 2.2 to be red in color.
set.seed(123)
stackOne = data.frame(id = rep(c(1, 2, 3), each = 3),
y = rnorm(9, 2, 1),
x = rep(c(1, 2, 3), 3))
ggplot(stackOne, aes(x = x, y = y)) +
geom_point() +
geom_line(aes(group = id))
Thanks!
You have at least a couple of options here. The first is quite simple, general (in that it's not limited to straight-line segments) and precise, but uses base plot rather than ggplot. The second uses ggplot, but is slightly more complicated, and colour transition will not be 100% precise (but near enough, as long as you specify an appropriate resolution... read on).
base:
If you're willing to use base plotting functions rather than ggplot, you could clip the plotting region to above the threshold (2.2), then plot the segments in your preferred colour, and subsequently clip to the region below the threshold, and plot again in red. While the first clip is strictly unnecessary, it prevents overplotting different colours, which can look a bit dud.
threshold <- 2.2
set.seed(123)
stackOne=data.frame(id=rep(c(1,2,3),each=3),
y=rnorm(9,2,1),
x=rep(c(1,2,3),3))
# create a second df to hold segment data
d <- stackOne
d$y2 <- c(d$y[-1], NA)
d$x2 <- c(d$x[-1], NA)
d <- d[-findInterval(unique(d$id), d$id), ] # remove last row for each group
plot(stackOne[, 3:2], pch=20)
# clip to region above the threshold
clip(min(stackOne$x), max(stackOne$x), threshold, max(stackOne$y))
segments(d$x, d$y, d$x2, d$y2, lwd=2)
# clip to region below the threshold
clip(min(stackOne$x), max(stackOne$x), min(stackOne$y), threshold)
segments(d$x, d$y, d$x2, d$y2, lwd=2, col='red')
points(stackOne[, 3:2], pch=20) # plot points again so they lie over lines
ggplot:
If you want or need to use ggplot, you can consider the following...
One solution is to use geom_line(aes(group=id, color = y < 2.2)), however this will assign colours based on the y-value of the point at the beginning of each segment. I believe you want to have the colour change not just at the nodes, but wherever a line crosses your given threshold of 2.2. I'm not all that familiar with ggplot, but one way to achieve this is to make a higher-resolution version of your data by creating new points along the lines that connect your existing points, and then use the color = y < 2.2 argument to achieve the desired effect.
For example:
threshold <- 2.2 # set colour-transition threshold
yres <- 0.01 # y-resolution (accuracy of colour change location)
d <- stackOne # for code simplification
# new cols for point coordinates of line end
d$y2 <- c(d$y[-1], NA)
d$x2 <- c(d$x[-1], NA)
d <- d[-findInterval(unique(d$id), d$id), ] # remove last row for each group
# new high-resolution y coordinates between each pair within each group
y.new <- apply(d, 1, function(x) {
seq(x['y'], x['y2'], yres*sign(x['y2'] - x['y']))
})
d$len <- sapply(y.new, length) # length of each series of points
# new high-resolution x coordinates corresponding with new y-coords
x.new <- apply(d, 1, function(x) {
seq(x['x'], x['x2'], length.out=x['len'])
})
id <- rep(seq_along(y.new), d$len) # new group id vector
y.new <- unlist(y.new)
x.new <- unlist(x.new)
d.new <- data.frame(id=id, x=x.new, y=y.new)
p <- ggplot(d.new, aes(x=x,y=y)) +
geom_line(aes(group=d.new$id, color=d.new$y < threshold))+
geom_point(data=stackOne)+
scale_color_discrete(sprintf('Below %s', threshold))
p
There may well be a way to do this through ggplot functions, but in the meantime I hope this helps. I couldn't work out how to draw a ggplotGrob into a clipped viewport (rather it seems to just scale the plot). If you want colour to be conditional on some x-value threshold instead, this would obviously need some tweaking.
Encouraged by people in my answer to a newer but related question, I'll also share a easier to use approximation to the problem here.
Instead of interpolating the correct values exactly, one can use ggforce::geom_link2() to interpolate lines and use after_stat() to assign the correct colours after interpolation. If you want more precision you can increase the n of that function.
library(ggplot2)
library(ggforce)
#> Warning: package 'ggforce' was built under R version 4.0.3
set.seed(123)
stackOne = data.frame(id = rep(c(1, 2, 3), each = 3),
y = rnorm(9, 2, 1),
x = rep(c(1, 2, 3), 3))
ggplot(stackOne, aes(x = x, y = y)) +
geom_point() +
geom_link2(
aes(group = id,
colour = after_stat(y < 2.2))
) +
scale_colour_manual(
values = c("black", "red")
)
Created on 2021-03-26 by the reprex package (v1.0.0)

I want to create the empirical cumulative distribution function for two samples and put the plots in the same plot [R]

I am using this code to generate the empirical cumulative distribution function for the two samples (you can put any numerical values in them). I would like to put them in the same plot but if you run the following commands everything is overlapping really bad [see picture 1]. Is there any way to do it like this [see picture 2] (also I want the symbols to disappear and be a line like the picture 2) .
plot(ecdf(sample[,1]),pch = 1)
par(new=TRUE)
plot(ecdf(sample[,2]),pch = 2)
picture 1:https://www.dropbox.com/s/sg1fr8jydsch4xp/vanboeren2.png?dl=0
picture 2:https://www.dropbox.com/s/erhgla34y5bxa58/vanboeren1.png?dl=0
Update: I am doing this
df1 <- data.frame(x = sample[,1])
df2 <- data.frame(x = sample[,2])
ggplot(df1, aes(x, colour = "g")) + stat_ecdf()
+geom_step(data = df2)
scale_x_continuous(limits = c(0, 5000)) `
which is very close (in terms of shape) but still can not put them at the same plot.
Try this with basic plot:
df1 <- data.frame(x = runif(200,1,5))
df2 <- data.frame(x = runif(200,3,8))
plot(ecdf(df1[,1]),pch = 1, xlim=c(0,10), main=NULL)
par(new=TRUE)
plot(ecdf(df2[,1]),pch = 2, xlim=c(0,10), main=NULL)
Both graphs have now the same xlim (try removing it to see both superimposed incorrectly). The main=NULL removes the title
Result:

Stacked histograms like in flow cytometry

I'm trying to use ggplot or base R to produce something like the following:
I know how to do histograms with ggplot2, and can easily separate them using facet_grid or facet_wrap. But I'd like to "stagger" them vertically, such that they have some overlap, as shown below. Sorry, I'm not allowed to post my own image, and it's quite difficult to find a simpler picture of what I want. If I could, I would only post the top-left panel.
I understand that this is not a particularly good way to display data -- but that decision does not rest with me.
A sample dataset would be as follows:
my.data <- as.data.frame(rbind( cbind( rnorm(1e3), 1) , cbind( rnorm(1e3)+2, 2), cbind( rnorm(1e3)+3, 3), cbind( rnorm(1e3)+4, 4)))
And I can plot it with geom_histogram as follows:
ggplot(my.data) + geom_histogram(aes(x=V1,fill=as.factor(V2))) + facet_grid( V2~.)
But I'd like the y-axes to overlap.
require(ggplot2)
require(plyr)
my.data <- as.data.frame(rbind( cbind( rnorm(1e3), 1) , cbind( rnorm(1e3)+2, 2), cbind( rnorm(1e3)+3, 3), cbind( rnorm(1e3)+4, 4)))
my.data$V2=as.factor(my.data$V2)
calculate the density depending on V2
res <- dlply(my.data, .(V2), function(x) density(x$V1))
dd <- ldply(res, function(z){
data.frame(Values = z[["x"]],
V1_density = z[["y"]],
V1_count = z[["y"]]*z[["n"]])
})
add an offset depending on V2
dd$offest=-as.numeric(dd$V2)*0.2 # adapt the 0.2 value as you need
dd$V1_density_offest=dd$V1_density+dd$offest
and plot
ggplot(dd, aes(Values, V1_density_offest, color=V2)) +
geom_line()+
geom_ribbon(aes(Values, ymin=offest,ymax=V1_density_offest, fill=V2),alpha=0.3)+
scale_y_continuous(breaks=NULL)
densityplot() from bioconductor flowViz package is one option for stacked densities.
from: http://www.bioconductor.org/packages/release/bioc/manuals/flowViz/man/flowViz.pdf :
For flowSets the idea is to horizontally stack plots of density estimates for all frames in the
flowSet for one or several flow parameters. In the latter case, each parameter will be plotted
in a separate panel, i.e., we implicitely condition on parameters.
you can see example visuals here:
http://www.bioconductor.org/packages/release/bioc/vignettes/flowViz/inst/doc/filters.html
source("http://bioconductor.org/biocLite.R")
biocLite("flowViz")
Using the ggridges package:
ggplot(my.data, aes(x = V1, y = factor(V2), fill = factor(V2), color = factor(V2))) +
geom_density_ridges(alpha = 0.5)
I think it's going to be difficult to get ggplot to offset the histograms like that. At least with faceting it makes new panels, and really, this transformation makes the y-axis meaningless. (The value is in the comparison from row to row). Here's one attempt at using base graphics to try to accomplish a similar thing.
#plotting function
plotoffsethists <- function(vals, groups, freq=F, overlap=.25, alpha=.75, colors=apply(floor(rbind(col2rgb(scales:::hue_pal(h = c(0, 360) + 15, c = 100, l = 65)(nlevels(groups))),alpha=alpha*255)),2,function(x) {paste0("#",paste(sprintf("%02X",x),collapse=""))}), ...) {
print(colors)
if (!is.factor(groups)) {
groups<-factor(groups)
}
offsethist <- function (x, col = NULL, offset=0, freq=F, ...) {
y <- if (freq) y <- x$counts
else
x$density
nB <- length(x$breaks)
rect(x$breaks[-nB], 0+offset, x$breaks[-1L], y+offset, col = col, ...)
}
hh<-tapply(vals, groups, hist, plot=F)
ymax<-if(freq)
sapply(hh, function(x) max(x$counts))
else
sapply(hh, function(x) max(x$density))
offset<-(mean(ymax)*overlap) * (length(ymax)-1):0
ylim<-range(c(0,ymax+offset))
xlim<-range(sapply(hh, function(x) range(x$breaks)))
plot.new()
plot.window(xlim, ylim, "")
box()
axis(1)
Map(offsethist, hh, colors, offset, freq=freq, ...)
invisible(hh)
}
#sample call
par(mar=c(3,1,1,1)+.1)
plotoffsethists(my.data$V1, factor(my.data$V2), overlap=.25)
Complementing Axeman's answer, you can add the option stat="binline" to the geom_density_ridges geom. This results in a histogram like plot, instead of a density line.
library(ggplot2)
library(ggridges)
my.data <- as.data.frame(rbind( cbind( rnorm(1e3), 1) ,
cbind( rnorm(1e3)+2, 2),
cbind( rnorm(1e3)+3, 3),
cbind( rnorm(1e3)+4, 4)))
my.data$V2 <- as.factor(my.data$V2)
ggplot(my.data, aes(x=V1, y=factor(V2), fill=factor(V2))) +
geom_density_ridges(alpha=0.6, stat="binline", bins=30)
Resulting image:

adding spread data to dotplots in R

I have a table with averages and interquartile ranges. I would like to create a dotplot, where the dot would show this average, and a bar would stretch through the dot, to show the interquartile range. In other words, the dot would be at the midpoint of a bar, the length of which would equal my interquartile range data. I am working in R.
For example,
labels<-c('a','b','c','d')
averages<-c(10,40,20,30)
ranges<-c(5,8,4,10)
dotchart(averages,labels=labels)
where the ranges would then be added to this plot as bars.
Any ideas?
Thanks!
Yet another method, using base.
labels <- c('a', 'b', 'c', 'd')
averages <- c(10, 40, 20, 30)
ranges <- c(5, 8, 4, 10)
dotchart(averages, labels=labels, xlab='average', pch=20,
xlim=c(min(averages-ranges), max(averages+ranges)))
segments(averages-ranges, 1:4, averages+ranges, 1:4)
For the record, here's a lattice solution, which uses a couple of functions from the Hmisc package:
library(lattice)
library(Hmisc)
labels<-c('a','b','c','d')
averages<-c(10,40,20,30)
ranges<-c(5,8,4,10)
low <- averages - ranges/2
high <- averages + ranges/2
d <- data.frame(labels, averages, low, high)
Dotplot(labels ~ Cbind(averages, low, high), data = d,
col = 1, # for black points
par.settings = list(plot.line = list(col = 1)), # for black bars
xlab = "Value")
ggplot2 has a good facility for doing this:
library(ggplot2)
labels<-c('a','b','c','d')
averages<-c(10,40,20,30)
ranges<-c(5,8,4,10)
x <- data.frame(labels,averages,ranges)
ggplot(x, aes(averages,labels)) +
geom_point() +
geom_errorbarh(aes(xmin=averages-ranges,xmax=averages+ranges))
Gives you a plot like:

Resources