ggplot2: Reading maximum bar height from plot object containing geom_histogram - r

Like this previous poster, I am also using geom_text to annotate plots in gglot2. And I want to position those annotations in relative coordinates (proportion of facet H & W) rather than data coordinates. Easy enough for most plots, but in my case I'm dealing with histograms. I'm sure the relevant information as to the y scale must be lurking in the plot object somewhere (after adding geom_histogram), but I don't see where.
My question: How do I read maximum bar height from a faceted ggplot2 object containing geom_histogram? Can anyone help?

Try this:
library(plyr)
library(scales)
p <- ggplot(mtcars, aes(mpg)) + geom_histogram(aes(y = ..density..)) + facet_wrap(~am)
r <- print(p)
# in data coordinate
(dc <- dlply(r$data[[1]], .(PANEL), function(x) max(x$density)))
(mx <- dlply(r$data[[1]], .(PANEL), function(x) x[which.max(x$density), ]$x))
# add annotation (see figure below)
p + geom_text(aes(x, y, label = text),
data = data.frame(x = unlist(mx), y = unlist(dc), text = LETTERS[1:2], am = 0:1),
colour = "red", vjust = 0)
# scale range
(yr <- llply(r$panel$ranges, "[[", "y.range"))
# in relative coordinates
(rc <- mapply(function(d, y) rescale(d, from = y), dc, yr))

Related

DBSCAN clustering plotting through ggplot2

I am trying to plot the dbscan clustering result through ggplot2. If I understand it correctly the current dbscan plots noise in black colour with base plot function. Some code first,
library(dbscan)
n <- 100
x <- cbind(
x = runif(5, 0, 10) + rnorm(n, sd = 0.2),
y = runif(5, 0, 10) + rnorm(n, sd = 0.2)
)
plot(x)
kNNdistplot(x, k = 5)
abline(h=.25, col = "red", lty=2)
res <- dbscan::dbscan(x, eps = .25, minPts = 4)
plot(res, x, main = "DBSCAN")
x <- data.frame(x)
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y")
I want two things to do differently here, first trying to plot the clustering output through ggplot(). The difficulty is if I use res$cluster to plot points the plot() will ignore points with 0 labels (which are noise points), and ggplots() will though error as length of res$cluster will be smaller than actual data to plot and if I try to use res$cluster+1 it will give 1 to noise points, which I don't want. And secondly if possible try to do something which clusym[] in package fpc does. It plots clusters with labels 1, 2, 3, ... and ignores 0 labels. Thats fine if my labels for noise points are still 0 and then giving any specific symbol say "*" to noise point with a specific colour lets say grey. I have seen a stack overflow post which tries to do similar thing for convex hull plotting but couldn't still figure out how to do this if I don't want to draw the hull and want a clustering number for each cluster.
A possibility which I thought was first plot the points without noise and then additional adding noise points with the desired colour and symbols to the original plot .
But since the res$cluster length is not equal to x it is thronging error.
ggplot(x, aes(x = x, y=y)) + geom_point(color = res$cluster+1, pch = clusym[res$cluster+1])
+ theme_grey() + ggtitle("(c)") + labs(x ="x", y = "y") + adding noise points
Error: Aesthetics must be either length 1 or the same as the data (100): shape, colour
You should first subset the third column from the output of DBSCAN, tack that onto your original data as a new column (i.e. as cluster), and assign that as a factor.
When you make the ggplot, you can assign color or shape to cluster. As for ignoring the noise points, I would do it as follows.
data <- dataframe with the cluster column (still in numeric form).
data2 <- dplyr::filter(data, cluster > 0)
data2$cluster <- as.factor(data2$cluster)
ggplot(data2, aes(x = x, y = y) +
geom_point(aes(color = `cluster`))

How to plot histograms of raw data on the margins of a plot of interpolated data

I would like to show in the same plot interpolated data and a histogram of the raw data of each predictor. I have seen in other threads like this one, people explain how to do marginal histograms of the same data shown in a scatter plot, in this case, the histogram is however based on other data (the raw data).
Suppose we see how price is related to carat and table in the diamonds dataset:
library(ggplot2)
p = ggplot(diamonds, aes(x = carat, y = table, color = price)) + geom_point()
We can add a marginal frequency plot e.g. with ggMarginal
library(ggExtra)
ggMarginal(p)
How do we add something similar to a tile plot of predicted diamond prices?
library(mgcv)
model = gam(price ~ s(table, carat), data = diamonds)
newdat = expand.grid(seq(55,75, 5), c(1:4))
names(newdat) = c("table", "carat")
newdat$predicted_price = predict(model, newdat)
ggplot(newdat,aes(x = carat, y = table, fill = predicted_price)) +
geom_tile()
Ideally, the histograms go even beyond the margins of the tileplot, as these data points also influence the predictions. I would, however, be already very happy to know how to plot a histogram for the range that is shown in the tileplot. (Maybe the values that are outside the range could just be added to the extreme values in different color.)
PS. I managed to more or less align histograms to the margins of the sides of a tile plot, using the method of the accepted answer in the linked thread, but only if I removed all kind of labels. It would be particularly good to keep the color legend, if possible.
EDIT:
eipi10 provided an excellent solution. I tried to modify it slightly to add the sample size in numbers and to graphically show values outside the plotted range since they also affect the interpolated values.
I intended to include them in a different color in the histograms at the side. I hereby attempted to count them towards the lower and upper end of the plotted range. I also attempted to plot the sample size in numbers somewhere on the plot. However, I failed with both.
This was my attempt to graphically illustrate the sample size beyond the plotted area:
plot_data = diamonds
plot_data <- transform(plot_data, carat_range = ifelse(carat < 1 | carat > 4, "outside", "within"))
plot_data <- within(plot_data, carat[carat < 1] <- 1)
plot_data <- within(plot_data, carat[carat > 4] <- 4)
plot_data$carat_range = as.factor(plot_data$carat_range)
p2 = ggplot(plot_data, aes(carat, fill = carat_range)) +
geom_histogram() +
thm +
coord_cartesian(xlim=xrng)
I tried to add the sample size in numbers with geom_text. I tried fitting it in the far right panel but it was difficult (/impossible for me) to adjust. I tried to put it on the main graph (which would anyway probably not be the best solution), but it didn’t work either (it removed the histogram and legend, on the right side and it did not plot all geom_texts). I also tried to add a third row of plots and writing it there. My attempt:
n_table_above = nrow(subset(diamonds, table > 75))
n_table_below = nrow(subset(diamonds, table < 55))
n_table_within = nrow(subset(diamonds, table >= 55 & table <= 75))
text_p = ggplot()+
geom_text(aes(x = 0.9, y = 2, label = paste0("N(>75) = ", n_table_above)))+
geom_text(aes(x = 1, y = 2, label = paste0("N = ", n_table_within)))+
geom_text(aes(x = 1.1, y = 2, label = paste0("N(<55) = ", n_table_below)))+
thm
library(egg)
pobj = ggarrange(p2, ggplot(), p1, p3,
ncol=2, widths=c(4,1), heights=c(1,4))
grid.arrange(pobj, leg, text_p, ggplot(), widths=c(6,1), heights =c(6,1))
I would be very happy to receive help on either or both tasks (adding sample size as text & adding values outside plotted range in a different color).
Based on your comment, maybe the best approach is to roll your own layout. Below is an example. We create the marginal plots as separate ggplot objects and lay them out with the main plot. We also extract the legend and put it outside the marginal plots.
Set-up
library(ggplot2)
library(cowplot)
# Function to extract legend
#https://github.com/hadley/ggplot2/wiki/Share-a-legend-between-two-ggplot2-graphs
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]]
return(legend) }
thm = list(theme_void(),
guides(fill=FALSE),
theme(plot.margin=unit(rep(0,4), "lines")))
xrng = c(0.6,4.4)
yrng = c(53,77)
Plots
p1 = ggplot(newdat, aes(x = carat, y = table, fill = predicted_price)) +
geom_tile() +
theme_classic() +
coord_cartesian(xlim=xrng, ylim=yrng)
leg = g_legend(p1)
p1 = p1 + thm[-1]
p2 = ggplot(diamonds, aes(carat)) +
geom_line(stat="density") +
thm +
coord_cartesian(xlim=xrng)
p3 = ggplot(diamonds, aes(table)) +
geom_line(stat="density") +
thm +
coord_flip(xlim=yrng)
plot_grid(
plot_grid(plotlist=list(p2, ggplot(), p1, p3), ncol=2,
rel_widths=c(4,1), rel_heights=c(1,4), align="hv", scale=1.1),
leg, rel_widths=c(5,1))
UPDATE: Regarding your comment about the space between the plots: This is an Achilles heel of plot_grid and I don't know if there's a way to fix it. Another option is ggarrange from the experimental egg package, which doesn't add so much space between plots. Also, you need to save the output of ggarrange first and then lay out the saved object with the legend. If you run ggarrange inside grid.arrange you get two overlapping copies of the plot:
# devtools::install_github('baptiste/egg')
library(egg)
pobj = ggarrange(p2, ggplot(), p1, p3,
ncol=2, widths=c(4,1), heights=c(1,4))
grid.arrange(pobj, leg, widths=c(6,1))

When using `scale_x_log10`, how can I map `geom_text` accurately to `geom_bin2d`?

A great answer on how to label the count on geom_bin2d, can be found here:
Getting counts on bins in a heat map using R
However, when modifying this to have a logarithmic X axis:
library(ggplot2)
set.seed(1)
dat <- data.frame(x = rnorm(1000), y = rnorm(1000))
# plot MODIFIED HERE TO BECOME log10
p <- ggplot(dat, aes(x = x, y = y)) + geom_bin2d() + scale_x_log10()
# Get data - this includes counts and x,y coordinates
newdat <- ggplot_build(p)$data[[1]]
# add in text labels
p + geom_text(data=newdat, aes((xmin + xmax)/2, (ymin + ymax)/2,
label=count), col="white")
This produces labels that are very poorly mapped to their respective points.
How can I correct the geom_text based labels to correctly map to thier respective points?
Apply logarithmic transformation directly on x values, not on scale. Change only one line of your code:
p <- ggplot(dat, aes(x = log10(x), y = y)) + geom_bin2d()
That allows to keep negative values and produces the following plot:

Label minimum and maximum of scale fill gradient legend with text: ggplot2

I have a plot created in ggplot2 that uses scale_fill_gradientn. I'd like to add text at the minimum and maximum of the scale legend. For example, at the legend minimum display "Minimum" and at the legend maximum display "Maximum". There are posts using discrete fills and adding labels with numbers instead of text (e.g. here), but I am unsure how to use the labels feature with scale_fill_gradientn to only insert text at the min and max. At the present I am apt to getting errors:
Error in scale_labels.continuous(scale, breaks) :
Breaks and labels are different lengths
Is this text label possible within ggplot2 for this type of scale / fill?
# The example code here produces an plot for illustrative purposes only.
# create data frame, from ggplot2 documentation
df <- expand.grid(x = 0:5, y = 0:5)
df$z <- runif(nrow(df))
#plot
ggplot(df, aes(x, y, fill = z)) + geom_raster() +
scale_fill_gradientn(colours=topo.colors(7),na.value = "transparent")
For scale_fill_gradientn() you should provide both arguments: breaks= and labels= with the same length. With argument limits= you extend colorbar to minimum and maximum value you need.
ggplot(df, aes(x, y, fill = z)) + geom_raster() +
scale_fill_gradientn(colours=topo.colors(7),na.value = "transparent",
breaks=c(0,0.5,1),labels=c("Minimum",0.5,"Maximum"),
limits=c(0,1))
User Didzis Elfert's answer slightly lacks "automatism" in my opinion (but it is of course pointing to the core of the problem +1 :).
Here an option to programatically define minimum and maximum of your data.
Advantages:
You will not need to hard code values any more (which is error prone)
You will not need hard code the limits (which also is error prone)
Passing a named vector: You don't need the labels argument (manually map labels to values is also error-prone).
As a side effect you will avoid the "non-matching labels/breaks" problem
library(ggplot2)
foo <- expand.grid(x = 0:5, y = 0:5)
foo$z <- runif(nrow(foo))
myfuns <- list(Minimum = min, Mean = mean, Maximum = max)
ls_val <- unlist(lapply(myfuns, function(f) f(foo$z)))
# you only need to set the breaks argument!
ggplot(foo, aes(x, y, fill = z)) +
geom_raster() +
scale_fill_gradientn(
colours = topo.colors(7),
breaks = ls_val
)
# You can obviously also replace the middle value with sth else
ls_val[2] <- 0.5
names(ls_val)[2] <- 0.5
ggplot(foo, aes(x, y, fill = z)) +
geom_raster() +
scale_fill_gradientn(
colours = topo.colors(7),
breaks = ls_val
)

How do you add a legend of geometries in ggplot2?

Let's say I've got a set of data and I want to add a legend to each geometry that I plot it with. For example:
x <- rnorm(100, 1)
qplot(x = x, y = 1:100, geom = c("point", "smooth"))
And it would look something like this:
Now, I want to add a legend so it would say something like:
Legend title
* points [in black]
--- smoothed [in blue]
Where I specify the "Legend title", "points", and "smoothed" names.
How would I go about that?
The easiest way to add extra information is with annotation rather than a legend.
(I know it's a toy example, but ggplot is being sensible by not including a legend when there is only one kind of point and one kind of line. You could make a legend, but it will by default take up more space and ink than necessary and be more work. When there is only one kind of point its meaning should be clear from labels on the x and y axes and from the general context of the graph. Lacking other information, the reader will then infer that the line is the result of fitting some function to the points. The only things they won't know are the specific function and the meaning of the grey error region. That can be a simple title, annotation, or text outside the plot.)
#Sample data in a dataframe since that works best with ggplot
set.seed(13013)
testdf <- data.frame(x <- rnorm(100, 1),y <- 1:100)
One option is a title:
ggplot(testdf , aes(x = x, y = y)) + geom_point()+
stat_smooth(method="loess")+
xlab("buckshot hole distance(from sign edge)")+
ylab("speed of car (mph)")+
ggtitle("Individual Points fit with LOESS (± 1 SD)")
Another option is an annotation layer. Here I used the mean and max functions to guess a reasonable location for the text, but one could do a better job with real data and maybe use an argument like size=3 to make the text size smaller.
ggplot(testdf , aes(x = x, y = y)) + geom_point()+
stat_smooth(method="loess")+
xlab("buckshot hole distance (from sign edge)")+
ylab("speed of car (mph)")+
annotate("text", x = max(testdf$x)-1, y = mean(testdf$y),
label = "LOESS fit with 68% CI region", colour="blue")
A fast way to annotate a ggplot plot , is to use geom_text
x <- rnorm(100, 1)
y = 1:100
library(ggplot2)
dat <- data.frame(x=x,y=y)
bp <- ggplot(data =dat,aes(x = x, y = y))+
geom_point()+ geom_smooth(group=1)
bp <- bp +geom_text(x = -1, y = 3, label = "* points ", parse=F)
bp <- bp +geom_text(x = -1, y = -1, label = "--- smoothed ", parse=F,color='blue')
bp

Resources