ggplot2 z clipping: remove unnecessary points in overlapping stacks - r

Consider the following example of plotting 100 overlapping points:
ggplot(data.frame(x=rnorm(100), y=rnorm(100)), aes(x=x, y=y)) +
geom_point(size=100) +
xlim(-10, 10) +
ylim(-10, 10)
I now want to save the image as vector graphics, e.g. in PDF. This is not a problem with the above example, but once I've got over a million points (e.g. from a volcano plot), the file size can exceed 100 MB for one page and it takes ages to display or edit.
In the above example the same shape could could still be represented by either
converting the points to a shape outline, or
keeping a couple of points and discarding the rest.
Is there any way (or preferably tool that already does this) to remove points from a plot that will never be visible? (ideally supporting transparency)
The best approach I have heard so far is to round the position of the dots and remove grid points that have > N points, then use the original positions of the remaining ones. Is there anything better?
Note that this should work with an arbitrary structure of points, and only remove those that are not visible.

You could do something with the convex hull, like this, filling in the polygon that makes up the convex hull:
library(ggplot2)
set.seed(123)
df <- data.frame(x = rnorm(100), y = rnorm(100))
idx <- chull(df)
ggplot(df, aes(x = x, y = y)) +
geom_point(size = 100,color="darkgrey") +
geom_polygon(data=df[idx,],color="blue") +
geom_point(size = 1, color = "red", size = 2) +
xlim(-10, 10) +
ylim(-10, 10)
yielding:
(Note that I pulled this chull-idea out of Hadley's "Extending ggplot2" guide https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html.)
In your case you would drop the geom_point calls and set transparency on the geom_polygon. Also not sure how fast chull is for millions of points, though it will clearly be faster than plotting them all.
And I am not really sure what you are after. If you really want the 100 pixel radius, they you could probably just do it for the ones on the complex hull, plus fill in the middle with geom_polygon.
So using this code:
ggplot(df[idx,], aes(x = x, y = y)) +
geom_point(size = 100, color = "black") +
geom_polygon(fill = "black") +
xlim(-10, 10) +
ylim(-10, 10)
to make this:

Related

How can I set the point size in a ggplot2 scatterplot to match the scale of the axes?

I am trying to make a scatterplot using ggplot2 in which the diameter of the points is of the same dimensions as the variables on the axes and should have the same scale. This problem is laid out well in this question as well as this one, which was resolved by drawing ellipses on the graph (nowadays done with geom_circle from ggforce. However, for my application I need to draw thousands of points, which is quick using geom_point but very slow using geom_circle. Is there a way to scale geom_point to the scale of the axes?
As an example of the problem, this graph shows the discrepancy in scales using scale_radius:
x <- runif(20, 0, 20)
y <- runif(20, 0, 20)
radius <- runif(20, 0, 4)
df <- data.frame(x = x, y = y, size = radius)
library(ggplot2)
p <- ggplot(
data = df,
mapping = aes(
x = x,
y = y,
size = radius
)
) +
geom_point() +
coord_fixed() + xlim(0, 20) + ylim(0, 20) +
scale_radius(range = c(min(radius), max(radius)))
p
I have tried using scale_radius and scale_continuous, but both use a scale that is arbitrary with relation to the axis scales (scale_radius also does not scale such that a point of size 0 displays with size 0). I had the idea of accessing the plot size using ggplot_build and scaling the point sizes accordingly. I can access the plot range using ggplot_buil(p)$layout$get_scales(i=1) or layer_scales(p), but no variables appear to correspond to the size of the plot in the units that scale_radius uses.
Using 2000 circles and max radius of 1 (ie max diameter 2), I get a ~5-7x speedup using a lower poly resolution per circle. You might also look at your output device and try ragg, which is faster than cairo still offers nice anti-aliasing.
ggplot(df, aes(x0 = x, y0 = y, r = radius)) +
ggforce::geom_circle(n = 20) + #5-7x faster than default
coord_fixed()
Still looks pretty good, with 1.5 sec render time on my system.
(You might also consider defining your plot window range using coord_fixed(xlim = c(0,20), ylim = c(0,20)) since that will have the effect of zooming in on that viewing window instead of cropping out all data points out outside it, as your xlim() and ylim() (shortcuts for scale_x_continuous(limits = ...) and scale_y_continuous(limits = ...). It's not an issue for geom_point but for geom_circle your approach will result in cut-off circles.)

Is it possible to define new shapes for plotting?

I see from here (http://sape.inf.usi.ch/quick-reference/ggplot2/shape) the set of possible shapes. If I wanted to define new shapes, is it possible? For example, suppose I wanted to use a 7-sided polygon with an optional fill aesthetic -- is there a way to tell ggplot about that shape?
I feel constrained by this set of possibilities:
library(tidyverse)
dat <- tibble(p = c(0:25, 32:127),
x = p %% 16,
y = p %/% 16)
ggplot(dat, aes(x, y)) +
geom_text(aes(label = p), size = 3, nudge_y = -.25) +
geom_point(aes(shape = p), size = 5, fill = "red") +
scale_shape_identity() +
theme_void()
Yes, it's possible to do this in one of several ways. Unless you have an svg file of a 7-sided polygon available, one quick solution would be to define this shape as a grob and plot it using geom_grob from package ggpmisc. This keeps things in vector format.
Creating the heptagon is the hard part:
library(ggplot2)
library(dplyr)
library(grid)
library(ggpmisc)
# Make heptagon
septs <- seq(0, 2 * pi, length.out = 8)
devratio <- dev.size()[2]/dev.size()[1]
heptagon <- linesGrob(x = unit(0.5 + 0.2 * devratio * sin(septs), "npc"),
y = unit(0.5 + 0.2 * cos(septs), "npc"),
gp = gpar(lwd = 2))
The plot itself is straightforward:
# Plot 10 random points with the heptagon
set.seed(69)
tibble(x = rnorm(10), y = rnorm(10), shape = list(heptagon)) %>%
ggplot() +
geom_grob(aes(x, y, label = shape))
As you can see from this example, custom shapes aren't necessarily all that easy to use, since the shape has to be defined pointwise by the user, it is difficult to match its size and lineweight to existing points, and the user would have to define what/where their fill is, etc. I don't think it's an omission from ggplot to not have a simple interface for creating custom shapes - ggplot has great extensibility for advanced users, and it's not clear that you could have a useful shape-creating interface for beginners. Meanwhile there are more than enough shapes to provide informative plots for all but the most niche applications.
Maybe not exactly what you're looking for but let me suggest three packages:
ggimage - allows you to use images, like a .png file, for your points. See https://mran.microsoft.com/snapshot/2018-04-02/web/packages/ggimage/vignettes/ggimage.html
ggpattern - allows you to add different fills to your graphics. See https://github.com/coolbutuseless/ggpattern
emoGG - uses emojis for your points. See https://github.com/dill/emoGG

Create standard color scale for several graphs

I am trying to create a custom color scale for several graphs. I would like it to be a standard color scheme so that the two graphs can be compared. The data for the first graph has a much smaller range (its maximum is just a bit above 3) while the other one goes to 9. Therefore, I need colors to match numbers 4-9 but do not want them to appear in the first graph. However, they always do and I do not understand why.
Here is the data for the first graph:
df <- data.frame(
x = runif(100),
y = runif(100),
z1 = rnorm(100),
z2 = abs(rnorm(100))
)
And here is the graph, with the custom color scale. However, as you can see all the colors appear in the graph even though only the first 5 colors should show up.
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z2))+scale_colour_gradientn(colours = c('springgreen1', 'springgreen4', 'yellowgreen','yellow2','lightsalmon','orange','orange3','orange4','navajowhite3','white'),breaks=c(0,1,2,3,4,5,6,7,8,9))
The limits term of scale_colour_gradientn can help here:
ggplot(df, aes(x, y)) +
geom_point(aes(colour = z2))+
scale_colour_gradientn(colours = c('springgreen1', 'springgreen4', 'yellowgreen','yellow2',
'lightsalmon','orange','orange3','orange4','navajowhite3','white'),
breaks=c(0,1,2,3,4,5,6,7,8,9),
limits = c(0,9)) +
theme(legend.key.height = unit(1.5, "cm"))

ggrepel: Repelling text in only one direction, and returning values of repelled text

I have a dataset, where each data point has an x-value that is constrained (represents an actual instance of a quantitative variable), y-value that is arbitrary (exists simply to provide a dimension to spread out text), and a label. My datasets can be very large, and there is often text overlap, even when I try to spread the data across the y-axis as much as possible.
Hence, I am trying to use the new ggrepel. However, I am trying to keep the text labels constrained at their x-value position, while only allowing them to repel from each other in the y-direction.
As an example, the below code produces an plot for 32 data points, where the x-values show the number of cylinders in a car, and the y-values are determined randomly (have no meaning but to provide a second dimension for text plotting purposes). Without using ggrepel, there is significant overlap in the text:
library(ggrepel)
library(ggplot2)
set.seed(1)
data = data.frame(x=runif(100, 1, 10),y=runif(100, 1, 10),label=paste0("label",seq(1:100)))
origPlot <- ggplot(data) +
geom_point(aes(x, y), color = 'red') +
geom_text(aes(x, y, label = label)) +
theme_classic(base_size = 16)
I can remedy the text overlap using ggrepel, as shown below. However, this changes not only the y-values, but also the x-values. I am trying to avoid changing the x-values, as they represent an actual physical meaning (the number of cylinders):
repelPlot <- ggplot(data) +
geom_point(aes(x, y), color = 'red') +
geom_text_repel(aes(x, y, label = label)) +
theme_classic(base_size = 16)
As a note, the reason I cannot allow the x-value of the text to change is because I am only plotting the text (not the points). Whereas, it seems that most examples in ggrepel keep the position of the points (so that their values remain true), and only repel the x and y values of the labels. Then, the points and connected to the labels with segments (you can see that in my second plot example).
I kept the points in the two examples above for demonstration purposes. However, I am only retaining the text (and hence will be removing the points and the segments), leaving me with something like this:
repelPlot2 <- ggplot(data) + geom_text_repel(aes(x, y, label = label), segment.size = 0) + theme_classic(base_size = 16)
My question is two fold:
1) Is it possible for me to repel the text labels only in the y-direction?
2) Is it possible for me to obtain a structure containing the new (repelled) y-values of the text?
Thank you for any advice!
ggrepel version 0.6.8 (Install from GitHub using devtools::github_install) now supports a "direction" argument, which enables repelling of labels only in "x" or "y" direction.
repelPlot2 <- ggplot(data) + geom_text_repel(aes(x, y, label = label), segment.size = 0, direction = "y") + theme_classic(base_size = 16)
Getting the y values is harder -- one approach can be to use the "repel_boxes" function from ggrepel first to get repelled values and then input those into ggplot with geom_text. For discussion and sample code of that approach, see https://github.com/slowkow/ggrepel/issues/24. Note that if using the latest version, the repel_boxes function now also has a "direction" argument, which takes in "both","x", or "y".
I don't think it is possible to repel text labels only in one direction with ggrepel.
I would approach this problem differently, by instead generating the arbitrary y-axis positions manually. For example, for the data set in your example, you could do this using the code below.
I have used the dplyr package to group the data set by the values of x, and then created a new column of data y containing the row numbers within each group. The row numbers are then used as the values for the y-axis.
library(ggplot2)
library(dplyr)
data <- data.frame(x = mtcars$cyl, label = paste0("label", seq(1:32)))
data <- data %>%
group_by(x) %>%
mutate(y = row_number())
ggplot(data, aes(x = x, y = y, label = label)) +
geom_text(size = 2) +
xlim(3.5, 8.5) +
theme_classic(base_size = 8)
ggsave("filename.png", width = 4, height = 2)

Line plot that changes color over "time"

I have a data frame that contains x and y coordinates for a random walk that moves in discrete steps (1 step up, down, left, or right). I'd like to plot the path---the points connected by a line. This is easy, of course. The difficulty is that the path crosses over itself and becomes difficult to interpret. I add jitter to the points to avoid overplotting, but it doesn't help distinguish the ordering of the walk.
I'd like to connect the points using a line that changes color over "time" (steps) according to a thermometer-like color scale.
My random walk is stored in its own class and I'm writing a specific plot method for it, so if you have suggestions for how I can do this using plot, that would be great. Thanks!
This is pretty easy to do in ggplot2:
so <- data.frame(x = 1:10,y = 1:10,col = 1:10)
ggplot(so,aes(x = x, y = y)) +
geom_line(aes(group = 1,colour = col))
If you prefer not to use ggplot, then ?segments will do what you want. -- I'm assuming here that x and y are both functions of time, as implied in your example.
If you use ggplot, you can set the colour aesthetic:
library(ggplot2)
walk <-cumsum(rnorm(n=100, mean=0))
dat <- data.frame(x = seq_len(length(walk)), y = walk)
ggplot(dat, aes(x,y, colour = x)) + geom_line()

Resources