I have some very dense data, so I would like to use geom_hex to visualize the distribution. However, they are very concentrated, such that it would dramatically help if I could jitter the underlying points.
The following works fine, but has no jitter:
ggplot(data=data, aes(x=x,y=y))+
geom_hex()
I tried the following:
ggplot(data=data, aes(x=x,y=y))+
geom_hex(position=position_jitter(width = 0.1, height = 0.1))
However, this produced an empty graph (labels, but not content).
Any suggestions for how to apply jitter within geom_hex is much appreciated.
Related
I found this on the Tidyverse Github:
https://github.com/tidyverse/ggplot2/issues/3716
but I can't find the resolution of yutannihilation's question.
For exploratory data analysis, I would like for the outline stroke to reach the x-axis as it does with base R, including facets with scales="free".
Is there a way to do this programmatically? The user may have multiple facets of data, on the same or different scales. Can I ensure the x-axis is wide enough to take the density to zero?
I have tried outline.type = "full" and "both" but neither seem to work.
The MRE shows the issue. The use case is within a Shiny app and can be facet_wrap-ed as well.
Thanks!
#R base
plot(density(diamonds$carat, adjust = 5))
#ggplot
library(ggplot2)
ggplot(diamonds, aes(carat)) +
geom_density(adjust = 5)
A straightforward solution would be to calculate the density yourself and plot that:
library(ggplot2)
ggplot(as.data.frame(density(diamonds$carat, adjust = 5)[1:2]), aes(x, y)) +
geom_line()
I was told to use geom_jitter over geom_points and reason given in help is it handle overplotting better in smaller dataset. I am confused what does overplotting mean and why it occurs in smaller datasets?
Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there.
Two (not mutually exclusive) cases that often lead to overplotting:
Noncontinuous data - e.g., if x or y are integers, then it will be difficult to tell how many points there are.
Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if x and y are continuous.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data.
With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include
using smaller points
using transparency
binning data (as in a heat map)
Example of jitter working on small data (adapted from ?geom_jitter):
p = ggplot(mpg, aes(cyl, hwy))
gridExtra::grid.arrange(
p + geom_point(),
p + geom_jitter(width = 0.25, height = 0.5)
)
Above, moving the points just a little bit spreads them out. Now we can see how many points are "really there", without changing the data too much that we don't understand it.
And not working on bigger data:
p2 = ggplot(diamonds, aes(carat, price))
gridExtra::grid.arrange(
p2 + geom_point(),
p2 + geom_jitter(),
p2 + geom_point(alpha = 0.1, shape = 16)
)
Below, the jittered plot (middle) is just as overplotted as the regular plot (top). There isn't open space around the points to spread them into. However, with a smaller point mark and transparency (bottom plot) we can get a feel for the density of the data.
Do you know how to get the curved effect Jake Kaupp achieves on his plot?
Looks to be something along the lines of:
ggplot(full_data, aes(y = total_consumption_lbs, x = milk_production_lbs)) +
geom_xspline2(aes(s_open = TRUE, s_shape = 0.5))
Where geom_xspline2() comes from library(ggalt)
But don't ask me, here is his source code:
https://github.com/jkaupp/tidytuesdays/blob/master/2019/week5/R/analysis.R
This approach doesn't look quite as nice as your example, but it's a start, and some fiddling may get you the rest of the way.
First, some data to work with:
x <- seq(1:20)
y <- jitter(x,amount=1.5)
df <- data.frame(x,y)
The approach using ggplot2 is to draw a geom_smooth with very small span (small enough to cause lots of errors, as you'll see), and then plot points with white borders over the top of that.
ggplot(df, aes(x,y)) +
geom_smooth(se=F, colour="black", span=0.15) +
geom_point(fill="black", colour="white", shape=21, size=2.5) +
theme_minimal()
The downsides: As I noted above, you'll see many errors about singularities in the loess fit, because the span is so small. Second, you'll note that not all of the points are centred on the line, which makes sense since you are using a loess fit for the line. Lastly, there doesn't appear to be a way to change the width of the line around the points, so you end up with quite a thin white border.
I was told to use geom_jitter over geom_points and reason given in help is it handle overplotting better in smaller dataset. I am confused what does overplotting mean and why it occurs in smaller datasets?
Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there.
Two (not mutually exclusive) cases that often lead to overplotting:
Noncontinuous data - e.g., if x or y are integers, then it will be difficult to tell how many points there are.
Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if x and y are continuous.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data.
With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include
using smaller points
using transparency
binning data (as in a heat map)
Example of jitter working on small data (adapted from ?geom_jitter):
p = ggplot(mpg, aes(cyl, hwy))
gridExtra::grid.arrange(
p + geom_point(),
p + geom_jitter(width = 0.25, height = 0.5)
)
Above, moving the points just a little bit spreads them out. Now we can see how many points are "really there", without changing the data too much that we don't understand it.
And not working on bigger data:
p2 = ggplot(diamonds, aes(carat, price))
gridExtra::grid.arrange(
p2 + geom_point(),
p2 + geom_jitter(),
p2 + geom_point(alpha = 0.1, shape = 16)
)
Below, the jittered plot (middle) is just as overplotted as the regular plot (top). There isn't open space around the points to spread them into. However, with a smaller point mark and transparency (bottom plot) we can get a feel for the density of the data.
Let's say I have this data.frame:
df <- data.frame(x = rep(1, 20), y = runif(20, 10, 20))
and I want to plot df$y vs. df$x.
Since the x values are constant, points that have identical or close y values will be plotted on top of each other in a simple scatterplot, which kind of hides the density of points at such y-values. One solution for that situation is of course to use a violin plot.
I'm looking for another solution - plotting clusters of points instead of the individual points, which will therefore look similar to a bubble plot. In a bubble plot however, a third dimension is required in order to make the bubbles meaningful, which I don't have in my data. Does anyone know of an R function/package that take as input points (and probably a defined radius) and will cluster them and plot them?
You can jitter the x values:
plot(jitter(df$x),df$y)
You could try a hexplot, using either the hexplot library or stat_binhex in ggplot2.
http://cran.r-project.org/web/packages/hexbin/
http://docs.ggplot2.org/0.9.3/stat_binhex.html
The other standard approach (vs. jitter) is to use a partially transparent color, so that overlapping points will appear darker than "lone" points.
De gustibus, etc.
Using transparency is another solution. E.g.:
ggplot(df, aes(x=x, y=y)) +
geom_point(alpha=0.2, size=3)
When there is only one x value, a density plot:
ggplot(df, aes(x=y)) +
stat_density(geom="line")
or a violin plot:
ggplot(df, aes(x=x, y=y)) +
geom_violin()
might also be options for displaying your data.
look at the sunflowerplot function (and the xyTable function that it uses to count overlapping points).
You could also use the my.symbols function from the TeachingDemos package with the results of xyTable to use other shapes (polygrams or example).