I was told to use geom_jitter over geom_points and reason given in help is it handle overplotting better in smaller dataset. I am confused what does overplotting mean and why it occurs in smaller datasets?
Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there.
Two (not mutually exclusive) cases that often lead to overplotting:
Noncontinuous data - e.g., if x or y are integers, then it will be difficult to tell how many points there are.
Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if x and y are continuous.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data.
With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include
using smaller points
using transparency
binning data (as in a heat map)
Example of jitter working on small data (adapted from ?geom_jitter):
p = ggplot(mpg, aes(cyl, hwy))
gridExtra::grid.arrange(
p + geom_point(),
p + geom_jitter(width = 0.25, height = 0.5)
)
Above, moving the points just a little bit spreads them out. Now we can see how many points are "really there", without changing the data too much that we don't understand it.
And not working on bigger data:
p2 = ggplot(diamonds, aes(carat, price))
gridExtra::grid.arrange(
p2 + geom_point(),
p2 + geom_jitter(),
p2 + geom_point(alpha = 0.1, shape = 16)
)
Below, the jittered plot (middle) is just as overplotted as the regular plot (top). There isn't open space around the points to spread them into. However, with a smaller point mark and transparency (bottom plot) we can get a feel for the density of the data.
Related
Is there a parameter that can anchor start and end points for a loess geom_smooth regression? If I increase the span (so that the regression isn't too wiggly), the starting and ending points seem to be drastically different (I have multiple lines on a graph, using as.factor) when in reality they are not (quite close together). I can't share my data as it is for confidential academic research, and I'm not sure how to reproduce an example for this... just wondering if this is possible with ggplot.
Here are some pictures that illustrate the problem, though...
Low span (span = 0.1), just the first 10 out of the 750 points to be graphed --> with this you can see the true starting points:
And then with the high span (span = 1.0), and all 750 points, the starting value and ending values are completely different. I'm not sure why this happens, but it is very misleading:
Basically, I want the smoothness of the second picture, but the specific and accurate starting points of the first when I graph all of the data (i.e., all 750 points). Let me know if there's any way to do this. Thanks for all your help.
Without seeing your code, I can already tell that you're setting your axis limits for the "span = 1.0" version using xlim(0,10) or scale_x_continuous(limits=c(0,10)) - is that correct? Change it to the following:
coord_cartesian(xlim = c(0, 10))
This is because xlim() (which is just a wrapper for scale_x_continuous(limits=...)) does not just zoom in on your data, but in fact discards any of the data outside of those limits before performing any calculations. Check the documentation on xlim() and the documentation on coord_cartesian() for more info.
It's easy to see how this is working using the following example:
# create dataset
set.seed(8675309)
df <- data.frame(x=1:1000, y=rnorm(1000))
# basic plot
p <- ggplot(df, aes(x,y)) + theme_bw() +
geom_point(color='gray75', size=1) + geom_smooth()
p
We get a basic plot, and as we expect, the result of geom_smooth() on this dataset is a straight line parallel to the x axis at y=0.
If we use xlim() or scale_x_continuous(limits=...) to see the first 10 points, you see that the geom_smooth() line is not the same:
p + xlim(0,10)
# or this one... results in the same plot
p + scale_x_continuous(limits=c(0,10))
The resulting line has a much higher standard deviation and is a bit above y=0, since the first 10 points happen to be just a bit above the average for the rest of the 990 points. If you use coord_cartesian(xlim=...), the zooming in of the plot happens after the calculations are made and no points are discarded, giving you the same points plotted, but the geom_smooth() line that matches that of the full dataset:
p + coord_cartesian(xlim=c(0,10))
I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale
I've been searching for a while, and I've found a number of answers for problems similar to mine, but not quite working when I try to implement them.
I'm trying to make a series of radar plots for different observations of performance. The data has been normalized such that the mean is 0 and the standard deviation is 1, and the y-axis on the plot has been set from -3 to 3 so as to make it visually comparable how well the subjects performed, with more extreme observations being worse. I would like to add colors associated with that scale, preferably such that -1 to 1 is green, and then the bands between +/- 1-2 is yellow and +/- 2-3 is red. All the examples I've been able to find relating to color fills is based directly in the data or from factors rather than a fixed scale, and anything I try seems to not show correctly. I'm not even sure if it is normally in the functionality of ggplot to be able to set a color scale in the way I'm looking for...
Here's the toy data I've been working with while working out the plotting (after reshaping):
variable <- c("time", "distance", "turns")
value <- c(0.9536197, 0.5842319, -2.1814528)
df <- data.frame(variable, value)
and here's my most recent attempt as far as ggplot code goes (using ggiraphExtra):
ggplot(temp, aes(x=variable, y=value, group=1)) + geom_point() + geom_polygon() +
ggiraphExtra:::coord_radar() + ylim(-3,3) +
scale_fill_gradient(low="red", high="green")
and this is the output:
radar plot with solid green geom_polygon fill
I was told to use geom_jitter over geom_points and reason given in help is it handle overplotting better in smaller dataset. I am confused what does overplotting mean and why it occurs in smaller datasets?
Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there.
Two (not mutually exclusive) cases that often lead to overplotting:
Noncontinuous data - e.g., if x or y are integers, then it will be difficult to tell how many points there are.
Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if x and y are continuous.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data.
With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include
using smaller points
using transparency
binning data (as in a heat map)
Example of jitter working on small data (adapted from ?geom_jitter):
p = ggplot(mpg, aes(cyl, hwy))
gridExtra::grid.arrange(
p + geom_point(),
p + geom_jitter(width = 0.25, height = 0.5)
)
Above, moving the points just a little bit spreads them out. Now we can see how many points are "really there", without changing the data too much that we don't understand it.
And not working on bigger data:
p2 = ggplot(diamonds, aes(carat, price))
gridExtra::grid.arrange(
p2 + geom_point(),
p2 + geom_jitter(),
p2 + geom_point(alpha = 0.1, shape = 16)
)
Below, the jittered plot (middle) is just as overplotted as the regular plot (top). There isn't open space around the points to spread them into. However, with a smaller point mark and transparency (bottom plot) we can get a feel for the density of the data.
I'm having trouble creating a figure with ggplot2. I am using geom_dotplot with center stacking to display my data which are discrete values for 4 categories.
For aesthetic reasons I want to customize the positions of the dots so that
reduce the empty space between dots along the y axis, (ie the dots are 1 value large)
The distributions fit and don't overlap
I've adjusted the bin and dotsize to achieve aesthetic goal 1, but that requires me to fiddle with the ylim() parameter to make sure that the groups fit in the plot. This results in a plot with more whitw space and few numbers on the y axis.
Question: Can anyone explain a way to resize the empty space on this plot?
My code is below:.
plot <- ggplot(figdata, aes(y=Counts, x=category, col=strain)) +
geom_dotplot(aes(fill=strain), dotsize=1, binwidth=.7,
binaxis= "y",stackdir ="centerwhole", stackratio=.7) +
ylim(18,59)
plot + scale_color_manual(values=c("#E69F00", "#56B4E9")) +
geom_errorbar(stat="hline", yintercept="mean",
aes( ymax=..y..,ymin=..y.., group = category, width = 0.5),
color="black")
Which produces:
EDIT: Incorporating jitter will allow the all the data to fit, but I don't want to add noise to this data and would prefer to show it as discreet data.
adjusting the binwidth and dotsize to 0.3 as suggested below also fits all the data, however it leaves too much white space.
I think that I might have to transform my data so that the values are steps smaller than 1, in order to get everything to fit horizontally and dot sizes to big large enough to reduce white space.
I think the easiest way is using coord_cartesian:
plot + scale_color_manual(values=c("#E69F00", "#56B4E9")) +
geom_errorbar(stat="hline", yintercept="mean",
aes( ymax=..y..,ymin=..y.., group = category, width = 0.5),
color="black") +
coord_cartesian(ylim=c(17,40))
Which gives me this plot (with fake data that are not as neatly distributed as yours):