I want to draw vertical boxplots of counts, and show the counts as points, overlaid over the boxplots. Because they are discrete values, there are going to be multiple points with the same value. To show the data in ggplot2, I could use geom_jitter() to spread the data and get a slightly better impression, but jitter screws up the values (the vertical component), and the randomness of the horizontal spread means that if the jitter height is set to 0, there is still a high chance of overlapping points.
Is there a way to spread all points which have the same value, evenly and horizontally? Something along the lines of this:
Here is some example data:
wine_votes <- melt(list(a=c(7,7,7,8,8,7,7,8,4,7,7,6,8,6),
b=c(5,8,6,4,3,4,4,9,5,8,4,5,4),
c=c(7.5,8,5,8,6,8,5,6,6.5,7,5,5,6),
d=c(4,4,5,5,6,8,5,8,5,6,3,6,5),
e=c(7,4,6,7,4,6,7,5,6.5,8.5,8,5)
))
names(wine_votes) <- c('vote', 'option')
# Example plot with jitter:
ggplot(wine_votes, aes(x=blend, y=vote)) +
geom_boxplot() + geom_jitter(position=position_jitter(height=0, width=0.2)) +
scale_y_continuous(breaks=seq(0,10,2))
While this isn't exactly what the image looks like it can be adjusted to get there. geom_dotplot (added in version 0.9.0-ish I think) does this:
ggplot(wine_votes, aes(x=option, y=vote)) +
geom_boxplot() +
scale_y_continuous(breaks=seq(0,10,2)) +
geom_dotplot(binaxis = "y", stackdir = "center", aes(fill=option))
Related
I was told to use geom_jitter over geom_points and reason given in help is it handle overplotting better in smaller dataset. I am confused what does overplotting mean and why it occurs in smaller datasets?
Overplotting is when one or more points are in the same place (or close enough to the same place) that you can't look at the plot and tell how many points are there.
Two (not mutually exclusive) cases that often lead to overplotting:
Noncontinuous data - e.g., if x or y are integers, then it will be difficult to tell how many points there are.
Lots of data - if your data is dense (or has regions of high density), then points will often overlap even if x and y are continuous.
Jittering is adding a small amount of random noise to data. It is often used to spread out points that would otherwise be overplotted. It is only effective in the non-continuous data case where overplotted points typically are surrounded by whitespace - jittering the data into the whitespace allows the individual points to be seen. It effectively un-discretizes the discrete data.
With high density data, jittering doesn't help because there is not a reliable area of whitespace around overlapping points. Other common techniques for mitigating overplotting include
using smaller points
using transparency
binning data (as in a heat map)
Example of jitter working on small data (adapted from ?geom_jitter):
p = ggplot(mpg, aes(cyl, hwy))
gridExtra::grid.arrange(
p + geom_point(),
p + geom_jitter(width = 0.25, height = 0.5)
)
Above, moving the points just a little bit spreads them out. Now we can see how many points are "really there", without changing the data too much that we don't understand it.
And not working on bigger data:
p2 = ggplot(diamonds, aes(carat, price))
gridExtra::grid.arrange(
p2 + geom_point(),
p2 + geom_jitter(),
p2 + geom_point(alpha = 0.1, shape = 16)
)
Below, the jittered plot (middle) is just as overplotted as the regular plot (top). There isn't open space around the points to spread them into. However, with a smaller point mark and transparency (bottom plot) we can get a feel for the density of the data.
I've been trying to standardise multiple bar plots so that the bars are all identical in width regardless of the number of bars. Note that this is over multiple distinct plots - faceting is not an option. It's easy enough to scale the plot area so that, for instance, a plot with 6 bars is 1.5* the width of a plot with 4 bars. This would work perfectly, except that each plot has an expanded x axis by default, which I would like to keep.
"The defaults are to expand the scale by 5% on each side for continuous variables, and by 0.6 units on each side for discrete variables."
https://ggplot2.tidyverse.org/reference/scale_discrete.html
My problem is that I can't for the life of me work out what '0.6 units' actually means. I've manually measured the distance between the bars and the y axis in various design tools and gotten inconsistent answers, so I can't factor '0.6 units' into my calculations when working out what size the panel windows should be. Additionally I can't find any answers on how many 'units' long a discrete x axis is - I assumed at first it would be 1 unit per category but that doesn't fit with the visuals at all. I've included an image that hopefully shows what I mean - the two graphs
In this image, the top graph has a plot area exactly 1.5* that of the bottom graph. Seeing as it has 6 bars compared with 4, that would mean each bar is the same width, except that that extra space between the axis and the first bar messes this up. Setting expand = expansion(add = c(0, 0)) clears this up but results in not-so-pretty graphs. What I'd like is for the bars to be identical in width between the two plots, accounting for this extra space. I'm specifically looking for a general solution that I can use for future plots, not for the individual solution for this sample. As such, what I'd really like to know is how many 'units' long are these two x axes? Many thanks for any and all help!
Instead of using expansion for the axis, I would probably use the fact that categorical variables are actually plotted on the positive integers on Cartesian co-ordinates. This means that, provided you know the maximum number of columns you are going to use in your plots, you can set this as the range in coord_cartesian. There is a little arithmetic involved to keep the bars centred, but it should give consistent results.
We start with some reproducible data:
library(ggplot2)
set.seed(1)
df <- data.frame(group = letters[1:6], value = 100 * runif(6))
Now we set the value for the maximum number of bars we will need:
MAX_BARS <- 6
And the only thing "funny" about the plot code is the calculation of the x axis limits in coord_cartesian:
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
Now let us remove one factor level and run the exact same plot code:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
And again:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
And again:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
You will see the bars remain constant width and centralized, yet the panel size remains fixed.
Created on 2021-11-06 by the reprex package (v2.0.0)
Take a look at the following plotting code:
library(ggplot2)
library(cowplot)
a <- data.frame(a1=1:10, a2=1:10)
b <- data.frame(b1=1:5, b2=2*(1:5))
aplot <- ggplot(a, aes(x=a1, ymin=0, ymax=12)) +
geom_line(aes(y=a2))
bplot <- ggplot(b, aes(x=b1, ymin=0, ymax=12)) +
geom_line(aes(y=b2))
plot_grid(aplot,bplot, ncol=2)
It yields two side-by-side plots of identical dimensions showing similar lines. But the x-axis scales are rather different. In fact, the second line has twice the slope of the first.
I am looking for a way to plot this figure so that the width of a plot is scaled by the limits of its x-axis, so that the slopes can be compared visually. The real plots I am interested in visualizing are five in number and will lack y-axis labels except for the leftmost. I can use grid.arrange() to plot them all in a row with whatever widths I want, but the problem is that I don't know what width to assign to each panel to make sure they come out right (the panel width has to be large enough to accommodate the plot margins, the y-axis tick marks, and the y-axis text). I can set the margins myself and account for them in my panel widths, but I cannot find a good way to figure out how wide (e.g. in cm) the y-axis text is.
You can use the rel_widths option in plot_grid to achieve this. You need to calculate the relative size you want each plot to be using the ratio of the ranges xmax-xmin of each panel. But, there is an extra catch. rel_widths sets the relative width of the whole panel, including the margins. So we also need to account for the margins in calculating the relative size. In the following code, adding an offset value of 2 to the numerator and denominator of relative.size works for this. But note that this offset value may change if you alter the size of the margins.
aplot <- ggplot(a, aes(x=a1, ymin=0, ymax=12, xmin=0, xmax=max(a$a1))) +
geom_line(aes(y=a2))
bplot <- ggplot(b, aes(x=b1, ymin=0, ymax=12, xmin=0, xmax=max(b$b1))) +
geom_line(aes(y=b2))
relative.size <- (2+max(b$b1)) / (2+max(a$a1)) # the addition of 2 here is to account for the plot margins
plot_grid(aplot,bplot, ncol=2, rel_widths=c(1,relative.size), align = "h")
gives
I'm having trouble creating a figure with ggplot2. I am using geom_dotplot with center stacking to display my data which are discrete values for 4 categories.
For aesthetic reasons I want to customize the positions of the dots so that
reduce the empty space between dots along the y axis, (ie the dots are 1 value large)
The distributions fit and don't overlap
I've adjusted the bin and dotsize to achieve aesthetic goal 1, but that requires me to fiddle with the ylim() parameter to make sure that the groups fit in the plot. This results in a plot with more whitw space and few numbers on the y axis.
Question: Can anyone explain a way to resize the empty space on this plot?
My code is below:.
plot <- ggplot(figdata, aes(y=Counts, x=category, col=strain)) +
geom_dotplot(aes(fill=strain), dotsize=1, binwidth=.7,
binaxis= "y",stackdir ="centerwhole", stackratio=.7) +
ylim(18,59)
plot + scale_color_manual(values=c("#E69F00", "#56B4E9")) +
geom_errorbar(stat="hline", yintercept="mean",
aes( ymax=..y..,ymin=..y.., group = category, width = 0.5),
color="black")
Which produces:
EDIT: Incorporating jitter will allow the all the data to fit, but I don't want to add noise to this data and would prefer to show it as discreet data.
adjusting the binwidth and dotsize to 0.3 as suggested below also fits all the data, however it leaves too much white space.
I think that I might have to transform my data so that the values are steps smaller than 1, in order to get everything to fit horizontally and dot sizes to big large enough to reduce white space.
I think the easiest way is using coord_cartesian:
plot + scale_color_manual(values=c("#E69F00", "#56B4E9")) +
geom_errorbar(stat="hline", yintercept="mean",
aes( ymax=..y..,ymin=..y.., group = category, width = 0.5),
color="black") +
coord_cartesian(ylim=c(17,40))
Which gives me this plot (with fake data that are not as neatly distributed as yours):
I created some grouped boxplots, basically for each dimension on the x axis I am showing various groups. Because my dataset is quite large, I had to precalculate the values for the boxes as ggplot did not have enough memory (I used ddply and did it in pieces).
I believe this is beter than just bar charts of the averages as it shows some of the variability.
I want 2 modifications, one was to not show the whisker lines, and I have done that by setting ymin=lower and ymax=upper.
I also wanted to add the means as well, but they show all in the center of each X category, and of course I want them each aligned with its box.
to make it easier on anyone helping, I recreated the same chart using mtcars - I tried position = "dodge" and "identity" with no change
Anyone knows how to do this? I searched and did not find a way. I am also attaching a picture of my latest chart. Code is below
data(mtcars)
data <- as.data.frame(mtcars)
data$cyl <- factor(data$cyl)
data$gear <- factor(data$gear)
summ <- ddply(data, .(cyl, gear),summarize, lower=quantile(mpg,probs=0.25,na.rm=T), middle=quantile(mpg,probs=.5,na.rm=T),upper=quantile(mpg,probs=.75,na.rm=T),avg=mean(mpg,na.rm=T))
p2 <- ggplot(summ, aes(x = cyl, lower = lower, middle = middle, upper = upper,fill=gear,ymin=lower,ymax=upper))+geom_boxplot(stat = "identity")
p2 <- p2 + geom_point(aes(x = cyl, y=avg, color=gear),color="red",position="dodge")
p2
The problem is that the width of the points is not the same as the width of the box plots. In that case you need to tell position_dodge what width do use. ?position_dodge gives a simple example of this using points and error bars, but the principle is the same for points and box plots. In your example, replacing position="dodge" with position=position_dodge(width=0.9) will dodge the points by the same amount as the box plots.