Position-dodge warning with ggplot boxplot? - r

I'm trying to make a boxplot with ggplot2 using the following code:
p <- ggplot(
data,
aes(d$score, reorder(d$names d$scores, median))
) +
geom_boxplot()
I have factors called names and integers called scores.
My code produces a plot, but the graphic does not depict the boxes (only shows lines) and I get a warning message, "position_dodge requires non-overlapping x intervals." I've tried to adjust the height and width with geom_boxplot(width=5), but this does not seem to fix the problem. Can anyone suggest a possible solution to my problem?
I should point out that my boxplot is rather large and has about 200 name values on the y-axis). Perhaps this is the problem?

The number of groups is not the problem; I can see the same thing even when there are only 2 groups. The issue is that ggplot2 draws boxplots vertically (continuous along y, categorical along x) and you are trying to draw them horizontally (continuous along x, categorical along y).
Also, your example has several syntax errors and isn't reproducible because we don't have data/d.
Start with some mock data
dat <- data.frame(scores=rnorm(1000,sd=500),
names=sample(LETTERS, 1000, replace=TRUE))
Corrected version of your example code:
ggplot(dat, aes(scores, reorder(names, scores, median))) + geom_boxplot()
This is the horizontal lines you saw.
If you instead put the categorical on the x axis and the continuous on the y you get
ggplot(dat, aes(reorder(names, scores, median), scores)) + geom_boxplot()
Finally, if you want to flip the coordinate axes, you can use coord_flip(). There can be some additional problems with this if you are doing even more sophisticated things, but for basic boxplots it works.
ggplot(dat, aes(reorder(names, scores, median), scores)) +
geom_boxplot() + coord_flip()

In case anyone else arrives here wondering why they're seeing
Warning message:
position_dodge requires non-overlapping x intervals
Why this happens
The reason this happens is because some of the boxplot / violin plot (or other plot type) are possibly overlapping. In many cases, you may not care, but in some cases, it matters, hence why it warns you.
How to fix it
You have two options. Either suppress warnings when generating/printing the ggplot
The other option, simply alter the width of the plot so that the plots don't overlap, then the warning goes away. Try altering the width argument to the geom: e.g. geom_boxplot(width = 0.5) (same works for geom_violin())

In addition to #stevec's options, if you're seeing
position_stack requires non-overlapping x intervals
position_fill requires non-overlapping x intervals
position_dodge requires non-overlapping x intervals
position_dodge2 requires non-overlapping x intervals
and if your x variable is supposed to overlap for different aesthetics such as fill, you can try making the x_var into a factor:
geom_bar(aes(x = factor(x_var), fill = type)

Related

How can I ensure consistent axis lengths between plots with discrete variables in ggplot2?

I've been trying to standardise multiple bar plots so that the bars are all identical in width regardless of the number of bars. Note that this is over multiple distinct plots - faceting is not an option. It's easy enough to scale the plot area so that, for instance, a plot with 6 bars is 1.5* the width of a plot with 4 bars. This would work perfectly, except that each plot has an expanded x axis by default, which I would like to keep.
"The defaults are to expand the scale by 5% on each side for continuous variables, and by 0.6 units on each side for discrete variables."
https://ggplot2.tidyverse.org/reference/scale_discrete.html
My problem is that I can't for the life of me work out what '0.6 units' actually means. I've manually measured the distance between the bars and the y axis in various design tools and gotten inconsistent answers, so I can't factor '0.6 units' into my calculations when working out what size the panel windows should be. Additionally I can't find any answers on how many 'units' long a discrete x axis is - I assumed at first it would be 1 unit per category but that doesn't fit with the visuals at all. I've included an image that hopefully shows what I mean - the two graphs
In this image, the top graph has a plot area exactly 1.5* that of the bottom graph. Seeing as it has 6 bars compared with 4, that would mean each bar is the same width, except that that extra space between the axis and the first bar messes this up. Setting expand = expansion(add = c(0, 0)) clears this up but results in not-so-pretty graphs. What I'd like is for the bars to be identical in width between the two plots, accounting for this extra space. I'm specifically looking for a general solution that I can use for future plots, not for the individual solution for this sample. As such, what I'd really like to know is how many 'units' long are these two x axes? Many thanks for any and all help!
Instead of using expansion for the axis, I would probably use the fact that categorical variables are actually plotted on the positive integers on Cartesian co-ordinates. This means that, provided you know the maximum number of columns you are going to use in your plots, you can set this as the range in coord_cartesian. There is a little arithmetic involved to keep the bars centred, but it should give consistent results.
We start with some reproducible data:
library(ggplot2)
set.seed(1)
df <- data.frame(group = letters[1:6], value = 100 * runif(6))
Now we set the value for the maximum number of bars we will need:
MAX_BARS <- 6
And the only thing "funny" about the plot code is the calculation of the x axis limits in coord_cartesian:
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
Now let us remove one factor level and run the exact same plot code:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
And again:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
And again:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
You will see the bars remain constant width and centralized, yet the panel size remains fixed.
Created on 2021-11-06 by the reprex package (v2.0.0)

R - Bar Plot with transparency based on values?

I have a dataset myData which contains x and y values for various Samples. I can create a line plot for a dataset which contains a few Samples with the following pseudocode, and it is a good way to represent this data:
myData <- data.frame(x = 290:450, X52241 = c(..., ..., ...), X75123 = c(..., ..., ...))
myData <- myData %>% gather(Sample, y, -x)
ggplot(myData, aes(x, y)) + geom_line(aes(color=Sample))
Which generates:
This turns into a Spaghetti Plot when I have a lot more Samples added, which makes the information hard to understand, so I want to represent the "hills" of each sample in another way. Preferably, I would like to represent the data as a series of stacked bars, one for each myData$Sample, with transparency inversely related to what is in myData$y. I've tried to represent that data in photoshop (badly) here:
Is there a way to do this? Creating faceted plots using facet_wrap() or facet_grid() doesn't give me what I want (far too many Samples). I would also be open to stacked ridgeline plots using ggridges, but I am not understanding how I would be able to convert absolute values to a stat(density) value needed to plot those.
Any suggestions?
Thanks to u/Joris for the helpful suggestion! Since, I did not find this question elsewhere, I'll go ahead and post the pretty simple solution to my question here for others to find.
Basically, I needed to apply the alpha aesthetic via aes(alpha=y, ...). In theory, I could apply this over any geom. I tried geom_col(), which worked, but the best solution was to use geom_segment(), since all my "bars" were going to be the same length. Also note that I had to "slice" up the segments in order to avoid the problem of overplotting similar to those found here, here, and here.
ggplot(myData, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, yend=Sample, alpha=y), color='blue3', size=14)
That gives us the nice gradient:
Since the max y values are not the same for both lines, if I wanted to "match" the intensity I normalized the data (myDataNorm) and could make the same plot. In my particular case, I kind of preferred bars that did not have a gradient, but which showed a hard edge for the maximum values of y. Here was one solution:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(x=x, xend=x-1, y=Sample, y=end=Sample, alpha=ifelse(y>0.9,1,0)) +
theme(legend.position='none')
Better, but I did not like the faint-colored areas that were left. The final code is what gave me something that perfectly captured what I was looking for. I simply moved the ifelse() statement to apply to the x aesthetic, so the parts of the segment drawn were only those with high enough y values. Note my data "starts" at x=290 here. Probably more elegant ways to combine those x and xend terms, but whatever:
ggplot(myDataNorm, aes(x, Sample)) +
geom_segment(aes(
x=ifelse(y>0.9,x,290), xend=ifelse(y>0.9,x-1,290),
y=Sample, yend=Sample), color='blue3', size=14) +
xlim(290,400) # needed to show entire scale

Varying dotsizes in ggplot2's geom_dotplot

I'm trying to make a dotplot where a numerical y values are grouped according to character variables. That works fine, but I also want to change the sizes of the dots according to another variable, so that there are three differrent sizes of dots in the plot. I can change the dot sizes, it's just that R doesn't seem to be getting it right.
I couldn't find a good sample dataset, so I've made a quick example:
#Making some sufficient data:
y1 <- c(1,1,2,3,4,5,6,6)
x1 <- c('A','A','B','C','A','A','B','B')
size1 <- c(0.3,0.3,0.3,0.3,0.3,0.6,0.6,1.0)
data1 <- data.frame(x1,y1,size1)
data1
#define size as a vector: apparently it helps some problems
size2 <- data1$size1
#plot my dotplot!
ggplot(data1, aes(x=x1,y=y1)) +
geom_dotplot(binaxis="y", stackdir="center", dotsize=size2)
Overall, the dotplot works fine. The y variables are grouped according to their group of A, B, or C. However, the dotsizes are incorrect: The only dot in group C should be small (dotsize=0.3), the two dots at y=1 of group A should both be of equal size... and so on.
Dotplot with all sorts of dotsize inaccuracies
The question 'geom_dotplot dot sizes change when plotting different datasets in loop' (geom_dotplot dot sizes change when plotting different datasets in loop) said that the dotsize of geom_dotplot wasn't exactly a dot size, but was relative to bin width. That could explain why I'm having trouble. However, I'm unsure of how to fix this. Is there a way to reliably vary dot sizes in ggplot2's dotplots, or should I try making a dotplot with a more flexible tool than geom_dotplot? (Restarting R and my computer don't work.)
Cheers!
The stack overflow thread you shared clarifies what you can do with geom_dotplot and if you add a binwidth param, you can see the effect of dotsize. Here is an example,
base <- ggplot(data1, aes(x=x1,y=y1))
base + geom_dotplot(binaxis="y", stackdir="center", dotsize=size1, binwidth = 1)
Output
Using geom_point instead of geom_dotplot should solve the problem
ggplot(data1, aes(x=x1,y=y1)) +
geom_point(aes(size=size1))

Plotting percent change for a large number of factors on same figure using ggplot by faceting or color-coding factors

Here is an example of the code I'm working with
x<-as.factor(rep(c("tree_mean","tree_qmean","tree_skew"),3))
factor<-c(rep("mfn2_burned_99",3),rep("mfna_burned_5_7",3),rep("mfna_burned_5_7_10_12",3)))
y<-c(0.336457409,-0.347422910,-0.318945621,1.494109367, 0.003578698,-0.019985780,-0.484171146, 0.611589217,-0.322292664)
dat<-as.data.frame(cbind(x,factor,y))
head(dat)
x factor y
tree_mean mfn2_burned_99 -0.3364574
tree_qmean mfn2_burned_99 -0.3474229
tree_skew mfn2_burned_99 -0.3189456
tree_mean mfna_burned_5_7 -0.8269814
tree_qmean mfna_burned_5_7 -0.8088810
tree_skew mfna_burned_5_7 -2.5429226
tree_mean mfna_burned_5_7_10_12 -0.8601206
tree_qmean mfna_burned_5_7_10_12 -0.8474920
tree_skew mfna_burned_5_7_10_12 -2.9854178
I am trying to plot how much x deviates from 0, and facet it by each factor, as so:
ggplot(dat) +
geom_point(aes(x=x,y=y),shape=1,size=3)+
geom_linerange(aes(x=x,ymin=0,ymax=y))+
geom_hline(yintercept=0)+
facet_grid(factor~.)
This works fine when I have three factors (ignore the *: I had a significance column which I have since removed.
Example below:
However, I have 8 factors in total, and faceting obscures the plot such that the distance from zero for each x value gets very distorted.
Example below
So, my question is this: what would be a better way of coding/rendering this plot given my large number of x values and factors using faceting or color coding by factor in ggplot??
I would be very open to color-coding each distance for x by factor rather than faceting, but I have been beating my head against the wall trying to figure out how to even do that in ggplot (very new to ggplot), so I can't yet say if it would make the figure much more interpretable.
One option as you note is to color your point and/or linerange by a factor. You can then use position_dodge to move the points slightly on the x axis.
For example:
ggplot(dat, aes(color = factor)) +
geom_point(aes(x=x,y=y),shape=1,size=3, position = position_dodge(width = 0.5)+
geom_linerange(aes(x=x,ymin=0,ymax=y), position = position_dodge(width =0.5))+
geom_hline(yintercept=0)
I think this would still be difficult with many factors, but with 8 it might suit your purposes.

R graphic: Shifting values of different series so that error bars do not overlap

Here is a code:
set.seed (12)
library(ggplot2)
dat = data.frame(a=runif(40,0,1),b=c('a','b','c','d','e'),c=c('Hi','Hello'))
ggplot(dat,aes(x=b,y=a,shape=factor(c))) + stat_summary(fun.data=mean_cl_normal)
The graph it creates has error bars that overlap so that it is hard to distinguish the limits. I've often seen graphs where the different series (given by the factor c) are slightly horizontally shifted so that error bars does not overlap. Is there a way to achieve this with R when using a categorical variable in x ?
Thank you
You can use something like position_dodge():
ggplot(dat,aes(x=b,y=a,shape=factor(c))) +
stat_summary(fun.data=mean_cl_normal, position=position_dodge(width=0.2))
Example plot:

Resources