Grouped geom_boxplot from calculated values with mean - r

I created some grouped boxplots, basically for each dimension on the x axis I am showing various groups. Because my dataset is quite large, I had to precalculate the values for the boxes as ggplot did not have enough memory (I used ddply and did it in pieces).
I believe this is beter than just bar charts of the averages as it shows some of the variability.
I want 2 modifications, one was to not show the whisker lines, and I have done that by setting ymin=lower and ymax=upper.
I also wanted to add the means as well, but they show all in the center of each X category, and of course I want them each aligned with its box.
to make it easier on anyone helping, I recreated the same chart using mtcars - I tried position = "dodge" and "identity" with no change
Anyone knows how to do this? I searched and did not find a way. I am also attaching a picture of my latest chart. Code is below
data(mtcars)
data <- as.data.frame(mtcars)
data$cyl <- factor(data$cyl)
data$gear <- factor(data$gear)
summ <- ddply(data, .(cyl, gear),summarize, lower=quantile(mpg,probs=0.25,na.rm=T), middle=quantile(mpg,probs=.5,na.rm=T),upper=quantile(mpg,probs=.75,na.rm=T),avg=mean(mpg,na.rm=T))
p2 <- ggplot(summ, aes(x = cyl, lower = lower, middle = middle, upper = upper,fill=gear,ymin=lower,ymax=upper))+geom_boxplot(stat = "identity")
p2 <- p2 + geom_point(aes(x = cyl, y=avg, color=gear),color="red",position="dodge")
p2

The problem is that the width of the points is not the same as the width of the box plots. In that case you need to tell position_dodge what width do use. ?position_dodge gives a simple example of this using points and error bars, but the principle is the same for points and box plots. In your example, replacing position="dodge" with position=position_dodge(width=0.9) will dodge the points by the same amount as the box plots.

Related

How can I ensure consistent axis lengths between plots with discrete variables in ggplot2?

I've been trying to standardise multiple bar plots so that the bars are all identical in width regardless of the number of bars. Note that this is over multiple distinct plots - faceting is not an option. It's easy enough to scale the plot area so that, for instance, a plot with 6 bars is 1.5* the width of a plot with 4 bars. This would work perfectly, except that each plot has an expanded x axis by default, which I would like to keep.
"The defaults are to expand the scale by 5% on each side for continuous variables, and by 0.6 units on each side for discrete variables."
https://ggplot2.tidyverse.org/reference/scale_discrete.html
My problem is that I can't for the life of me work out what '0.6 units' actually means. I've manually measured the distance between the bars and the y axis in various design tools and gotten inconsistent answers, so I can't factor '0.6 units' into my calculations when working out what size the panel windows should be. Additionally I can't find any answers on how many 'units' long a discrete x axis is - I assumed at first it would be 1 unit per category but that doesn't fit with the visuals at all. I've included an image that hopefully shows what I mean - the two graphs
In this image, the top graph has a plot area exactly 1.5* that of the bottom graph. Seeing as it has 6 bars compared with 4, that would mean each bar is the same width, except that that extra space between the axis and the first bar messes this up. Setting expand = expansion(add = c(0, 0)) clears this up but results in not-so-pretty graphs. What I'd like is for the bars to be identical in width between the two plots, accounting for this extra space. I'm specifically looking for a general solution that I can use for future plots, not for the individual solution for this sample. As such, what I'd really like to know is how many 'units' long are these two x axes? Many thanks for any and all help!
Instead of using expansion for the axis, I would probably use the fact that categorical variables are actually plotted on the positive integers on Cartesian co-ordinates. This means that, provided you know the maximum number of columns you are going to use in your plots, you can set this as the range in coord_cartesian. There is a little arithmetic involved to keep the bars centred, but it should give consistent results.
We start with some reproducible data:
library(ggplot2)
set.seed(1)
df <- data.frame(group = letters[1:6], value = 100 * runif(6))
Now we set the value for the maximum number of bars we will need:
MAX_BARS <- 6
And the only thing "funny" about the plot code is the calculation of the x axis limits in coord_cartesian:
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
Now let us remove one factor level and run the exact same plot code:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
And again:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
And again:
df <- df[-1,]
ggplot(df, aes(group, value)) +
geom_col() +
coord_cartesian(xlim = c(1 -(MAX_BARS - length(unique(df$group)))/2,
MAX_BARS - (MAX_BARS - length(unique(df$group)))/2))
You will see the bars remain constant width and centralized, yet the panel size remains fixed.
Created on 2021-11-06 by the reprex package (v2.0.0)

Geom_area plot doesn't fill the area between the lines

I want to make an area plot with ggplot(mpg, aes(x=year,y=hwy, fill=manufacturer)) + geom_area(), but I get this:
I'm realy new in R world, can anyone explain why it does not fill the area between the lines? Thanks!
First of all, there's nothing wrong with your code. It's working as intended and you are correct in the syntax required to do what you are looking to do.
Why don't you get the area geom to plot correctly, then? Simple answer is that you don't have enough points to draw a proper line between your x values for all of the aesthetics (manufacturers). Try the geom_point plot and you'll see what I mean:
ggplot(mpg, aes(x=year,y=hwy)) + geom_point(aes(color=manufacturer))
You need a different dataset. Here's a dummy one that is simply two lines with different slopes. It works as expected because each of the aesthetics has y values which span the x labels:
# dummy dataset
df <- data.frame(
x=rep(1:10,2),
y=c(seq(1,10,length.out=10), seq(1,5,length.out=10)),
z=c(rep('A',10), rep('B', 10))
)
# plot
ggplot(df, aes(x,y)) + geom_area(aes(fill=z))

How to make half-wiskers in a ggplot2 line graph?

I make very slow progress in R but now I'm able to do some stuff.
Right now I'm plotting the effects of 4 treatments on plant growth in one graph. As you can see the errorbars overlap which is why I made them different colors. I think in order to make the graph clearer it's better to use the lower errorbars as "half wiskers" for the lower 2 lines, and the upper errorbars for the top two lines (like I have now), see the attached image for reference
Is that doable with the way my script is set up now?
Here is part of my script of the plot, I have a lot more but this is where I specify the plot itself (leaving out the aesthetics and stuff), thanks in advance:
"soda1" is my altered dataframe, setup in a clear way, "sdtv" are my standard deviations for each timepoint/treatment, "oppervlak" is my y variable and "Measuring Date" is my x variable. "Tray ID" is the treatment, so my grouping variable.
p <- ggplot(soda1, aes(x=reorder(`Measuring Date`, oppervlak), y=`oppervlak`, group=`Tray ID`, fill=`Tray ID`, colour = `Tray ID` )) +
scale_fill_brewer(palette = "Spectral") +
geom_errorbar(data=soda1, mapping=aes(ymin=oppervlak, ymax=oppervlak+sdtv, group=`Tray ID`), width=0.1) +
geom_line(aes(linetype=`Tray ID`)) +
geom_point(mapping=aes(x=`Measuring Date`, y=oppervlak, shape=`Tray ID`))
print(p)
Showing only one side of errorbars can hide an overlap in the uncertainty between the distribution of two or more variables or measurements.
Instead of hiding this overlap, you could adjust the position of your errorbars horizontally very easily by adding position=position_dodge(width=) to your call to geom_errorbar().
For example:
library(ggplot2)
# some random data with two factors
df <- data.frame(a=rep(1:10, times=2),
b=runif(20),
treat=as.factor(rep(c(0,1), each=10)),
errormax=runif(20),
errormin=runif(20))
# plotting both sides of the errorbars, but dodging them horizontally
p <- ggplot(data=df, aes(x=a, y=b, colour=treat)) +
geom_line() +
geom_errorbar(data=df, aes(ymin=b-errormin, ymax=b+errormax),
position=position_dodge(width=0.25))

Reordering data based on a column in [r] to order x-value items from lowest to highest y-values in ggplot

I have a dataframe that I want to reorder to make a ggplot so I can easily see which items have the highest and lowest values in them. In my case, I've grouped the data into two groups, and it'd be nice to have a visual representation of which group tends to score higher. Based on this question I came up with:
library(ggplot2)
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- line that doesn't seem to be working
ggplot(cor.data.sorted,aes(x=pic,y=r.val,size=df.val,color=exp)) + geom_point()
which produces this:
I've tried quite a few variants to reorder the data, and I feel like this should be pretty simple to achieve. To clarify, if I had succesfully reorganised the data then the y-values would go up as the plot moves along the x-value. So maybe i'm focussing on the wrong part of the code to achieve this in a ggplot figure?
You could do something like this?
library(tidyverse);
cor.data %>%
mutate(pic = factor(pic, levels = as.character(pic)[order(r.val)])) %>%
ggplot(aes(x = pic, y = r.val, size = df.val, color = exp)) + geom_point()
This obviously still needs some polishing to deal with the x axis label clutter etc.
Rather than try to order the data before creating the plot, I can reorder the data at the time of writing the plot:
cor.data<- read.csv("https://dl.dropbox.com/s/p4uy6uf1vhe8yzs/cor.data.csv?dl=0",stringsAsFactors = F)
cor.data.sorted = cor.data[with(cor.data,order(r.val,pic)),] #<-- This line controls order points drawn created to make (slightly) more readible plot
gplot(cor.data.sorted,aes(x=reorder(pic,r.val),y=r.val,size=df.val,color=exp)) + geom_point()
to create

Wrong density values in a histogram with `fill` option in `ggplot2`

I was creating histograms with ggplot2 in R whose bins are separated with colors and noticed one thing. When the bins of a histogram are separated by colors with fill option, the density value of the histogram turns funny.
Here is the data.
set.seed(42)
x <- rnorm(10000,0,1)
df <- data.frame(x=x, b=x>1)
This is a histogram without fill.
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..))
This is a histogram with fill.
ggplot(df, aes(x = x, fill=b)) +
geom_histogram(aes(y=..density..))
You can see the latter is pretty crazy. The left side of the bins is sticking out. The density values of the bins of each color are obviously wrong.
I thought over this issue for a while. The data can't be wrong for the first histogram was normal. It should be something in ggplot2 or geom_histogram function. I googled "geom_histogram density fill" and couldn't find much help.
I want the end product to look like:
Separated by colors as you see in the second histogram
Size and shape identical to the first histogram
The vertical axis being density
How would you deal with issue?
I think what you may want is this:
ggplot(df, aes(x = x, fill=b)) +
geom_histogram()
Rather than the density. As mentioned above the density is asking for extra calcuations.
One thing that is important (in my opinion) is that histograms are graphs of one variable. As soon as you start adding data from other variables you start to change them more into bar charts or something else like that.
You will want work on setting the axis manually if you want it to range from 0 to .4.
The solution is to hand-compute density like this (instead of using the built-in ggplot2 version):
library(ggplot2)
# Generate test data
set.seed(42)
x <- rnorm(10000,0,1)
df <- data.frame(x=x, b=x>1)
ggplot(df, aes(x = x, fill=b)) +
geom_histogram(mapping = aes(y = ..count.. / (sum(..count..) * ..width..)))
when you provide a column name for the fill parameter in ggplot it groups varaiables and plots them according to each group with a unique color.
if you want a single color for the plot just specify the color you want:
FIXED
ggplot(df, aes(x = x)) +
geom_histogram(aes(y=..density..),fill="Blue")

Resources