Displaying separate means within fill groups in ggplot boxplot - r

I have a grouped boxplot using data with 3 categories. One category is set as the x-axis of the boxplots, the other is set as the fill, and the last one, as a faceting category. I want to display the means for each fill group, but using stat_summary only gives me the mean for the x-axis category, without separating the means for the fill:
Here is the current code:
demoplot<-ggplot(demo,aes(x=variable,y=value))
demoplot+geom_boxplot(aes(fill=category2),position=position_dodge(.9))+
stat_summary(fun.y=mean, colour="black", geom="point", shape=18, size=4,) +
facet_wrap(~category1)
Is there any way to display the mean for each category2 without having to manually compute and plot the points? Adjusting the position dodge doesn't really help, as it's just one computed mean. Would creating conditions within the mean() function be advisable?
For anyone interested, here's the data:
Advanced thanks for any enlightenment on this.

Ggplot needs to have explicit information on grouping here. You can do that either by using a aes(group=....) in the desired layer, or moving the fill=... to the main call to ggplot. Without explicit grouping for a layer, ggplot will group by the factor on the x-axis. Here's some sample code with fake data:
library(ggplot2)
set.seed(123)
nobs <- 1000
dat <- data.frame(var1=sample(LETTERS[1:3],nobs, T),
var2=sample(LETTERS[1:2],nobs,T),
var3=sample(LETTERS[1:3],nobs,T),
y=rnorm(nobs))
p1 <- ggplot(dat, aes(x=var1, y=y)) +
geom_boxplot(aes(fill=var2), position=position_dodge(.9)) +
facet_wrap(~var3) +
stat_summary(fun.y=mean, geom="point", aes(group=var2), position=position_dodge(.9),
color="black", size=4)

Related

How to set background color for each panel in grouped boxplot?

I plotted a grouped boxplot and trying to change the background color for each panel. I can use panel.background function to change whole plot background. But how this can be done for individual panel? I found a similar question here. But I failed to adopt the code to my plot.
Top few lines of my input data look like
Code
p<-ggplot(df, aes(x=Genotype, y=Length, fill=Treatment)) + scale_fill_manual(values=c("#69b3a2", "#CF7737"))+
geom_boxplot(width=2.5)+ theme(text = element_text(size=20),panel.spacing.x=unit(0.4, "lines"),
axis.title.x=element_blank(),axis.text.x=element_blank(),axis.ticks.x=element_blank(),axis.text.y = element_text(angle=90, hjust=1,colour="black")) +
labs(x = "Genotype", y = "Petal length (cm)")+
facet_grid(~divide,scales = "free", space = "free")
p+theme(panel.background = element_rect(fill = "#F6F8F9", colour = "#E7ECF1"))
Unfortunately, like the other theme elements, the fill aesthetic of element_rect() cannot be mapped to data. You cannot just send a vector of colors to fill either (create your own mapping of sorts). In the end, the simplest solution probably is going to be very similar to the answer you linked to in your question... with a bit of a twist here.
I'll use mtcars as an example. Note that I'm converting some of the continuous variables in the dataset to factors so that we can create some more discrete values.
It's important to note, the rect geom is drawn before the boxplot geom, to ensure the boxplot appears on top of the rect.
ggplot(mtcars, aes(factor(carb), disp)) +
geom_rect(
aes(fill=factor(carb)), alpha=0.5,
xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) +
geom_boxplot() +
facet_grid(~factor(carb), scales='free_x') +
theme_bw()
All done... but not quite. Something is wrong and you might notice this if you pay attention to the boxes on the legend and the gridlines in the plot panels. It looks like the alpha value is incorrect for some facets and okay for others. What's going on here?
Well, this has to do with how geom_rect works. It's drawing a box on each plot panel, but just like the other geoms, it's mapped to the data. Even though the x and y aesthetics for the geom_rect are actually not used to draw the rectangle, they are used to indicate how many of each rectangle are drawn. This means that the number of rectangles drawn in each facet corresponds to the number of lines in the dataset which exist for that facet. If 3 observations exist, 3 rectangles are drawn. If 20 observations exist for one facet, 20 rectangles are drawn, etc.
So, the fix is to supply a dataframe that contains one observation each for every facet. We have to then make sure that we supply any and all other aesthetics (x and y here) that are included in the ggplot call, or we will get an error indicating ggplot cannot "find" that particular column. Remember, even if geom_rect doesn't use these for drawing, they are used to determine how many observations exist (and therefore how many to draw).
rect_df <- data.frame(carb=unique(mtcars$carb)) # supply one of each type of carb
# have to give something to disp
rect_df$disp <- 0
ggplot(mtcars, aes(factor(carb), disp)) +
geom_rect(
data=rect_df,
aes(fill=factor(carb)), alpha=0.5,
xmin=-Inf, xmax=Inf, ymin=-Inf, ymax=Inf) +
geom_boxplot() +
facet_grid(~factor(carb), scales='free_x') +
theme_bw()
That's better.

Show mean values in boxplots in R

time_pic <- ggplot(data_box, aes(x=Kind, y=TimeTotal, fill=Sitting_Position)) +
geom_boxplot()
print(time_pic)
time_pic+labs(title="", x="", y = "Time (Sec)")
I ran the above codes to get the following image. But I don't know how to add average value for each boxplot on this image.
updated.
I tried this.
means <- aggregate(TimeTotal ~ Sitting_Position*Kind, data_box, mean)
ggplot(data=data_box, aes(x=Kind, y=TimeTotal, fill=Sitting_Position)) +
geom_boxplot() +
stat_summary(fun=mean, colour="darkred", geom="point", shape=18, size=3,show_guide = FALSE) +
geom_text(data = means, aes(label = TimeTotal, y = TimeTotal + 0.08))
This is what it looks like now. Two dots are on the same line. And two values are overlapping with each other.
As others said, you can share your dataset for more specific help, but in this case I think the point can be made using a dummy dataset. I'm creating one that looks pretty similar to your own in terms of naming, so theoretically you can just plug in this code and it could work.
The biggest thing you need here is to control how ggplot2 is separating the separate boxplots for the data_box$Sitting_Position that share the same data_box$Kind. The process of separating and spreading the boxes around that x= axis value is called "dodging". When you supply a fill= or color= (or other) aesthetic in aes() for that geom, ggplot2 knows enough that it will assume you also want to group the data according to that value. So, your initial ggplot() call has in aes() that fill=Sitting_Position, which means that geom_boxplot() "works" - it creates the separate boxes that are colored differently and which are "dodged" properly.
When you create the points and the text, ggplot2 has no idea that you want to "dodge" this data, and even if you did want to dodge, on what basis to use for the dodge, since the fill= aesthetic doesn't make sense for a text or point geom. How to fix this? The answer is to:
Supply a group= aesthetic, which can override the grouping of a fill= or color= aesthetic, but which also can serve as a basis for the dodging for geoms that do not have a similar aesthetic.
Specify more clearly how you want to dodge. This will be important for accurate positioning of all things you want to dodge. Otherwise, you will have things dodged, but maybe not the same distance.
Here's how I combined all that:
# the datasets
set.seed(1234)
data_box <- data.frame(
Kind=c(rep('Model-free AR',100),rep('Real-world',100)),
TimeTotal=c(rnorm(50,5.5,1),rnorm(50,5.43,1.1),rnorm(50,4.9,1),rnorm(50,4.7,0.2)),
Sitting_Position=rep(c(rep('face to face',50),rep('side by side',50)),2)
)
means <- aggregate(TimeTotal ~ Sitting_Position*Kind, data_box, mean)
# the plot
ggplot(data_box, aes(x=Kind, y=TimeTotal)) + theme_bw() +
# specifying dodge here and width to avoid overlapping boxes
geom_boxplot(
aes(fill=Sitting_Position),
position=position_dodge(0.6), width=0.5
) +
# note group aesthetic and same dodge call for next two objects
stat_summary(
aes(group=Sitting_Position),
position=position_dodge(0.6),
fun=mean,
geom='point', color='darkred', shape=18, size=3,
show.legend = FALSE
) +
geom_text(
data=means,
aes(label=round(TimeTotal,2), y=TimeTotal + 0.18, group=Sitting_Position),
position=position_dodge(0.6)
)
Giving you this:

Problems with making scale_colour_manual work for my multi-factor bargraph on ggplot2 in R

I am trying to make a manual colour scale for my bar graph using plyr to summarize the data and ggplot2 to present the graph.
The data has two variables:
Region (displayed on the X-axis)
Genotype (displayed by the fill)
I have managed to do this already, however, I have not been able to find a way to personalize the colours - it simply gives me two randomly assigned colours.
Could someone please help me figure out what I am missing here?
I have included my code and an image of the graph below. The graph basically has the appearance I want it to, except that I can't personalize the colours.
ggplotdata <- summarySE(data, measurevar="Density", groupvars=c("Genotype", "Region"))
ggplotdata
#Plot the data
ggplotdata$Genotype <- factor(ggplotdata$Genotype, c("WT","KO"))
Mygraph <-ggplot(ggplotdata, aes(x=Region, y=Density, fill=Genotype)) +
geom_bar(position=position_dodge(), stat="identity",
colour="black",
size=.2) +
geom_errorbar(aes(ymin=Density-se, ymax=Density+se),
width=.2,
position=position_dodge(.9)) +
xlab(NULL) +
ylab("Density (cells/mm2)") +
scale_colour_manual(name=NULL,
breaks=c("KO", "WT"),
labels=c("KO", "WT"),
values=c("#FFFFFF", "#3366FF")) +
ggtitle("X") +
scale_y_continuous(breaks=0:17*500) +
theme_minimal()
Mygraph
The answer here was to use scale_fill_manual instead, thank you #dc37

Merge two stacked bar graphs into one plot R (ggplot2) [duplicate]

I'm having quite the time understanding geom_bar() and position="dodge". I was trying to make some bar graphs illustrating two groups. Originally the data was from two separate data frames. Per this question, I put my data in long format. My example:
test <- data.frame(names=rep(c("A","B","C"), 5), values=1:15)
test2 <- data.frame(names=c("A","B","C"), values=5:7)
df <- data.frame(names=c(paste(test$names), paste(test2$names)), num=c(rep(1,
nrow(test)), rep(2, nrow(test2))), values=c(test$values, test2$values))
I use that example as it's similar to the spend vs. budget example. Spending has many rows per names factor level whereas the budget only has one (one budget amount per category).
For a stacked bar plot, this works great:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity")
In particular, note the y value maxes. They are the sums of the data from test with the values of test2 shown on blue on top.
Based on other questions I've read, I simply need to add position="dodge" to make it a side-by-side plot vs. a stacked one:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", position="dodge")
It looks great, but note the new max y values. It seems like it's just taking the max y value from each names factor level from test for the y value. It's no longer summing them.
Per some other questions (like this one and this one, I also tried adding the group= option without success (produces the same dodged plot as above):
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(num))) +
geom_bar(stat="identity", position="dodge")
I don't understand why the stacked works great and the dodged doesn't just put them side by side instead of on top.
ETA: I found a recent question about this on the ggplot google group with the suggestion to add alpha=0.5 to see what's going on. It isn't that ggplot is taking the max value from each grouping; it's actually over-plotting bars on top of one another for each value.
It seems that when using position="dodge", ggplot expects only one y per x. I contacted Winston Chang, a ggplot developer about this to confirm as well as to inquire if this can be changed as I don't see an advantage.
It seems that stat="identity" should tell ggplot to tally the y=val passed inside aes() instead of individual counts which happens without stat="identity" and when passing no y value.
For now, the workaround seems to be (for the original df above) to aggregate so there's only one y per x:
df2 <- aggregate(df$values, by=list(df$names, df$num), FUN=sum)
p <- ggplot(df2, aes(x=Group.1, y=x, fill=factor(Group.2)))
p <- p + geom_bar(stat="identity", position="dodge")
p
I think the problem is that you want to stack within values of the num group, and dodge between values of num.
It might help to look at what happens when you add an outline to the bars.
library(ggplot2)
set.seed(123)
df <- data.frame(
id = 1:18,
names = rep(LETTERS[1:3], 6),
num = c(rep(1, 15), rep(2, 3)),
values = sample(1:10, 18, replace=TRUE)
)
By default, there are a lot of bars stacked - you just don't see that they're separate unless you have an outline:
# Stacked bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black")
If you dodge, you get bars that are dodged between values of num, but there may be multiple bars within each value of num:
# Dodged on 'num', but some overplotted bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
If you also add id as a grouping var, it'll dodge all of them:
# Dodging with unique 'id' as the grouping var
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(id))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
I think what you want is to both dodge and stack, but you can't do both.
So the best thing is to summarize the data yourself.
library(plyr)
df2 <- ddply(df, c("names", "num"), summarise, values = sum(values))
ggplot(df2, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge")

Issue with ggplot2, geom_bar, and position="dodge": stacked has correct y values, dodged does not

I'm having quite the time understanding geom_bar() and position="dodge". I was trying to make some bar graphs illustrating two groups. Originally the data was from two separate data frames. Per this question, I put my data in long format. My example:
test <- data.frame(names=rep(c("A","B","C"), 5), values=1:15)
test2 <- data.frame(names=c("A","B","C"), values=5:7)
df <- data.frame(names=c(paste(test$names), paste(test2$names)), num=c(rep(1,
nrow(test)), rep(2, nrow(test2))), values=c(test$values, test2$values))
I use that example as it's similar to the spend vs. budget example. Spending has many rows per names factor level whereas the budget only has one (one budget amount per category).
For a stacked bar plot, this works great:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity")
In particular, note the y value maxes. They are the sums of the data from test with the values of test2 shown on blue on top.
Based on other questions I've read, I simply need to add position="dodge" to make it a side-by-side plot vs. a stacked one:
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", position="dodge")
It looks great, but note the new max y values. It seems like it's just taking the max y value from each names factor level from test for the y value. It's no longer summing them.
Per some other questions (like this one and this one, I also tried adding the group= option without success (produces the same dodged plot as above):
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(num))) +
geom_bar(stat="identity", position="dodge")
I don't understand why the stacked works great and the dodged doesn't just put them side by side instead of on top.
ETA: I found a recent question about this on the ggplot google group with the suggestion to add alpha=0.5 to see what's going on. It isn't that ggplot is taking the max value from each grouping; it's actually over-plotting bars on top of one another for each value.
It seems that when using position="dodge", ggplot expects only one y per x. I contacted Winston Chang, a ggplot developer about this to confirm as well as to inquire if this can be changed as I don't see an advantage.
It seems that stat="identity" should tell ggplot to tally the y=val passed inside aes() instead of individual counts which happens without stat="identity" and when passing no y value.
For now, the workaround seems to be (for the original df above) to aggregate so there's only one y per x:
df2 <- aggregate(df$values, by=list(df$names, df$num), FUN=sum)
p <- ggplot(df2, aes(x=Group.1, y=x, fill=factor(Group.2)))
p <- p + geom_bar(stat="identity", position="dodge")
p
I think the problem is that you want to stack within values of the num group, and dodge between values of num.
It might help to look at what happens when you add an outline to the bars.
library(ggplot2)
set.seed(123)
df <- data.frame(
id = 1:18,
names = rep(LETTERS[1:3], 6),
num = c(rep(1, 15), rep(2, 3)),
values = sample(1:10, 18, replace=TRUE)
)
By default, there are a lot of bars stacked - you just don't see that they're separate unless you have an outline:
# Stacked bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black")
If you dodge, you get bars that are dodged between values of num, but there may be multiple bars within each value of num:
# Dodged on 'num', but some overplotted bars
ggplot(df, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
If you also add id as a grouping var, it'll dodge all of them:
# Dodging with unique 'id' as the grouping var
ggplot(df, aes(x=factor(names), y=values, fill=factor(num), group=factor(id))) +
geom_bar(stat="identity", colour="black", position="dodge", alpha=0.1)
I think what you want is to both dodge and stack, but you can't do both.
So the best thing is to summarize the data yourself.
library(plyr)
df2 <- ddply(df, c("names", "num"), summarise, values = sum(values))
ggplot(df2, aes(x=factor(names), y=values, fill=factor(num))) +
geom_bar(stat="identity", colour="black", position="dodge")

Resources