Overlay points (and error bars) over bar plot with position_dodge - r

I have been trying to look for an answer to my particular problem but I have not been successful, so I have just made a MWE to post here.
I tried the answers here with no success.
The task I want to do seems easy enough, but I cannot figure it out, and the results I get are making me have some fundamental questions...
I just want to overlay points and error bars on a bar plot, using ggplot2.
I have a long format data frame that looks like the following:
> mydf <- data.frame(cell=paste0("cell", rep(1:3, each=12)),
scientist=paste0("scientist", rep(rep(rep(1:2, each=3), 2), 3)),
timepoint=paste0("time", rep(rep(1:2, each=6), 3)),
rep=paste0("rep", rep(1:3, 12)),
value=runif(36)*100)
I have attempted to get the plot I want the following way:
myPal <- brewer.pal(3, "Set2")[1:2]
myPal2 <- brewer.pal(3, "Set1")
outfile <- "test.pdf"
pdf(file=outfile, height=10, width=10)
print(#or ggsave()
ggplot(mydf, aes(cell, value, fill=scientist )) +
geom_bar(stat="identity", position=position_dodge(.9)) +
geom_point(aes(cell, color=rep), position=position_dodge(.9), size=5) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_manual(values=myPal) +
scale_color_manual(values=myPal2)
)
dev.off()
But I obtain this:
The problem is, there should be 3 "rep" values per "scientist" bar, but the values are ordered by "rep" instead (they should be 1,2,3,1,2,3, instead of 1,1,2,2,3,3).
Besides, I would like to add error bars with geom_errorbar but I didn't manage to get a working example...
Furthermore, overlying actual value points to the bars, it is making me wonder what is actually being plotted here... if the values are taken properly for each bar, and why the max value (or so it seems) is plotted by default.
The way I think this should be properly plotted is with the median (or mean), adding the error bars like the whiskers in a boxplot (min and max value).
Any idea how to...
... have the "rep" value points appear in proper order?
... change the value shown by the bars from max to median?
... add error bars with max and min values?

I restructured your plotting code a little to make things easier.
The secret is to use proper grouping (which is otherwise inferred from fill and color. Also since you're dodging on multiple levels, dodge2 has to be used.
When you are unsure about "what is plotted where" in bar/column charts, it's always helpful to add the option color="black" which reveals that still things are stacked on top each other, because of your use of dodge instead of dodge2.
p = ggplot(mydf, aes(x=cell, y=value, group=paste(scientist,rep))) +
geom_col(aes(fill=scientist), position=position_dodge2(.9)) +
geom_point(aes(cell, color=rep), position=position_dodge2(.9), size=5) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
ggsave(filename = outfile, plot=p, height = 10, width = 10)
gives:
Regarding error bars
Since there are only three replicates I would show original data points and maybe a violin plot. For completeness sake I added also a geom_errorbar.
ggplot(mydf, aes(x=cell, y=value,group=paste(cell,scientist))) +
geom_violin(aes(fill=scientist),position=position_dodge(),color="black") +
geom_point(aes(cell, color=rep), position=position_dodge(0.9), size=5) +
geom_errorbar(stat="summary",position=position_dodge())+
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
gives
Update after comment
As I mentioned in my comment below, the stacking of the percentages leads to an undesirable outcome.
ggplot(mydf, aes(x=paste(cell, scientist), y=value)) +
geom_bar(aes(fill=rep),stat="identity", position=position_stack(),color="black") +
geom_point(aes(color=rep), position=position_dodge(.9), size=3) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")

Related

How I can correctly overlap bar and linechart together

I am using below codes
p <- ggplot() +
geom_bar(data=filter(df, variable=="LA"), aes(x=Gen, y=Mean, fill=Leaf),
stat="identity", position="dodge")+
geom_point(data=filter(df, variable=="TT"),aes(x=Gen, y=Mean, colour=Leaf))+
geom_line(data=filter(df, variable=="TT"), aes(x=Gen, y=Mean, group=Leaf))+
ggtitle("G")+xlab("Genotypes")+ylab("Canopy temperature")+
scale_fill_hue(name="", labels=c("Leaf-1", "Leaf-2", "Leaf-3"))+
scale_y_continuous(sec.axis=sec_axis(~./20, name="2nd Y-axis"))+
theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="top")
graph produced from above code
I want graph like that
data
https://docs.google.com/spreadsheets/d/1Fjmg-l0WTL7jhEqwwtC4RXY_9VQV9GOBliFq_3G1f8I/edit#gid=0
From data, I want variable LA to left side and TT from right side
Above part is resolved,
Now, I am trying to put errorbars on the bar graph with below code, it caused an error, can someone have a look for solution?
p + geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=0.5,
position=position_dodge(0.9), colour="black", size=.7)
For this you need to understand that even you have the second Y-Axis, it is just a markup and everything draw on the graph is still base on the main Y-Axis(left one).
So you need to do two things:
Convert anything that should reference to the second Y-Axis to same scale of the one on the left, in this case is the bar scale (LA variables) whose maximum is 15. So you need to divide the value of TT by 20.
Second Axis needs to label correctly so it will be the main Y-Axis multiply by 20.
p <- ggplot() +
geom_bar(data=filter(df, variable=="LA"), aes(x=Gen, y=Mean, fill=Leaf),
stat="identity", position="dodge") +
# values are divided by 20 to be in the same value range of bar graph
geom_point(data=filter(df, variable=="TT"),aes(x=Gen, y=Mean/20, colour=Leaf))+
geom_line(data=filter(df, variable=="TT"), aes(x=Gen, y=Mean/20, group=Leaf))+
ggtitle("G")+xlab("Genotypes")+ylab("Canopy temperature")+
scale_fill_hue(name="", labels=c("Leaf-1", "Leaf-2", "Leaf-3"))+
# second axis is multiply by 20 to reflect the actual value of lines & points
scale_y_continuous(
sec.axis=sec_axis(trans = ~ . * 20, name="2nd Y-axis",
breaks = c(0, 100, 200, 300))) +
theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="top")
For the error par which is very basic here. You will need to adjust the theme and the graph to have a good looking one.
p + geom_errorbar(data = filter(df, variable=="TT"),
aes(x = Gen, y=Mean/20, ymin=(Mean-se)/20,
ymax=(Mean+se)/20), width=0.5,
position=position_dodge(0.9), colour="black", size=.7)
One final note: Please consider reading the error message, understand what it say, reference to the help document of packages, functions in R so you can learn how to do all the code yourself.

Percentages in faceted histogram with scale_y_continuous()

I am trying to use scale_y_continuous() with a faceted histogram and running into an issue. I am hoping to get each count to be a percentage instead. My code is:
ggplot(d, aes(x = likely_att)) +
geom_histogram(binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())
It looks like the distributions themselves are accurate, but the scaling is off: the percentages are "200 000%", "5 000%", etc. and that seems wrong, but I'm not quite sure why it's happening.
There are many more "yes" than "no" or "separated" married values in my dataset, which is why I use scales = "free_y" and why I'm hoping to just have percentages shown and only need one axis value shown.
I can't share this exact data for privacy reasons, but the likely_att variable is just a 1-5 numeric var, and married is a character var with 3 values: yes, no, separated.
In case it's helpful, I basically want it to look just like this image, but with percentages instead of counts, so I can just have one single y axis on the far left with 0 - 100 %
The problem is that using the percentage_format() function changes the way the labels are printed, but it doesn't actually rescale the numbers. To do that, you could use the density constructed variable and multiply it by the bin-width, then use the percent formatting.
ggplot(d, aes(x = likely_att)) +
stat_bin(aes(y=..density..*.5, group = married),
binwidth = 0.5, color = "black") +
facet_wrap(~married, scales = "free_y") +
theme_classic() +
scale_y_continuous(labels = percent_format())

ggplot histogram: present both overall count in addition to group count in each bin

I am trying to generate a histogram using ggplot which on the x axis has speeds and on the y axis has the counts. In addition, each bin shows how many of those were during the day and night.
I need to present the counts themselves on the plot. I managed to add the counts within each bar but now I would like to present another number, the total count, on top of each bar. Is that possible?
This is my code:
ggplot(aes(x = speedmh ) , data = GPSdataset1hDFDS48) +
geom_histogram(aes(fill=DayActv), bins=15, colour="grey20", lwd=0.2) + ylim(0, 400) +xlim(0,500)+
stat_bin(bins=15, geom="text", colour="white", size=3.5,
aes(label=..count.., group=DayActv), position=position_stack(vjust=0.5))
and this is the result I get:
How do I add the total count of speeds within each bin to the top of every bar?
Ideally I would like to make this histogram of proportions of speeds instead of counts, but I think that is too complicated for me at the moment.
Thank you!!
Mia
One way is to add another stat_bin command without the grouping:
library(ggplot2)
ggplot(aes(x = speedmh) , data = GPSdataset1hDFDS48) +
geom_histogram(aes(fill=DayActv), bins=15, colour="grey20", lwd=0.2) + ylim(0, 400) +
xlim(0,500) +
stat_bin(bins=15, geom="text", colour="white", size=3.5,
aes(label=..count.., group=DayActv), position=position_stack(vjust=0.5)) +
stat_bin(bins=15, geom="text", colour="black", size=3.5,
aes(label=..count..), vjust=-0.5)
Data:
GPSdataset1hDFDS48 <- data.frame(speedmh=rexp(1000, 0.015), DayActv=factor(sample(0:1, 1000,TRUE)))

Scale geom_density to match geom_bar with percentage on y

Since I was confused about the math last time I tried asking this, here's another try. I want to combine a histogram with a smoothed distribution fit. And I want the y axis to be in percent.
I can't find a good way to get this result. Last time, I managed to find a way to scale the geom_bar to the same scale as geom_density, but that's the opposite of what I wanted.
My current code produces this output:
ggplot2::ggplot(iris, aes(Sepal.Length)) +
geom_bar(stat="bin", aes(y=..density..)) +
geom_density()
The density and bar y values match up, but the scaling is nonsensical. I want percentage on the y axes, not well, the density.
Some new attempts. We begin with a bar plot modified to show percentages instead of counts:
gg = ggplot2::ggplot(iris, aes(Sepal.Length)) +
geom_bar(aes(y = ..count../sum(..count..))) +
scale_y_continuous(name = "%", labels=scales::percent)
Then we try to add a geom_density to that and somehow get it to scale properly:
gg + geom_density()
gg + geom_density(aes(y=..count..))
gg + geom_density(aes(y=..scaled..))
gg + geom_density(aes(y=..density..))
Same as the first.
gg + geom_density(aes(y = ..count../sum(..count..)))
gg + geom_density(aes(y = ..count../n))
Seems to be off by about factor 10...
gg + geom_density(aes(y = ..count../n/10))
same as:
gg + geom_density(aes(y = ..density../10))
But ad hoc inserting numbers seems like a bad idea.
One useful trick is to inspect the calculated values of the plot. These are not normally saved in the object if one saves it. However, one can use:
gg_data = ggplot_build(gg + geom_density())
gg_data$data[[2]] %>% View
Since we know the density fit around x=6 should be about .04 (4%), we can look around for ggplot2-calculated values that get us there, and the only thing I see is density/10.
How do I get geom_density fit to scale to the same y axis as the modified geom_bar?
Bonus question: why are the grouping of the bars different? The current function does not have spaces in between bars.
Here is an easy solution:
library(scales) # ! important
library(ggplot2)
ggplot(iris, aes(Sepal.Length)) +
stat_bin(aes(y=..density..), breaks = seq(min(iris$Sepal.Length), max(iris$Sepal.Length), by = .1), color="white") +
geom_line(stat="density", size = 1) +
scale_y_continuous(labels = percent, name = "percent") +
theme_classic()
Output:
Try this
ggplot2::ggplot(iris, aes(x=Sepal.Length)) +
geom_histogram(stat="bin", binwidth = .1, aes(y=..density..)) +
geom_density()+
scale_y_continuous(breaks = c(0, .1, .2,.3,.4,.5,.6),
labels =c ("0", "1%", "2%", "3%", "4%", "5%", "6%") ) +
ylab("Percent of Irises") +
xlab("Sepal Length in Bins of .1 cm")
I think your first example is what you want, you just want to change the labels to make it seem like it is percents, so just do that rather than mess around.

Duplicated xtick labels in ggplot facets

I have this data.frame which I want to plot in facets using ggplot + facet_wrap:
set.seed(1)
df <- data.frame(val=rnorm(36),
gt=c(sapply(c("wt","pd","md","bd"),function(x) rep(x,9))),
ts=rep(c(sapply(c("cb","hp","ac"),function(x) rep(x,3))),4),
col=c(sapply(c("darkgray","darkblue","darkred","darkmagenta"),function(x) rep(x,9))),
index=rep(1:9,4),
stringsAsFactors=F)
df$xlab <- paste(df$ts,df$index,sep=".")
df$gt <- factor(df$gt,levels=c("wt","pd","md","bd"))
Here's how I'm trying to plot:
require(ggplot2)
ggplot(df,aes(x=index,y=val,color=gt))+geom_point(size=3)+facet_wrap(~gt,ncol=4)+
scale_fill_manual(values=c("darkgray","darkblue","darkred","darkmagenta"),labels=levels(df$gt),name="gt",guide=F)+
scale_colour_manual(values=c("darkgray","darkblue","darkred","darkmagenta"),labels=levels(df$gt),name="gt",guide=F)+
labs(x="replicate",y="val")+scale_x_continuous(breaks=df$index,labels=df$xlab)+
theme_bw()+theme(axis.text=element_text(size=6),axis.title=element_text(size=7),legend.text=element_text(size=6),legend.key=element_blank(),panel.border=element_blank(),strip.background=element_blank())
Which gives:
The problem is that the x0axis tick labels repeat themselves, sinceI'm calling scale_x_continuous. How do I get it right with facet_wrap?
Use the actual x-values in xlab as the x aesthetic, along with scales="free_x" in facet_wrap and delete the call to scale_x_continuous. Note, however, that the axis labels are still the same in each panel, because they are the same for each level of gt in the data.
ggplot(df,aes(x=xlab, y=val, color=gt)) +
geom_point(size=3, show.legend=FALSE) +
facet_wrap(~gt, ncol=4, scales="free_x") +
# scale_fill_manual(values=c("darkgray","darkblue","darkred","darkmagenta"), labels=levels(df$gt), name="gt", guide=F) +
scale_colour_manual(values=c("darkgray","darkblue","darkred","darkmagenta")) +
labs(x="replicate", y="val") +
#scale_x_continuous(breaks=df$index, labels=df$xlab)+
theme_bw() +
theme(axis.text=element_text(size=8),
axis.title=element_text(size=7),
legend.text=element_text(size=6),
legend.key=element_blank(),
panel.border=element_blank(),
strip.background=element_blank())
Now let's change xlab, just to see how this works when different panels really do have different labels:
df$xlab[10:20] = LETTERS[1:11]
Now run the same plot code again to get the following:
One more contingency is the case where not all the panels have the same number of x-values. In that case, you can switch to facet_grid and add space="free_x" if you want the width of each panel to be proportional to the number of x-values in each panel.
ggplot(df[-c(1:5),], aes(x=xlab, y=val, color=gt)) +
geom_point(size=3, show.legend=FALSE) +
facet_grid(.~gt, space="free_x", scales="free_x") +
scale_colour_manual(values=c("darkgray","darkblue","darkred","darkmagenta")) +
labs(x="replicate", y="val") +
theme_bw() +
theme(axis.text=element_text(size=8),
axis.title=element_text(size=7),
legend.text=element_text(size=6),
legend.key=element_blank(),
panel.border=element_blank(),
strip.background=element_blank())
A few other things:
You don't need to add color names to your data frame. If you want to change the default color, you can just set the them using one of the scale_colour_*** functions (as you did in your code).
For future reference this c(sapply(c("darkgray","darkblue","darkred","darkmagenta"),function(x) rep(x,9))) can be changed to this rep(c("darkgray","darkblue","darkred","darkmagenta"), each=9).
You can remove the scale_fill_manual line, as you don't have a fill aesthetic in your graph.

Resources