I have two vectors. I want to make a barplot of the first vector (simple enough, right). The twist is that every element of the second vector is the standard deviation for every element of the first vector (which itself is the average of 4 other values). How can I do that?
The vectors in question:
-4.6521175 0.145839723
1.1744100 0.342278694
-0.2581400 0.003776341
-0.3452675 0.073241199
-2.3823650 0.095008502
0.5625125 0.021627196
I.e., how can I add the elements of the second column vector as error bars to the corresponding elements in the first column vector?
Note: Before you ask, yes I did search extensively on this site and did a lot of googling, but my problem is a bit more specific, i.e. what I found didn't match what I needed.
I personally like arrows() best for this kind of graphic:
df <- data.frame(bar = c(-4.6521175, 1.1744100, -0.2581400, -0.3452675, -2.3823650, 0.5625125),
error = c(0.145839723, 0.342278694, 0.003776341, 0.073241199, 0.095008502, 0.021627196))
foo <- barplot(df$bar,ylim=c(-6,2),border=NA)
arrows(x0=foo,y0=df$bar+df$error,y1=df$bar-df$error,angle=90,code=3,length=0.1)
Two details:
border=NA in barplot() removes the borders around the bars, so you can actually see the error whiskers around the third bar. Since the third error is so small, the whisker lies pretty much on top of the bar border.
I used the length parameter in arrows() to reduce the width of the horizontal whiskers, which is especially relevant if we have larger numbers of bars. The default is length=0.25.
However, note that "dynamite plots" have major disadvantages. You write that your data come from just four raw points for each bar. In such a case it would almost certainly be better to just plot a (jittered) dotplot of your raw data.
An implementation with geom_bar and geom_errorbar of ggplot2:
library(ggplot2)
ggplot(df, aes(x=row.names(df), y=V1)) +
geom_bar(stat="identity", fill="grey") +
geom_errorbar(aes(ymin = V1 - V2, ymax = V1 + V2), width=0.6) +
theme_classic()
this results in:
If you want to remove the numbers on the x-axis, you can add:
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank())
to your ggplot code.
Used data:
df <- read.table(text="-4.6521175 0.145839723
1.1744100 0.342278694
-0.2581400 0.003776341
-0.3452675 0.073241199
-2.3823650 0.095008502
0.5625125 0.021627196", header=FALSE)
In response to your comment, two possible solution when you want plot such a large number of bars:
1: Only include a selection of the axis-labels:
ggplot(df2, aes(x=as.numeric(row.names(df2)), y=V1)) +
geom_bar(stat="identity", fill="grey", width=0.7) +
geom_errorbar(aes(ymin = V1 - V2, ymax = V1 + V2), width=0.5) +
scale_x_continuous(breaks=c(1,seq(10,200,10)), expand=c(0,0)) +
theme_classic() +
theme(axis.text.x=element_text(size = 6, angle = 90, vjust = 0.5))
this gives:
As can be seen, it is not ideal to cram so many bars in a plot. See therefore alternative 2.
2: Create a grouping variable which you can use for creating facets:
df2$id <- rep(letters[1:20], each=10)
ggplot(df2, aes(x=as.numeric(row.names(df2)), y=V1)) +
geom_bar(stat="identity", fill="grey", width=0.7) +
geom_errorbar(aes(ymin = V1 - V2, ymax = V1 + V2), width=0.5) +
scale_x_continuous(breaks=as.numeric(row.names(df2))) +
facet_wrap(~ id, scales = "free_x") +
theme_bw() +
theme(axis.text.x=element_text(angle = 90, vjust = 0.5))
this gives:
Used data for the two last examples:
df2 <- data.frame(V1=sample(df$V1, 200, replace=TRUE),
V2=sample(df$V2, 200, replace=TRUE))
Related
I am using below codes
p <- ggplot() +
geom_bar(data=filter(df, variable=="LA"), aes(x=Gen, y=Mean, fill=Leaf),
stat="identity", position="dodge")+
geom_point(data=filter(df, variable=="TT"),aes(x=Gen, y=Mean, colour=Leaf))+
geom_line(data=filter(df, variable=="TT"), aes(x=Gen, y=Mean, group=Leaf))+
ggtitle("G")+xlab("Genotypes")+ylab("Canopy temperature")+
scale_fill_hue(name="", labels=c("Leaf-1", "Leaf-2", "Leaf-3"))+
scale_y_continuous(sec.axis=sec_axis(~./20, name="2nd Y-axis"))+
theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="top")
graph produced from above code
I want graph like that
data
https://docs.google.com/spreadsheets/d/1Fjmg-l0WTL7jhEqwwtC4RXY_9VQV9GOBliFq_3G1f8I/edit#gid=0
From data, I want variable LA to left side and TT from right side
Above part is resolved,
Now, I am trying to put errorbars on the bar graph with below code, it caused an error, can someone have a look for solution?
p + geom_errorbar(aes(ymin=Mean-se, ymax=Mean+se), width=0.5,
position=position_dodge(0.9), colour="black", size=.7)
For this you need to understand that even you have the second Y-Axis, it is just a markup and everything draw on the graph is still base on the main Y-Axis(left one).
So you need to do two things:
Convert anything that should reference to the second Y-Axis to same scale of the one on the left, in this case is the bar scale (LA variables) whose maximum is 15. So you need to divide the value of TT by 20.
Second Axis needs to label correctly so it will be the main Y-Axis multiply by 20.
p <- ggplot() +
geom_bar(data=filter(df, variable=="LA"), aes(x=Gen, y=Mean, fill=Leaf),
stat="identity", position="dodge") +
# values are divided by 20 to be in the same value range of bar graph
geom_point(data=filter(df, variable=="TT"),aes(x=Gen, y=Mean/20, colour=Leaf))+
geom_line(data=filter(df, variable=="TT"), aes(x=Gen, y=Mean/20, group=Leaf))+
ggtitle("G")+xlab("Genotypes")+ylab("Canopy temperature")+
scale_fill_hue(name="", labels=c("Leaf-1", "Leaf-2", "Leaf-3"))+
# second axis is multiply by 20 to reflect the actual value of lines & points
scale_y_continuous(
sec.axis=sec_axis(trans = ~ . * 20, name="2nd Y-axis",
breaks = c(0, 100, 200, 300))) +
theme(axis.text.x=element_text(angle=90, hjust=1), legend.position="top")
For the error par which is very basic here. You will need to adjust the theme and the graph to have a good looking one.
p + geom_errorbar(data = filter(df, variable=="TT"),
aes(x = Gen, y=Mean/20, ymin=(Mean-se)/20,
ymax=(Mean+se)/20), width=0.5,
position=position_dodge(0.9), colour="black", size=.7)
One final note: Please consider reading the error message, understand what it say, reference to the help document of packages, functions in R so you can learn how to do all the code yourself.
I have been trying to look for an answer to my particular problem but I have not been successful, so I have just made a MWE to post here.
I tried the answers here with no success.
The task I want to do seems easy enough, but I cannot figure it out, and the results I get are making me have some fundamental questions...
I just want to overlay points and error bars on a bar plot, using ggplot2.
I have a long format data frame that looks like the following:
> mydf <- data.frame(cell=paste0("cell", rep(1:3, each=12)),
scientist=paste0("scientist", rep(rep(rep(1:2, each=3), 2), 3)),
timepoint=paste0("time", rep(rep(1:2, each=6), 3)),
rep=paste0("rep", rep(1:3, 12)),
value=runif(36)*100)
I have attempted to get the plot I want the following way:
myPal <- brewer.pal(3, "Set2")[1:2]
myPal2 <- brewer.pal(3, "Set1")
outfile <- "test.pdf"
pdf(file=outfile, height=10, width=10)
print(#or ggsave()
ggplot(mydf, aes(cell, value, fill=scientist )) +
geom_bar(stat="identity", position=position_dodge(.9)) +
geom_point(aes(cell, color=rep), position=position_dodge(.9), size=5) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_manual(values=myPal) +
scale_color_manual(values=myPal2)
)
dev.off()
But I obtain this:
The problem is, there should be 3 "rep" values per "scientist" bar, but the values are ordered by "rep" instead (they should be 1,2,3,1,2,3, instead of 1,1,2,2,3,3).
Besides, I would like to add error bars with geom_errorbar but I didn't manage to get a working example...
Furthermore, overlying actual value points to the bars, it is making me wonder what is actually being plotted here... if the values are taken properly for each bar, and why the max value (or so it seems) is plotted by default.
The way I think this should be properly plotted is with the median (or mean), adding the error bars like the whiskers in a boxplot (min and max value).
Any idea how to...
... have the "rep" value points appear in proper order?
... change the value shown by the bars from max to median?
... add error bars with max and min values?
I restructured your plotting code a little to make things easier.
The secret is to use proper grouping (which is otherwise inferred from fill and color. Also since you're dodging on multiple levels, dodge2 has to be used.
When you are unsure about "what is plotted where" in bar/column charts, it's always helpful to add the option color="black" which reveals that still things are stacked on top each other, because of your use of dodge instead of dodge2.
p = ggplot(mydf, aes(x=cell, y=value, group=paste(scientist,rep))) +
geom_col(aes(fill=scientist), position=position_dodge2(.9)) +
geom_point(aes(cell, color=rep), position=position_dodge2(.9), size=5) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
ggsave(filename = outfile, plot=p, height = 10, width = 10)
gives:
Regarding error bars
Since there are only three replicates I would show original data points and maybe a violin plot. For completeness sake I added also a geom_errorbar.
ggplot(mydf, aes(x=cell, y=value,group=paste(cell,scientist))) +
geom_violin(aes(fill=scientist),position=position_dodge(),color="black") +
geom_point(aes(cell, color=rep), position=position_dodge(0.9), size=5) +
geom_errorbar(stat="summary",position=position_dodge())+
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
gives
Update after comment
As I mentioned in my comment below, the stacking of the percentages leads to an undesirable outcome.
ggplot(mydf, aes(x=paste(cell, scientist), y=value)) +
geom_bar(aes(fill=rep),stat="identity", position=position_stack(),color="black") +
geom_point(aes(color=rep), position=position_dodge(.9), size=3) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
I'm preparing an appendix plot for a revised manuscript where I need to give information of the within-year ranges (variability) of several variables between years and sites.
I figured the tidiest way to do this (I have 7 sites, 21 years, and 5 variables...) would be to use a rose plot using coord_polar. However, I stumbled upon something that has always frustrated me about ggplot - the default ordering assumptions. While factors are easily reordered based on some value, this seems to only work in a fixed fashion: as far as I've understood, the order needs to apply throughout the data frame.
In this plot, the ordering needs to depend on a value which changes between years, and therefore the colour and fill values need to change in plotting order within the panel.
To demonstrate, I've created a reproducible example coded below (pictured in the way it should not work)
Basically, I always need the Site with the minimum value within a given Year to be plotted first (in the centre), followed outwards by the increase in value of the other sites, in order of the original value (see order and diff columns of the data frame). In other words, some years Site a will be at the centre, some years Site c will be in the centre, etc.
Any help would be massively appreciated.
library('ggplot2')
library('reshape2')
library("plyr")
## reproducible example of problem: create dummy data
madeup <- data.frame(Year = rep(2000:2015, each=20), Site=rep(c("a","b","c","d"), each=5, times=16),
var1 = rnorm(n=16*20, mean=20, sd=5), var2= rnorm(n=16*20, mean=50, sd=1))
## create ranges of the data by Year and Site
myRange <- function(dat) {range=max(dat, na.rm=TRUE)-min(dat,na.rm = TRUE)}
vardf <- ddply(madeup, .(Site, Year), summarise, var1=myRange(var1),
var2=myRange(var2))
varmelt <- melt(vardf, id.vars = c("Site","Year"))
varmelt$Site <- as.character(varmelt$Site) # this to preserve the new order when rbind called
varmelt <- by(varmelt, list(varmelt$Year, varmelt$variable), function(x) {x <- x[order(x$value),]
x$order <- 1:nrow(x)
return(x)})
varmelt <- do.call(rbind, varmelt)
## create difference between these values so that each site gets plotted cumulatively on the rose plot
## (otherwise areas close to the centre become uninterpretable)
vartest <- by(varmelt, list(varmelt$Year, varmelt$variable), function(x) {
x$diff <- c(x$value[1], diff(x$value))
return(x)
})
vartest <- do.call(rbind,vartest)
## plot rose plot to display how ranges in variables vary by year and between sites
## for this test example we'll just take one variable, but the idea is to facet by variable
max1 <- max(vartest$value[vartest$variable=='var1'])
yearlength <- length(2000:2015)
ggplot(vartest[vartest$variable=="var1",], aes(x=factor(Year), y=diff)) +
theme_bw() +
geom_hline(yintercept = seq(0,max1, by=1), size=0.3, col="grey60",lty=3) +
geom_vline(xintercept=seq(1,yearlength,1), size=0.3, col='grey30', lty=2) +
geom_bar(stat='identity', width=1, size=0.5, aes(col=Site, fill=Site)) +
scale_x_discrete() +
coord_polar() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
As long as you don't use stacked bars (position = "stack", which is the default for geom_bar), ggplot2 will actually use the order of the rows in your data for the plotting order. So all you need to do, is use the original values for the y-axis (rather than the cumulatively differenced ones) along with position = "identity", and order your data from largest to smallest value before plotting:
ordered_data <- vartest[order(-vartest$value), ]
ggplot(ordered_data, aes(factor(Year), value)) +
geom_col(aes(fill = Site), position = "identity", width = 1) +
coord_polar() +
facet_wrap(~ variable)
Created on 2018-02-17 by the reprex package (v0.2.0).
PS. When generating random data for an example, consider using set.seed so that your results can be reproduced exactly.
You can start with a single plot of the largest site, and then layer smaller sites on top like so:
a <- ggplot(vartest[vartest$variable=="var1"& vartest$order==4,], aes(x=factor(Year), y=value,group=order)) +
theme_bw() +
geom_hline(yintercept = seq(0,max1, by=1), size=0.3, col="grey60",lty=3) +
geom_vline(xintercept=seq(1,yearlength,1), size=0.3, col='grey30', lty=2) +
geom_bar(stat='identity', width=1, size=0.5, aes(col=Site, fill=Site)) +
scale_x_discrete() +
coord_polar() +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
b <- a + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==3,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
c <- b + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==2,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
c + geom_bar(data = vartest[vartest$variable=="var1"& vartest$order==1,],
stat='identity', width=1, size=0.5, aes(x=factor(Year), y=value,col=Site, fill=Site))
This produces the following:
Is that what you wanted?
I am trying to improve the clarity and aspect of a histogram of discrete values which I need to represent with a log scale.
Please consider the following MWE
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) + geom_histogram()
which produces
and then
ggplot(data, aes(x=dist)) + geom_line() + scale_x_log10(breaks=c(1,2,3,4,5,10,100))
which probably is even worse
since now it gives the impression that the something is missing between "1" and "2", and also is not totally clear which bar has value "1" (bar is on the right of the tick) and which bar has value "2" (bar is on the left of the tick).
I understand that technically ggplot provides the "right" visual answer for a log scale. Yet as observer I have some problem in understanding it.
Is it possible to improve something?
EDIT:
This what happen when I applied Jaap solution to my real data
Where do the dips between x=0 and x=1 and between x=1 and x=2 come from? My value are discrete, but then why the plot is also mapping x=1.5 and x=2.5?
The first thing that comes to mind, is playing with the binwidth. But that doesn't give a great solution either:
ggplot(data, aes(x=dist)) +
geom_histogram(binwidth=10) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0.015,0)) +
theme_bw()
gives:
In this case it is probably better to use a density plot. However, when you use scale_x_log10 you will get a warning message (Removed 524 rows containing non-finite values (stat_density)). This can be resolved by using a log plus one transformation.
The following code:
library(ggplot2)
library(scales)
ggplot(data, aes(x=dist)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300,1000), trans="log1p", expand=c(0,0)) +
scale_y_continuous(breaks=c(0,125,250,375,500,625,750), expand=c(0,0)) +
theme_bw()
will give this result:
I am wondering, what if, y-axis is scaled instead of x-axis. It will results into few warnings wherever values are 0, but may serve your purpose.
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) + geom_histogram() + scale_y_log10()
Also you may want to display frequencies as data labels, since people might ignore the y-scale and it takes some time to realize that y scale is logarithmic.
ggplot(data, aes(x=dist)) + geom_histogram(fill = 'skyblue', color = 'grey30') + scale_y_log10() +
stat_bin(geom="text", size=3.5, aes(label=..count.., y=0.8*(..count..)))
A solution could be to convert your data to a factor:
library(ggplot2)
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
ggplot(data, aes(x=factor(dist))) +
geom_histogram(stat = "count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Resulting in:
I had the same issue and, inspired by #Jaap's answer, I fiddled with the histogram binwidth using the x-axis in log scale.
If you use binwidth = 0.201, the bars will be juxtaposed as expected. However, this means you can only have up to five bars between two x coordinates.
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) +
geom_histogram(binwidth = 0.201, color = 'red') +
scale_x_log10()
Result:
Folks,
I am plotting histograms using geom_histogram and I would like to label each histogram with the mean value (I am using mean for the sake of this example). The issue is that I am drawing multiple histograms in one facet and I get labels overlapping. This is an example:
library(ggplot2)
df <- data.frame (type=rep(1:2, each=1000), subtype=rep(c("a","b"), each=500), value=rnorm(4000, 0,1))
plt <- ggplot(df, aes(x=value, fill=subtype)) + geom_histogram(position="identity", alpha=0.4)
plt <- plt + facet_grid(. ~ type)
plt + geom_text(aes(label = paste("mean=", mean(value)), colour=subtype, x=-Inf, y=Inf), data = df, size = 4, hjust=-0.1, vjust=2)
Result is:
The problem is that the labels for Subtypes a and b are overlapping. I would like to solve this.
I have tried the position, both dodge and stack, for example:
plt + geom_text(aes(label = paste("mean=", mean(value)), colour=subtype, x=-Inf, y=Inf), position="stack", data = df, size = 4, hjust=-0.1, vjust=2)
This did not help. In fact, it issued warning about the width.
Would you pls help ?
Thx,
Riad.
I think you could precalculate mean values before plotting in new data frame.
library(plyr)
df.text<-ddply(df,.(type,subtype),summarise,mean.value=mean(value))
df.text
type subtype mean.value
1 1 a -0.003138127
2 1 b 0.023252169
3 2 a 0.030831337
4 2 b -0.059001888
Then use this new data frame in geom_text(). To ensure that values do not overlap you can provide two values in vjust= (as there are two values in each facet).
ggplot(df, aes(x=value, fill=subtype)) +
geom_histogram(position="identity", alpha=0.4)+
facet_grid(. ~ type)+
geom_text(data=df.text,aes(label=paste("mean=",mean.value),
colour=subtype,x=-Inf,y=Inf), size = 4, hjust=-0.1, vjust=c(2,4))
Just to expand on #Didzis:
You actually have two problems here. First, the text overlaps, but more importantly, when you use aggregating functions in aes(...), as in:
geom_text(aes(label = paste("mean=", mean(value)), ...
ggplot does not respect the subsetting implied in the facets (or in the groups for that matter). So mean(value) is based on the full dataset regardless of faceting or grouping. As a result, you have to use an auxillary table, as #Didzis shows.
BTW:
df.text <- aggregate(df$value,by=list(type=df$type,subtype=df$subtype),mean)
gets you the means and does not require plyr.