ggplot2: geom_text() in geom_col() with POSIXt x axis [duplicate] - r

I would like to plot a time series using bar charts and have the Bin Width set to 0.9. I cannot seem to be able to do that however. I have searched around but could not find anything helpful so far. Is this a limitation if the stat="identity ?
Here is a sample data and graph.
Cheers !
time <- c('2015-06-08 00:59:00','2015-06-08 02:48:00','2015-06-08 06:43:00','2015-06-08 08:59:00','2015-06-08 10:59:00','2015-06-08 12:59:00','2015-06-08 14:58:00','2015-06-08 16:58:00','2015-06-08 18:59:00','2015-06-08 20:59:00','2015-06-08 22:57:00','2015-06-09 00:59:00','2015-06-09 01:57:00','2015-06-09 03:22:00','2015-06-09 06:14:00','2015-06-09 08:59:00','2015-06-09 10:59:00','2015-06-09 12:59:00','2015-06-09 14:59:00','2015-06-09 16:59:00','2015-06-09 18:59:00','2015-06-09 20:59:00','2015-06-09 22:58:00','2015-06-10 00:57:00','2015-06-10 02:34:00','2015-06-10 04:45:00','2015-06-10 06:24:00','2015-06-10 08:59:00','2015-06-10 10:59:00','2015-06-10 12:59:00','2015-06-10 14:59:00','2015-06-10 16:59:00','2015-06-10 18:59:00','2015-06-10 20:58:00','2015-06-10 22:52:00','2015-06-11 00:59:00','2015-06-11 02:59:00','2015-06-11 04:59:00','2015-06-11 06:59:00','2015-06-11 08:59:00','2015-06-11 10:59:00','2015-06-11 12:59:00','2015-06-11 14:59:00','2015-06-11 16:58:00','2015-06-11 18:58:00','2015-06-11 20:56:00','2015-06-11 21:49:00','2015-06-12 00:59:00','2015-06-12 02:59:00','2015-06-12 04:20:00','2015-06-12 08:55:00','2015-06-12 10:55:00','2015-06-12 12:59:00','2015-06-12 14:59:00','2015-06-12 16:59:00','2015-06-12 18:59:00','2015-06-12 20:55:00','2015-06-12 22:50:00','2015-06-13 00:16:00','2015-06-13 12:59:00','2015-06-13 14:35:00','2015-06-13 16:56:00','2015-06-13 18:59:00','2015-06-13 20:59:00','2015-06-13 22:44:00','2015-06-13 23:19:00','2015-06-14 08:53:00','2015-06-14 10:14:00','2015-06-14 12:59:00','2015-06-14 14:59:00','2015-06-14 16:56:00','2015-06-14 18:58:00','2015-06-14 20:57:00','2015-06-14 22:31:00','2015-06-14 23:59:00')
count <- c(59,63,9,13,91,80,97,210,174,172,167,74,43,18,18,29,136,157,126,170,188,135,207,216,163,163,126,111,172,213,209,265,203,205,195,201,171,157,153,176,187,252,227,223,171,162,146,161,136,124,155,239,233,157,158,125,138,45,45,1,2,6,6,46,48,4,1,1,12,56,65,122,81,110,42)
level <- c('low','low','low','low','low','low','low','high','normal','normal','normal','low','low','low','low','low','low','normal','low','normal','normal','low','high','high','normal','normal','low','low','normal','high','high','high','high','high','normal','high','normal','normal','normal','normal','normal','high','high','high','normal','normal','low','normal','low','low','normal','high','high','normal','normal','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low')
DF = data.frame(time, count, level)
DF$time = as.POSIXct(DF$time)
ggplot(DF, aes(x=time, y=count, fill=level), width=0.9) +
geom_bar(stat="identity") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks("day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")

Found it ! Actually, the width is supported, though the scale is in seconds since I'm plotting a time series where the X axis is formatted as a POSIX date. Therefore, a width=0.9 means the bin width is 0.9 seconds. Since my bins are 2hrs eachs then a width of "1" is actually 7200. So here is the code that works.
ggplot(DF, aes(x=time, y=count, width=6000, fill=level)) +
geom_bar(stat="identity", position="identity", color="grey") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks("day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
Results as below. There are some averlaps in the bars, I just need to aligh my data, say to the next hour.

If what you are trying to achieve is widening the bars in the plot, ggplot doesn't seem to support that for geom_bar. However, it is pretty straightforward to implement a barplot using geom_rect.
Since many of the datapoints seem to be spaced roughly two hours apart, I am assuming here that the 0.9 width you want to achieve is 0.9 hours to either side of the given time (so basically smushing out most of the space between the bars.
If that's what you want, the following code should work:
library(lubridate)
ggplot(DF, aes(xmin=time-minutes(54), xmax=time+minutes(54), ymin=0, ymax=count,
fill=level)) +
geom_rect(color="#666666")

I am also trying to wrap my head around R.
I have worked on a solution and found a solution that also provided me with a warning pointing at the problem - overlapping time x intervals. The error disappears at width = 2000. By supplementing with position = "dodge": "places overlapping objects directly beside one another" - https://r4ds.had.co.nz/data-visualisation.html - you can achieve a reasonable plot.
# Original file
ggplot(DF, aes(x=time, y=count, fill=level, width=2000), position = "dodge") +
geom_bar(stat="identity") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks("day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
PREVIOUS VERSION NOT SO GOOD Here is another solution:
ggplot(DF, aes(x=time, y=count, colour = level)) +
geom_bar(stat="identity") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks(width = "day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
colour=level gives wider columns

Related

ggplot histogram: present both overall count in addition to group count in each bin

I am trying to generate a histogram using ggplot which on the x axis has speeds and on the y axis has the counts. In addition, each bin shows how many of those were during the day and night.
I need to present the counts themselves on the plot. I managed to add the counts within each bar but now I would like to present another number, the total count, on top of each bar. Is that possible?
This is my code:
ggplot(aes(x = speedmh ) , data = GPSdataset1hDFDS48) +
geom_histogram(aes(fill=DayActv), bins=15, colour="grey20", lwd=0.2) + ylim(0, 400) +xlim(0,500)+
stat_bin(bins=15, geom="text", colour="white", size=3.5,
aes(label=..count.., group=DayActv), position=position_stack(vjust=0.5))
and this is the result I get:
How do I add the total count of speeds within each bin to the top of every bar?
Ideally I would like to make this histogram of proportions of speeds instead of counts, but I think that is too complicated for me at the moment.
Thank you!!
Mia
One way is to add another stat_bin command without the grouping:
library(ggplot2)
ggplot(aes(x = speedmh) , data = GPSdataset1hDFDS48) +
geom_histogram(aes(fill=DayActv), bins=15, colour="grey20", lwd=0.2) + ylim(0, 400) +
xlim(0,500) +
stat_bin(bins=15, geom="text", colour="white", size=3.5,
aes(label=..count.., group=DayActv), position=position_stack(vjust=0.5)) +
stat_bin(bins=15, geom="text", colour="black", size=3.5,
aes(label=..count..), vjust=-0.5)
Data:
GPSdataset1hDFDS48 <- data.frame(speedmh=rexp(1000, 0.015), DayActv=factor(sample(0:1, 1000,TRUE)))

Overlay points (and error bars) over bar plot with position_dodge

I have been trying to look for an answer to my particular problem but I have not been successful, so I have just made a MWE to post here.
I tried the answers here with no success.
The task I want to do seems easy enough, but I cannot figure it out, and the results I get are making me have some fundamental questions...
I just want to overlay points and error bars on a bar plot, using ggplot2.
I have a long format data frame that looks like the following:
> mydf <- data.frame(cell=paste0("cell", rep(1:3, each=12)),
scientist=paste0("scientist", rep(rep(rep(1:2, each=3), 2), 3)),
timepoint=paste0("time", rep(rep(1:2, each=6), 3)),
rep=paste0("rep", rep(1:3, 12)),
value=runif(36)*100)
I have attempted to get the plot I want the following way:
myPal <- brewer.pal(3, "Set2")[1:2]
myPal2 <- brewer.pal(3, "Set1")
outfile <- "test.pdf"
pdf(file=outfile, height=10, width=10)
print(#or ggsave()
ggplot(mydf, aes(cell, value, fill=scientist )) +
geom_bar(stat="identity", position=position_dodge(.9)) +
geom_point(aes(cell, color=rep), position=position_dodge(.9), size=5) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_manual(values=myPal) +
scale_color_manual(values=myPal2)
)
dev.off()
But I obtain this:
The problem is, there should be 3 "rep" values per "scientist" bar, but the values are ordered by "rep" instead (they should be 1,2,3,1,2,3, instead of 1,1,2,2,3,3).
Besides, I would like to add error bars with geom_errorbar but I didn't manage to get a working example...
Furthermore, overlying actual value points to the bars, it is making me wonder what is actually being plotted here... if the values are taken properly for each bar, and why the max value (or so it seems) is plotted by default.
The way I think this should be properly plotted is with the median (or mean), adding the error bars like the whiskers in a boxplot (min and max value).
Any idea how to...
... have the "rep" value points appear in proper order?
... change the value shown by the bars from max to median?
... add error bars with max and min values?
I restructured your plotting code a little to make things easier.
The secret is to use proper grouping (which is otherwise inferred from fill and color. Also since you're dodging on multiple levels, dodge2 has to be used.
When you are unsure about "what is plotted where" in bar/column charts, it's always helpful to add the option color="black" which reveals that still things are stacked on top each other, because of your use of dodge instead of dodge2.
p = ggplot(mydf, aes(x=cell, y=value, group=paste(scientist,rep))) +
geom_col(aes(fill=scientist), position=position_dodge2(.9)) +
geom_point(aes(cell, color=rep), position=position_dodge2(.9), size=5) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
ggsave(filename = outfile, plot=p, height = 10, width = 10)
gives:
Regarding error bars
Since there are only three replicates I would show original data points and maybe a violin plot. For completeness sake I added also a geom_errorbar.
ggplot(mydf, aes(x=cell, y=value,group=paste(cell,scientist))) +
geom_violin(aes(fill=scientist),position=position_dodge(),color="black") +
geom_point(aes(cell, color=rep), position=position_dodge(0.9), size=5) +
geom_errorbar(stat="summary",position=position_dodge())+
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")
gives
Update after comment
As I mentioned in my comment below, the stacking of the percentages leads to an undesirable outcome.
ggplot(mydf, aes(x=paste(cell, scientist), y=value)) +
geom_bar(aes(fill=rep),stat="identity", position=position_stack(),color="black") +
geom_point(aes(color=rep), position=position_dodge(.9), size=3) +
facet_grid(timepoint~., scales="free_x", space="free_x") +
scale_y_continuous("% of total cells") +
scale_fill_brewer(palette = "Set2")+
scale_color_brewer(palette = "Set1")

Scale geom_density to match geom_bar with percentage on y

Since I was confused about the math last time I tried asking this, here's another try. I want to combine a histogram with a smoothed distribution fit. And I want the y axis to be in percent.
I can't find a good way to get this result. Last time, I managed to find a way to scale the geom_bar to the same scale as geom_density, but that's the opposite of what I wanted.
My current code produces this output:
ggplot2::ggplot(iris, aes(Sepal.Length)) +
geom_bar(stat="bin", aes(y=..density..)) +
geom_density()
The density and bar y values match up, but the scaling is nonsensical. I want percentage on the y axes, not well, the density.
Some new attempts. We begin with a bar plot modified to show percentages instead of counts:
gg = ggplot2::ggplot(iris, aes(Sepal.Length)) +
geom_bar(aes(y = ..count../sum(..count..))) +
scale_y_continuous(name = "%", labels=scales::percent)
Then we try to add a geom_density to that and somehow get it to scale properly:
gg + geom_density()
gg + geom_density(aes(y=..count..))
gg + geom_density(aes(y=..scaled..))
gg + geom_density(aes(y=..density..))
Same as the first.
gg + geom_density(aes(y = ..count../sum(..count..)))
gg + geom_density(aes(y = ..count../n))
Seems to be off by about factor 10...
gg + geom_density(aes(y = ..count../n/10))
same as:
gg + geom_density(aes(y = ..density../10))
But ad hoc inserting numbers seems like a bad idea.
One useful trick is to inspect the calculated values of the plot. These are not normally saved in the object if one saves it. However, one can use:
gg_data = ggplot_build(gg + geom_density())
gg_data$data[[2]] %>% View
Since we know the density fit around x=6 should be about .04 (4%), we can look around for ggplot2-calculated values that get us there, and the only thing I see is density/10.
How do I get geom_density fit to scale to the same y axis as the modified geom_bar?
Bonus question: why are the grouping of the bars different? The current function does not have spaces in between bars.
Here is an easy solution:
library(scales) # ! important
library(ggplot2)
ggplot(iris, aes(Sepal.Length)) +
stat_bin(aes(y=..density..), breaks = seq(min(iris$Sepal.Length), max(iris$Sepal.Length), by = .1), color="white") +
geom_line(stat="density", size = 1) +
scale_y_continuous(labels = percent, name = "percent") +
theme_classic()
Output:
Try this
ggplot2::ggplot(iris, aes(x=Sepal.Length)) +
geom_histogram(stat="bin", binwidth = .1, aes(y=..density..)) +
geom_density()+
scale_y_continuous(breaks = c(0, .1, .2,.3,.4,.5,.6),
labels =c ("0", "1%", "2%", "3%", "4%", "5%", "6%") ) +
ylab("Percent of Irises") +
xlab("Sepal Length in Bins of .1 cm")
I think your first example is what you want, you just want to change the labels to make it seem like it is percents, so just do that rather than mess around.

How to set Bin Width With geom_bar stat="identity" in a time Series plot?

I would like to plot a time series using bar charts and have the Bin Width set to 0.9. I cannot seem to be able to do that however. I have searched around but could not find anything helpful so far. Is this a limitation if the stat="identity ?
Here is a sample data and graph.
Cheers !
time <- c('2015-06-08 00:59:00','2015-06-08 02:48:00','2015-06-08 06:43:00','2015-06-08 08:59:00','2015-06-08 10:59:00','2015-06-08 12:59:00','2015-06-08 14:58:00','2015-06-08 16:58:00','2015-06-08 18:59:00','2015-06-08 20:59:00','2015-06-08 22:57:00','2015-06-09 00:59:00','2015-06-09 01:57:00','2015-06-09 03:22:00','2015-06-09 06:14:00','2015-06-09 08:59:00','2015-06-09 10:59:00','2015-06-09 12:59:00','2015-06-09 14:59:00','2015-06-09 16:59:00','2015-06-09 18:59:00','2015-06-09 20:59:00','2015-06-09 22:58:00','2015-06-10 00:57:00','2015-06-10 02:34:00','2015-06-10 04:45:00','2015-06-10 06:24:00','2015-06-10 08:59:00','2015-06-10 10:59:00','2015-06-10 12:59:00','2015-06-10 14:59:00','2015-06-10 16:59:00','2015-06-10 18:59:00','2015-06-10 20:58:00','2015-06-10 22:52:00','2015-06-11 00:59:00','2015-06-11 02:59:00','2015-06-11 04:59:00','2015-06-11 06:59:00','2015-06-11 08:59:00','2015-06-11 10:59:00','2015-06-11 12:59:00','2015-06-11 14:59:00','2015-06-11 16:58:00','2015-06-11 18:58:00','2015-06-11 20:56:00','2015-06-11 21:49:00','2015-06-12 00:59:00','2015-06-12 02:59:00','2015-06-12 04:20:00','2015-06-12 08:55:00','2015-06-12 10:55:00','2015-06-12 12:59:00','2015-06-12 14:59:00','2015-06-12 16:59:00','2015-06-12 18:59:00','2015-06-12 20:55:00','2015-06-12 22:50:00','2015-06-13 00:16:00','2015-06-13 12:59:00','2015-06-13 14:35:00','2015-06-13 16:56:00','2015-06-13 18:59:00','2015-06-13 20:59:00','2015-06-13 22:44:00','2015-06-13 23:19:00','2015-06-14 08:53:00','2015-06-14 10:14:00','2015-06-14 12:59:00','2015-06-14 14:59:00','2015-06-14 16:56:00','2015-06-14 18:58:00','2015-06-14 20:57:00','2015-06-14 22:31:00','2015-06-14 23:59:00')
count <- c(59,63,9,13,91,80,97,210,174,172,167,74,43,18,18,29,136,157,126,170,188,135,207,216,163,163,126,111,172,213,209,265,203,205,195,201,171,157,153,176,187,252,227,223,171,162,146,161,136,124,155,239,233,157,158,125,138,45,45,1,2,6,6,46,48,4,1,1,12,56,65,122,81,110,42)
level <- c('low','low','low','low','low','low','low','high','normal','normal','normal','low','low','low','low','low','low','normal','low','normal','normal','low','high','high','normal','normal','low','low','normal','high','high','high','high','high','normal','high','normal','normal','normal','normal','normal','high','high','high','normal','normal','low','normal','low','low','normal','high','high','normal','normal','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low','low')
DF = data.frame(time, count, level)
DF$time = as.POSIXct(DF$time)
ggplot(DF, aes(x=time, y=count, fill=level), width=0.9) +
geom_bar(stat="identity") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks("day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
Found it ! Actually, the width is supported, though the scale is in seconds since I'm plotting a time series where the X axis is formatted as a POSIX date. Therefore, a width=0.9 means the bin width is 0.9 seconds. Since my bins are 2hrs eachs then a width of "1" is actually 7200. So here is the code that works.
ggplot(DF, aes(x=time, y=count, width=6000, fill=level)) +
geom_bar(stat="identity", position="identity", color="grey") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks("day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
Results as below. There are some averlaps in the bars, I just need to aligh my data, say to the next hour.
If what you are trying to achieve is widening the bars in the plot, ggplot doesn't seem to support that for geom_bar. However, it is pretty straightforward to implement a barplot using geom_rect.
Since many of the datapoints seem to be spaced roughly two hours apart, I am assuming here that the 0.9 width you want to achieve is 0.9 hours to either side of the given time (so basically smushing out most of the space between the bars.
If that's what you want, the following code should work:
library(lubridate)
ggplot(DF, aes(xmin=time-minutes(54), xmax=time+minutes(54), ymin=0, ymax=count,
fill=level)) +
geom_rect(color="#666666")
I am also trying to wrap my head around R.
I have worked on a solution and found a solution that also provided me with a warning pointing at the problem - overlapping time x intervals. The error disappears at width = 2000. By supplementing with position = "dodge": "places overlapping objects directly beside one another" - https://r4ds.had.co.nz/data-visualisation.html - you can achieve a reasonable plot.
# Original file
ggplot(DF, aes(x=time, y=count, fill=level, width=2000), position = "dodge") +
geom_bar(stat="identity") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks("day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
PREVIOUS VERSION NOT SO GOOD Here is another solution:
ggplot(DF, aes(x=time, y=count, colour = level)) +
geom_bar(stat="identity") +
scale_x_datetime(labels = date_format("%D"), breaks = date_breaks(width = "day")) +
xlab("myXlabel") +
ylab("myYlabel") +
ggtitle("myTitle")
colour=level gives wider columns

How to improve the aspect of ggplot histograms with log scales and discrete values

I am trying to improve the clarity and aspect of a histogram of discrete values which I need to represent with a log scale.
Please consider the following MWE
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) + geom_histogram()
which produces
and then
ggplot(data, aes(x=dist)) + geom_line() + scale_x_log10(breaks=c(1,2,3,4,5,10,100))
which probably is even worse
since now it gives the impression that the something is missing between "1" and "2", and also is not totally clear which bar has value "1" (bar is on the right of the tick) and which bar has value "2" (bar is on the left of the tick).
I understand that technically ggplot provides the "right" visual answer for a log scale. Yet as observer I have some problem in understanding it.
Is it possible to improve something?
EDIT:
This what happen when I applied Jaap solution to my real data
Where do the dips between x=0 and x=1 and between x=1 and x=2 come from? My value are discrete, but then why the plot is also mapping x=1.5 and x=2.5?
The first thing that comes to mind, is playing with the binwidth. But that doesn't give a great solution either:
ggplot(data, aes(x=dist)) +
geom_histogram(binwidth=10) +
scale_x_continuous(expand=c(0,0)) +
scale_y_continuous(expand=c(0.015,0)) +
theme_bw()
gives:
In this case it is probably better to use a density plot. However, when you use scale_x_log10 you will get a warning message (Removed 524 rows containing non-finite values (stat_density)). This can be resolved by using a log plus one transformation.
The following code:
library(ggplot2)
library(scales)
ggplot(data, aes(x=dist)) +
stat_density(aes(y=..count..), color="black", fill="blue", alpha=0.3) +
scale_x_continuous(breaks=c(0,1,2,3,4,5,10,30,100,300,1000), trans="log1p", expand=c(0,0)) +
scale_y_continuous(breaks=c(0,125,250,375,500,625,750), expand=c(0,0)) +
theme_bw()
will give this result:
I am wondering, what if, y-axis is scaled instead of x-axis. It will results into few warnings wherever values are 0, but may serve your purpose.
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) + geom_histogram() + scale_y_log10()
Also you may want to display frequencies as data labels, since people might ignore the y-scale and it takes some time to realize that y scale is logarithmic.
ggplot(data, aes(x=dist)) + geom_histogram(fill = 'skyblue', color = 'grey30') + scale_y_log10() +
stat_bin(geom="text", size=3.5, aes(label=..count.., y=0.8*(..count..)))
A solution could be to convert your data to a factor:
library(ggplot2)
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
ggplot(data, aes(x=factor(dist))) +
geom_histogram(stat = "count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Resulting in:
I had the same issue and, inspired by #Jaap's answer, I fiddled with the histogram binwidth using the x-axis in log scale.
If you use binwidth = 0.201, the bars will be juxtaposed as expected. However, this means you can only have up to five bars between two x coordinates.
set.seed(99)
data <- data.frame(dist = as.integer(rlnorm(1000, sdlog = 2)))
class(data$dist)
ggplot(data, aes(x=dist)) +
geom_histogram(binwidth = 0.201, color = 'red') +
scale_x_log10()
Result:

Resources