i am displaying time series data with ggplot2 but the tick-labels show some strange behaviour. probably i am doing something wrong but i could not find any help on the internet. here's an example:
#just sample data
time <- as.Date(seq(as.Date("2004/1/1"), as.Date("2009/12/1"), by = "1 month"))
data <- rnorm(nrow(test))+c(1:nrow(test))
test <- data.frame(time, data)
i plot with:
q1 <- ggplot(data=test) + geom_line(aes(x=time, y=data))
q1 <- q1 + scale_x_date(major="years", minor="3 months", format="%Y-%m", lim=c(as.Date("2004/1/1"),as.Date("2009/12/1")), name="")
q1
this produces the following graph:
but from my understanding the grid should end 2009/12/1 - right? thanks a lot for your help!
The limits parameter to scale_x_date affects which data points are plotted, but does not directly change the axis tick labels nor the axis range. This behavior is well illustrated in the help page http://had.co.nz/ggplot2/scale_date.html (towards the bottom of the page.)
If you want to eliminate the empty areas to left and right of your data, use coord_cartesian
library(ggplot2)
x <- as.Date(seq(as.Date("2004/1/1"), as.Date("2009/12/1"), by = "1 month"))
y <- rnorm(length(x))+c(1:length(x))
test <- data.frame(time=x, data=y)
q2 <- ggplot(data=test) +
geom_line(aes(x=time, y=data)) +
scale_x_date(major="years", minor="3 months", format="%Y-%m", name="") +
coord_cartesian(xlim=c(as.Date("2004/1/1"),as.Date("2009/12/1")))
png("date_ticks_plot.png", height=600, width=600)
print(q2)
dev.off()
Your line does end at 2009/12/1, but perhaps you are using an older version of ggplot, and upgrading may help with x-axis labels.
Related
My problem seems quite basic, but I couldn't find any relevant answer. I want to create line plots with the date on the x axis. The y axis will be Covid statistics (deaths, hospitalizations, you name it). I want to create a separate plot for the different waves of the pandemic which means that my charts cover different times. My problem is that R fixes the plot to the same size and thus the lines for the shorter time period are skewed in comparison to those of the longer time period. Ideally, I would want 1 month on the x axis to be fixed to a certain number of px or mm. But I can't find out how. My best idea so far is to assign both plots a different total width, but that doesn't give me an optimal result either.
Here's a reproducible example with a built-in dataset to explain:
library(ggplot2)
library(dplyr)
economics_1967 <- economics %>%
filter(date<"1968-01-01")
economics_1968 <- economics %>%
filter(date<"1969-01-01"&date>"1967-12-31")
#data is only available for six months in 1967, but for 12 in 1968
exampleplot1 <- ggplot(economics_1967)+
geom_line(aes(date, unemploy))+
scale_x_date(date_breaks="1 month", date_labels="%b")
#possible: ggsave("exampleplot1.png", width=2, height=1)
exampleplot2 <- ggplot(economics_1968)+
geom_line(aes(date, unemploy))+
scale_x_date(date_breaks="1 month", date_labels="%b")
ggsave("exampleplot2.png", width=4, height=1)
#possible: ggsave("exampleplot1.png", width=2, height=1)
Thank you!
EDIT: Thanks for the suggestions! Facet wrap would be a good idea but in the end I decided to just plot the whole time in one case. The background is that I classified countries differently for their policies in different times, so that's why I wanted to have a clear break in the visualization, but I just put a vertical line in there.
facet_grid is one approach, if you don't mind showing the two charts together.
library(dplyr); library(ggplot2)
bind_rows(e1967 = economics_1967,
e1968 = economics_1968, .id="source") %>%
ggplot(aes(date, unemploy)) +
geom_line() +
scale_x_date(date_breaks="1 month", date_labels="%b") +
facet_grid(~source, scales = "free_x", space = "free_x")
I like #Jon Spring's solution a lot. I want to present it a tad differently --to show that facet() usually operates on a single dataset that has one existing variable used to facet.
econ_subset <-
economics %>%
dplyr::filter(dplyr::between(date, as.Date("1967-09-01"), as.Date("1968-12-31"))) %>%
dplyr::mutate(
year = lubridate::year(date) # Used below to facet
)
ggplot(econ_subset, aes(date, unemploy)) +
geom_line() +
scale_x_date(date_breaks="1 month", date_labels="%b") +
facet_grid(~year, scales = "free_x", space = "free_x")
(In Jon's solution, bind_rows() is used to stack the two separate datasets back together.)
I am trying to plot a time series in ggplot2. Assume I am using the following data structure (2500 x 20 matrix):
set.seed(21)
n <- 2500
x <- matrix(replicate(20,cumsum(sample(c(-1, 1), n, TRUE))),nrow = 2500,ncol=20)
aa <- x
rnames <- seq(as.Date("2010-01-01"), length=dim(aa)[1], by="1 month") - 1
rownames(aa) <- format(as.POSIXlt(rnames, format = "%Y-%m-%d"), format = "%d.%m.%Y")
colnames(aa) <- paste0("aa",1:k)
library("ggplot2")
library("reshape2")
library("scales")
aa <- melt(aa, id.vars = rownames(aa))
names(aa) <- c("time","id","value")
Now the following command to plot the time series produces a weird looking x axis:
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
What I found out is that I can change the format to date:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line()
This looks better, but still not a good graph. My question is especially how to control the formatting of the x axis.
Does it have to be in Date format? How can I control the amount of breaks (i.e. years) shown in either case? It seems to be mandatory if Date is not used; otherwise ggplot2 uses some kind of useful default for the breaks I believe.
For example the following command does not work:
aa$time <- as.Date(aa$time, "%d.%m.%Y")
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line() +
scale_x_continuous(breaks=pretty_breaks(n=10))
Also if you got any hints how to improve the overall look of the graph feel free to add (e.g. the lines look a bit inprecise imho).
You can format dates with scale_x_date as #Gopala mentioned. Here's an example using a shortened version of your data for illustration.
library(dplyr)
# Dates need to be in date format
aa$time <- as.Date(aa$time, "%d.%m.%Y")
# Shorten data to speed rendering
aa = aa %>% group_by(id) %>% slice(1:200)
In the code below, we get date breaks every six months with date_breaks="6 months". That's probably more breaks than you want in this case and is just for illustration. If you want to determine which months get the breaks (e.g., Jan/July, Feb/Aug, etc.) then you also need to use coord_cartesian and set the start date with xlim and expand=FALSE so that ggplot won't pad the start date. But when you set expand=FALSE you also don't get any padding on the y-axis, so you need to add the padding manually with scale_y_continuous (I'd prefer to be able to set expand separately for the x and y axes, but AFAIK it's not possible). Because the breaks are packed tightly, we use a theme statement to rotate the labels by 90 degrees.
ggplot(aa, aes(x=time,y=value,colour=id,group=id)) +
geom_line(show.legend=FALSE) +
scale_y_continuous(limits=c(min(aa$value) - 2, max(aa$value) + 1)) +
scale_x_date(date_breaks="6 months",
labels=function(d) format(d, "%b %Y")) +
coord_cartesian(xlim=c(as.Date("2009-07-01"), max(aa$time) + 182),
expand=FALSE) +
theme_bw() +
theme(axis.text.x=element_text(angle=-90, vjust=0.5))
I have about 20 years of daily data in a time series. It has columns Date, rainfall and other data.
I am trying plot rainfall vs Time. I want to get 20 line plots with different colours and legend is generated that show the years in one graph. I tried the following codes but it is not giving me the desired results. Any suggestion to fix my issue would be most welcome
library(ggplot2)
library(seas)
data(mscdata)
p<-ggplot(data=mscdata,aes(x=date,y=precip,group=year,color=year))
p+geom_line()+scale_x_date(labels=date_format("%m"),breaks=date_breaks("1 months"))
It doesnt look great but here's a method. We first coerce the data into dates in the same year:
mscdata$dayofyear <- as.Date(format(mscdata$date, "%j"), format = "%j")
Then we plot:
library(ggplot2)
library(scales)
p <- ggplot(data = mscdata, aes(x = dayofyear, y = precip, group = year, color = year))
p + geom_line() +
scale_x_date(labels = date_format("%m"), breaks = date_breaks("1 months"))
While I agree with #Jaap that this may not be the best way to depict these data, try to following:
mscdata$doy <- as.numeric(strftime(mscdata$date, format="%j"))
ggplot(data=mscdata,aes(x=doy,y=precip,group=year)) +
geom_line(aes(color=year))
Although the given answers are good answers to your questions as it stands, i don't think it will solve your problem. I think you should be looking at a different way to present the data. #Jaap already suggested using facets. Take for example this approach:
#first add a month column to your dataframe
mscdata$month <- format(mscdata$date, "%m")
#then plot it using boxplot with year on the X-axis and month as facet.
p1 <- ggplot(data = mscdata, aes(x = year, y = precip, group=year))
p1 + geom_boxplot(outlier.shape = 3) + facet_wrap(~month)
This will give you a graph per month, showing the rainfall per year next to one each other. Because i use boxplot, the peaks in rainfall show up as dots ('normal' rain events are inside box).
Another possible approach would be to use stat_summary.
I am trying to plot a simple bar chart with labels in ggplot2. However, when I use position=dodge, it puts the wrong labels in the resulting graphic, eg. 17.6% instead of 77.7% for Trucks. My data and code are below.
library(ggplot2)
mode <- factor(c("Truck", "Rail","Water","Air","Other"), levels=c("Truck", "Rail","Water","Air","Other"))
Year <- factor(c("2011","2011","2011","2011","2011","2040","2040","2040","2040","2040"))
share <- c(0.709946085, 0.175582806, 0.11392987, 0.000534132, 0.00000710797, 0.777162621, 0.133121584, 0.088818658, 0.000880041, 0.000017097)
modeshares <- data.frame(Year, mode, share)
theme_set(theme_grey(base_size = 18))
modeshares$lab <- as.character(round(100 * share,1))
modeshares$lab <- paste(modeshares$lab,"%",sep="")
ggplot(data=modeshares, aes(x=mode, y=share*100, fill=Year, ymax=(share*100))) + geom_bar(stat="identity", position="dodge") + labs(y="Percent",x="Mode") +geom_text(label=modeshares$lab,position=position_dodge(width=1),vjust=-0.5)
The resulting graph is shown below.
Any insights into how to ensure that the correct label values are displayed would be much appreciated.
Thanks!
Main Question
I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.
I'm looking for:
A histogram of the frequency of my dates
Tick marks centered under the matching bars
Date labels in %Y-b format
Appropriate limits; minimized empty space between edge of grid space and outermost bars
I've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph. I wanted %Y-%b formatting, though, so I hunted around and tried the following, based on this SO:
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph
Correct x axis label format
The frequency distribution has changed shape (binwidth issue?)
Tick marks don't appear centered under bars
The xlims have changed as well
I worked through the example in the ggplot2 documentation at the scale_x_date section and geom_line() appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.
Updates based on answers from edgester and gauden
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.
Assume for both:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on #edgester's answer below, I was able to do the following:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:
Plot based on gauden's approach:
Note the following:
gaps in gauden's plot for 2009-Dec and 2010-Mar; table(dates$Date) reveals that there are 19 instances of 2009-12-01 and 26 instances of 2010-03-01 in the data
edgester's plot starts at 2008-Apr and ends at 2012-May. This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01. For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May. After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!
Any thoughts on the differences here? edgester's method of creating a separate count
Related References
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:
Started here at learnr.wordpress, a popular R blog. It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.
Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation.
r-bloggers has a post on this, but it appears outdated. The simple format= option did not work for me.
This SO question is playing with breaks and labels. I tried treating my Date vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here).
UPDATE
Version 2: Using Date class
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
The Target Plot v2
The Code v2
And here is (somewhat excessively) commented code:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
Version 1: Using POSIXct
I try a solution that does everything in ggplot2, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
The Target Plot v1
The Code v1
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks= argument for geom_histogram() and creating a little shortcut function to make the required sequence.
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- lubridate::ymd(dates$Date)
by_month <- function(x,n=1){
seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}
ggplot(dates,aes(Date)) +
geom_histogram(breaks = by_month(dates$Date)) +
scale_x_date(labels = scales::date_format("%Y-%b"),
breaks = by_month(dates$Date,2)) +
theme(axis.text.x = element_text(angle=90))
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter:
... + Geom_histogram (binwidth = 30, color = "white") + ...
If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.