Editing axis tick labels with ggplot2 when x-axis is mm-dd - r

I have a dataset where the x-axis is a date, but it is only mm-dd (no year). I am using year as a group variable as I am trying to show a YOY change on the same plot. The x-axis labeling is too crowded and I'd like to limit the tick mark labels so that not every date is shown. This could be every other day, every third day, one day a week -- any of these would work.
I have tried a few solutions but cannot get them to work, I'm assuming because my x-axis is not a Date, but a character. (Previous to arriving at this mm-dd solution for the x-axis, I tried plotting the x-axis with a yyyy-mm-dd Date format, but was unsuccessful in figuring out how to get ggplot2 to ignore the "yyyy" part.)
An example:
myDF <- data.frame(
myDate = format(seq(as.Date("2014-02-01"),
length=28, by="1 day"), "%m-%d"),
myVar = sample(100,28),
myGroup = sample(2,28,TRUE)
)
head(myDF)
myDate myVar myGroup
02-01 87 1
02-02 34 1
02-03 48 2
02-04 59 1
02-05 98 1
02-06 18 2
ggplot(myDF, aes(myDate, myVar, group=myGroup, color=as.factor(myGroup))) +
geom_line()
I have tried:
ggplot(myDF, aes(myDate, myVar, group=myGroup, color=as.factor(myGroup))) +
geom_line() + scale_x_discrete(breaks = c(1,10,20))
This appears to confuse ggplot since the labels disappear completely. (Same result with a seq() attempt.)
I have also tried:
ggplot(myDF, aes(myDate, myVar, group=myGroup, color=as.factor(myGroup))) +
geom_line() + scale_x_date(breaks = "1 week")
This throws an error re: myDate not being a Date.
I've already switched the format of the tick labels to be vertical, but it is still too crowded on the plot.
Any tips would be very much appreciated. Thanks!

If you want to use myDate variable without the year (as character) then one solution would be to use scale_x_discrete() and then provide myDF$myDate as breaks= argument and select sequence of values you want to show. In this example I selected every 7th value.
ggplot(myDF, aes(myDate, myVar, group=myGroup, color=as.factor(myGroup))) +
geom_line() + scale_x_discrete(breaks = unique(myDF$myDate)[seq(1,28,7)])

Related

Chance number of digits on the x-axis

I try to plot price agains year. year is formated as integer, but is ploted with one digit. How can I change that? x-axis without digit.
df looks like this:
now I use this code to plot:
ggplot(df, aes(Jahr, Energiepreis)) + geom_line()
plot looks like this
I tried with scale_x_continous() but with no success so far
By default ggplot2 will choose approx. 5 breaks for a continuous variable, which quite often works fine but as your values are years I would opt for setting the desired breaks explicitly using the breaks argument, e.g. to add a break for every second year you could do:
df <- data.frame(
Jahr = 2020:2030,
Energiepreis = 1:11
)
library(ggplot2)
ggplot(df, aes(Jahr, Energiepreis)) +
geom_line() +
scale_x_continuous(breaks = seq(2020, 2030, 2))

Specify limits to Date axis that crosses the new year

I have this dataset I am working with where I am plotting monthly
summaries. A problem I have encountered in ggplot2 is to let the x axis go from say month 10 to 12 and then continue onwards with months 1 to say 4. In the example below I show this
with a 20 year dataset where I remove months May to September and plot the rest.
library(lubridate)
library(ggplot2)
mon=seq.Date(from=as.Date("2000-01-01"),to=as.Date("2019-12-01"),by="month")
val=rnorm(length(mon))
dd=data.frame(mon,val)
ddsub=subset(dd,month(mon)<5 |month(mon) >9)
ggplot(data=ddsub,aes(month(mon),val,group=month(mon))) + geom_boxplot() +
xlab("Month") + scale_x_continuous(breaks=c(1:12))
What I would like is for the x axis to start in Oct and to continue past the end of year to Apr.
Since month(ddsub$mon) returns a numeric resulting in a continuous horizontal axis, I have not found any neat way of breaking the ascending numerical order.
My only solution is do define the months as factors that I then reorder in the right way
mon_factor=as.factor(month(ddsub$mon))
ddsub$mon_ahead=reorder(mon_factor,rep(c(4,5,6,7,1,2,3),20))
ggplot(data=ddsub,aes(mon_ahead,val)) + geom_boxplot() + xlab("Month")
While this works, I don't find it an elegant solution. It is cumbersome to have to
define a new month variable and then reorder it.
Does anyone know if there is a way of working with the Date-objects directly and define
the limits of the axis so that it begins in Oct and ends in Apr ?
I think using a factor will be simplest here, and you can automate the ordering using a helper column like mo_FY below, which makes October month 1 of the fiscal year. I like the syntax of forcats::fct_reorder to establish the ordering.
ddsub$mo_FY = (month(ddsub$mon) + 2) %% 12 + 1
ddsub$mon_fct = forcats::fct_reorder(factor(month(ddsub$mon)), ddsub$mo_FY)
ggplot(data=ddsub, aes(mon_fct, val)) +
geom_boxplot() +
xlab("Month")
If you want to avoid creating a factor, you can do it on the fly with the modulus operator and creative labels:
ddsub %>%
ggplot() +
geom_boxplot(aes(x = (month(mon)+2) %% 12, y = val, group = month(mon))) +
xlab("Month") +
scale_x_continuous(breaks = c(0:6),labels = month(c(10:12,1:4), label = T))

ggplot: Issue when converting axis values from number of days to months in a boxplot

When converting a numeric variable "number of days from 1st of January of 2015" to date, the boxplot only shows part of the range of y-values but not all.
In this example, I plotted "gender" vs "months". Months were obtained by transforming the original "days" variable (i.e. days starting from 2015/1/1). The range of numeric values should extend from the end of March to the beginning of April of the subsequent year, but ggplot() is only plotting values between Aug and Jan and showing only month labels within that range in the y-axis.
Any help to solve this issue is very welcome!
Here is the code and the corresponding plot:
gender <- c(rep("female",144), rep("male",144))
days <- c(274,285,302,330,117,230,271,207,235,249,268,NA,NA,NA,NA,210,255,290,267,252,257,268,288,220,264,270,277,303,222,252,296,323,369,NA,258,NA,240,245,310,271,272,282,314,345,214,211,258,268,145,176,244,273,249,257,277,284,272,273,272,282,290,297,260,266,277,213,247,244,269,349,268,NA,220,235,269,299,266,273,274,307,285,299,300,224,257,284,291,305,278,294,455,280,262,272,276,295,338,264,339,232,277,230,270,312,276,285,308,241,273,340,249,260,270,352,297,217,247,287,320,191,249,265,287,320,432,262,265,324,309,234,441,409,264,381,262,276,316,330,252,264,298,315,287,330,274,287,371,237,259,266,349,247,249,241,333,379,486,198,249,270,275,279,314,182,234,252,289,319,216,262,293,234,272,284,311,258,NA,299,314,290,292,296,300,274,289,359,267,319,NA,492,294,319,293,265,273,315,307,315,287,378,238,239,315,325,361,249,NA,192,224,226,204,208,234,263,283,294,430,267,273,307,327,460,240,307,319,492,300,311,485,348,297,348,317,317,318,338,316,316,336,255,284,316,249,302,307,308,301,265,273,316,281,326,272,283,NA,NA,243,254,271,191,259,324,287,265,310,337,287,326,304,399,337,295,313,228,288,307,270,347,290,245,NA,283,423,223,NA,264,314,283)
mytable <- data.frame(gender,days)
range(mytable$days, na.rm=T) # 117 to 492
mytable$months <- (as.Date(days,origin = "2015/1/1"))
ggplot(mytable, aes(x=gender, y=months,fill=gender)) +
geom_boxplot()
I am not sure about the intuition behind this plot. But, this would give you what you desire:
ggplot(mytable, aes(x=gender, y=months, fill=gender)) +
geom_boxplot() +
scale_y_date(date_labels="%b ", date_breaks ="1 month",
limits = c(as.Date("2015-3-1"), as.Date("2016-2-1")))

Grouped bar chart with date on x-axis

I'm getting back to R, and I have some trouble plotting the data I want.
It's in this format :
date value1 value2
10/25/2016 50 60
12/16/2016 70 80
01/05/2017 35 45
And I would like to plot value1 and value2 next to each other, with the corresponding date on the x axis. So far I have this, I tried to plot only value1 first :
df$date <- as.Date(df$date, "%m/%d/%Y")
ggplot(data=df,aes(x=date,y=value1))
But the resulting plot doesn't show anything. The maximum values on the x and y axis seem to correspond to the ranges of my dataframe, but why is nothing showing up?
It works with plot(df$date,df$value1) though, so I don't get what I am doing wrong.
the ggplot call alone does not actually create any layers on the plot. You need to add a geom.
For this you probably want geom_point() or geom_line()
ggplot(data=df,aes(x=date,y=value1)) +
geom_point()
or
ggplot(data=df,aes(x=date,y=value1)) +
geom_line()
or you could do both if you want points and lines
ggplot(data=df,aes(x=date,y=value1)) +
geom_point() +
geom_line()
If you want both values on the plot, I would recommend doing some data manipulation first with the tidyr package.
df %>%
gather(key = "group", value = "value", value1:value2) %>%
ggplot(aes(date, value, color = group, group = group)) +
geom_line()

Understanding dates and plotting a histogram with ggplot2 in R

Main Question
I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.
I'm looking for:
A histogram of the frequency of my dates
Tick marks centered under the matching bars
Date labels in %Y-b format
Appropriate limits; minimized empty space between edge of grid space and outermost bars
I've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph. I wanted %Y-%b formatting, though, so I hunted around and tried the following, based on this SO:
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph
Correct x axis label format
The frequency distribution has changed shape (binwidth issue?)
Tick marks don't appear centered under bars
The xlims have changed as well
I worked through the example in the ggplot2 documentation at the scale_x_date section and geom_line() appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.
Updates based on answers from edgester and gauden
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.
Assume for both:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on #edgester's answer below, I was able to do the following:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:
Plot based on gauden's approach:
Note the following:
gaps in gauden's plot for 2009-Dec and 2010-Mar; table(dates$Date) reveals that there are 19 instances of 2009-12-01 and 26 instances of 2010-03-01 in the data
edgester's plot starts at 2008-Apr and ends at 2012-May. This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01. For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May. After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!
Any thoughts on the differences here? edgester's method of creating a separate count
Related References
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:
Started here at learnr.wordpress, a popular R blog. It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.
Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation.
r-bloggers has a post on this, but it appears outdated. The simple format= option did not work for me.
This SO question is playing with breaks and labels. I tried treating my Date vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here).
UPDATE
Version 2: Using Date class
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
The Target Plot v2
The Code v2
And here is (somewhat excessively) commented code:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
Version 1: Using POSIXct
I try a solution that does everything in ggplot2, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
The Target Plot v1
The Code v1
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks= argument for geom_histogram() and creating a little shortcut function to make the required sequence.
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- lubridate::ymd(dates$Date)
by_month <- function(x,n=1){
seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}
ggplot(dates,aes(Date)) +
geom_histogram(breaks = by_month(dates$Date)) +
scale_x_date(labels = scales::date_format("%Y-%b"),
breaks = by_month(dates$Date,2)) +
theme(axis.text.x = element_text(angle=90))
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter:
... + Geom_histogram (binwidth = 30, color = "white") + ...
If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.

Resources