Google Trends and Weeks, ggplot2 - r

When I am downloading data from Google Trend, the dataset looks like this:
Week nuclear atomic nuclear.weapons unemployment
2004-01-04 - 2004-01-10 11 11 1 15
2004-01-11 - 2004-01-17 11 13 1 13
2004-01-18 - 2004-01-24 10 11 1 13
How can I change the dates in "Week" from this format "Y-m-d - Y-m-d" to a format like "Year-Week"?
Furthermore, how can I tell ggplot, that it only the years are printed on the x-axes instead of all values for x?
#Mattrition: Thank you. I followed your advice:
trends <- melt(trends, id = "Woche",
measure = c("nuclear", "atomic", "nuclear.weapons", "unemployment"))
trends$Week<- gsub("^(\\d+-\\d+-\\d+).+", "\\1", trends$Week)
trends$Week <- as.Date(trends$Week)
ggplot(trends, aes(Week, value, colour = variable, group=variable)) +
geom_line() +
ylab("Trends") +
theme(legend.position="top", legend.title=element_blank(),
panel.background = element_rect(fill = "#FFFFFF", colour="#000000"))+
scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9", "#009E73"))+
stat_smooth(method="loess")
Now, every second year is labeled (2004, 2006, ...) in x-axis. How can I tell ggplot to label every year (2004, 2005, ...)?

ggplot will understand Date objects (see ?Date) and work out appropriate labelling if you can convert your dates to this format.
You can use something like gsub to extract starting day for each week. This uses regular expressions to match the first argument and return anything inside the set of brackets:
df$startingDay <- gsub("^(\\d+-\\d+-\\d+).+", "\\1", df$Week)
Then call as.Date() on the extracted day strings to convert to Date objects:
df$date <- as.Date(df$startingDay)
You can then use the date objects to plot whatever you wanted to plot:
g <- ggplot(df, aes(date, as.numeric(atomic))) + geom_line()
print(g)
EDIT:
To answer your additional question, add the following to your ggplot object:
library(scales)
g <- g + scale_x_date(breaks=date_breaks(width="1 year"),
labels=date_format("%Y"))

Related

Plotting depth data

Working with a large chemistry dataset of samples collected at different depths. The data is in long format as:
<Date> <Depth> <Temp>
2015-06-11 4 m 15
2015-07-11 4 m 16
2015-08-11 4 m 17
2015-06-11 3 m 19
2015-07-11 3 m 20
2015-08-11 3 m 21
2015-06-11 2 m 25
2015-07-11 2 m 26
2015-08-11 2 m 27
Trying to graph it as such that I have temperature on my x-axis and depth on my y-axis and then color them by their dates. Currently when I add a geom_line to the function it just connects all the dots.
ggplot(aes(x = Temp, y = Depth, color = Date)) +
geom_point() +
geom_line()
Connection for geoms is established typically just by applying an aesthetic as you did (color=). What's actually happening there is that ggplot is applying the actual aesthetic (drawing the color), but will also apply a somewhat hidden aesthetic, group= to the same value. Normally, this works fine unless the column you assign to color= is continuous (like a date), rather than a factor (which is ordinal, but discrete). If df$Date is actually formatted as a "Date" class, then it's continuous and would exhibit behavior consistent with what you mentioned. The fix is to either explicitly define the group= aesthetic in addition to color=, or to convert df$Date to a factor (discrete).
The example below using your dataset should help explain. For exemplary purposes, I'm adding a column called df$Other, which is formatted as a factor.
df <- data.frame(
Date=rep(c('2015-06-11','2016-07-11','2015-08-11'),3),
Other=rep(c('Jun','July','Aug'),3),
Depth=c(4,4,4,3,3,3,2,2,2),
Temp=c(15,16,17,19,20,21,25,26,27)
)
df$Date <- as.Date(df$Date, format='%Y-%m-%d')
First, here's what your code posted gives you:
ggplot(df, aes(x=Temp, y=Depth, color=Date)) + geom_point() + geom_line()
Look familiar? We know that df$Date is continuous, because ggplot2 draws a legend which is continuous by default, and also because we know it is formatted as a Date class. Consider what happens if you swap out df$Other in place of df$Date:
ggplot(df, aes(x=Temp, y=Depth, color=Other)) + geom_point() + geom_line()
Now the issue should be very clear, but how can you solve it? Well, like I mentioned there are two approaches. One is to maintain df$Date as a continuous variable, but clarify to ggplot2 that you want to use this as a grouping variable. In order to do so, ggplot2 will basically convert it to a factor for purposes of connecting the lines, but keep it continuous to make the color scale:
ggplot(df, aes(x=Temp, y=Depth, color=Date)) +
geom_point() + geom_line(aes(group=Date))
One of the best options might be to set df$Date as a factor with ordered levels, since you're not actually using the "Date" class's continuous nature anyway. You can actually just use color=factor(Date) to fix it right in-line, but you'll notice that the levels are not going to be correct (in terms of the months in the correct order). In this case, I'd recommend changing the column prior to plotting using factor() and setting the levels there. Here's my solution:
# convert to character vector first
df$Date <- as.character(df$Date)
# it's already in the correct order, so just use the order of the df
df$Date <- factor(df$Date, levels=unique(df$Date))
ggplot(df, aes(x=Temp, y=Depth, color=Date)) + geom_point() + geom_line()

Extracting Year from Date for ggplot2 to compare time series

I have the below data which I am trying to plot on the one chart so I can compare 2013 to 2014 data, with colour set by the 'year'.
I would like the output to look something like this:
My example CSV data looks like the below:
Date Data
1/01/2013 10
1/02/2013 20
1/03/2013 30
1/04/2013 20
1/01/2014 40
1/02/2014 70
1/03/2014 80
1/04/2014 90
I have the below code, but it doesn't extract the 'year' from the 'Date' data. I only know how to treat each 'date' with a different colour instead, but it's not really what I want.
p <- ggplot(d, aes(x=as.Date(Date, "%d/%m/%Y"), y=Data,
group=Date, color=Date)) +
geom_bar(stat="identity") +
scale_color_discrete(name="Year") +
labs(x="",y="Test Data") +
geom_smooth(aes(group=1))
p
Any help would be much appreciated.
Add an extra column Year to your data frame. Here is a simple example:
# create example data set
library("zoo")
library("strucchange")
d <- data.frame(Date=index(SP2001)+90, Data=SP2001$AAPL)
# add year column to data frame
d$Year <- format(d$Date, "%Y")
library("ggplot2")
p <- ggplot(d, aes(x=as.Date(Date, "%d/%m/%Y"), y=Data,
group=Year)) +
geom_bar(aes(fill=Year), stat="identity") +
labs(x="", y="Test Data") +
geom_smooth(aes(colour=Year))
p
given a date object you can extract the year as follows
format(date_series,'%Y')
%Y will use 4 digits, %y just the last 2
you can add more elements to the format string, for example %Y%m outputs things like 201401, 201402 - I use this one frequently

trouble getting Date field on X axis using ggplot2

head(bktst.plotdata)
date method product type actuals forecast residual Percent_error month
1 2012-12-31 bauwd CUSTM NET 194727.51 -8192.00 -202919.51 -104.21 Dec12
2 2013-01-31 bauwd CUSTM NET 470416.27 1272.01 -469144.26 -99.73 Jan13
3 2013-02-28 bauwd CUSTM NET 190943.57 -1892.45 -192836.02 -100.99 Feb13
4 2013-03-31 bauwd CUSTM NET -42908.91 2560.05 45468.96 -105.97 Mar13
5 2013-04-30 bauwd CUSTM NET -102401.68 358807.48 461209.16 -450.39 Apr13
6 2013-05-31 bauwd CUSTM NET -134869.73 337325.33 472195.06 -350.11 May13
I have been trying to plot my back test result using ggplot2. Given above a sample dataset. I have dates ranging from Dec2012 to Jul2013. 3 levels in 'method', 5 levels in 'product' and 2 levels in 'type'
I tried this code, trouble is that R is not reading x-axis correct, on the X-axis I am getting 'Jan, feb, mar, apr, may,jun, jul, aug', instead I expect R to plot Dec-to-Jul
month.plot1 <- ggplot(data=bktst.plotdata, aes(x= date, y=Percent_error, colour=method))
facet4 <- facet_grid(product~type,scales="free_y")
title3 <- ggtitle("Percent Error - Month-over-Month")
xaxis2 <- xlab("Date")
yaxis3 <- ylab("Error (%)")
month.plot1+geom_line(stat="identity", size=1, position="identity")+facet4+title3+xaxis2+yaxis3
# Tried changing the code to this still not getting the X-axis right
month.plot1 <- ggplot(data=bktst.plotdata, aes(x= format(date,'%b%y'), y=Percent_error, colour=method))
month.plot1+geom_line(stat="identity", size=1, position="identity")+facet4+title3+xaxis2+yaxis3
Well, it looks like you are plotting the last day of each month, so it actually makes sense to me that December 31 is plotted very very close to January. If you look at the plotted points (with geom_point) you can see that each point is just to the left of the closest month axis.
It sounds like you want to plot years and months instead of actual dates. There are a variety of ways you might do this, but one thing you could is to change the day part of the date to the first of the month instead of the last of the month. Here I show how you could do this using some functions from package lubridate along with paste (I have assumed your variable date is already a Date object).
require(lubridate)
bktst.plotdata$date2 = as.Date(with(bktst.plotdata,
paste(year(date), month(date), "01", sep = "-")))
Then the plot axes start at December. You can change the format of the x axis if you load the scales package.
require(scales)
ggplot(data=bktst.plotdata, aes(x = date2, y=Percent_error, colour=method)) +
facet_grid(product~type,scales="free_y") +
ggtitle("Percent Error - Month-over-Month") +
xlab("Date") + ylab("Error (%)") +
geom_line() +
scale_x_date(labels=date_format(format = "%m-%Y"))

Formatting dates in ggplot to highlight the start of financial years

I've got data refering to financial years, starting from 1 April each year and ending 31 March in next solar year.
df <- data.frame(date = seq(as.POSIXct("2008-04-01"), by="month", length.out=49),
var = rnorm(49))
head(df,3)
date var
1 2008-04-01 0.04265025
2 2008-05-01 -1.59671801
3 2008-06-01 0.4909673
Plotting df with library(ggplot2); ggplot(df) + geom_line(aes(date, var)) I get:
Now, what I'm interested in is having say the "2009" label positioned at "2009-04-01", as it's that the actual start of the FY 2009. I managed to get that with the following code:
ggplot(df) + geom_line(aes(date, var)) +
scale_x_datetime(breaks = df$date[months(df$date)=="April"],
labels = date_format("%Y"))
which correctly gives:
My question is (finally :-) ) does some of you have a better way for showing financial years and eventually better codes then the above?
You could use geom_rect to highlight the financial years. Assuming you save your original plot as p, try:
bgdf <- data.frame(xmin=as.POSIXct(paste0(2008:2011,"-04-01")),
xmax=as.POSIXct(paste0(2009:2012,"-04-01")),
ymin=min(df$var),ymax=max(df$var),alpha=((2008:2011)%%2)*0.1)
p + geom_rect(aes(xmin=xmin,xmax=xmax,ymin=ymin,ymax=ymax),
data=bgdf,alpha=bgdf$alpha,fill="blue")

Understanding dates and plotting a histogram with ggplot2 in R

Main Question
I'm having issues with understanding why the handling of dates, labels and breaks is not working as I would have expected in R when trying to make a histogram with ggplot2.
I'm looking for:
A histogram of the frequency of my dates
Tick marks centered under the matching bars
Date labels in %Y-b format
Appropriate limits; minimized empty space between edge of grid space and outermost bars
I've uploaded my data to pastebin to make this reproducible. I've created several columns as I wasn't sure the best way to do this:
> dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
> head(dates)
YM Date Year Month
1 2008-Apr 2008-04-01 2008 4
2 2009-Apr 2009-04-01 2009 4
3 2009-Apr 2009-04-01 2009 4
4 2009-Apr 2009-04-01 2009 4
5 2009-Apr 2009-04-01 2009 4
6 2009-Apr 2009-04-01 2009 4
Here's what I tried:
library(ggplot2)
library(scales)
dates$converted <- as.Date(dates$Date, format="%Y-%m-%d")
ggplot(dates, aes(x=converted)) + geom_histogram()
+ opts(axis.text.x = theme_text(angle=90))
Which yields this graph. I wanted %Y-%b formatting, though, so I hunted around and tried the following, based on this SO:
ggplot(dates, aes(x=converted)) + geom_histogram()
+ scale_x_date(labels=date_format("%Y-%b"),
+ breaks = "1 month")
+ opts(axis.text.x = theme_text(angle=90))
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
That gives me this graph
Correct x axis label format
The frequency distribution has changed shape (binwidth issue?)
Tick marks don't appear centered under bars
The xlims have changed as well
I worked through the example in the ggplot2 documentation at the scale_x_date section and geom_line() appears to break, label, and center ticks correctly when I use it with my same x-axis data. I don't understand why the histogram is different.
Updates based on answers from edgester and gauden
I initially thought gauden's answer helped me solve my problem, but am now puzzled after looking more closely. Note the differences between the two answers' resulting graphs after the code.
Assume for both:
library(ggplot2)
library(scales)
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
Based on #edgester's answer below, I was able to do the following:
freqs <- aggregate(dates$Date, by=list(dates$Date), FUN=length)
freqs$names <- as.Date(freqs$Group.1, format="%Y-%m-%d")
ggplot(freqs, aes(x=names, y=x)) + geom_bar(stat="identity") +
scale_x_date(breaks="1 month", labels=date_format("%Y-%b"),
limits=c(as.Date("2008-04-30"),as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Here is my attempt based on gauden's answer:
dates$Date <- as.Date(dates$Date)
ggplot(dates, aes(x=Date)) + geom_histogram(binwidth=30, colour="white") +
scale_x_date(labels = date_format("%Y-%b"),
breaks = seq(min(dates$Date)-5, max(dates$Date)+5, 30),
limits = c(as.Date("2008-05-01"), as.Date("2012-04-01"))) +
ylab("Frequency") + xlab("Year and Month") +
theme_bw() + opts(axis.text.x = theme_text(angle=90))
Plot based on edgester's approach:
Plot based on gauden's approach:
Note the following:
gaps in gauden's plot for 2009-Dec and 2010-Mar; table(dates$Date) reveals that there are 19 instances of 2009-12-01 and 26 instances of 2010-03-01 in the data
edgester's plot starts at 2008-Apr and ends at 2012-May. This is correct based on a minimum value in the data of 2008-04-01 and a max date of 2012-05-01. For some reason gauden's plot starts in 2008-Mar and still somehow manages to end at 2012-May. After counting bins and reading along the month labels, for the life of me I can't figure out which plot has an extra or is missing a bin of the histogram!
Any thoughts on the differences here? edgester's method of creating a separate count
Related References
As an aside, here are other locations that have information about dates and ggplot2 for passers-by looking for help:
Started here at learnr.wordpress, a popular R blog. It stated that I needed to get my data into POSIXct format, which I now think is false and wasted my time.
Another learnr post recreates a time series in ggplot2, but wasn't really applicable to my situation.
r-bloggers has a post on this, but it appears outdated. The simple format= option did not work for me.
This SO question is playing with breaks and labels. I tried treating my Date vector as continuous and don't think it worked so well. It looked like it was overlaying the same label text over and over so the letters looked kind of odd. The distribution is sort of correct but there are odd breaks. My attempt based on the accepted answer was like so (result here).
UPDATE
Version 2: Using Date class
I update the example to demonstrate aligning the labels and setting limits on the plot. I also demonstrate that as.Date does indeed work when used consistently (actually it is probably a better fit for your data than my earlier example).
The Target Plot v2
The Code v2
And here is (somewhat excessively) commented code:
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.Date(dates$Date)
# convert the Date to its numeric equivalent
# Note that Dates are stored as number of days internally,
# hence it is easy to convert back and forth mentally
dates$num <- as.numeric(dates$Date)
bin <- 60 # used for aggregating the data and aligning the labels
p <- ggplot(dates, aes(num, ..count..))
p <- p + geom_histogram(binwidth = bin, colour="white")
# The numeric data is treated as a date,
# breaks are set to an interval equal to the binwidth,
# and a set of labels is generated and adjusted in order to align with bars
p <- p + scale_x_date(breaks = seq(min(dates$num)-20, # change -20 term to taste
max(dates$num),
bin),
labels = date_format("%Y-%b"),
limits = c(as.Date("2009-01-01"),
as.Date("2011-12-01")))
# from here, format at ease
p <- p + theme_bw() + xlab(NULL) + opts(axis.text.x = theme_text(angle=45,
hjust = 1,
vjust = 1))
p
Version 1: Using POSIXct
I try a solution that does everything in ggplot2, drawing without the aggregation, and setting the limits on the x-axis between the beginning of 2009 and the end of 2011.
The Target Plot v1
The Code v1
library("ggplot2")
library("scales")
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- as.POSIXct(dates$Date)
p <- ggplot(dates, aes(Date, ..count..)) +
geom_histogram() +
theme_bw() + xlab(NULL) +
scale_x_datetime(breaks = date_breaks("3 months"),
labels = date_format("%Y-%b"),
limits = c(as.POSIXct("2009-01-01"),
as.POSIXct("2011-12-01")) )
p
Of course, it could do with playing with the label options on the axis, but this is to round off the plotting with a clean short routine in the plotting package.
I know this is an old question, but for anybody coming to this in 2021 (or later), this can be done much easier using the breaks= argument for geom_histogram() and creating a little shortcut function to make the required sequence.
dates <- read.csv("http://pastebin.com/raw.php?i=sDzXKFxJ", sep=",", header=T)
dates$Date <- lubridate::ymd(dates$Date)
by_month <- function(x,n=1){
seq(min(x,na.rm=T),max(x,na.rm=T),by=paste0(n," months"))
}
ggplot(dates,aes(Date)) +
geom_histogram(breaks = by_month(dates$Date)) +
scale_x_date(labels = scales::date_format("%Y-%b"),
breaks = by_month(dates$Date,2)) +
theme(axis.text.x = element_text(angle=90))
I think the key thing is that you need to do the frequency calculation outside of ggplot. Use aggregate() with geom_bar(stat="identity") to get a histogram without the reordered factors. Here is some example code:
require(ggplot2)
# scales goes with ggplot and adds the needed scale* functions
require(scales)
# need the month() function for the extra plot
require(lubridate)
# original data
#df<-read.csv("http://pastebin.com/download.php?i=sDzXKFxJ", header=TRUE)
# simulated data
years=sample(seq(2008,2012),681,replace=TRUE,prob=c(0.0176211453744493,0.302496328928047,0.323054331864905,0.237885462555066,0.118942731277533))
months=sample(seq(1,12),681,replace=TRUE)
my.dates=as.Date(paste(years,months,01,sep="-"))
df=data.frame(YM=strftime(my.dates, format="%Y-%b"),Date=my.dates,Year=years,Month=months)
# end simulated data creation
# sort the list just to make it pretty. It makes no difference in the final results
df=df[do.call(order, df[c("Date")]), ]
# add a dummy column for clarity in processing
df$Count=1
# compute the frequencies ourselves
freqs=aggregate(Count ~ Year + Month, data=df, FUN=length)
# rebuild the Date column so that ggplot works
freqs$Date=as.Date(paste(freqs$Year,freqs$Month,"01",sep="-"))
# I set the breaks for 2 months to reduce clutter
g<-ggplot(data=freqs,aes(x=Date,y=Count))+ geom_bar(stat="identity") + scale_x_date(labels=date_format("%Y-%b"),breaks="2 months") + theme_bw() + opts(axis.text.x = theme_text(angle=90))
print(g)
# don't overwrite the previous graph
dev.new()
# just for grins, here is a faceted view by year
# Add the Month.name factor to have things work. month() keeps the factor levels in order
freqs$Month.name=month(freqs$Date,label=TRUE, abbr=TRUE)
g2<-ggplot(data=freqs,aes(x=Month.name,y=Count))+ geom_bar(stat="identity") + facet_grid(Year~.) + theme_bw()
print(g2)
The error graph this under the title "Plot based on Gauden's approach" is due to the binwidth parameter:
... + Geom_histogram (binwidth = 30, color = "white") + ...
If we change the value of 30 to a value less than 20, such as 10, you will get all frequencies.
In statistics the values are more important than the presentation is more important a bland graphic to a very pretty picture but with errors.

Resources