Extracting Year from Date for ggplot2 to compare time series - r

I have the below data which I am trying to plot on the one chart so I can compare 2013 to 2014 data, with colour set by the 'year'.
I would like the output to look something like this:
My example CSV data looks like the below:
Date Data
1/01/2013 10
1/02/2013 20
1/03/2013 30
1/04/2013 20
1/01/2014 40
1/02/2014 70
1/03/2014 80
1/04/2014 90
I have the below code, but it doesn't extract the 'year' from the 'Date' data. I only know how to treat each 'date' with a different colour instead, but it's not really what I want.
p <- ggplot(d, aes(x=as.Date(Date, "%d/%m/%Y"), y=Data,
group=Date, color=Date)) +
geom_bar(stat="identity") +
scale_color_discrete(name="Year") +
labs(x="",y="Test Data") +
geom_smooth(aes(group=1))
p
Any help would be much appreciated.

Add an extra column Year to your data frame. Here is a simple example:
# create example data set
library("zoo")
library("strucchange")
d <- data.frame(Date=index(SP2001)+90, Data=SP2001$AAPL)
# add year column to data frame
d$Year <- format(d$Date, "%Y")
library("ggplot2")
p <- ggplot(d, aes(x=as.Date(Date, "%d/%m/%Y"), y=Data,
group=Year)) +
geom_bar(aes(fill=Year), stat="identity") +
labs(x="", y="Test Data") +
geom_smooth(aes(colour=Year))
p

given a date object you can extract the year as follows
format(date_series,'%Y')
%Y will use 4 digits, %y just the last 2
you can add more elements to the format string, for example %Y%m outputs things like 201401, 201402 - I use this one frequently

Related

ggplot: Plotting timeseries data with missing values

I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".
This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:
YearMonthDay (format- "YYYYMMDD", example "19810501")
Rainfall (mm)
The code for importing data into R:
StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")
The code for time data "Time" :
Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")
The code for precipitation data "data1" :
library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")
The code for data frame "Pecip1" :
Precip1 <- data.frame(Time, data1, check.rows=TRUE)
The code for ggplot is:
ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()
Using ggplot for plotting the graph between "Time" and "data1" results as:
Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".
The plot of "data1" data against its index is as shown:
The code for this plot is:
plot(data1, type = "l")
Any help would be highly appreciated. Thanks!
By using pad we can make up for those lost values an assign an NA value as to
avoid plotting in the region of missing data.
library(padr)
library(zoo)
YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)
StLucia<-data.frame(YearMonthDay,Data)
StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format=
"%Y%m%d")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-04 3
4 1981-05-05 4
Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.
So lets add the missing date:
StLucia<-pad(StLucia,interval="day")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-03 NA
4 1981-05-04 3
5 1981-05-05 4
plot(StLucia, type = "l")
If you want to fill in those NA values, use na.locf() from package(zoo)
Here is a reproducible example - change the names to match your data.
# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))
# demonstrate problem
ggplot(dd, aes(t, y)) +
geom_point() +
geom_line()
The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:
ggplot(dd, aes(t, y)) +
geom_col()
If you really want to use lines, you should fill in the missing dates with NA for rainfall. H
# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")
# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
geom_point() +
geom_line()
Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.

ggplot why are bars not stacked?

I would like to create a stacked bar graph however my output shows overlaid bars instead of stacked. How can I rectify this?
#Create data
date <- as.Date(rep(c("1/1/2016", "2/1/2016", "3/1/2016", "4/1/2016", "5/1/2016"),2))
sales <- c(23,52,73,82,12,67,34,23,45,43)*1000
geo <- c(rep("Western Territory",5), rep("Eastern Territory",5))
data <- data.frame(date, sales, geo)
#Plot
library(ggplot2)
ggplot(data=data, aes(x=date, y=sales, fill=geo))+
stat_summary(fun.y=sum, geom="bar") +
ggtitle("TITLE")
Plot output:
As you can see from the summarized table below, it confirms the bars are not stacked:
>#Verify plot is correct
>ddply(data, c("date"), summarize, total=sum(sales))
date total
1 0001-01-20 90000
2 0002-01-20 86000
3 0003-01-20 96000
4 0004-01-20 127000
5 0005-01-20 55000
Thanks!
You have to include position="stack" in your statSummary:
stat_summary(position="stack",fun.y=sum, geom="bar")
Alternatively, since your data are already summarized, you could use geom_col (the short hand for geom_bar(stat = "identity")):
ggplot(data=data, aes(x=date, y=sales, fill=geo))+
geom_col() +
scale_x_date(date_labels = "%b-%d")
Produces:
Note that I changed the date formatting (by adding format = "%m/%d/%Y" to the as.Date call) and explictly set the axis lable formatting.
If your actual data have more than one entry per period, you can always summarise first, then pass that into ggplot instead of the raw data.

Synchronous X-Axis For Multiple Years of Sales with ggplot

I have 1417 days of sale data from 2012-01-01 to present (2015-11-20). I can't figure out how to have a single-year (Jan 1 - Dec 31) axis and each year's sales on the same, one year-long window, even when using ggplot's color = as.factor(Year) option.
Total sales are type int
head(df$Total.Sales)
[1] 495 699 911 846 824 949
and I have used the lubridate package to pull Year out of the original Day variable.
df$Day <- as.Date(as.numeric(df$Day), origin="1899-12-30")
df$Year <- year(df$Day)
But because Day contains the year information
sample(df$Day, 1)
[1] "2012-05-05"
ggplot is still graphing three years instead of synchronizing them to the same period of time (one, full year):
g <- ggplot(df, aes(x = Day, y = Total.Sales, color = as.factor(Year))) +
geom_line()
I create some sample data as follows
set.seed(1234)
dates <- seq(as.Date("2012-01-01"), as.Date("2015-11-20"), by = "1 day")
values <- sample(1:6000, size = length(dates))
data <- data.frame(date = dates, value = values)
Providing something of the sort is, by the way, what is meant by a reproducible example.
Then I prepare some additional columns
library(lubridate)
data$year <- year(data$date)
data$day_of_year <- as.Date(paste("2012",
month(data$date),mday(data$date), sep = "-"))
The last line is almost certainly what Roland meant in his comment. And he was right to choose the leap year, because it contains all possible dates. A normal year would miss February 29th.
Now the plot is generated by
library(ggplot2)
library(scales)
g <- ggplot(data, aes(x = day_of_year, y = value, color = as.factor(year))) +
geom_line() + scale_x_date(labels = date_format("%m/%d"))
I call scale_x_date to define x-axis labels without the year. This relies on the function date_format from the package scales. The string "%m/%d" defines the date format. If you want to know more about these format strings, use ?strptime.
The figure looks as follows:
You can see immediately what might be the trouble with this representation. It is hard to distinguish anything on this plot. But of course this is also related to the fact that my sample data is wildly varying. Your data might look different. Otherwise, consider using faceting (see ?facet_grid or ?facet_wrap).

Cannot convert a time variable to plot it on ggplot

I have two problems handling my time variable in Gnu R!
Firstly, I cannot recode the time data (downloadable here) from factor (or character) with as.Posixlt or with as.Date without an error message like this:
character string is not in a standard unambiguous format
I have then tried to covert my time data with:
dates <- strptime(time, "%Y-%m-%j")
which only gives me:
NA
Secondly, the reason why I wanted (had) to convert my time data is that I want to plot it with ggplot2 and adjust my scale_x_continuous (as described here) so that it only writes me every 50 year (i.e. 1250-01-01, 1300-01-01, etc.) in the x-axis, otherwise the x-axis is too busy (see graph below).
This is the code I use:
library(ggplot2)
library(scales)
library(reshape)
df <- read.csv(file="https://dl.dropboxusercontent.com/u/109495328/time.csv")
attach(df)
dates <- as.character(time)
population <- factor(Number_Humans)
ggplot(df, aes(x = dates, y = population)) + geom_line(aes(group=1), colour="#000099") + theme(axis.text.x=element_text(angle=90)) + xlab("Time in Years (A.D.)")
You need to remove the quotation marks in the date column, then you can convert it to date format:
df <- read.csv(file="https://dl.dropboxusercontent.com/u/109495328/time.csv")
df$time <- gsub('\"', "", as.character(df$time), fixed=TRUE)
df$time <- as.Date(df$time, "%Y-%m-%j")
ggplot(df, aes(x = time, y = Number_Humans)) +
geom_line(colour="#000099") +
theme(axis.text.x=element_text(angle=90)) +
xlab("Time in Years (A.D.)")

Google Trends and Weeks, ggplot2

When I am downloading data from Google Trend, the dataset looks like this:
Week nuclear atomic nuclear.weapons unemployment
2004-01-04 - 2004-01-10 11 11 1 15
2004-01-11 - 2004-01-17 11 13 1 13
2004-01-18 - 2004-01-24 10 11 1 13
How can I change the dates in "Week" from this format "Y-m-d - Y-m-d" to a format like "Year-Week"?
Furthermore, how can I tell ggplot, that it only the years are printed on the x-axes instead of all values for x?
#Mattrition: Thank you. I followed your advice:
trends <- melt(trends, id = "Woche",
measure = c("nuclear", "atomic", "nuclear.weapons", "unemployment"))
trends$Week<- gsub("^(\\d+-\\d+-\\d+).+", "\\1", trends$Week)
trends$Week <- as.Date(trends$Week)
ggplot(trends, aes(Week, value, colour = variable, group=variable)) +
geom_line() +
ylab("Trends") +
theme(legend.position="top", legend.title=element_blank(),
panel.background = element_rect(fill = "#FFFFFF", colour="#000000"))+
scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9", "#009E73"))+
stat_smooth(method="loess")
Now, every second year is labeled (2004, 2006, ...) in x-axis. How can I tell ggplot to label every year (2004, 2005, ...)?
ggplot will understand Date objects (see ?Date) and work out appropriate labelling if you can convert your dates to this format.
You can use something like gsub to extract starting day for each week. This uses regular expressions to match the first argument and return anything inside the set of brackets:
df$startingDay <- gsub("^(\\d+-\\d+-\\d+).+", "\\1", df$Week)
Then call as.Date() on the extracted day strings to convert to Date objects:
df$date <- as.Date(df$startingDay)
You can then use the date objects to plot whatever you wanted to plot:
g <- ggplot(df, aes(date, as.numeric(atomic))) + geom_line()
print(g)
EDIT:
To answer your additional question, add the following to your ggplot object:
library(scales)
g <- g + scale_x_date(breaks=date_breaks(width="1 year"),
labels=date_format("%Y"))

Resources