How to create boxplot based on 5 year intervals in R - r

I have a continuous variable y measured on different dates. I need to make boxplots with a box showing the distribution of y for each 5 year interval.
Sample data:
rdob <- as.Date(dob, format= "%m/%d/%y")
ggplot(data = data, aes(x=rdob, y=ageyear)) + geom_boxplot()
#Warning message:
#Continuous x aesthetic -- did you forget aes(group=...)?
This image is the first one I tried. What I want is a box for every five year interval, instead of a box for every year.

Here is a way to pull out the year in base R:
format(as.Date("2008-11-03", format="%Y-%m-%d"), "%Y")
Simply wrap your date vector in a format() and add the "%Y". To get this to be integer, you can use as.integer.
You could also take a look at the year function in the lubridate package which will make this extraction a little bit more straightforward.
One method to get 5 year intervals is to use cut to create a factor variable that creates levels at selected break points. Unless you have dozens of years your best bet would be to set the break points manually:
df$myTimeInterval <- cut(df$years, breaks=c(1995, 2000, 2005, 2010, 2015))

Here's an example taking Dave2e's suggestion of using cut on date intervals along with ggplot's group aesthetic mapping:
library(ggplot2)
n <- 1000
## Randomly sample birth dates and dummy up an effect that trends upward with DOB
dobs <- sample(seq(as.Date('1970/01/01'), Sys.Date(), by="day"), n)
effect <- rnorm(n) + as.numeric(as.POSIXct(dobs)) / as.numeric(as.POSIXct(Sys.Date()))
data <- data.frame(dob=dobs, effect=effect)
## boxplot w/ DOB binned to 5 year intervals
ggplot(data=data, aes(x=dob, y=effect)) + geom_boxplot(aes(group=cut(dob, "5 year")))

library(lubridate)
year=year(rdob)

Related

Create a boxplot per year out of a ts object in R

I have a ts object: 240 monthly observations stating from January 2000:
data <- runif(240)
data_ts <- ts(data,
start = c(2000, 1),
frequency = 12)
And I want to create a boxplot per year out of my data_ts.
I know how to create a boxplot per month:
boxplot(data_ts ~ cycle(data_ts))
But I don't know how to create a boxplot per year, that is, a boxplot of the observations of each year (a boxplot of year 2000, a boxplot of 2001, and so on).
Any idea?
Thanks!
The year is given as shown:
year <- as.integer(time(data_ts))
boxplot(data_ts ~ year)
I use the window() function to subset the years, and a for() loop to iterate each year and create a boxplot(). The title() function adds the title to the plot, and png() and dev.off() work together to save the image to disk:
getwd() # print location files will be saved too.
for (i in 2010:2012) { # small loop for testing)
png(file=paste("boxplot_",i,".png",sep="")) # create a png
boxplot(window(x=data_ts, start=c(i, 1), end=c(i, 12))) # boxplot, of yearly data.
title(i) # add the year as a title to the plot
dev.off() # save the png
}
Maybe this also helps:
data <- runif(240)
data_ts <- ts(data,
start = c(2000, 1),
frequency = 12)
frame<-data.frame(values=as.matrix(data_ts), date=lubridate::year(zoo::as.Date(data_ts)))
library(ggplot2)
ggplot(frame,aes(y=values,x=date,group=date))+
geom_boxplot()
It is not the most elegant solution though as it uses both the zoo and lubridate packages to convert the date into a year that ggplot understands.

Synchronous X-Axis For Multiple Years of Sales with ggplot

I have 1417 days of sale data from 2012-01-01 to present (2015-11-20). I can't figure out how to have a single-year (Jan 1 - Dec 31) axis and each year's sales on the same, one year-long window, even when using ggplot's color = as.factor(Year) option.
Total sales are type int
head(df$Total.Sales)
[1] 495 699 911 846 824 949
and I have used the lubridate package to pull Year out of the original Day variable.
df$Day <- as.Date(as.numeric(df$Day), origin="1899-12-30")
df$Year <- year(df$Day)
But because Day contains the year information
sample(df$Day, 1)
[1] "2012-05-05"
ggplot is still graphing three years instead of synchronizing them to the same period of time (one, full year):
g <- ggplot(df, aes(x = Day, y = Total.Sales, color = as.factor(Year))) +
geom_line()
I create some sample data as follows
set.seed(1234)
dates <- seq(as.Date("2012-01-01"), as.Date("2015-11-20"), by = "1 day")
values <- sample(1:6000, size = length(dates))
data <- data.frame(date = dates, value = values)
Providing something of the sort is, by the way, what is meant by a reproducible example.
Then I prepare some additional columns
library(lubridate)
data$year <- year(data$date)
data$day_of_year <- as.Date(paste("2012",
month(data$date),mday(data$date), sep = "-"))
The last line is almost certainly what Roland meant in his comment. And he was right to choose the leap year, because it contains all possible dates. A normal year would miss February 29th.
Now the plot is generated by
library(ggplot2)
library(scales)
g <- ggplot(data, aes(x = day_of_year, y = value, color = as.factor(year))) +
geom_line() + scale_x_date(labels = date_format("%m/%d"))
I call scale_x_date to define x-axis labels without the year. This relies on the function date_format from the package scales. The string "%m/%d" defines the date format. If you want to know more about these format strings, use ?strptime.
The figure looks as follows:
You can see immediately what might be the trouble with this representation. It is hard to distinguish anything on this plot. But of course this is also related to the fact that my sample data is wildly varying. Your data might look different. Otherwise, consider using faceting (see ?facet_grid or ?facet_wrap).

R growth rate calculation week over week on daily timeseries data

I'm trying to calculate w/w growth rates entirely in R. I could use excel, or preprocess with ruby, but that's not the point.
data.frame example
date gpv type
1 2013-04-01 12900 back office
2 2013-04-02 16232 back office
3 2013-04-03 10035 back office
I want to do this factored by 'type' and I need to wrap up the Date type column into weeks. And then calculate the week over week growth.
I think I need to do ddply to group by week - with a custom function that determines if a date is in a given week or not?
Then, after that, use diff and find the growth b/w weeks divided by the previous week.
Then I'll plot week/week growths, or use a data.frame to export it.
This was closed but had same useful ideas.
UPDATE: answer with ggplot:
All the same as below, just use this instead of plot
ggplot(data.frame(week=seq(length(gr)), gr), aes(x=week,y=gr*100)) + geom_point() + geom_smooth(method='loess') + coord_cartesian(xlim = c(.95, 10.05)) + scale_x_discrete() + ggtitle('week over week growth rate, from Apr 1') + ylab('growth rate %')
(old, correct answer but using only plot)
Well, I think this is it:
df_net <- ddply(df_all, .(date), summarise, gpv=sum(gpv)) # df_all has my daily data.
df_net$week_num <- strftime(df_net$date, "%U") #get the week # to 'group by' in ddply
df_weekly <- ddply(df_net, .(week_num), summarize, gpv=sum(gov))
gr <- diff(df_weekly$gpv)/df_weekly$gpv[-length(df_weekly$gpv)] #seems correct, but this I don't understand via: http://stackoverflow.com/questions/15356121/how-to-identify-the-virality-growth-rate-in-time-series-data-using-r
plot(gr, type='l', xlab='week #', ylab='growth rate percent', main='Week/Week Growth Rate')
Any better solutions out there?
For the last part, if you want to calculate the growth rate you can take logs and then use diff, with the default parameters lag = 1 (previos week) and difference = 1 (first difference):
df_weekly_log <- log(df_weekly)
gr <- diff(df_weekly_log , lag = 1, differences = 1)
The later is an approximation, valid for small differences.
Hope it helps.

how to create a plot with continuous days of year of subsequent years as x axis in R?

I am trying to plot temperature data of two consecutive years (say 05 Nov, 2010 to 30 March, 2011) having days of year as x axis values. For example:
temp<-c(30.1:40.1) # y axis
doy<-c(360:365,1:5) # x axis
please help me out. thanks.
temp<-c(30.1:40.1) # y axis
doy<-c(360:365,1:5) # x axis
doy2 <- c(360:365,c(1:5)+365)
plot(temp ~ doy2, xaxt="n", xlab = "doy")
axis(1,doy,at=doy2)
Alternatively:
The most rigorous way of approaching this would be to use the date-time objects within R. Then R will recognise the 'temp' data as dates, and will therefore give the right result when plotted. Date time objects are complicated, but if you deal with them regularly, they are worth learning:
temp<-c(30.1:40.1) # y axis
doy<-c(360:365,1:5) # x axis
doy2 <- c(360:365,c(1:5)+365) #we sill need this to place numbers 1:5 into a new year (i.e. 365 days later)
doy.date <- as.Date("2011-01-01") #Set a base date (choose the year that you will start with
doy.date <- doy.date + doy2 - 1 #add the days of year to the base date and subtract one (the base date was January 1st)
plot(temp ~ doy.date, xlab = "doy") #plot as usual
#see documentation on dates
?date
#or for date with times:
?POSIXct

How to create a time scatterplot with R?

The data are a series of dates and times.
date time
2010-01-01 09:04:43
2010-01-01 10:53:59
2010-01-01 10:57:18
2010-01-01 10:59:30
2010-01-01 11:00:44
…
My goal was to represent a scatterplot with the date on the horizontal axis (x) and the time on the vertical axis (y). I guess I could also add a color intensity if there are more than one time for the same date.
It was quite easy to create an histogram of dates.
mydata <- read.table("mydata.txt", header=TRUE, sep=" ")
mydatahist <- hist(as.Date(mydata$day), breaks = "weeks", freq=TRUE, plot=FALSE)
barplot(mydatahist$counts, border=NA, col="#ccaaaa")
I haven't figured out yet how to create a scatterplot where the axis are date and/or time.
I would like also to be able to have axis not necessary with linear dates YYYY-MM-DD, but also based on months such as MM-DD (so different years accumulate), or even with a rotation on weeks.
Any help, RTFM URI slapping or hints is welcome.
The ggplot2 package handles dates and times quite easily.
Create some date and time data:
dates <- as.POSIXct(as.Date("2011/01/01") + sample(0:365, 100, replace=TRUE))
times <- as.POSIXct(runif(100, 0, 24*60*60), origin="2011/01/01")
df <- data.frame(
dates = dates,
times = times
)
Then get some ggplot2 magic. ggplot will automatically deal with dates, but to get the time axis formatted properly use scale_y_datetime():
library(ggplot2)
library(scales)
ggplot(df, aes(x=dates, y=times)) +
geom_point() +
scale_y_datetime(breaks=date_breaks("4 hour"), labels=date_format("%H:%M")) +
theme(axis.text.x=element_text(angle=90))
Regarding the last part of your question, on grouping by week, etc: To achieve this you may have to pre-summarize the data into the buckets that you want. You can use possibly use plyr for this and then pass the resulting data to ggplot.
I'd start by reading about as.POSIXct, strptime, strftime, and difftime. These and related functions should allow you to extract the desired subsets of your data. The formatting is a little tricky, so play with the examples in the help files.
And, once your dates are converted to a POSIX class, as.numeric() will convert them all to numeric values, hence easy to sort, plot, etc.
Edit: Andre's suggestion to play w/ ggplot to simplify your axis specifications is a good one.

Resources