R - split data to hydrological quarters - r

I wish to split my data sets into year quarters according to definition of hydrological year. According to Wikipedia, "Due to meteorological and geographical factors, the definition of the water years varies". In USA, hydrological year is a period between October 1st of one year and September 30th of the next.
I use definition of hydrological year for Poland (starts at November 1st and ends at October 31st).
Sample data set looks as folllows:
sampleData <- structure(list(date = structure(c(15946, 15947, 15875, 15910, 15869, 15888, 15823, 16059, 16068, 16067), class = "Date"),`example value` = c(-0.325806595888448, 0.116001346459147, 1.68884381116696, -0.480527505762716, -0.50307381813168,-1.12032214801472, -0.659699514672226, -0.547101497279717, 0.729148872679021,-0.769760735764215)), .Names = c("date", "example value"), row.names = c(NA, -10L), class = "data.frame")
For some reason, function "cut" in my code complains that "breaks" and "labels" differs in length (but they don't). If I omit "labels" options in cut (as below) function works perfectly.
What is wrong with labels?
ToHydroQuarters <-function(df)
{
result <- df
yearStart <- as.numeric(format(min(df$date),'%Y'))-1
#Hydrological year in Poland starts at November 1st
DateStart <- as.Date(paste(yearStart,"-11-01",sep=""))
breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter")
breakYear <- format(breaks,'%Y')
#Please, do not create labels in such way.
#Please note that for November and December we have next hydrological year - since it started at 1st November. So, we need to check month to decide which year we have (?) or use cut function again as mentioned here: http://stackoverflow.com/questions/22073881/hydrological-year-time-series
labels <- c(paste("Winter",breakYear[1]),
paste("Spring",breakYear[2]),
paste("Summer",breakYear[3]),
paste("Autumn",breakYear[4]),
paste("Autumn",breakYear[5]))
######Here is problem - once I add labels parameter, function complains about different lengths
result$hydroYear <- cut(df$date, breaks)
result
}

Firstly I think it is unwise to have labels as a "hardcoded" variable in a function since it is impossible to check without some kind of reproducible example, however I can see what you're trying to achieve.
You claim that your break and labels should be the correct length, however the function itself doesn't always work (this is without the labels, even if the labels did exist the cut function did not process the last portion of the dates).
For example:
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
a <- ToHydroQuarters(df)
tail(a)
returns:
date hydroYear
971 2011-08-29 <NA>
972 2011-08-30 <NA>
973 2011-08-31 <NA>
974 2011-09-01 <NA>
975 2011-09-02 <NA>
976 2011-09-03 <NA>
Doing something like breaks <- seq(from=DateStart, to=max(df$date)+90, by="quarter"), does resolve that issue, as it forces a break to actually exist. This might solve your labelling issue that you've had in your function, but it does not make the function "generic".
Personally on the coding side I think it would be better to convert the month, and year parts separately, because it would be easier to understand. For example, you could use library(lubridate) to easily extract the month and specify the breaks and the labels as you normally would. I was thinking the function could look something like this:
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
So then using some dummy data ...
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="month")))
thq <- function(date) {
mnth <- cut(month(date), breaks=c(1,4,7, 10, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Spring", "Summer", "Autumn", "Winter"))
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
df$newdate <- thq(df$date)
Which has the following output:
date newdate
1 2009-01-01 Spring 2009
2 2009-02-01 Spring 2009
3 2009-03-01 Spring 2009
4 2009-04-01 Summer 2009
5 2009-05-01 Summer 2009
6 2009-06-01 Summer 2009
7 2009-07-01 Autumn 2009
8 2009-08-01 Autumn 2009
9 2009-09-01 Autumn 2009
10 2009-10-01 Winter 2010
11 2009-11-01 Winter 2010
12 2009-12-01 Winter 2010
13 2010-01-01 Spring 2010
14 2010-02-01 Spring 2010
15 2010-03-01 Spring 2010
16 2010-04-01 Summer 2010
17 2010-05-01 Summer 2010
18 2010-06-01 Summer 2010
19 2010-07-01 Autumn 2010
20 2010-08-01 Autumn 2010
21 2010-09-01 Autumn 2010
22 2010-10-01 Winter 2011
23 2010-11-01 Winter 2011
24 2010-12-01 Winter 2011
25 2011-01-01 Spring 2011
26 2011-02-01 Spring 2011
27 2011-03-01 Spring 2011
28 2011-04-01 Summer 2011
29 2011-05-01 Summer 2011
30 2011-06-01 Summer 2011
31 2011-07-01 Autumn 2011
32 2011-08-01 Autumn 2011
33 2011-09-01 Autumn 2011
You can shift the months using the modulo operator if it is in a weird order...
thq <- function(date) {
mnth <- cut(((month(df$date)+1) %% 12), breaks=c(0, 3, 6, 9, 12),
right=FALSE, include.lowest=TRUE,
labels=c("Nov_Jan", "Feb_Apr", "May_Jul", "Aug_Oct")
)
# you will need to alter the return statement yourself, because
# I feel there is enough information for you to do it, rather than
# me changing it every time you change the question.
return(paste(mnth, ifelse(mnth == "Winter", year(date)+1, year(date))))
}
library(lubridate)
x <- ymd(c("09-01-01", "09-01-02", "11-09-03"))
df <- data.frame(date=as.Date(seq(from=min(x), to=max(x), by="day")))
df$new <- thq(df$date)
head(df)
output:
> head(df)
date new
1 2009-01-01 Nov_Jan 2009
2 2009-01-02 Nov_Jan 2009
3 2009-01-03 Nov_Jan 2009
4 2009-01-04 Nov_Jan 2009
5 2009-01-05 Nov_Jan 2009
6 2009-01-06 Nov_Jan 2009

Related

How to calculate the average year

I have a 20-year monthly XTS time series
Jan 1990 12.3
Feb 1990 45.6
Mar 1990 78.9
..
Jan 1991 34.5
..
Dec 2009 89.0
I would like to get the average (12-month) year, or
Jan xx
Feb yy
...
Dec kk
where xx is the average of every January, yy of every February, and so on.
I have tried apply.yearly and lapply but these return 1 value, which is the 20-year total average
Would you have any suggestions? I appreciate it.
The lubridate package could be useful for you. I would use the functions year() and month() in conjunction with aggregate():
library(xts)
library(lubridate)
#set up some sample data
dates = seq(as.Date('2000/01/01'), as.Date('2005/01/01'), by="month")
df = data.frame(rand1 = runif(length(dates)), rand2 = runif(length(dates)))
my_xts = xts(df, dates)
#get the mean by year
aggregate(my_xts$rand1, by=year(index(my_xts)), FUN=mean)
This outputs something like:
2000 0.5947939
2001 0.4968154
2002 0.4941752
2003 0.5291211
2004 0.6631564
To find the mean for each month you can do:
#get the mean by month
aggregate(my_xts$rand1, by=month(index(my_xts)), FUN=mean)
which will output something like
1 0.5560279
2 0.6352220
3 0.3308571
4 0.6709439
5 0.6698147
6 0.7483192
7 0.5147294
8 0.3724472
9 0.3266859
10 0.5331233
11 0.5490693
12 0.4642588

How to find out how many trading days in each month in R?

I have a dataframe like this. The time span is 10 years. Because it's Chinese market data, and China has Lunar Holidays. So each year have different holiday times in terms of the western calendar.
When it is a holiday, the stock market does not open, so it is a non-trading day. Weekends are non-trading days too.
I want to find out which month of which year has the least number of trading days, and most importantly, what number is that.
There are not repeated days.
date change open high low close volume
1 1995-01-03 -1.233 637.72 647.71 630.53 639.88 234518
2 1995-01-04 2.177 641.90 655.51 638.86 653.81 422220
3 1995-01-05 -1.058 656.20 657.45 645.81 646.89 430123
4 1995-01-06 -0.948 642.75 643.89 636.33 640.76 487482
5 1995-01-09 -2.308 637.52 637.55 625.04 625.97 509851
6 1995-01-10 -2.503 616.16 617.60 607.06 610.30 606925
If there are not repeated days, you can count days per month and year by:
library(data.table) "maxx"))), .Names = c("X2005", "X2006", "X2007", "X2008"))
library(lubridate)
dt <- as.data.table(dt)
dt_days <- dt[, .(count_day=.N), by=.(year(date), month(date))]
Then you only need to do this to get the min:
dt_days[count_day==min(count_day)]
The chron and bizdays packages deal with business days but neither actually contains a usable calendar of holidays limiting their usefulness.
We will use chron below assuming you have defined the .Holidays vector of dates that are holidays. (If you run the code below without doing that only weekdays will be regarded as business days as the default .Holidays vector supplied by chron has very few dates in it.) DF has 120 rows (one row for each year/month) and the last line subsets that to just the month in each year having least business days.
library(chron)
library(zoo)
st <- as.yearmon("2001-01")
en <- as.yearmon("2010-12")
ym <- seq(st, en, 1/12) # sequence of year/months of interest
# no of business days in each yearmonth
busdays <- sapply(ym, function(x) {
s <- seq(as.Date(x), as.Date(x, frac = 1), "day")
sum(!is.weekend(s) & !is.holiday(s))
})
# data frame with one row per year/month
yr <- as.integer(ym)
DF <- data.frame(year = yr, month = cycle(ym), yearmon = ym, busdays)
# data frame with one row per year
wx.min <- ave(busdays, yr, FUN = function(x) which.min(x) == seq_along(x))
DF[wx.min == 1, ]
giving:
year month yearmon busdays
2 2001 2 Feb 2001 20
14 2002 2 Feb 2002 20
26 2003 2 Feb 2003 20
38 2004 2 Feb 2004 20
50 2005 2 Feb 2005 20
62 2006 2 Feb 2006 20
74 2007 2 Feb 2007 20
95 2008 11 Nov 2008 20
98 2009 2 Feb 2009 20
110 2010 2 Feb 2010 20

Changing X-axis values in Time Series plot with R

I'm a newer R user and I need help with a time series plot. I created a time series plot, and cannot figure out how to change my x-axis values to correspond to my sample dates. My data is as follows:
Year Month Level
2009 8 350
2009 9 210
2009 10 173
2009 11 166
2009 12 153
2010 1 141
2010 2 129
2010 3 124
2010 4 103
2010 5 69
2010 6 51
2010 7 49
2010 8 51
2010 9 51
Let's say this data is saved as the name "data.csv"
data = read.table("data.csv", sep = ",", header = T)
data.ts = ts(data, frequency = 1)
plot(dat.mission.ts[, 3], ylab = "level", main = "main", axes = T)
I've also tried inputing the start = c(2009, 8) into the ts function but I still get wrong values
When I plot this my x axis does not correlate to August 2009 through Sept. 2010. It will either increase by year or just by decimal. I've looked up many examples online and also through the ? help on R, but cannot find a way to relabel my axis values. Any help would be appreciated.
Using base coding, you can accomplish this in a few steps. As described in this SO answer, you can identify your "Month" and "Year" data as a date if you use as.Date and paste functions together and incorporate a day (i.e., first day of the month; "1"). For the purposes of this answer, I will simply refer to the data you provided as df:
df$date<-with(df,as.Date(paste(Year,Month,'1',sep='-'),format='%Y-%m-%d'))
df
Year Month Level date
1 2009 8 350 2009-08-01
2 2009 9 210 2009-09-01
3 2009 10 173 2009-10-01
4 2009 11 166 2009-11-01
5 2009 12 153 2009-12-01
6 2010 1 141 2010-01-01
7 2010 2 129 2010-02-01
8 2010 3 124 2010-03-01
9 2010 4 103 2010-04-01
10 2010 5 69 2010-05-01
11 2010 6 51 2010-06-01
12 2010 7 49 2010-07-01
13 2010 8 51 2010-08-01
14 2010 9 51 2010-09-01
Then you can use your basic plot, axis, and mtext functions to control how you want to visualize the data and your axes. For instance:
xmin<-min(df$date,na.rm=T);xmax<-max(df$date,na.rm=T) #ESTABLISH X-VALUES (MIN & MAX)
ymin<-min(df$Level,na.rm=T);ymax<-max(df$Level,na.rm=T) #ESTABLISH Y-VALUES (MIN & MAX)
xseq<-seq.Date(xmin,xmax,by='1 month') #CREATE DATE SEQUENCE THAT INCREASES BY MONTH FROM DATE MINIMUM TO MAXIMUM
yseq<-round(seq(0,ymax,by=50),0) # CREATE SEQUENCE FROM 0-350 BY 50
par(mar=c(1,1,0,0),oma=c(6,5,3,2)) #CONTROLS YOUR IMAGE MARGINS
plot(Level~date,data=df,type='b',ylim=c(0,ymax),axes=F,xlab='',ylab='');box() #PLOT LEVEL AS A FUNCTION OF DATE, REMOVE AXES FOR FUTURE CUSTOMIZATION
axis.Date(side=1,at=xseq,format='%Y-%m',labels=T,las=3) #ADD X-AXIS LABELS WITH "YEAR-MONTH" FORMAT
axis(side=2,at=yseq,las=2) #ADD Y-AXIS LABELS
mtext('Date (Year-Month)',side=1,line=5) #X-AXIS LABEL
mtext('Level',side=2,line=4) #Y-AXIS LABEL
library(data.table)
library(ggplot2)
library(scales)
data<-data.table(datetime=seq(as.POSIXct("2009/08/01",format="%Y/%m/%d"),
as.POSIXct("2010/09/01",format="%Y/%m/%d"),by="1 month"),
Level=c(350,210,173,166,153,141,129,124,103,69,51,49,51,51))
ggplot(data)+
geom_point(aes(x=datetime,y=Level),col="brown1",size=1)+
scale_x_datetime(labels = date_format("%Y/%m"),breaks = "1 month")+
theme(axis.text.x = element_text(angle = 90, hjust = 1,vjust=0.3))
Example using xts package:
library(xts)
ts1 <- xts(data$Level, as.POSIXct(sprintf("%d-%d-01", data$Year, data$Month)))
# or ts1 <- xts(data$Level, as.yearmon(data$Year + (data$Month-1)/12))
plot(ts1)
If you are using ggplot2:
library(ggplot2)
autoplot(ts1)

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

How can I avoid having to loop through and search through this data frame?

I have a 1 million row data frame that contains monthly water usage data (HCF) for various accounts from 2003-2010:
> head(LeakyAccts)
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Sep 2007 24
3 10114488 Nov 2006 11
4 10114488 Jun 2008 18
5 10114488 Aug 2003 6
6 10114488 Jan 2008 30
Dates are yearmon's. I want to know how much each account used every month compared to the same month in the previous year. So for each row, I'd like to find the difference between the usage in that month (Date) and the usage in the same month the previous year (Date - 1). In other words, I want this:
for(i in 1:nrow(LeakyAccts)) {
row <- which((LeakyAccts$ACCOUNT == LeakyAccts[i,]$UB_ACCT_NBR) & (LeakyAccts$Date == (LeakyAccts[i,]$Date - 1)))
if (length(row) == 1) { # no previous year for 2003
LeakyAccts[i,]$Difference <- LeakyAccts[i,]$HCF - LeakyAccts[row,]$HCF
}
}
Needless to say, this loop takes hours to run and seems very un-R-like. How can I avoid using an ugly for loop and speed up the computation? Is there perhaps a way to do this using an apply function or a data.table?
I've reconfigured your data a little to give a complete example:
library(zoo)
dat <- structure(list(ACCOUNT = c(10114488L, 10114488L, 10114488L, 20114488L, 20114488L, 20114488L), ate = structure(c(2010.75, 2009.75, 2008.75, 2008, 2007, 2006), class = "yearmon"), HCF = c(25L, 24L, 11L, 18L, 6L, 30L)), .Names = c("ACCOUNT", "Date", "HCF"), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")
Which looks like:
ACCOUNT Date HCF
1 10114488 Oct 2010 25
2 10114488 Oct 2009 24
3 10114488 Oct 2008 11
4 20114488 Jan 2008 18
5 20114488 Jan 2007 6
6 20114488 Jan 2006 30
Since yearmon is essentially just a numeric value where a difference of 1 is a year's difference, you can get the matching differences from a year ago like:
dat$HCF - dat$HCF[match(dat$Date-1,dat$Date)]
#[1] 1 13 NA 12 -24 NA
...which you can also apply within each group like:
do.call(c,by(dat,dat$ACCOUNT,function(x) x$HCF - x$HCF[match(x$Date-1,x$Date)]))
#101144881 101144882 101144883 201144881 201144882 201144883
# 1 13 NA 12 -24 NA
Or using data.table like:
library(data.table)
dat <- as.data.table(dat)
dat[, Difference := HCF - HCF[match(Date-1,Date)], by=ACCOUNT]
dat
# ACCOUNT Date HCF Difference
#1: 10114488 Oct 2010 25 1
#2: 10114488 Oct 2009 24 13
#3: 10114488 Oct 2008 11 NA
#4: 20114488 Jan 2008 18 12
#5: 20114488 Jan 2007 6 -24
#6: 20114488 Jan 2006 30 NA

Resources