a daily time series - calculating totals in R - r

I have a time series with 30 years of daily data ( two columns labelled date and value)
Date Value
01-01-1975 0.051
02-01-1975 0.051
03-01-1975 0.051
04-01-1975 0.051
05-01-1975 0.051
06-01-1975 0.051
07-01-1975 0.051
08-01-1975 0.051
09-01-1975 0.051
10-01-1975 0.048
11-01-1975 0.048
12-01-1975 0.048
.........
I am trying to aggregate 5 days totals into sum ( so for each year I would get 73 values generated, it is a leap year then it would last value would be 6 days total rather than 5 days) In other words I always want to start on January 1 and always end on 31 Dec for each year, but I need to deal with the leap year case somehow, e.g. by treating each year separately or by finding leap years and treating them differently. But I am having issues
I did the following,
test <- read.csv("~/H/x.csv")
test$Date <- as.Date(test$Date, format = "%d-%m-%Y")
output <- aggregate(Flow ~ cut(Date, "5 days"), test, sum)
But it didn't quite gave me the results I wanted, which is for each year i want 73 values computed..
This is my first go at programming and R, so your guidance would be most welcome

cut by 5 days but use ave to do it by year so that weeks do not cross year boundaries. This gives Date5. Now aggregate over the cut values:
# test data
DF <- data.frame(Date = seq(as.Date("1975-01-01"), length = 2000, by = "day"),
Value = 1:2000)
to.yr <- function(x) as.numeric(format(x, "%Y"))
Date5 <- ave(DF$Date, to.yr(DF$Date), FUN = function(x) cut(x, "5 day"))
ag <- aggregate(Value ~ Date5, DF, sum)
To count the number of weeks (full or partial) used:
> table(to.yr(ag$Date5))
1975 1976 1977 1978 1979 1980
73 74 73 73 73 35

Some sample data to play with:
test = data.frame(Date=seq(as.Date("1975-01-01"),as.Date("2005-01-01"),1))
test$value = runif(nrow(test))
head(test)
Date value
1 1975-01-01 0.2929824
2 1975-01-02 0.2222665
3 1975-01-03 0.2659065
4 1975-01-04 0.5511573
Now use lubridate package's yday function to set the day of year from 1 to 366:
> require(lubridate)
> test$yday = yday(test$Date)
Now integer divide year day minus 1 by five to give our grouping (from 0 to 73 in this case):
> test$grp = (test$yday-1) %/% 5
head(test,10)
Date value yday grp
1 1975-01-01 0.29298243 0 0
2 1975-01-02 0.22226646 1 0
3 1975-01-03 0.26590648 2 0
4 1975-01-04 0.55115730 3 0
5 1975-01-05 0.55990854 4 0
6 1975-01-06 0.70054357 5 1
7 1975-01-07 0.27184097 6 1
8 1975-01-08 0.47779337 7 1
9 1975-01-09 0.09127241 8 1
10 1975-01-10 0.65023465 9 1
So we have the odd days in each year in group 73. Which ones?
test[test$grp==73,]
Date value yday grp
731 1976-12-31 0.6636329 365 73
2192 1980-12-31 0.4586537 365 73
3653 1984-12-31 0.3473794 365 73
5114 1988-12-31 0.9160449 365 73
6575 1992-12-31 0.3215585 365 73
8036 1996-12-31 0.1965876 365 73
9497 2000-12-31 0.6795412 365 73
10958 2004-12-31 0.3622685 365 73
We want to put these in group 72:
test$grp[test$grp==73]=72
Now we can do an analysis based on that group variable, and we should only get 73 values (remember we started at zero). I'll use dplyr because its cool:
require(dplyr)
test %>% group_by(grp) %>% summarise(mean=mean(value))
Source: local data frame [73 x 2]
grp mean
1 0 0.5052336
2 1 0.5178286
3 2 0.4844037
4 3 0.5368534
5 4 0.4900208
6 5 0.5078784
7 6 0.4754043
....
73 x 2 looks right!

Related

create an unique week variable NOT depending on the calendar in R

I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.

how to group sales data by 4 days from yesterday to start date in r?

Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.

Subset data based on pentad dates with leap year

I'm trying to subset the following data by pentad dates. Pentad means non overlapping 5 day average. For leap years, Pentad 12 includes February 29 (6 days average instead of 5):
Link to Data
Link to pentad dates
Here's my code:
library(stringr)
dat <- read.csv("tc_filt_1981-2007.csv",header = T,sep = ",")
dat$Date = paste(dat$Year, str_pad(dat$Month,2,'left','0'), str_pad(dat$Day,2,'left','0'), sep='-')
dat$yday = as.POSIXlt(dat$Date)$yday + 1
dat$pentad = ceiling(dat$yday/5)
df<-split(dat, dat$pentad)
Problem:
The dat$y line only works for 365 days. In a given year, there should be only 73 pentads. My code above produce 74 pentads when I check the dat$pentad. The df contains the data frames for each pentad.
I did the following for checking:
test<-dat[which(dat$pentad == 74),]
Output:
SN CY Year Month Day Hour Lat Lon Cat Date yday pentad
200034 34 2000 12 31 0 12.7 128.2 TS 2000-12-31 366 74
200034 34 2000 12 31 6 13.3 128.8 TS 2000-12-31 366 74
200034 34 2000 12 31 12 13.9 129.7 TS 2000-12-31 366 74
200034 34 2000 12 31 18 14.4 130.6 TS 2000-12-31 366 74
Question:
How do I account for the leap year in my code?
Can anyone suggest how can I do this?
Many thanks,
Minor adjustment:
library(lubridate)
dat$pentad = ceiling( (dat$yday - leap_year(dat$Year)*(dat$yday > 59)) / 5 )

Difftime for workdays according to holidayNYSE in R

I'm trying to find difftime for working days only. I want to calculate difftime according to holidayNYSE calendar. When I use the difftime function weekends and holidays are included in the answers, my dataset contaies only data from working days, but when using difftime I have to subtract the non-working days somehow.
A is a vector of 0 and 1, and I want to find the duration of how many days with 0 or 1. Duration for run one are suppose to be 35 and I get 49 (working days from January 1990).
df <- data.frame(Date=(dates), A)
setDT(df)
df <- data.frame(Date=(dates), A)
DF1 <- df[, list(A = unique(A), duration = difftime(max(Date),min(Date), holidayNYSE
(year=setRmetricsOptions(start="1990-01-01", end="2015-31-12")))), by = run]
DF1
run A duration
1: 1 1 49 days
2: 2 0 22 days
3: 3 1 35 days
4: 4 0 27 days
5: 5 1 14 days
---
291: 291 1 6 days
292: 292 0 34 days
293: 293 1 10 days
294: 294 0 15 days
295: 295 1 29 days
An answer to my question without use of difftime:
df <- data.frame(Date=(dates), Value1=bull01)
setDT(df)
df[, run := cumsum(c(1, diff(Value1) !=0))]
duration <- rep(0)
for (i in 1:295){
ind <- which(df$run==i)
a <- df$Date[ind]
duration[i] <- length(a)
}
c <- rep(c(1,0),295)
c <- c[1:295]
df2 <- data.frame(duration, type=c)
> df2
run duration type
1 35 1
2 17 0
3 25 1
4 20 0
5 10 1
---
291 5 1
292 25 0
293 9 1
294 11 0
295 21 1

How to subset data.frame by weeks and then sum?

Let's say I have several years worth of data which look like the following
# load date package and set random seed
library(lubridate)
set.seed(42)
# create data.frame of dates and income
date <- seq(dmy("26-12-2010"), dmy("15-01-2011"), by = "days")
df <- data.frame(date = date,
wday = wday(date),
wday.name = wday(date, label = TRUE, abbr = TRUE),
income = round(runif(21, 0, 100)),
week = format(date, format="%Y-%U"),
stringsAsFactors = FALSE)
# date wday wday.name income week
# 1 2010-12-26 1 Sun 91 2010-52
# 2 2010-12-27 2 Mon 94 2010-52
# 3 2010-12-28 3 Tues 29 2010-52
# 4 2010-12-29 4 Wed 83 2010-52
# 5 2010-12-30 5 Thurs 64 2010-52
# 6 2010-12-31 6 Fri 52 2010-52
# 7 2011-01-01 7 Sat 74 2011-00
# 8 2011-01-02 1 Sun 13 2011-01
# 9 2011-01-03 2 Mon 66 2011-01
# 10 2011-01-04 3 Tues 71 2011-01
# 11 2011-01-05 4 Wed 46 2011-01
# 12 2011-01-06 5 Thurs 72 2011-01
# 13 2011-01-07 6 Fri 93 2011-01
# 14 2011-01-08 7 Sat 26 2011-01
# 15 2011-01-09 1 Sun 46 2011-02
# 16 2011-01-10 2 Mon 94 2011-02
# 17 2011-01-11 3 Tues 98 2011-02
# 18 2011-01-12 4 Wed 12 2011-02
# 19 2011-01-13 5 Thurs 47 2011-02
# 20 2011-01-14 6 Fri 56 2011-02
# 21 2011-01-15 7 Sat 90 2011-02
I would like to sum 'income' for each week (Sunday thru Saturday). Currently I do the following:
Weekending 2011-01-01 = sum(df$income[1:7]) = 487
Weekending 2011-01-08 = sum(df$income[8:14]) = 387
Weekending 2011-01-15 = sum(df$income[15:21]) = 443
However I would like a more robust approach which will automatically sum by week. I can't work out how to automatically subset the data into weeks. Any help would be much appreciated.
First use format to convert your dates to week numbers, then plyr::ddply() to calculate the summaries:
library(plyr)
df$week <- format(df$date, format="%Y-%U")
ddply(df, .(week), summarize, income=sum(income))
week income
1 2011-52 413
2 2012-01 435
3 2012-02 379
For more information on format.date, see ?strptime, particular the bit that defines %U as the week number.
EDIT:
Given the modified data and requirement, one way is to divide the date by 7 to get a numeric number indicating the week. (Or more precisely, divide by the number of seconds in a week to get the number of weeks since the epoch, which is 1970-01-01 by default.
In code:
df$week <- as.Date("1970-01-01")+7*trunc(as.numeric(df$date)/(3600*24*7))
library(plyr)
ddply(df, .(week), summarize, income=sum(income))
week income
1 2010-12-23 298
2 2010-12-30 392
3 2011-01-06 294
4 2011-01-13 152
I have not checked that the week boundaries are on Sunday. You will have to check this, and insert an appropriate offset into the formula.
This is now simple using dplyr. Also I would suggest using cut(breaks = "week") rather than format() to cut the dates into weeks.
library(dplyr)
df %>% group_by(week = cut(date, "week")) %>% mutate(weekly_income = sum(income))
I Googled "group week days into weeks R" and came across this SO question. You mention you have multiple years, so I think we need to keep up with both the week number and also the year, so I modified the answers there as so format(date, format = "%U%y")
In use it looks like this:
library(plyr) #for aggregating
df <- transform(df, weeknum = format(date, format = "%y%U"))
ddply(df, "weeknum", summarize, suminc = sum(income))
#----
weeknum suminc
1 1152 413
2 1201 435
3 1202 379
See ?strptime for all the format abbreviations.
Try rollapply from the zoo package:
rollapply(df$income, width=7, FUN = sum, by = 7)
# [1] 487 387 443
Or, use period.sum from the xts package:
period.sum(xts(df$income, order.by=df$date), which(df$wday %in% 7))
# [,1]
# 2011-01-01 487
# 2011-01-08 387
# 2011-01-15 443
Or, to get the output in the format you want:
data.frame(income = period.sum(xts(df$income, order.by=df$date),
which(df$wday %in% 7)),
week = df$week[which(df$wday %in% 7)])
# income week
# 2011-01-01 487 2011-00
# 2011-01-08 387 2011-01
# 2011-01-15 443 2011-02
Note that the first week shows as 2011-00 because that's how it is entered in your data. You could also use week = df$week[which(df$wday %in% 1)] which would match your output.
This solution is influenced by #Andrie and #Chase.
# load plyr
library(plyr)
# format weeks as per requirement (replace "00" with "52" and adjust corresponding year)
tmp <- list()
tmp$y <- format(df$date, format="%Y")
tmp$w <- format(df$date, format="%U")
tmp$y[tmp$w=="00"] <- as.character(as.numeric(tmp$y[tmp$w=="00"]) - 1)
tmp$w[tmp$w=="00"] <- "52"
df$week <- paste(tmp$y, tmp$w, sep = "-")
# get summary
df2 <- ddply(df, .(week), summarize, income=sum(income))
# include week ending date
tmp$week.ending <- lapply(df2$week, function(x) rev(df[df$week==x, "date"])[[1]])
df2$week.ending <- sapply(tmp$week.ending, as.character)
# week income week.ending
# 1 2010-52 487 2011-01-01
# 2 2011-01 387 2011-01-08
# 3 2011-02 443 2011-01-15
df.index = df['week'] #the the dt variable as index
df.resample('W').sum() #sum using resample
With dplyr:
df %>%
arrange(date) %>%
mutate(week = as.numeric(date - date[1])%/%7) %>%
group_by(week) %>%
summarise(weekincome= sum(income))
Instead of date[1] you can have any date from when you want to start your weekly study.

Resources