Aggregation by time period in lubridate - r

This question asks about aggregation by time period in R, what pandas calls resampling. The most useful answer uses the XTS package to group by a given time period, applying some function such as sum() or mean().
One of the comments suggested there was something similar in lubridate, but didn't elaborate. Can someone provide an idiomatic example using lubridate? I've read through the lubridate vignette a couple times and can imagine some combination of lubridate and plyr, however I want to make sure there isn't an easier way that I'm missing.
To make the example more real, let's say I want the daily sum of bicycles traveling northbound from this dataset:
library(lubridate)
library(reshape2)
bikecounts <- read.csv(url("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD"), header=TRUE, stringsAsFactors=FALSE)
names(bikecounts) <- c("Date", "Northbound", "Southbound")
Data looks like this:
> head(bikecounts)
Date Northbound Southbound
1 10/02/2012 12:00:00 AM 0 0
2 10/02/2012 01:00:00 AM 0 0
3 10/02/2012 02:00:00 AM 0 0
4 10/02/2012 03:00:00 AM 0 0
5 10/02/2012 04:00:00 AM 0 0
6 10/02/2012 05:00:00 AM 0 0

I don't know why you'd use lubridate for this. If you're just looking for something less awesome than xts you could try this
tapply(bikecounts$Northbound, as.Date(bikecounts$Date, format="%m/%d/%Y"), sum)
Basically, you just need to split by Date, then apply a function.
lubridate could be used for creating a grouping factor for split-apply problems. So, for example, if you want the sum for each month (ignoring year)
tapply(bikecounts$Northbound, month(mdy_hms(bikecounts$Date)), sum)
But, it's just using wrappers for base R functions, and in the case of the OP, I think the base R function as.Date is the easiest (as evidenced by the fact that the other Answers also ignored your request to use lubridate ;-) ).
Something that wasn't covered by the Answer to the other Question linked to in the OP is split.xts. period.apply splits an xts at endpoints and applies a function to each group. You can find endpoints that are useful for a given task with the endpoints function. For example, if you have an xts object, x, then endpoints(x, "months") would give you the row numbers that are the last row of each month. split.xts leverages that to split an xts object -- split(x, "months") would return a list of xts objects where each component was for a different month.
Although, split.xts() and endpoints() are primarily intended for xts objects, they also work on some other objects as well, including plain time based vectors. Even if you don't want to use xts objects, you still may find uses for endpoints() because of its convenience or its speed (implemented in C)
> split.xts(as.Date("1970-01-01") + 1:10, "weeks")
[[1]]
[1] "1970-01-02" "1970-01-03" "1970-01-04"
[[2]]
[1] "1970-01-05" "1970-01-06" "1970-01-07" "1970-01-08" "1970-01-09"
[6] "1970-01-10" "1970-01-11"
> endpoints(as.Date("1970-01-01") + 1:10, "weeks")
[1] 0 3 10
I think lubridate's best use in this problem is for parsing the "Date" strings into POSIXct objects. i.e. the mdy_hms function in this case.
Here's an xts solution that uses lubridate to parse the "Date" strings.
x <- xts(bikecounts[, -1], mdy_hms(bikecounts$Date))
period.apply(x, endpoints(x, "days"), sum)
apply.daily(x, sum) # identical to above
For this specific task, xts also has an optimized period.sum function (written in Fortran) that is very fast
period.sum(x, endpoints(x, "days"))

Using ddply from plyr package:
library(plyr)
bikecounts$Date<-with(bikecounts,as.Date(Date, format = "%m/%d/%Y"))
x<-ddply(bikecounts,.(Date),summarise, sumnorth=sum(Northbound),sumsouth=sum(Southbound))
> head(x)
Date sumnorth sumsouth
1 2012-10-02 1165 773
2 2012-10-03 1761 1760
3 2012-10-04 1767 1708
4 2012-10-05 1590 1558
5 2012-10-06 926 1080
6 2012-10-07 951 1191
> tail(x)
Date sumnorth sumsouth
298 2013-07-26 1964 1999
299 2013-07-27 1212 1289
300 2013-07-28 902 1078
301 2013-07-29 2040 2048
302 2013-07-30 2314 2226
303 2013-07-31 2008 2076

Here is an option using data.table
after importing the csv:
library(data.table)
# convert the data.frame to data.table
bikecounts <- data.table(bikecounts)
# Calculate
bikecounts[, list(NB=sum(Northbound), SB=sum(Southbound)), by=as.Date(Date, format="%m/%d/%Y")]
as.Date NB SB
1: 2012-10-02 1165 773
2: 2012-10-03 1761 1760
3: 2012-10-04 1767 1708
4: 2012-10-05 1590 1558
5: 2012-10-06 926 1080
---
299: 2013-07-27 1212 1289
300: 2013-07-28 902 1078
301: 2013-07-29 2040 2048
302: 2013-07-30 2314 2226
303: 2013-07-31 2008 2076
Note, you can also use fread() ("fast read") from the data.table package to read in the CSV into a data.table in one step.
The only draw back is you to manually convert the date/time from string.
eg:
bikecounts <- fread("http://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD", header=TRUE, stringsAsFactors=FALSE)
setnames(bikecounts, c("Date", "Northbound", "Southbound"))
bikecounts[, Date := as.POSIXct(D, format="%m/%d/%Y %I:%M:%S %p")]

Here is the requested lubridate solution, which I also added to the linked question. It uses a combination of lubridate and zoo aggregate() for these operations:
ts.month.sum <- aggregate(zoo.ts, month, sum)
ts.daily.mean <- aggregate(zoo.ts, day, mean)
ts.mins.mean <- aggregate(zoo.ts, minutes, mean)
Obviously, you need to first convert your data to a zoo() object, which is easy enough. You can also use yearmon() or yearqtr(), or custom functions for both split and apply. This method is as syntactically sweet as that of pandas.

Related

Creating a for loop to subset data on R

I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596
you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))
There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.
I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.

sub-setting without reading the whole data based on a column

I have a big data which one of its columns is Date in "character" class. Is there any argument in different reading functions for reading the rows which are only in 1/13/2017 - 1/13/2018 for example? or at least what is the command to subsetting without reading the total data in any class?
thank you for your response
Suppose we have the test file generated in the Note at the end. We suppose that the actual file is much larger or else we could have just read it directly using read.csv. Instead we use read.csv.sql to only read in 2017 and 2018 into R. Then we use this much smaller data frame cutting it down to precisely the dates we want:
library(sqldf)
sql <- "select * from file where dates like '%2017' or dates like '%2018'"
dd <- read.csv.sql("testfile.csv", sql)
dd$dates <- as.Date(dd$dates, "%d/%m/%Y")
dd_sub <- subset(dd, dates > '2017-01-13' & dates <= '2018-01-13')
To show that it worked we display the first few and last few rows:
> head(dd_sub)
dates value
14 2017-01-14 744
15 2017-01-15 745
16 2017-01-16 746
17 2017-01-17 747
18 2017-01-18 748
19 2017-01-19 749
> tail(dd_sub)
dates value
365 2017-12-31 1095
366 2018-01-01 1096
367 2018-01-02 1097
368 2018-01-03 1098
369 2018-01-04 1099
370 2018-01-05 1100
The actual file you have may have different format than testfile.csv so you will need to carefully read ?read.csv.sql and set whatever arguments you need.
Note
# test input in reproducible form
dates <- format(as.Date("2015-01-01") + 1:1100, "%d/%m/%Y")
d <- data.frame(dates, value = seq_along(dates))
write.csv(d, "testfile.csv", row.names = FALSE, quote = FALSE)

Average xts object with missing values to hourly endpoints

I am using xts to convert to hourly average data. I am starting with a year's worth of 10-minute data. Some hours have one 10-minute period (such as 'UTSP' in row 229) that is NA (missing).
For such hours, I would still like the average of the data that are available, however in the output I get NA for that variable for the hour.
Other hours may have no data (all data are missing). I want these completely missing hours to return NA, but where some data exist for an hour, I want that data to be used.
Here is a reproducible example of what I've been trying:
Lines <- "date,time,UTSP,UPM10,UPM25,UPM1,UWS,UWDT,PTSP,PPM10,PPM25,PPM1,PWS,PWDT
218,2014/10/15,22:00,9.7,4.9,4.66,1.54,6,152.56,102,53.6,33.71,10.34,NA,NA
219,2014/10/15,22:10,9.3,5.1,4.57,1.61,6.4,147.56,106.4,55.1,33.92,10.47,NA,NA
220,2014/10/15,22:20,8.9,5,4.7,1.55,6.4,147.56,108.3,54.8,33.19,10.53,NA,NA
221,2014/10/15,22:30,9.7,5.3,4.93,1.62,6.8,152.56,110.3,57.4,34.97,11.14,NA,NA
222,2014/10/15,22:40,9.1,5.2,4.76,1.54,6.8,152.56,118.9,62.3,37.58,11.63,NA,NA
223,2014/10/15,22:50,9.8,5.5,5.07,1.62,6.7,152.56,120.5,61.8,36.24,11.9,NA,NA
224,2014/10/15,23:00,11.1,5.6,5.2,1.59,6.4,152.56,108.6,57.1,34.93,11.66,NA,NA
225,2014/10/15,23:10,9.8,5.4,4.89,1.63,7.3,152.56,116,59.6,35.08,11.14,NA,NA
226,2014/10/15,23:20,9.1,5,4.95,1.63,7.1,152.56,122.6,63.8,38.28,12.17,NA,NA
227,2014/10/15,23:30,9.7,5.2,4.88,1.58,7.3,147.56,88.1,46.7,29.59,9.78,NA,NA
228,2014/10/15,23:40,9.2,5.2,4.79,1.66,7.1,152.56,92.4,48.8,30.11,9.69,NA,NA
229,2014/10/15,23:50,NA,NA,NA,NA,NA,NA,89.7,48.1,30.53,9.89,NA,NA
230,2014/10/16,00:00,9.8,5.5,5.03,1.6,7,147.56,91.2,47.5,30.09,9.38,NA,NA
231,2014/10/16,00:10,9.7,5.1,4.81,1.57,7.1,152.56,91.2,47.6,29.44,9.4,NA,NA
232,2014/10/16,00:20,9.9,5.4,5.09,1.61,7.4,147.56,91.1,48.3,29.78,9.23,NA,NA
233,2014/10/16,00:30,9.8,5.4,4.82,1.62,6.9,152.56,95.7,48.6,29.47,9.8,NA,NA
234,2014/10/16,00:40,10.6,5.7,4.99,1.58,6.8,147.56,91.3,47.9,29.57,9.94,NA,NA
235,2014/10/16,00:50,10.1,5.4,4.93,1.65,7,147.56,86.3,44.9,27.9,8.93,NA,NA"
conn <- textConnection(Lines)
dframe <- read.csv(conn)
close(conn)
library(xts)
USP_TSP.xts <- xts(dframe$UTSP,
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
I have also tried several variations of na.contiguous, na.omit, na.action.
My resultant output always seems to be the same (excerpt):
[,1]
2014-10-15 22:50:00 9.4166667
2014-10-15 23:50:00 NA
2014-10-16 00:50:00 9.9833333
... with the value for 2014-10-15 hr 23 being NA, even though there were 5 out of 6 non-missing values
Also, I am calculating all the columns separately, then combining them later. Is there an easier way - like calculating all the columns at once?
Calling na.exclude doesn't change the USP_TSP.xts object. You would need to assign the output of na.exclude to USP_TSP.xts to achieve that.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
But if you want to process all the columns in the object at once, using na.exclude is going to remove all rows that have at least one column with a missing value.
xData <- xts(dframe[,-(1:2)],
as.POSIXct(paste(dframe$date,dframe$time), format="%Y/%m/%d %H:%M"))
na.exclude(xData)
# UTSP UPM10 UPM25 UPM1 UWS UWDT PTSP PPM10 PPM25 PPM1 PWS PWDT
str(na.exclude(xData))
# An 'xts' object of zero-width
Instead, you should supply na.rm=TRUE to the call to mean inside the period.apply call. If you want to process all columns at the same time, you can use colMeans.
xDataMeans <- period.apply(xData, endpoints(xData, "hours"), colMeans, na.rm=TRUE)
xDataMeans
# UTSP UPM10 UPM25 UPM1 UWS UWDT
# 2014-10-15 22:50:00 9.416667 5.166667 4.781667 1.580 6.516667 150.8933
# 2014-10-15 23:50:00 9.780000 5.280000 4.942000 1.618 7.040000 151.5600
# 2014-10-16 00:50:00 9.983333 5.416667 4.945000 1.605 7.033333 149.2267
# PTSP PPM10 PPM25 PPM1 PWS PWDT
# 2014-10-15 22:50:00 111.06667 57.50000 34.93500 11.001667 NaN NaN
# 2014-10-15 23:50:00 102.90000 54.01667 33.08667 10.721667 NaN NaN
# 2014-10-16 00:50:00 91.13333 47.46667 29.37500 9.446667 NaN NaN
Your code works fine. You just need to assign USP_TSP.xts <- na.exclude(USP_TSP.xts). If you merely call na.exclude(USP_TSP.xts), then the output without NAs is printed, but it is not stored in any variable.
USP_TSP.xts <- na.exclude(USP_TSP.xts)
ep <- endpoints(USP_TSP.xts,'hours')
period.apply(USP_TSP.xts,ep,mean)
# [,1]
#2014-10-15 22:50:00 9.416667
#2014-10-15 23:40:00 9.780000
#2014-10-16 00:50:00 9.983333
Alternatively you can use period.apply(USP_TSP.xts,ep,mean, na.rm=T) if you don't want to modify the original xts object.

Aggregating in R conditional on the date

I have some data, where I aggregate the information on a unique minute basis with the below code based on a dataset for 1 day.
I would however like to be able to run this code with a datafile that is combined of multiple days. I have a date column in the dataset, so I can use that as a unique identifier for each day. Is there a way to aggregate the data on a 1 minute basis, given that the dates aren't the same?
The problem is that the unique function extracts the unique events that occur the first day, and then adds all the same events that happen in that minute afterwards. If i base it on the date too, I believe I can create unique 1-minute entries for each day in one long dataset.
Below is the code that works for a single days data.
novo <- read.csv("C:/Users/Morten/Desktop/data.csv", header = TRUE, stringsAsFactors=FALSE )
TimeStamp <- novo[,1]
price <- novo[, 2]
volume <- novo[,3]
nV <- sum(volume)
MinutesFloor <- unique(floor(TimeStamp))
nTradingMinutes <- length(MinutesFloor)
PriceMin <- rep(0, nTradingMinutes)
VolumeMin <- rep(0, nTradingMinutes)
for( j in 1:nTradingMinutes){
ThisMinutes <- (floor(TimeStamp) == MinutesFloor[j])
PriceMin[j] <- mean(price[ThisMinutes])
VolumeMin[j] <- sum(volume[ThisMinutes])
}
Thanks in advance
data format:
date,"ord","shares","finalprice","time","stock"
20100301,C,80,389,540.004,1158
20100301,C,77,389,540.004,1158
20100301,C,60,389,540.004,1158
20100301,C,28,389,540.004,1158
20100301,C,7,389,540.004,1158
20100302,C,25,394.7,540.00293333,1158
20100302,C,170,394.7,540.00293333,1158
20100302,C,40,394.7,540.00293333,1158
20100302,C,75,394.7,540.00293333,1158
20100302,C,100,394.7,540.00293333,1158
20100302,C,1,394.7,540.00293333,1158
I would like to suggest a radically simplified version of your code.
You are doing quite a few things rather inefficient. R is made to compute summary statistics clustered by different data values.
We will use this methods heavily.
I assume your data to be of the form you provided. At my system, this looks like
novo <- read.csv("test.csv", header = TRUE, stringsAsFactors=FALSE )
This gives us:
> str(novo)
'data.frame': 11 obs. of 6 variables:
$ date : int 20100301 20100301 20100301 20100301 20100301 20100302 20100302 20100302 20100302 20100302 ...
$ ord : chr "C" "C" "C" "C" ...
$ shares : int 80 77 60 28 7 25 170 40 75 100 ...
$ finalprice: num 389 389 389 389 389 ...
$ time : num 540 540 540 540 540 ...
$ stock : int 1158 1158 1158 1158 1158 1158 1158 1158 1158 1158 ...
Now, I assume that your date is ordered YearMonthDate. If you have a different ordering, you would have to alter the format command below. Furthermore, your time probably is in minutes.
Then we can create timestamps containing both the date and the time using the POSIXct datatype:
timestamps <- as.POSIXct(as.character(novo$date), format='%Y%m%d') + novo$time*60
Now, we do the rounding up minutes by creating a factor variable and using the cut function:
timestampsByMinute <- droplevels(cut(timestamps, 'min'))
Note that the additional droplevels function just removes the minutes that have no data item s available.
Finally, we may compute the summary statistics you did in the for-loop:
tapply is a function taking it's first argument, dividing it into groups defined by the second argument and applying the function given as third argument to that data. Thus we may just throw the tapply function on your data. (I have the feeling that the column numbers you used in your code do not match the column names in your example data - feel free to adapt to different columns if I interpreted your meaning the wrong way)
PriceMin <- tapply(novo$finalprice, timestampsByMinute, mean)
VolumeMin <- tapply(novo$shares, timestampsByMinute, sum)
This gives us
> PriceMin
2010-03-01 09:00:00 2010-03-02 09:00:00
389.0 394.7
> VolumeMin
2010-03-01 09:00:00 2010-03-02 09:00:00
252 411
which is probably what you want.
Note that tapply is much faster that the loop you used. If you have huge datafiles, this may be important.
I hope there are no errors left in my code - testing was not easy given the fact that you provided only data for one minute per day.
Edit:
As per request, here a small modification that removes the time information from the data:
> unname(VolumeMin)
[1] 252 411
> unname(PriceMin)
[1] 389.0 394.7

rolling computations in xts by month

I am familiar with the zoo function rollapply which allows you to do rolling computations on zoo or xts objects and you can specify the rolling increment via the by parameter. I am specifically interested in applying a function every month but using all of the past daily data in the computation. For example say my data set looks like this:
dte, val
1/01/2001, 10
1/02/2001, 11
...
1/31/2001, 2
2/01/2001, 54
2/02/2001, 34
...
2/30/2001, 29
I would like to select the end of each month and apply a function that uses all the daily data. This doesn't seem like it would work with rollapply since the by argument would be 30 sometimes, 29 other months, etc. My current idea is:
f <- function(xts_obj) { coef(lm(a ~ b, data=as.data.frame(xts_obj)))[1] }
month_end <- endpoints(my_xts, on="months", k=1)
rslt <- apply(month_end, 1, function(idx) { my_xts[paste0("/",idx)] })
Surely there is a better way to do this that would be quicker no?
To clarify: I would like to use overlapping periods just the rolling should be done monthly.
If I understand correctly, you can get the dates of your endpoints, then for each endpoint (i.e. using lapply or for), call rollapply using data up to that point.
getSymbols("SPY", src='yahoo', from='2012-01-01', to='2012-08-01')
idx <- index(SPY)[endpoints(SPY, 'months')]
out <- lapply(idx, function(i) {
as.xts(rollapplyr(as.zoo(SPY[paste0("/", i)]), 5,
function(x) coef(lm(x[, 4] ~ x[, 1]))[2], by.column=FALSE))
})
sapply(out, NROW)
#[1] 16 36 58 78 100 121 142 143
I temporarily coerce to zoo for the rollapplyr to make sure the rollapply.zoo method is being used (as opposed to the unexported rollapply.xts method), then coerce back to xts
As an answer to "Is the zoo/xts conversion needed?":
It isn't needed in this case, but rollapply won't work if you send it a dataframe, as I recently discovered from this StackOverflow answer
You want period.apply(), or its convenience helper apply.monthly(), both in xts.
Example:
R> foo <- xts(1:100, order.by=Sys.Date()+0:99)
R> apply.monthly(foo, sum)
[,1]
2012-08-31 105
2012-09-30 885
2012-10-31 1860
2012-11-25 2200
R>
or equally
R> apply.monthly(foo, quantile)
0% 25% 50% 75% 100%
2012-08-31 1 4.25 7.5 10.75 14
2012-09-30 15 22.25 29.5 36.75 44
2012-10-31 45 52.50 60.0 67.50 75
2012-11-25 76 82.00 88.0 94.00 100
R>
just to prove that functions returning more than one value can be used too.

Resources