R: Efficiently subsetting dataframe based on time of day - r

I have a large (150,000x7) dataframe that I intend to use for back-testing and real-time analysis of a financial market. The data represents the condition of an investment vehicle at 5 minute intervals (although holes do exist). It looks like this (but much longer):
pTime Time Price M1 M2 M3 M4
1 1212108300 20:45:00 1.5518 12.21849 -0.37125 4.50549 -31.00559
2 1212108900 20:55:00 1.5516 11.75350 -0.81792 -1.53846 -32.12291
3 1212109200 21:00:00 1.5512 10.75070 -1.47438 -8.24176 -34.35754
4 1212109500 21:05:00 1.5514 10.23529 -1.06044 -8.46154 -33.24022
5 1212109800 21:10:00 1.5514 9.74790 -1.02759 -10.21978 -33.24022
6 1212110100 21:15:00 1.5513 9.31092 -1.17076 -11.97802 -33.79888
7 1212110400 21:20:00 1.5512 8.84034 -1.28428 -13.62637 -34.35754
8 1212110700 21:25:00 1.5509 8.07843 -1.63715 -18.24176 -36.03352
9 1212111000 21:30:00 1.5509 7.39496 -1.49198 -20.65934 -36.03352
10 1212111300 21:35:00 1.5512 7.65266 -1.03717 -18.57143 -34.35754
The data is pre-loaded into R, but during my back-test I need to subset it by two criteria:
The first criteria is a sliding window to avoid peeking into the future. The window must be such that, each new 5 minute interval on the back-test shifts the whole window into the future by 5 minutes. This part I can do like this:
require(zoo)
zooser <- zoo(x=tser$Close, order.by=as.POSIXct(tser$pTime, origin="1970-01-01"))
window(zooser, start=A, end=B)
The second criteria is another sliding window, but one that slides through time of day and contains only those entries that are within N minutes of the input time on any given day.
Example: If the window's size is 2 hours, and the input time is 12:00PM then the window must contain all rows with Time between 10:00AM and 2:00PM
This is the part that I am having trouble figuring out.
Edit: My data has holes in it, two consecutive rows could be MORE than 5 minutes apart. The data looks like this (very zoomed in)
As the window moves through these gaps the number of points inside the windows should vary.
The following is my MySQL code that does what I want to do in R (same table structure):
SET #qTime = Time(FROM_UNIXTIME(SAMP_endTime));
SET #inc = -1;
INSERT INTO MetIndListBuys (pTime,ArrayPos,M1,M2,M3,M4)
SELECT pTime,#inc:=#inc+1,M1,M2,M3,M4
FROM mergebuys USE INDEX (`y`) WHERE pTime BETWEEN SAMP_startTime AND SAMP_endTime
AND TIME_TO_SEC(TIMEDIFF(Time,#qTime))/3600 BETWEEN 0-HourSpan AND HourSpan
;

Say that you have your target time t0 on the same scale as pTime: seconds since epoch. Then t0 - pTime = (difference in the number of days since epoch between the two) + (difference in remaining seconds). Taking t0 - pTime %% (num. seconds per day) will leave us with the difference in seconds in clock arithmetic (wrapped around if the difference is negative). This suggests the following function:
SecondsPerDay <- 24 * 60 * 60
within <- function(d, t0Sec, wMin) {
diff <- (d$pTime - t0Sec) %% SecondsPerDay
wSec <- 60 * wMin
return(d[diff < wSec | diff > (SecondsPerDay - wSec), ])
}

1) If DF is the data frame shown in the question then create a zoo object from it as you have done and split it into days giving zs. Then lapply your function f to each successive set of w points in each component (i.e. in each day). For example, if you want to apply your function to 2 hours of data at a time and your data is regularly spaced 5 minute data then w = 24 (since there are 24 five minute periods in two hours). In such a case f would be passed 24 rows of data as a matrix each time its called. Also align has been set to "right" below but it can alternately be set to align="center" and the condition giving ix can be changed to double sided, etc. For more on rollapply see: ?rollapply
library(zoo)
z <- zoo(DF[-2], as.POSIXct(DF[,1], origin = "1970-01-01"))
w <- 3 # replace this with 24 to handle two hours at a time with five min data
f <- function(x) {
tt <- x[, 1]
ix <- tt[w] - tt <= w * 5 * 60 # RHS converts w to seconds
x <- x[ix, -1]
sum(x) # replace sum with your function
}
out <- rollapply(z, w, f, by.column = FALSE, align = "right")
Using the data frame in the question we get this:
> out
$`2008-05-30`
2008-05-30 02:00:00 2008-05-30 02:05:00 2008-05-30 02:10:00 2008-05-30 02:15:00
-66.04703 -83.92148 -95.93558 -100.24924
2008-05-30 02:20:00 2008-05-30 02:25:00 2008-05-30 02:30:00 2008-05-30 02:35:00
-108.15038 -121.24519 -134.39873 -140.28436
By the way, be sure to read this post .
2) This could alternately be done as the following where w and f are as above:
n <- nrow(DF)
m <- as.matrix(DF[-2])
sapply(w:n, function(i) { m <- m[seq(length = w, to = i), ]; f(m) })
Replace the sapply with lapply if needed. Also this may seem shorter than the first solution but its not much different once you add the code to define f and w (which appear in the first but not the second).
If there are no holes during the day and only holes between days then these solutions could be simplified.

Related

How do I call a function using a specific time window?

Suppose I have a zoo object (or it could be a data.frame) that has an index on "time of day" and has some value (see sample data below):
val
...
2006-08-01 12:00 23
2006-08-01 12:01 24
2006-08-01 12:02 25
2006-08-01 12:03 26
2006-08-01 12:04 27
2006-08-01 12:05 28
2006-08-01 12:06 29
...
2006-08-02 12:00 123
2006-08-02 12:01 124
2006-08-02 12:02 125
2006-08-02 12:03 126
2006-08-02 12:04 127
...
I would like to call a custom function (call it custom.func(vals)) from 12:01 - 12:03 (i.e. something similar to zoo::rollapply) every time that interval occurs so in this example, daily. How would I do that?
NOTES (for robustness, it would also be great to take into account the following edge cases but not necessary):
Don't assume that I have values for 12:01 - 12:03 every day
Don't assume that the entire range 12:01 - 12:03 is present every day. Some days I might only have 12:01 and 12:02 but might be missing 12:03
What if I wanted my custom.func(vals) to be called on day boundaries like using val from 23:58 - 00:12?
Suppose our input is the POSIXct zoo object z given in the Note at the end.
Create a character vector times which has one element per element of z and is in the form HH:MM. Then create a logical ok which indicates which times are between the indicated boundary values. z[ok] is then z reduced to those values. Finally for each day apply sum (can use some other function if desired) using aggregate.zoo :
times <- format(time(z), "%H:%M")
ok <- times >= "12:01" & times <= "12:03"
aggregate(z[ok], as.Date, sum)
## 2006-08-01 2006-08-02
## 75 375
times straddle midnight
The version is for the case where the times straddle midnight. Note that the order of values sent to the function is not the original order but if the function is symmetric that does not matter.
times <- format(time(z), "%H:%M")
ok <- times >= "23:58" | times <= "00:12"
aggregate(z[ok], (as.Date(format(time(z))) + (times >= "23:58"))[ok], sum)
## 2006-08-02
## 41
Variation
The prior code chunk works if the function is symmetric in the components of its argument (which is the case for many functions such as mean and sum) but if the function were not symmetric we would need a slightly different approach. We define to.sec which translates an HH:MM string to numeric seconds and subtract to.sec("23:58") from each POSIXct datetime. Then the components of z to keep are those whose transformed times converted to HH:MM character strings that are less than "00:14".
to.sec <- function(x) with(read.table(text = x, sep = ":"), 3600 * V1 + 60 * V2)
times <- format(time(z) - to.sec("23:58"), "%H:%M")
ok <- times <= "00:14"
aggregate(z[ok], as.Date(time(z)[ok] - to.sec("23:58")), sum)
## 2006-08-01
## 41
Note
Lines <- "datetime val
2006-08-01T12:00 23
2006-08-01T12:01 24
2006-08-01T12:02 25
2006-08-01T12:03 26
2006-08-01T12:04 27
2006-08-01T12:05 28
2006-08-01T12:06 29
2006-08-01T23:58 20
2006-08-02T00:01 21
2006-08-02T12:00 123
2006-08-02T12:01 124
2006-08-02T12:02 125
2006-08-02T12:03 126
2006-08-02T12:04 127"
library(zoo)
z <- read.zoo(text = Lines, tz = "", header = TRUE, format = "%Y-%m-%dT%H:%M")
EDIT
Have revised the non-symmetric code and simplified all code chunks.
I recommend runner package which allows to compute any rolling function on irregular time series. Function runner is equivalent of rollApply with distinction that it can depend on dates. runner allows to apply any R function on window length defined by k with date idx (or any integer). Example below calculates regression on 5-minutes (5*60 sec) window span. Algorithm don't care if there will be day-change, just compute 5-minutes each time (for example 23:56-00:01).
Create data:
set.seed(1)
x <- cumsum(rnorm(1000))
y <- 3 * x + rnorm(1000)
time <- as.POSIXct(cumsum(sample(60:120, 1000, replace = TRUE)),
origin = Sys.Date()) # unequaly spaced time series
data <- data.frame(time, y, x)
Custom function to be called on sliding windows:
library(runner)
running_regression <- function(idx) {
predict(lm(y ~ x, data = data))[max(idx)]
}
data$pred <- runner(seq_along(x),
k = 60 * 5,
idx = time,
f = running_regression)
Once we have created dataset with rolling 5-minute prediction, then we can filter only particular windows - here, only 1-st minute of the hour. It means that we always keep {hh}:56 - {hh+1}:01
library(dplyr)
library(lubridate)
filtered <-
data %>%
filter(minute(time) == 1)
plot(data$time, data$y, type = "l", col = "red")
points(filtered$time, filtered$pred, col = "blue")
There are some other examples in vignette how to do this with runner

R: Combine two dataframes by the nearest time

I have two dataframes; one that contains a year's worth of hourly temperatures and the other contains flight information. Bellow shows an extract from the temperature dataframe:
Time <- c("2000-01-01 00:53:00","2000-01-01 06:53:00","2000-01-01 10:53:00")
Time <- as.POSIXct(Time)
Temp <- c(20,30,10)
Temperature <- data.frame(Time,Temp)
Temperature
Time Temp
1 2000-01-01 00:53:00 20
2 2000-01-01 06:53:00 30
3 2000-01-01 10:53:00 10
Bellow shows an extract from the flight information dataframe:
DepartureTime <- c("2000-01-01 03:01:00","2000-01-01 10:00:00","2000-01-01 14:00:00")
DepartureTime <- as.POSIXct(DepartureTime)
FlightInformation <- data.frame(DepartureTime)
FlightInformation
DepartureTime
1 2000-01-01 03:01:00
2 2000-01-01 10:14:00
3 2000-01-01 14:55:00
My goal is to take each row of FlightInformation$DepartureTime and find the closest time in the whole column Temperature$Time. I then want to add the corresponding temperature to the FlightInformation dataframe. The desired output should look like this:
FlightInformation
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:14:00 10
3 2000-01-01 14:55:00 10
My attempts so far have come up with this:
i <- 1
j <- 1
while(i <= nrow(Temperature)){
while(j <= nrow(FlightInformation)){
if(Temperature$Time[i] == FlightInformation$Time[j]){
FlightInformation$Temp[j] == Temperature$Temp[i]
}
j <- j + 1
}
i <- i + 1
}
This involves first rounding all times to the nearest hour. This method is not as accurate as i would like it to be and seems VERY inefficient! Is there an easy way to find the nearest posix to give my desired output?
Some assumptions:
you have temperature data before and after all flight information; otherwise you'll see NA
temperature data is continuous-enough, meaning with the interpolation this presents, you don't grab something from 3 months prior (not useful)
temperature data is ordered (easy enough to fix if not)
We'll use cut, that finds the interval in which values fit within a series of breaks:
(ind <- cut(FlightInformation$DepartureTime, Temperature$Time, labels = FALSE))
# [1] 1 2 NA
These indicate rows within Temperature from which we should retrieve the $Temp. Unfortunately, it is absolute and does not allow for being closer to the next value, so we can compensate for that:
(ind <- ind + (abs(Temperature$Time[ind] - FlightInformation$DepartureTime) >
abs(Temperature$Time[1+ind] - FlightInformation$DepartureTime)))
# [1] 1 3 NA
Okay, now that NA: that indicates that the latest $DepartureTime is outside of the known times. This indicates a violation of my first assumption above, but it can be fixed. I use a magic-constant of "6 hours" here to determine that the data is close enough to be able to use it; there are certainly many other heuristics which will be less-wrong. For those, we can just assume the latest temperature:
(is_recoverable <- is.na(ind) & abs(FlightInformation$DepartureTime - max(Temperature$Time)) < 60*60*6)
# [1] FALSE FALSE TRUE
ind[is_recoverable] <- nrow(Temperature)
ind
# [1] 1 3 3
The the results:
FlightInformation$Temp <- Temperature$Temp[ ind ]
FlightInformation
# DepartureTime Temp
# 1 2000-01-01 03:01:00 20
# 2 2000-01-01 10:00:00 10
# 3 2000-01-01 14:00:00 10
Though definitely quicker than double while loops, it will be a problem if you have large gaps in your temperature data. That is, if you have a 3-year gap in your data, the most-recent temperature will be used, which might be 2.99 years ago. For a double-check, use this:
FlightInformation$TempTime <- Temperature$Time[ ind ]
FlightInformation$TimeDelta <- with(FlightInformation, abs(TempTime - DepartureTime))
FlightInformation
# DepartureTime Temp TempTime TimeDelta
# 1 2000-01-01 03:01:00 20 2000-01-01 00:53:00 128 mins
# 2 2000-01-01 10:00:00 10 2000-01-01 10:53:00 53 mins
# 3 2000-01-01 14:00:00 10 2000-01-01 10:53:00 187 mins
You can use different units for the time delta and check for problems with:
units(FlightInformation$TimeDelta) <- "secs"
which(FlightInformation$TimeDelta > 60*60*6)
# integer(0)
(where integer(0) says you have none that are outside of my magic window of 6 hours.)
Here's a way! Time is easiest to work with for this if you convert it to a numeric value. Then you can compare the numeric values to find the closest times before/after your reference time (FlightInformation$time_num in the below example). Once you have the closest time before and after your reference value, figure out which is really the closest to your reference. Use that time value to look up (index) the correct temperature value and add it to your data frame.
#convert time to numeric (seconds since origin of time)
Temperature$time_num <- as.numeric(Temperature$Time)
FlightInformation$time_num <- as.numeric(FlightInformation$DepartureTime)
#make sure time data is in correct order so that indexes for time are in correct order
Temperature <- Temperature[with(Temperature, order(time_num)), ] #sort data
for (i in 1:nrow(FlightInformation)) #for each row of data in flight...
{
#find the time in Temp that is closest + prior to Flight time
#create a logical vector saying which Temperature$time_num are <= to FlightInformation$time_num.
#pull the max row index from the logical vector where value == TRUE (this is the closest time for Temp that is prior to Flight Time)
#use that row index to look up the Temperature$time_num value that is closest + prior to Flight time
#will return NA/warning message if no time in Temp is before time in Flight
temptime_prior <- Temperature[max(which(Temperature$time_num <= FlightInformation$time_num[i])), "time_num"]
#find the time in Temp that is closest + after to Flight time
#will return NA/warning message if no time in Temp is after time in Flight
temptime_after <- Temperature[min(which(Temperature$time_num > FlightInformation$time_num[i])), "time_num"]
#compare times before and after to see which is closest to flight time. If no before/after time was found (e.g., NA was returned), always use the other time value
temptime_closest <- ifelse(is.na(temptime_prior), temptime_after,
ifelse(is.na(temptime_after), temptime_prior,
ifelse((FlightInformation$time_num[i] - temptime_prior) <= (temptime_after - FlightInformation$time_num[i]),
temptime_prior, temptime_after)))
#look up the right temp by finding the row index of right Temp$time_num value and add it to Flight info
FlightInformation$Temp[i] <- Temperature[which(Temperature$time_num == temptime_closest), "Temp"]
}
#get rid of numeric time column, you don't need it anymore
FlightInformation <- FlightInformation[,!(names(FlightInformation) %in% c("time_num"))]
Output
DepartureTime Temp
1 2000-01-01 03:01:00 20
2 2000-01-01 10:00:00 10
3 2000-01-01 14:00:00 10
If you have subsets of data in each data frame you need to match up to (e.g., match df1$group1 time values only to df2$group1 time values), you can use survival::neardate. It's a nice function for this that does basically what the above code does, but has some additional parameters if you need them.
Hope this helps! The codes a lot shorter without all the comments =)

Create a variable based on time difference between dates on consecutive rows

I have a very large set of data driven off of an id and a date. The dataset has several hundred million rows and about 10 million id's. I am running in a non-windows environment with ample RAM and multiple processors available. I am doing this in parallel. At the moment, I'm working with multidplyr, though am considering all options.
For illustration:
> df[1:11,]
id date gap episode
1 100000019 2015-01-24 0 1
2 100000019 2015-02-20 27 1
3 100000019 2015-03-31 39 2
4 100000019 2015-04-29 29 2
5 100000019 2015-05-27 28 2
6 100000019 2015-06-24 28 2
7 100000019 2015-07-24 30 2
8 100000019 2015-08-23 30 2
9 100000019 2015-09-21 29 2
10 100000019 2015-10-22 31 3
11 100000019 2015-12-30 69 4
The data is sorted before the function call. The order is important. For each id, after the first date, I need to determine the number of days between each subsequent date. I call this a gap. So, the first date for the id gets a gap of zero. The second date gets the value of the second date minus the date in the prior row. An so on.
I am splitting the data by id, then sending the data for each id to the following function.
assign_gap <- function(x) {
# x$gap <- NA
for(i in 1:nrow(x)) {
x[i, ]$gap <- ifelse(i == 1, 0, x[i,]$date - x[i-1, ]$date)
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_gap', assign_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_gap(.)) %>% collect())
I then apply another function that groups the sequence of gaps across dates into "episodes" based on allowable_gap (I am using a value of 30). So, each id will potentially have multiple episodes assigned based on the date sequence and the gap.
assign_episode <- function(x, allowable_gap){
ep <- 1
for(i in 1:nrow(x)){
ifelse(x[i,]$gap <= allowable_gap, ep <- ep, ep <- ep + 1)
x[i, ]$episode <- ep
}
return(x)
}
cluster <- create_cluster(8)
cluster_assign_value(cluster, 'assign_episode', assign_episode)
cluster_assign_value(cluster, 'allowable_gap', allowable_gap)
system.time(df <- df %>% partition(id, cluster = cluster) %>% do(assign_episode(., allowable_gap)) %>% collect())
Given the amount of data I have, I'd really like to find a way to avoid these loops in the functions, which I expect will improve efficiency considerably. If anyone can think of an alternative that accomplishes the same thing, I would be grateful.
I would recommend using the data.table library. This library is extremely fast, particularly if one is working with large data sets like yours. Here is a partial solution, where I solve the first step of your question:
1. calculate gap between dates, making sure the first row of each id is 0
library(data.table)
setDT(df)
df[, gap := c(0L, diff(date)) , by = id ]
Even though this is not working in parallel, I would expect this code to be faster than the loop you're currently using.
2. Assign a group episode for consecutive observations when the gap is under 30 by id
I haven't found a solution for the second part of your question yet, but I would encourage others to complement this answer if they find a solution.

How to mark the observations with given information

Considering the data collected with 5 minutes time interval with a numeric variable a,and a discret variable acc, which represents if there's any incident happened(0 for no incident while 1 for incident):
a<-c(1:(288*4))
t<-seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-05 00:00:00"), by = '5 min')
acc<-rep(0,288*4)
df<-data.frame(t,a,acc)
Now I have another data set which has the time(accurates to 1 sec) at which the incidents happened during the collection period:
T<-sample(seq(as.POSIXct("2016-01-01 00:05:00"), as.POSIXct("2016-01-05 00:00:00"), by = '1 sec'),size = 5)
I want to mark the nearest 2 prior observation's acc as 1 according to the time in T. For example, if the incident happened at 2016-01-02 07:13:23, the observations' acc with t of 2016-01-02 07:05:00 and 2016-01-02 07:10:00 are marked as 1
How could I manage to do this?
ind <- findInterval(T, df$t)
df$acc[c(ind, ind + 1)] <- 1
One way could be:
library(lubridate)
df$acc=apply(sapply(T,function(x) x %within% interval((df$t - minutes(4)-seconds(59)),(df$t + minutes(4)+seconds(59)))),1,sum)
lubridate allows for the easy manipulation of dates, minutes(x) and seconds(x) adds x minutes or second to a period object.
interval() is used to create a time interval confined by the time in df$t ± 4min59s.
sapply() is used to check if any of the time in T is within the interval.
apply() is used to collapse the results of sapply() (it outputs 1 column for each element in T)
If T contains a value that is exactly equal to one in df$t such as 2016-01-04 12:05:00 CET this will only put 1 for this one.

Data frame of departure and return dates, how do I get a list of all dates away?

I'm stuck on a problem calculating travel dates. I have a data frame of departure dates and return dates.
Departure Return
1 7/6/13 8/3/13
2 7/6/13 8/3/13
3 6/28/13 8/7/13
I want to create and pass a function that will take these dates and form a list of all the days away. I can do this individually by turning each column into dates.
## Turn the departure and return dates into a readable format
Dept <- as.Date(travelDates$Dept, format = "%m/%d/%y")
Retn <- as.Date(travelDates$Retn, format = "%m/%d/%y")
travel_dates <- na.omit(data.frame(dept_dates,retn_dates))
seq(from = travel_dates[1,1], to = travel_dates[1,2], by = 1)
This gives me [1] "2013-07-06" "2013-07-07"... and so on. I want to scale to cover the whole data frame, but my attempts have failed.
Here's one that I thought might work.
days_abroad <- data.frame()
get_days <- function(x,y){
all_days <- seq(from = x, to = y, by =1)
c(days_abroad, all_days)
return(days_abroad)
}
get_days(travel_dates$dept_dates, travel_dates$retn_dates)
I get this error:
Error in seq.Date(from = x, to = y, by = 1) : 'from' must be of length 1
There's probably a lot wrong with this, but what I would really like help on is how to run multiple dates through seq().
Sorry, if this is simple (I'm still learning to think in r) and sorry too for any breaches in etiquette. Thank you.
EDIT: updated as per OP comment.
How about this:
travel_dates[] <- lapply(travel_dates, as.Date, format="%m/%d/%y")
dts <- with(travel_dates, mapply(seq, Departure, Return, by="1 day"))
This produces a list with as many items as you had rows in your initial table. You can then summarize (this will be data.frame with the number of times a date showed up):
data.frame(count=sort(table(Reduce(append, dts)), decreasing=T))
# count
# 2013-07-06 3
# 2013-07-07 3
# 2013-07-08 3
# 2013-07-09 3
# ...
OLD CODE:
The following gets the #days of each trip, rather than a list with the dates.
transform(travel_dates, days_away=Return - Departure + 1)
Which produces:
# Departure Return days_away
# 1 2013-07-06 2013-08-03 29 days
# 2 2013-07-06 2013-08-03 29 days
# 3 2013-06-28 2013-08-07 41 days
If you want to put days_away in a separate list, that is trivial, though it seems more useful to have it as an additional column to your data frame.

Resources