Hard times with split in XTS - r

Mr. Ulrichs xts package is always phenomenal. I have always been using split for the ordinary 5 minutes, 15 minutes, 30 minutes splits. No problems ever.
Now I'm stuck.
#Setup test data
my.time <- seq(from = as.POSIXct('2000-01-01 00:00:00'),
to = as.POSIXct('2000-01-01 1:00:00'),
by = '1 sec')
my.data <- rep(10, length = length(my.time))
my.xts <- as.xts(my.data, order.by = my.time)
#Now splitting and checking endtimes of first split
tail((split(my.xts, f="minutes", k=20))[[1]])
tail((split(my.xts, f="minutes", k=30))[[1]])
#2000-01-01 00:19:59 10 #All good
#2000-01-01 00:29:59 10 #All good
tail((split(my.xts, f="minutes", k=22))[[1]])
#2000-01-01 00:11:59 10 #Hmmm, what am I missing. Expectimg 00:21:59
#As endpoints is used by split I also checked this behaviour
endpoints(my.xts, on="minutes", k=20)
#[1] 0 1200 2400 3600 3601 #All good
endpoints(my.xts, on="minutes", k=30)
#[1] 0 1800 3600 3601 #All good
endpoints(my.xts, on="minutes", k=22)
#[1] 0 720 2040 3360 3601 #Hmmm
Trying to understand this I dug further into the XTS code at https://github.com/joshuaulrich/xts/blob/master/src/endpoints.c
There I found that this is supposed to be more effective
c(0,which(diff(_x%/%on%/%k+1) != 0),NROW(_x))
So I tried this
which(diff(.index(my.xts) %/% 60 %/% 20 +1) != 0)
#[1] 1200 2400 3600 #As expected
which(diff(.index(my.xts) %/% 60 %/% 21 +1) != 0)
#[1] 1080 2340 3600 #Expecting 1260 2520...
which(diff(.index(my.xts) %/% 60 %/% 22 +1) != 0)
#[1] 720 2040 3360 #Expecting 1320 2640...
which(diff(.index(my.xts) %/% 60 %/% 23 +1) != 0)
#[1] 720 2100 3480 #Expecting 1380 2760...
which(diff(.index(my.xts) %/% 60 %/% 24 +1) != 0)
#[1] 1440 2880 #As expected
which(diff(.index(my.xts) %/% 60 %/% 30 +1) != 0)
#[1] 1800 3600 #As expected
This is where my brain overheated and I posted here instead. I'm sure there is something I'm simply missing, so I haven't posted this as a bug at Github. Please help to explain what is going on. Why am I not getting the expected results?
EDIT:
So, a quick think and I'm guessing this has to do with that all functions base on the start of Unix time and using time base which is not divisible with one hour. Is this a correct lead in my understanding?
EDIT2:
Posted my answer below after finally understanding how endpoints and split are supposed to work...

Of course, split (and endpoints) work as they are supposed to be working within xts. I.e. splitting with 1970-01-01 00:00:00 as the start point when deciding the intervals.
And yes, I was misusing split as an easy way to split at an arbitrary start point of an hour within my time series xts data.
Anyhow, I solved my little issue by writing this short function which "resets" the first timestamp to 1970-01-01 00:00:00.
## contuation from code snippet above
#So, I realised it was all about resetting the first time
#to 1970-01-01 00:00:0,
#so that I could do easily my "strange" splitting.
#Please note that this is not to be used for dates,
#only when working on hour, mins or secs
#and don't care that you break the index.
startMovedSplit <- function(x, ...) {
startIndex <- head(.index(x), n=1)
.index(x) <- .index(x) - startIndex
split(x, ...)
}
#New Try
tail((startMovedSplit(my.xts, f="minutes", k=20))[[1]])
tail((startMovedSplit(my.xts, f="minutes", k=30))[[1]])
#1970-01-01 00:19:59 10 #Good enough for my purposes
#1970-01-01 00:29:59 10 #Good enough for my purposes
tail((startMovedSplit(my.xts, f="minutes", k=22))[[1]])
#1970-01-01 00:21:59 10 #Good enough for my purposes
And the best part of all my misunderstanding of the xts library? I now know how to handle the ellipses (...) within functions and calling of subfunctions!

Related

How do I find the mid time between two times in R, sometimes overnight, sometimes same day, but I have no date?

I would like to find the mid time between two times in column SLQ300 (sleep time) and SLQ310 (wake up time) in my data frame of 6000 participants.
Added: I have already the duration in column SLD012. So if we could add half of the duration to the sleep time, it would be great.
Eg. in the first row it should be the midpoint between 23:00 and 07:00, which is 03:00.
And in the 9th row, it should be between 01:00 and 06:00, which is 03:30.
Thank you in advance!
Data frame
Try this:
time2num <- function(x) {
vapply(strsplit(x, ':'), function(y) sum(as.numeric(y) * c(60, 1)),
numeric(1), USE.NAMES=FALSE)
}
# sample data
dat <- data.frame(id=1:2, tm1=c("23:00","01:00"), tm2=c("07:00","06:00"))
# the code
dat[,c("tm1n","tm2n")] <- lapply(dat[,c("tm1","tm2")], time2num)
dat
# id tm1 tm2 tm1n tm2n
# 1 1 23:00 07:00 1380 420
# 2 2 01:00 06:00 60 360
with(dat, ifelse(tm1n > tm2n, 24*60, 0) + tm2n - tm1n)
# [1] 480 300 ### minutes
Or you can use modulus:
with(dat, tm2n - tm1n) %% (12*60)
(though I haven't tested it in all sorts of combinations).
And the mid time:
format(as.POSIXct(paste(Sys.Date(), dat$tm1)) +
60*with(dat, ifelse(tm1n > tm2n, 24*60, 0) + tm2n - tm1n)/2,
format="%H:%M")
# [1] "03:00" "03:30"

How do I call a function using a specific time window?

Suppose I have a zoo object (or it could be a data.frame) that has an index on "time of day" and has some value (see sample data below):
val
...
2006-08-01 12:00 23
2006-08-01 12:01 24
2006-08-01 12:02 25
2006-08-01 12:03 26
2006-08-01 12:04 27
2006-08-01 12:05 28
2006-08-01 12:06 29
...
2006-08-02 12:00 123
2006-08-02 12:01 124
2006-08-02 12:02 125
2006-08-02 12:03 126
2006-08-02 12:04 127
...
I would like to call a custom function (call it custom.func(vals)) from 12:01 - 12:03 (i.e. something similar to zoo::rollapply) every time that interval occurs so in this example, daily. How would I do that?
NOTES (for robustness, it would also be great to take into account the following edge cases but not necessary):
Don't assume that I have values for 12:01 - 12:03 every day
Don't assume that the entire range 12:01 - 12:03 is present every day. Some days I might only have 12:01 and 12:02 but might be missing 12:03
What if I wanted my custom.func(vals) to be called on day boundaries like using val from 23:58 - 00:12?
Suppose our input is the POSIXct zoo object z given in the Note at the end.
Create a character vector times which has one element per element of z and is in the form HH:MM. Then create a logical ok which indicates which times are between the indicated boundary values. z[ok] is then z reduced to those values. Finally for each day apply sum (can use some other function if desired) using aggregate.zoo :
times <- format(time(z), "%H:%M")
ok <- times >= "12:01" & times <= "12:03"
aggregate(z[ok], as.Date, sum)
## 2006-08-01 2006-08-02
## 75 375
times straddle midnight
The version is for the case where the times straddle midnight. Note that the order of values sent to the function is not the original order but if the function is symmetric that does not matter.
times <- format(time(z), "%H:%M")
ok <- times >= "23:58" | times <= "00:12"
aggregate(z[ok], (as.Date(format(time(z))) + (times >= "23:58"))[ok], sum)
## 2006-08-02
## 41
Variation
The prior code chunk works if the function is symmetric in the components of its argument (which is the case for many functions such as mean and sum) but if the function were not symmetric we would need a slightly different approach. We define to.sec which translates an HH:MM string to numeric seconds and subtract to.sec("23:58") from each POSIXct datetime. Then the components of z to keep are those whose transformed times converted to HH:MM character strings that are less than "00:14".
to.sec <- function(x) with(read.table(text = x, sep = ":"), 3600 * V1 + 60 * V2)
times <- format(time(z) - to.sec("23:58"), "%H:%M")
ok <- times <= "00:14"
aggregate(z[ok], as.Date(time(z)[ok] - to.sec("23:58")), sum)
## 2006-08-01
## 41
Note
Lines <- "datetime val
2006-08-01T12:00 23
2006-08-01T12:01 24
2006-08-01T12:02 25
2006-08-01T12:03 26
2006-08-01T12:04 27
2006-08-01T12:05 28
2006-08-01T12:06 29
2006-08-01T23:58 20
2006-08-02T00:01 21
2006-08-02T12:00 123
2006-08-02T12:01 124
2006-08-02T12:02 125
2006-08-02T12:03 126
2006-08-02T12:04 127"
library(zoo)
z <- read.zoo(text = Lines, tz = "", header = TRUE, format = "%Y-%m-%dT%H:%M")
EDIT
Have revised the non-symmetric code and simplified all code chunks.
I recommend runner package which allows to compute any rolling function on irregular time series. Function runner is equivalent of rollApply with distinction that it can depend on dates. runner allows to apply any R function on window length defined by k with date idx (or any integer). Example below calculates regression on 5-minutes (5*60 sec) window span. Algorithm don't care if there will be day-change, just compute 5-minutes each time (for example 23:56-00:01).
Create data:
set.seed(1)
x <- cumsum(rnorm(1000))
y <- 3 * x + rnorm(1000)
time <- as.POSIXct(cumsum(sample(60:120, 1000, replace = TRUE)),
origin = Sys.Date()) # unequaly spaced time series
data <- data.frame(time, y, x)
Custom function to be called on sliding windows:
library(runner)
running_regression <- function(idx) {
predict(lm(y ~ x, data = data))[max(idx)]
}
data$pred <- runner(seq_along(x),
k = 60 * 5,
idx = time,
f = running_regression)
Once we have created dataset with rolling 5-minute prediction, then we can filter only particular windows - here, only 1-st minute of the hour. It means that we always keep {hh}:56 - {hh+1}:01
library(dplyr)
library(lubridate)
filtered <-
data %>%
filter(minute(time) == 1)
plot(data$time, data$y, type = "l", col = "red")
points(filtered$time, filtered$pred, col = "blue")
There are some other examples in vignette how to do this with runner

How to calculate the number of a specific weekday between two POSIXct date arrays and return another numerical array?

I wrote a formula based on a weekday calculating algorithm (found in Stackexchange as well, great job guys. Here is the code snippet:
countwd <- function(start, end, day){
x <- seq(start, end, by=1)
y <- weekdays(x, TRUE)
sum(y==day)
}
x$OFFDAY <- NULL
for(i in 1:nrow(x)){
x$OFFDAY[i] <- countwd(x$PICK_DATE[i], x$SHIP_DATE[i], "Mon")
}
This is way too slow (loop proceeds like 2-4 rows per second!!!!), and I have millions of entries for each month.
Here is the vectorisation of the function:
x$OFFDAY <- countwd(x$PICK_DATE, x$SHIP_DATE, "Mon")
Shows this error:
Error in seq.POSIXt(start, end, by = 1) : 'from' must be of length 1
I cannot understand how to apply the "apply" family functions in this case as I have two vectors to compare (yes, I am really new to this).
Sample Data:
PICK_DATE SHIP_DATE
01-APR-2017 00:51 02-APR-2017 06:55 AM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 09:39 AM
I have converted these to POSIXct, and the formula works well for individual values (returns the second value though, no idea why. However, I can work around that):
>countwd(x$PICK_DATE[1], x$SHIP_DATE[1], "Mon")
[1] 0
An easy way to vectorize a function of multiple varying inputs is to use mapply:
mapply(countwd, x$SHIP_DATE, x$PICK_DATE, "Mon")
Or, alternatively, you can use sapply and pass a sequence of indices as a first argument (this way the syntax is very similar to a for loop:
sapply(1:nrow(x), function(i) countwd(x$SHIP_DATE[i], x$PICK_DATE[i], "Mon"))
The main inefficiency in your case however stems from the countwd function. Notice that you are passing POSIXt vectors to the function. Thus, when seq is called in the first row of the function, the by argument is taken to be seconds instead of days! This leads to generating needlessly large vectors (see ?seq.POSIXt for details).
Changing countwd in the following way should greatly improve performance:
countwd <- function(start, end, day) {
x <- seq(start, end, by="day")
y <- weekdays(x, TRUE)
sum(y==day)
}
Also note that weekdays is locale-specific and may not work as intended depending on your locale settings.
Based on #demirev's answer and my comments above, here is a worked example using the improved countwd function and mapply. I put in a few helper columns using lubridate to check the solution, and changed some of the dates to return values to df$off_days that were not zero.
library(lubridate)
df <- data.frame(pick_date = c(rep("01-APR-2017 00:51", 4)), ship_date = c("05-APR-2017 06:55", "09-APR-2017 12:11", "30-APR-2017 12:11", "02-MAY-2017 12:11"))
df$pick_date <- lubridate::dmy_hm(df$pick_date)
df$ship_date <- lubridate::dmy_hm(df$ship_date)
df$pick_day <- wday(df$pick_date, label = T)
df$ship_day <- wday(df$ship_date, label = T)
df$days_between <- interval(df$pick_date, df$ship_date) %/% days()
countwd <- function(start, end, day) {
x <- seq(start, end, by="day")
y <- weekdays(x, TRUE)
sum(y==day)
}
df$off_days <- mapply(countwd, df$pick_date, df$ship_date, "Mon")
df
pick_date ship_date pick_day ship_day days_between off_days
1 2017-04-01 00:51:00 2017-04-05 06:55:00 Sat Wed 4 1
2 2017-04-01 00:51:00 2017-04-09 12:11:00 Sat Sun 8 1
3 2017-04-01 00:51:00 2017-04-30 12:11:00 Sat Sun 29 4
4 2017-04-01 00:51:00 2017-05-02 12:11:00 Sat Tues 31 5

R: Efficiently subsetting dataframe based on time of day

I have a large (150,000x7) dataframe that I intend to use for back-testing and real-time analysis of a financial market. The data represents the condition of an investment vehicle at 5 minute intervals (although holes do exist). It looks like this (but much longer):
pTime Time Price M1 M2 M3 M4
1 1212108300 20:45:00 1.5518 12.21849 -0.37125 4.50549 -31.00559
2 1212108900 20:55:00 1.5516 11.75350 -0.81792 -1.53846 -32.12291
3 1212109200 21:00:00 1.5512 10.75070 -1.47438 -8.24176 -34.35754
4 1212109500 21:05:00 1.5514 10.23529 -1.06044 -8.46154 -33.24022
5 1212109800 21:10:00 1.5514 9.74790 -1.02759 -10.21978 -33.24022
6 1212110100 21:15:00 1.5513 9.31092 -1.17076 -11.97802 -33.79888
7 1212110400 21:20:00 1.5512 8.84034 -1.28428 -13.62637 -34.35754
8 1212110700 21:25:00 1.5509 8.07843 -1.63715 -18.24176 -36.03352
9 1212111000 21:30:00 1.5509 7.39496 -1.49198 -20.65934 -36.03352
10 1212111300 21:35:00 1.5512 7.65266 -1.03717 -18.57143 -34.35754
The data is pre-loaded into R, but during my back-test I need to subset it by two criteria:
The first criteria is a sliding window to avoid peeking into the future. The window must be such that, each new 5 minute interval on the back-test shifts the whole window into the future by 5 minutes. This part I can do like this:
require(zoo)
zooser <- zoo(x=tser$Close, order.by=as.POSIXct(tser$pTime, origin="1970-01-01"))
window(zooser, start=A, end=B)
The second criteria is another sliding window, but one that slides through time of day and contains only those entries that are within N minutes of the input time on any given day.
Example: If the window's size is 2 hours, and the input time is 12:00PM then the window must contain all rows with Time between 10:00AM and 2:00PM
This is the part that I am having trouble figuring out.
Edit: My data has holes in it, two consecutive rows could be MORE than 5 minutes apart. The data looks like this (very zoomed in)
As the window moves through these gaps the number of points inside the windows should vary.
The following is my MySQL code that does what I want to do in R (same table structure):
SET #qTime = Time(FROM_UNIXTIME(SAMP_endTime));
SET #inc = -1;
INSERT INTO MetIndListBuys (pTime,ArrayPos,M1,M2,M3,M4)
SELECT pTime,#inc:=#inc+1,M1,M2,M3,M4
FROM mergebuys USE INDEX (`y`) WHERE pTime BETWEEN SAMP_startTime AND SAMP_endTime
AND TIME_TO_SEC(TIMEDIFF(Time,#qTime))/3600 BETWEEN 0-HourSpan AND HourSpan
;
Say that you have your target time t0 on the same scale as pTime: seconds since epoch. Then t0 - pTime = (difference in the number of days since epoch between the two) + (difference in remaining seconds). Taking t0 - pTime %% (num. seconds per day) will leave us with the difference in seconds in clock arithmetic (wrapped around if the difference is negative). This suggests the following function:
SecondsPerDay <- 24 * 60 * 60
within <- function(d, t0Sec, wMin) {
diff <- (d$pTime - t0Sec) %% SecondsPerDay
wSec <- 60 * wMin
return(d[diff < wSec | diff > (SecondsPerDay - wSec), ])
}
1) If DF is the data frame shown in the question then create a zoo object from it as you have done and split it into days giving zs. Then lapply your function f to each successive set of w points in each component (i.e. in each day). For example, if you want to apply your function to 2 hours of data at a time and your data is regularly spaced 5 minute data then w = 24 (since there are 24 five minute periods in two hours). In such a case f would be passed 24 rows of data as a matrix each time its called. Also align has been set to "right" below but it can alternately be set to align="center" and the condition giving ix can be changed to double sided, etc. For more on rollapply see: ?rollapply
library(zoo)
z <- zoo(DF[-2], as.POSIXct(DF[,1], origin = "1970-01-01"))
w <- 3 # replace this with 24 to handle two hours at a time with five min data
f <- function(x) {
tt <- x[, 1]
ix <- tt[w] - tt <= w * 5 * 60 # RHS converts w to seconds
x <- x[ix, -1]
sum(x) # replace sum with your function
}
out <- rollapply(z, w, f, by.column = FALSE, align = "right")
Using the data frame in the question we get this:
> out
$`2008-05-30`
2008-05-30 02:00:00 2008-05-30 02:05:00 2008-05-30 02:10:00 2008-05-30 02:15:00
-66.04703 -83.92148 -95.93558 -100.24924
2008-05-30 02:20:00 2008-05-30 02:25:00 2008-05-30 02:30:00 2008-05-30 02:35:00
-108.15038 -121.24519 -134.39873 -140.28436
By the way, be sure to read this post .
2) This could alternately be done as the following where w and f are as above:
n <- nrow(DF)
m <- as.matrix(DF[-2])
sapply(w:n, function(i) { m <- m[seq(length = w, to = i), ]; f(m) })
Replace the sapply with lapply if needed. Also this may seem shorter than the first solution but its not much different once you add the code to define f and w (which appear in the first but not the second).
If there are no holes during the day and only holes between days then these solutions could be simplified.

Split a duration of an interval in calendar days

I have the following data set which shows the start and the end of an episode (date and time)
ep <- data.frame(start=c("2009-07-13 23:45:00", "2009-08-14 08:30:00",
"2009-09-16 15:30:00"),
end=c("2009-07-14 00:03:00", "2009-08-15 08:35:00",
"2009-09-19 07:30:00"))
I need to convert it into a data frame which would show in each calendar day how many minutes of episodes there were. For the above example it would be:
2009-07-13 15
2009-07-14 3
2009-08-14 930
2009-08-15 515
2009-09-16 510
2009-09-17 1440
2009-09-18 1440
2009-09-19 450
I appreciate any help
This works, but seems slightly inelegant. First, create a vector that is a sequence of times between each start and end time by minutes:
tmp <- do.call(c, apply(ep, 1,
function(x) head(seq(from = as.POSIXct(x[1]),
to = as.POSIXct(x[2]),by = "mins"),
-1)))
We use head(...., -1) to remove the last minute from each sequence so as the minutes match what you wanted.
Next, split this vector into minutes occurring on individual days, and count how many minuts there are per day:
tmp <- sapply(split(tmp, format(tmp, format = "%Y-%m-%d")), length)
Note that for some reason (probably time-zone related) that we can't just use as.Date(tmp) to get a vector of dates, we need to explicitly format the times to show only the date parts.
The final step is to arrange the tmp object that contains everything we need into the format you requested:
mins <- data.frame(Date = names(tmp), Minutes = tmp, row.names = NULL)
This gives:
> mins
Date Minutes
1 2009-07-13 15
2 2009-07-14 3
3 2009-08-14 930
4 2009-08-15 515
5 2009-09-16 510
6 2009-09-17 1440
7 2009-09-18 1440
8 2009-09-19 450

Resources