Converting dates to hours in R - r

I have a start and end date for individuals and i need to estimate if the time passed from the start to the end is within 2 days
or 3 plus days.These dates are assign to record ids, how can i filter ones that ended within 2 days (from the start date)
and the ones that ended after 3 days or later.
Record_id <- c("2245","6728","5122","9287")
Start <- c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End <- c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
I tried using
elapsed.time <- DF$start %--% DF$End
time.duration <- as.duration(elapsed.time)
but I am getting error because End date contains hour.Thank you.

Here's a dplyr pipe that will include both constraints (2 and 3 days):
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 2, 3))
# # A tibble: 4 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750000 days
# 2 6728 2021-01-21 00:00:00 2021-01-22 16:00:00 1.666667 days
# 3 5122 2021-01-17 00:00:00 2021-01-22 13:00:00 5.541667 days
# 4 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625000 days
I included mutate(d= so that we can see what the actual differences are. If you were looking to remove those, then use filter(between(..)) (no !).
In the case of the data you provided, all observations are less than 2 or more than 3 days. I'll expand this range so that we can see it in effect:
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 1, 6))
# # A tibble: 2 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750 days
# 2 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625 days
Data
df <- structure(list(Record_id = c("2245", "6728", "5122", "9287"), Start = c("2021-01-13 CST", "2021-01-21 CST", "2021-01-17 CST", "2021-01-13 CST"), End = c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST", "2021-01-25 15:00:00 CST")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))

I just converted the character to a date time with lubridate and then subtracted the dates. What you'll get back are days. I then filter for dates that are within 2 days.
Record_id<- c("2245","6728","5122","9287")
Start<-c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End<-c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
df <- dplyr::tibble(x = Record_id, y = Start, z = End)
df %>%
dplyr::mutate_at(vars(y:z), ~ lubridate::as_datetime(.)) %>%
dplyr::mutate(diff = as.numeric(z - y)) %>%
dplyr::filter(diff <= 2 )

Related

Find the overlap between two timestamps in R to assign shifts

Problem
Currently, I have a large flight crew schedule dataset, with a start and end time, where my goal is to identify whether an employee was working a night shift. A night shift is defined as any portion of the shift between 01:00:00 and 05:59:59. I have looked at functions such as %overlaps%, but these seem not to work for only timestamps. Some sample data (in UTC-tz):
library(lubridate)
df <- data.frame(start = ymd_hms(c("2018-09-19 23:30:00", "2018-09-19 17:00:00", "2018-09-22 04:30:00")),
end = ymd_hms(c('2018-09-20 07:05:00', "2018-09-19 21:00:00", "2018-09-22 12:00:00")))
Solution
Ideally, I would like to get the following output, with a Boolean variable indicating whether the employee worked a night shift:
start end night.shift
2018-09-19 23:30:00 | 2018-09-20 07:05:00 | TRUE
2018-09-19 17:00:00 | 2018-09-19 21:00:00 | FALSE
2018-09-22 04:30:00 | 2018-09-22 12:00:00 | TRUE
Thanks in advance!
Using seq.POSIXt
transform(df, night.shift=mapply(\(x, y) any(
as.POSIXct(outer(as.Date(c(x, y)), c('01:00:00', '05:59:59'), paste), tz='GMT') %in%
seq.POSIXt(x, y, by='sec')),
start, end))
# start end night.shift
# 1 2018-09-19 23:30:00 2018-09-20 07:05:00 TRUE
# 2 2018-09-19 17:00:00 2018-09-19 21:00:00 FALSE
# 3 2018-09-22 04:30:00 2018-09-22 12:00:00 TRUE
or, almost twice as fast, %inrange% from the data.table package.
library(data.table)
transform(df, night.shift=mapply(\(x, y) any(
as.POSIXct(outer(as.Date(c(x, y)), c('01:00:00', '05:59:59'), paste), tz='GMT') %inrange%
c(x, y)),
start, end))
# start end night.shift
# 1 2018-09-19 23:30:00 2018-09-20 07:05:00 TRUE
# 2 2018-09-19 17:00:00 2018-09-19 21:00:00 FALSE
# 3 2018-09-22 04:30:00 2018-09-22 12:00:00 TRUE
You can use interval() or %--% to create an Interval object and int_overlaps() to test if two intervals overlap.
library(dplyr)
library(lubridate)
df %>%
mutate(
night.shift = int_overlaps(
(date(start) + hms("01:00:00")) %--% (date(start) + hms("05:59:59")),
start %--% end
) | int_overlaps(
(date(end) + hms("01:00:00")) %--% (date(end) + hms("05:59:59")),
start %--% end
)
)
Another way is using %within% to check whether a date-time object falls within an interval.
df %>%
rowwise() %>%
mutate(
night.shift = any(outer(date(c(start, end)), hms(c("01:00:00", "05:59:59")), `+`) %within% (start %--% end))
) %>%
ungroup()
Output
# # A tibble: 4 × 3
# start end night.shift
# <dttm> <dttm> <lgl>
# 1 2018-09-19 23:30:00 2018-09-20 07:05:00 TRUE
# 2 2018-09-19 17:00:00 2018-09-19 21:00:00 FALSE
# 3 2018-09-22 04:30:00 2018-09-22 12:00:00 TRUE
# 4 2018-09-22 04:30:00 2018-09-23 00:30:00 TRUE
Reference
Utilities for creation and manipulation of Interval objects
This is super janky and not optimized, but it works (and was fun to figure out). You'll want to vectorize it if possible.
library(lubridate)
df <- data.frame(start = ymd_hms(c("2018-09-19 23:30:00", "2018-09-19 17:00:00", "2018-09-22 04:30:00")),
end = ymd_hms(c('2018-09-20 07:05:00', "2018-09-19 21:00:00", "2018-09-22 12:00:00")))
night <- interval( hms::as_hms(3600), hms::as_hms(21599), tz = "UTC")
print(night)
for(i in 1:3) {
s = df$start[i]
f = df$end[i]
start_seconds = hms::as_hms(60*60*hour(s) + 60*minute(s) + second(s))
end_seconds = hms::as_hms(60*60*hour(f) + 60*minute(f) + second(f))
interval <- interval(start_seconds, end_seconds, tz = "UTC")
t <- int_overlaps(night, interval)
print(t)
}

do() superseded! Alternative is to use across(), nest_by(), and summarise, how?

I'm doing something quite simple. Given a dataframe of start dates and end dates for specific periods I want to expand/create a full sequence for each period binned by week (with the factor for each row), then output this in a single large dataframe.
For instance:
library(tidyverse)
library(lubridate)
# Dataset
start_dates = ymd_hms(c("2019-05-08 00:00:00",
"2020-01-17 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00"), tz = "UTC")
end_dates = ymd_hms(c( "2019-10-24 00:00:00",
"2020-03-03 00:00:00",
"2020-05-28 00:00:00",
"2020-12-10 00:00:00",
"2021-05-07 00:00:00",
"2022-01-04 00:00:00",
"2022-01-19 00:00:00"), tz = "UTC")
df1 = data.frame(studying = paste0("period",seq(1:7),sep = ""),start_dates,end_dates)
It was suggested to me to use do(), which currently works fine but I hate it when things are superseded. I also have a way of doing it using map2. But reading the file (https://dplyr.tidyverse.org/reference/do.html) suggests you can use nest_by(), across() and summarise() to do the same job as do(), how would I go about getting same result? I've tried a lot of things but I just can't seem to get it.
# do() way to do it
df1 %>%
group_by(studying) %>%
do(data.frame(week=seq(.$start_dates,.$end_dates,by="1 week")))
# transmute() way to do it
df1 %>%
transmute(weeks = map2(start_dates,end_dates, seq, by = "1 week"), studying)
%>% unnest(cols = c(weeks))
As the documentation of ?do suggests, we can now use summarise and replace the . with across():
library(tidyverse)
library(lubridate)
df1 %>%
group_by(studying) %>%
summarise(week = seq(across()$start_dates,
across()$end_dates,
by = "1 week"))
#> `summarise()` has grouped output by 'studying'. You can override using the
#> `.groups` argument.
#> # A tibble: 134 x 2
#> # Groups: studying [7]
#> studying week
#> <chr> <dttm>
#> 1 period1 2019-05-08 00:00:00
#> 2 period1 2019-05-15 00:00:00
#> 3 period1 2019-05-22 00:00:00
#> 4 period1 2019-05-29 00:00:00
#> 5 period1 2019-06-05 00:00:00
#> 6 period1 2019-06-12 00:00:00
#> 7 period1 2019-06-19 00:00:00
#> 8 period1 2019-06-26 00:00:00
#> 9 period1 2019-07-03 00:00:00
#> 10 period1 2019-07-10 00:00:00
#> # … with 124 more rows
Created on 2022-01-19 by the reprex package (v0.3.0)
You can also use tidyr::complete:
df1 %>%
group_by(studying) %>%
complete(start_dates = seq(from = start_dates, to = end_dates, by = "1 week")) %>%
select(-end_dates, weeks = start_dates)
# A tibble: 134 x 2
# Groups: studying [7]
studying weeks
<chr> <dttm>
1 period1 2019-05-08 00:00:00
2 period1 2019-05-15 00:00:00
3 period1 2019-05-22 00:00:00
4 period1 2019-05-29 00:00:00
5 period1 2019-06-05 00:00:00
6 period1 2019-06-12 00:00:00
7 period1 2019-06-19 00:00:00
8 period1 2019-06-26 00:00:00
9 period1 2019-07-03 00:00:00
10 period1 2019-07-10 00:00:00
# ... with 124 more rows
Although marked Experimental the help file for group_modify does say that
‘group_modify()’ is an evolution of ‘do()’
and, in fact, the code for the example in the question using group_modify is nearly the same as with do.
# with group_modify
df2 <- df1 %>%
group_by(studying) %>%
group_modify(~ data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))
# with do
df0 <- df1 %>%
group_by(studying) %>%
do(data.frame(week = seq(.$start_dates, .$end_dates, by = "1 week")))
identical(df2, df0)
## [1] TRUE
Not sure if this exactly what you are looking for, but here is my attempt with rowwise and unnest
df1 %>%
rowwise() %>%
mutate(week = list(seq(start_dates, end_dates, by = "1 week"))) %>%
select(studying, week) %>%
unnest(cols = c(week))
Another approach:
library(tidyverse)
df1 %>%
group_by(studying) %>%
summarise(df = tibble(weeks = seq(start_dates, end_dates, by = 'week'))) %>%
unnest(df)
#> `summarise()` has grouped output by 'studying'. You can override using the `.groups` argument.
#> # A tibble: 134 × 2
#> # Groups: studying [7]
#> studying weeks
#> <chr> <dttm>
#> 1 period1 2019-05-08 00:00:00
#> 2 period1 2019-05-15 00:00:00
#> 3 period1 2019-05-22 00:00:00
#> 4 period1 2019-05-29 00:00:00
#> 5 period1 2019-06-05 00:00:00
#> 6 period1 2019-06-12 00:00:00
#> 7 period1 2019-06-19 00:00:00
#> 8 period1 2019-06-26 00:00:00
#> 9 period1 2019-07-03 00:00:00
#> 10 period1 2019-07-10 00:00:00
#> # … with 124 more rows
Created on 2022-01-20 by the reprex package (v2.0.1)

How to round datetime to nearest time of day, preferably vectorized?

Say I have a POSIXct vector like
timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)
I would like to round these times up to the nearest hour of the day in a vector:
hours_of_day = c(6, 14, 20)
i.e., the following result:
timestamps result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00
Is there a vectorized solution to this (or otherwise fast)? I have a few million timestamps and need to apply it for several hours_of_day.
One way to simplify this problem is to (1) find the next hours_of_day for each lubridate::hour(timestamps) and then (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600. But how to do step 1 vectorized?
Convert to as.POSIXlt, which allows you to extract hours and minutes, and calculate decimal hours. In an lapply/sapply combination first look up where these are less than the hours of the day vector, and choose the maximum hour using which.max. Now create new date-time using ISOdate and add one day ifelse date-time is smaller than original time.
timestamps <- as.POSIXlt(timestamps)
h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600),
`<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400
Result
r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"
data.frame(timestamps, r)
# timestamps r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00
Note: I've added "2021-01-24 23:59:00 CET" to timestamps to demonstrate the date change.
Benchmark
Tested on a length 1.4e6 vector.
# Unit: seconds
# expr min lq mean median uq max neval cld
# POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321 3 a
# lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316 3 b
Data:
timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520,
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)
I would extract the hour component, use cut to bin it, and assign the binned hours back to the original:
hours_of_day = c(2, 14, 20)
library(lubridate)
library(magrittr) ## just for the pipe
new_hours = timestamps %>%
hour %>%
cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>%
as.character() %>%
as.integer()
result = floor_date(timestamps, "hour")
hour(result) = new_hours
result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"
Building on the approach by #jay.sf, I made a function for floor as well while adding support for NA values.
floor_date_to = function(timestamps, hours_of_day) {
# Handle NA with a temporary filler so code below doesn't break
na_timestamps = is.na(timestamps)
timestamps[na_timestamps] = as.POSIXct("9999-12-31")
# Proceed as usual
timestamps = as.POSIXlt(timestamps)
hours_of_day = rev(hours_of_day) # floor-specific: because which.max returns the first index by default
nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))] # floor-specific: negative which.max()
rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400 # floor: use minus
return(rounded)
timestamps[na_timestamps] = NA # Overwrite with NA again
}

Is there a function for checking if a time interval overlaps in a single column and sort by group in R

I have a large dataset and I'm trying to find where time intervals overlap by group. To complicate things further I'm hoping that the code could be integrated with the 'dplyr' group_by function so the overlapping times don't get confused with other ids.
I've tried using the overlapping function "int_overlaps(int1, int2)" from "lubridate", but this doesn't work for one column. Any other overlapping functions appear to not work with time intervals.
library(lubridate)
id <- c(1,1,1,2,2)
start <-as.POSIXct(c("2017-06-27 09:30:00","2017-06-27 15:30:00",
"2017-06-27 14:30:00","2017-06-28 09:30:00","2017-06-28 15:00:00"),tz= "UTC")
end <-as.POSIXct(c("2017-06-27 10:30:00","2017-06-27 17:30:00",
"2017-06-27 18:30:00","2017-06-28 10:30:00","2017-06-28 16:00:00"),tz= "UTC")
inter1<- interval(start,end,tz="UTC")
df <- data.frame(id,inter1)
overlap <- c(FALSE,TRUE,TRUE,FALSE,FALSE)
new_df<-data.frame(id,inter1,overlap)
The sample data doesn't have any overlapping periods. The following change was made:
start <-as.POSIXct(c("2017-06-27 09:30:00","2017-06-27 15:30:00",
"2017-06-27 14:30:00","2017-06-28 09:30:00","2017-06-28 15:00:00"), tz= "UTC")
Using lead will return NA if it is the last record in a group
library(dplyr)
new_df %>%
group_by(id) %>%
arrange(int_start(inter1), .by_group = TRUE) %>%
mutate(overlap2 = lead(int_start(inter1)) < int_end(inter1))
# A tibble: 5 x 4
# Groups: id [2]
id inter1 overlap overlap2
<dbl> <Interval> <lgl> <lgl>
1 1 2017-06-27 09:30:00 UTC--2017-06-27 10:30:00 UTC FALSE FALSE
2 1 2017-06-27 14:30:00 UTC--2017-06-28 18:30:00 UTC TRUE TRUE
3 1 2017-06-27 15:30:00 UTC--2017-06-27 17:30:00 UTC TRUE NA
4 2 2017-06-28 09:30:00 UTC--2017-06-28 10:30:00 UTC FALSE FALSE
5 2 2017-06-28 15:00:00 UTC--2017-06-28 16:00:00 UTC FALSE NA
If needing to compare each row to all rows within the group
library(tidyverse)
new_df %>%
group_by(id) %>%
arrange(int_start(inter1), .by_group = TRUE) %>%
mutate(overlap2 = map_int(inter1, ~ sum(int_overlaps(.x, inter1))) > 1)
# A tibble: 5 x 4
# Groups: id [2]
id inter1 overlap overlap2
<dbl> <Interval> <lgl> <lgl>
1 1 2017-06-27 09:30:00 UTC--2017-06-27 10:30:00 UTC FALSE FALSE
2 1 2017-06-27 14:30:00 UTC--2017-06-28 18:30:00 UTC TRUE TRUE
3 1 2017-06-27 15:30:00 UTC--2017-06-27 17:30:00 UTC TRUE TRUE
4 2 2017-06-28 09:30:00 UTC--2017-06-28 10:30:00 UTC FALSE FALSE
5 2 2017-06-28 15:00:00 UTC--2017-06-28 16:00:00 UTC FALSE FALSE
1) sqldf Assuming you only want to overlap the times and not the dates, replace inter1 with start, end as well as the starting and ending times, time1 and time2, giving new_df1. Then do a self join on id and the
overlap condition grouping by rowid. overlap is TRUE if the count of matching rows exceeds 1 (since overlapping itself does not count).
library(dplyr)
library(lubridate)
library(sqldf)
new_df1 <- new_df %>%
mutate(
start = int_start(inter1),
end = int_end(inter1),
time1 = sub(".* ", "", start),
time2 = sub(".* ", "", end),
inter1 = NULL
)
sqldf("select a.id, a.start, a.end, count(*) > 1 as overlap
from new_df1 a
join new_df1 b on a.id = b.id and
(a.time1 between b.time1 and b.time2 or b.time1 between a.time1 and a.time2)
group by a.rowid")
giving:
id start end overlap
1 1 2017-06-27 05:30:00 2017-06-27 06:30:00 FALSE
2 1 2017-06-27 11:30:00 2017-06-27 13:30:00 TRUE
3 1 2017-06-28 10:30:00 2017-06-28 14:30:00 TRUE
4 2 2017-06-28 05:30:00 2017-06-28 06:30:00 FALSE
5 2 2017-06-28 11:00:00 2017-06-28 12:00:00 FALSE
2) This forms the full ni x ni join for each id i and then filters it down and groups it as a second and third step whereas (1) does these all at once so depending on the SQL optimizations applied by the database software (1) might be much more efficient. Anyways, this joins on id and then filters on the overlap condition and finally does the counting. new_df1 is from (1).
new_df1 %>%
mutate(rowid = 1:n()) %>%
inner_join(new_df1, by = "id", suffix = c("", ".y")) %>%
filter((time1 >= time1.y & time1 <= time2.y) |
(time1.y >= time1 & time1.y <= time2)) %>%
count(rowid, id, start, end) %>%
mutate(overlap = n > 1) %>%
select(id, start, end, overlap)
giving:
# A tibble: 5 x 4
rowid start end overlap
<int> <dttm> <dttm> <lgl>
1 1 2017-06-27 09:30:00 2017-06-27 10:30:00 FALSE
2 2 2017-06-27 15:30:00 2017-06-27 17:30:00 TRUE
3 3 2017-06-28 14:30:00 2017-06-28 18:30:00 TRUE
4 4 2017-06-28 09:30:00 2017-06-28 10:30:00 FALSE
5 5 2017-06-28 15:00:00 2017-06-28 16:00:00 FALSE
Note
The poster changed the question after it was already answered but in any case we used this as the input.
new_df <-
structure(list(id = c(1, 1, 1, 2, 2), inter1 = new("Interval",
.Data = c(3600, 7200, 14400, 3600, 3600), start = structure(c(1498555800,
1498577400, 1498660200, 1498642200, 1498662000), tzone = "UTC",
class = c("POSIXct",
"POSIXt")), tzone = "UTC"), overlap = c(FALSE, TRUE, TRUE,
FALSE, FALSE)), class = "data.frame", row.names = c(NA, -5L))

R convert hourly to daily data up to 0:00 instead of 23:00

How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]

Resources