Related
I have data on start time ('startTime', a date-time variable, POSIXct) and duration in minutes ('duration_minutes'):
df <- data.frame(id = c(1, 2, 3),
startTime = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11")),
duration_minutes = c(315, 120, 45))
I want to convert the start time and duration to elapsed time per hour, for each hour, from the hour of the start time to the last hour at the end of the duration:
df_result <- data.frame(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3),
startTime = c("2018-01-01 12:15:31","2018-01-01 13:00:00",
"2018-01-01 14:00:00","2018-01-01 15:00:00",
"2018-01-01 16:00:00","2018-01-01 17:00:00",
"2018-01-02 23:43:00","2018-01-03 00:00:00",
"2018-01-03 01:00:00",
"2018-01-03 11:00:11"),
duration_minutes = c(44.48, 60, 60, 60, 60, 30.5, 17, 60, 43, 45))
Please, advice with the possible solution.
Another possibility:
library(data.table)
library(lubridate)
setDT(df)
df[ , ceil_start := ceiling_date(start, "hour", change_on_boundary = TRUE)]
df[ , {
if(difftime(ceil_start, start, units = "min") > dur) {
.SD[ , .(start, dur)]
} else {
end <- start + dur * 60
time <- c(start,
seq(from = ceil_start,
to = floor_date(end, "hour"),
by = "hour"),
end)
.(start = head(time, -1), dur = `units<-`(diff(time), "mins"))
}
},
by = id]
# id start dur
# 1: 1 2018-01-01 12:15:31 44.48333 mins
# 2: 1 2018-01-01 13:00:00 60.00000 mins
# 3: 1 2018-01-01 14:00:00 60.00000 mins
# 4: 1 2018-01-01 15:00:00 60.00000 mins
# 5: 1 2018-01-01 16:00:00 60.00000 mins
# 6: 1 2018-01-01 17:00:00 30.51667 mins
# 7: 2 2018-01-02 23:43:00 17.00000 mins
# 8: 2 2018-01-03 00:00:00 60.00000 mins
# 9: 2 2018-01-03 01:00:00 43.00000 mins
# 10: 3 2018-01-03 11:00:11 45.00000 mins
# 11: 4 2018-01-03 11:35:00 25.00000 mins
# 12: 4 2018-01-03 12:00:00 10.00000 mins
# 13: 5 2018-01-03 00:00:00 60.00000 mins
# 14: 5 2018-01-03 01:00:00 0.00000 mins
Explanation
Convert data.frame to data.table (setDT). Round up start times to nearest hour (ceiling_date(start, "hour", ...). Use change_on_boundary = TRUE for easier handling of times without minutes and seconds (not in the data, but tested).
To handle cases when the end time (start + duration) is in the same hour as the start time (e.g. id = 3), check if difference between rounded time and start time is larger than duration (if(difftime(ceil_start, start, units = "min") > dur))). If so, just select the start and duration columns (.SD[ , .(start, dur)).
For other cases (else), calculate end time: end <- start + dur * 60. Create a sequence from the up-rounded start time ('ceil_start'), to the down-rounded end time, with an hourly increment (seq(from = ceil_start, to = floor_date(end, "hour"), by = "hour")). Concatenate with 'start' and 'end' times. Return all times except the last (head(time, -1) and calculate difference between time steps in minutes (`units<-`(diff(time), "mins")).
For times with H:M:S = 00:00:00 and duration is a multiple of 60 min, like id = 5, the current solution gives a row with a duration of 0 minutes for the last hour. While waiting for a more elegant solution, a quick and dirty way is just to delete such rows with duration = 0.
Data
Please note that I have added a case not included in original data, id = 4 (see also my comment above) and id = 5.
df <- data.frame(id = 1:5,
start = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-03 11:35:00",
"2018-01-03 00:00:00")),
dur = c(315, 120, 45, 35, 60))
Try this:
library(data.table)
library(lubridate)
library(magrittr)
df <-
setDT(df)[, start_ceiling := ceiling_date(startTime, "hour", change_on_boundary = TRUE)] %>%
.[, `:=` (
reps = ifelse(
startTime + (duration_minutes * 60) <= start_ceiling, 1, pmax(2, floor(duration_minutes / 60) + 1)
),
initial_diff = as.numeric(difftime(start_ceiling[1], startTime[1], units = "mins"))
), by = id] %>%
.[, df[df[, rep(.I, reps)]]] %>%
.[, startTime := pmax(startTime, floor_date(startTime, "hour") + hours(0:(.N - 1))), by = id] %>%
.[reps > 1, duration_minutes := c(initial_diff[.N],
rep(60, reps[.N] - 2),
(duration_minutes[.N] - initial_diff[.N]) %% 60), by = id] %>%
.[!(duration_minutes == 0 & reps > 1), ] %>%
.[, c("reps", "start_ceiling", "initial_diff") := NULL]
I've tested this with all the scenarios we've gathered so far, and this is the output:
id startTime duration_minutes
1: 1 2018-01-01 12:15:31 44.48333
2: 1 2018-01-01 13:00:00 60.00000
3: 1 2018-01-01 14:00:00 60.00000
4: 1 2018-01-01 15:00:00 60.00000
5: 1 2018-01-01 16:00:00 60.00000
6: 1 2018-01-01 17:00:00 30.51667
7: 2 2018-01-02 23:43:00 17.00000
8: 2 2018-01-03 00:00:00 60.00000
9: 2 2018-01-03 01:00:00 43.00000
10: 3 2018-01-03 11:00:11 45.00000
11: 4 2018-01-04 10:00:00 60.00000
12: 4 2018-01-04 11:00:00 5.00000
13: 5 2018-01-05 00:00:00 60.00000
14: 6 2018-01-06 11:35:00 25.00000
15: 6 2018-01-06 12:00:00 10.00000
16: 7 2018-01-07 00:00:00 60.00000
17: 7 2018-01-07 01:00:00 60.00000
Data used:
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7),
startTime = as.POSIXct(
c(
"2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-04 10:00:00",
"2018-01-05 00:00:00",
"2018-01-06 11:35:00",
"2018-01-07 00:00:00"
)
),
duration_minutes = c(315, 120, 45, 65, 60, 35, 120)
)
df
id startTime duration_minutes
1 1 2018-01-01 12:15:31 315
2 2 2018-01-02 23:43:00 120
3 3 2018-01-03 11:00:11 45
4 4 2018-01-04 10:00:00 65
5 5 2018-01-05 00:00:00 60
6 6 2018-01-06 11:35:00 35
7 7 2018-01-07 00:00:00 120
Assume we have an interval spanning several days (interval "A" in Figure below).
library(lubridate)
int <- interval("2018-01-01 22:00:00", "2018-01-04 10:00:00")
In hours, I get
as.period(int, unit = "hours")
"60H 0M 0S"
Now, I want to subtract all non-working-hours, here 16:00-08:00 (greyed out) in that interval, i.e. only keep the blue parts (08:00-16:00) and, again, calculate the remaining hours (see "B" in Figure below), which would be 8 + 8 + 2 = 18 hours.
One approach would be to create a list of intervals I want to keep which span the entire interval and then calculate intersections. (The code below could, of course, be setup programmatically using floor/ceiling/seq functions etc.)
int_keep <- list(
interval("2018-01-01 08:00:00", "2018-01-01 16:00:00"),
interval("2018-01-02 08:00:00", "2018-01-02 16:00:00"),
interval("2018-01-03 08:00:00", "2018-01-03 16:00:00"),
interval("2018-01-04 08:00:00", "2018-01-04 16:00:00"),
interval("2018-01-05 08:00:00", "2018-01-05 16:00:00")
)
l <- lapply(int_keep, function(x) intersect(x, int))
mns <- sapply(l, as.numeric) # returns seconds
sum(mns, na.rm = T) / 60 / 60 # sum of intersections in hours
[1] 18
While this works, it appears utterly clumsy to me. What would be a less tedious way to do this?
df <- data.frame(DateTime=seq.POSIXt(as.POSIXct("2018-01-01 22:00:00"), as.POSIXct("2018-01-04 10:00:00"), by = "1 hour"))
head(df)
#DateTime
#1 2018-01-01 22:00:00
#2 2018-01-01 23:00:00
#3 2018-01-02 00:00:00
#4 2018-01-02 01:00:00
#5 2018-01-02 02:00:00
#6 2018-01-02 03:00:00
#you want the hours worked between A and B
A <-format(strptime("8:00:00", "%H:%M:%S"),"%H:%M:%S")
B <-format(strptime("16:00:00", "%H:%M:%S"),"%H:%M:%S")
#a simple ifelse statement to assign a value of 1 to column "value" if the time is between 8 and 16 or a 0 if it's not:
df$value<-ifelse((format(df[1],"%H:%M:%S")>A & format(df[1],"%H:%M:%S")<=B),1,0)
tail(df)
#DateTime DateTime
#56 2018-01-04 05:00:00 0
#57 2018-01-04 06:00:00 0
#58 2018-01-04 07:00:00 0
#59 2018-01-04 08:00:00 0
#60 2018-01-04 09:00:00 1
#61 2018-01-04 10:00:00 1
#now taking the column sum of the value column will give you the total hours worked:
TotalHoursWorked<-colSums(df$value)
TotalHoursWorked
#DateTime
# 18
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]
I have a selection of scattered timestamp data based on requests to a particular service. This data covers approximately 3.5-4 years of requests against this service.
I am looking to turn this selection of variable-interval timestamps into a frequency-binned timeseries in R.
How would I go about converting these timestamps into a frequency-binned timeseries, such as "between 1 and 1:15PM on this day, there were 7 requests, and between 1:15 and 1:30PM there were 2, and between 1:30 and 1:45, there were 0", being sure to also have a bin where there is nothing?
The data is just a vector of timestamps from a database dump, all of the format: ""2014-02-17 13:10:46". Just a big ol' vector with ~2 million objects in it.
You could use tools for handling time series data from xts and zoo. Note that you will need some artificial 'data':
library(xts)
set.seed(42)
ts.index <- ISOdatetime(2018, 1, 8, 8:9, sample(60, 10), 0)
ts <- xts(rep(1, length(ts.index)), ts.index)
aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
#>
#> 2018-01-08 08:15:00 1
#> 2018-01-08 08:30:00 3
#> 2018-01-08 08:45:00 1
#> 2018-01-08 09:00:00 1
#> 2018-01-08 09:15:00 1
#> 2018-01-08 09:45:00 3
Edit: If you want to include bins without observations, you can convert to a strictly regular ts object and replace the inserted NAvalues with zero:
raw <- aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
as.xts(na.fill(as.ts(raw), 0), dateFormat = "POSIXct")
#> zoo(coredata(x), tt)
#> 2018-01-08 08:15:00 1
#> 2018-01-08 08:30:00 3
#> 2018-01-08 08:45:00 1
#> 2018-01-08 09:00:00 1
#> 2018-01-08 09:15:00 1
#> 2018-01-08 09:30:00 0
#> 2018-01-08 09:45:00 3
Edit 2: It also works for the provided sample data:
library(xts)
data <- c(1228917812, 1245038910, 1245986979, 1268750482, 1281615510, 1292561113)
class(data) = c("POSIXct", "POSIXt")
attr(data, "tzone") <- "UTC"
dput(data)
#> structure(c(1228917812, 1245038910, 1245986979, 1268750482, 1281615510,
#> 1292561113), class = c("POSIXct", "POSIXt"), tzone = "UTC")
ts <- xts(rep(1, length(data)), data)
raw <- aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
head(as.xts(na.fill(as.ts(raw), 0), dateFormat = "POSIXct"))
#> zoo(coredata(x), tt)
#> 2008-12-10 15:00:00 1
#> 2008-12-10 15:15:00 0
#> 2008-12-10 15:30:00 0
#> 2008-12-10 15:45:00 0
#> 2008-12-10 16:00:00 0
#> 2008-12-10 16:15:00 0
I have two data frames: A
y_m_d SNOW
1 2010-01-01 0.0
2 2010-01-02 0.0
3 2010-01-03 0.1
4 2010-01-04 0.0
5 2010-01-05 0.0
6 2010-01-06 2.3
B:
time temp
1 2010-01-01 00:00:00 20.00000
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
I want to combine two data frame based on time. A is a daily record and B is a hourly record. I want to fill the A record at the beginning of each day at 00:00:00 and leave the rest of day blank.
The result should be look like this:
time temp SNOW
1 2010-01-01 00:00:00 20.00000 0.0
2 2010-01-01 01:00:00 18.33333
3 2010-01-01 02:00:00 17.00000
4 2010-01-01 03:00:00 25.33333
5 2010-01-01 04:00:00 23.33333
6 2010-01-01 05:00:00 22.66667
Could you please give me some advice?
Thank you.
Here's a quick solution:
A$y_m_d <- as.Date(A$y_m_d)
B$SNOW <- sapply(as.Date(B$time), function(x) A[A$y_m_d==x, "SNOW"])
This might not be the most efficient way in the world to do this, but it is a solution. I attempted to create data with the exact same variable types and structure as you.
# Create example data
y_m_d <- as.POSIXct(c("2010-01-01", "2010-01-02"), format="%Y-%m-%d")
SNOW <- c(0, 0.1)
time <- as.POSIXct(c("2010-01-01 00:00:00", "2010-01-01 01:00:00", "2010-01-01 02:00:00", "2010-01-02 00:00:00", "2010-01-02 01:00:00", "2010-01-02 02:00:00"), format="%Y-%m-%d %H:%M:%S")
temp <- rnorm(6, mean=20, sd=4)
A <- data.frame(y_m_d, SNOW)
B <- data.frame(time, temp)
# Check data
A
## y_m_d SNOW
## 1 2010-01-01 0.0
## 2 2010-01-02 0.1
B
## time temp
## 1 2010-01-01 00:00:00 17.52852
## 2 2010-01-01 01:00:00 12.42715
## 3 2010-01-01 02:00:00 21.79584
## 4 2010-01-02 00:00:00 19.90442
## 5 2010-01-02 01:00:00 16.40524
## 6 2010-01-02 02:00:00 16.86854
# Loop through days and construct new SNOW variable
days <- as.POSIXct(format(B$time, "%Y-%m-%d"), format="%Y-%m-%d")
SNOW_new <- c()
for (i in 1:nrow(A)) {
SNOW_new <- c(A[i, "SNOW"], rep(NA, sum(days==A[i, "y_m_d"])-1), SNOW_new)
}
# Create new data frame
C <- data.frame(B, SNOW_new)
## time temp SNOW_new
## 1 2010-01-01 00:00:00 17.52852 0.1
## 2 2010-01-01 01:00:00 12.42715 NA
## 3 2010-01-01 02:00:00 21.79584 NA
## 4 2010-01-02 00:00:00 19.90442 0.0
## 5 2010-01-02 01:00:00 16.40524 NA
## 6 2010-01-02 02:00:00 16.86854 NA
I put NA rather than a blank space because I assume you want the SNOW_new variable to be numeric, not character. But if you do want a blank space, you can just replace the NA in the rep function with a "".
Making sure time variables are in the right format.
A$y_m_d <- as.POSIXct(A$y_m_d, format="%Y-%m-%d")
B$time <- as.POSIXct(B$time, format="%Y-%m-%d %H:%M:%S")
The package lubridate is suited to merge time series data
#install.packages("lubridate")
library(lubridate)
A <- xts(A[,-1], order.by = A$y_m_d)
B <- xts(B[,-1], order.by = B$time)
merge.xts(A, B)