probably very easy but struggling with it, looked for the answers on the web but they usually relate to cut and snapshots, not intervals overlapping
require(data.table)
x = data.table(start=c("2017-04-18 18:05:00","2017-04-18 18:00:00",
"2017-04-18 21:05:00", "2017-04-18 16:05:00"),
end=c("2017-04-18 19:05:00","2017-04-18 21:30:00",
"2017-04-18 22:00:00", "2017-04-18 16:10:00"))
we have 4 observations and i need to allocate it to the corresponding hourly windows.
start end
1: 2017-04-18 18:05:00 2017-04-18 19:05:00
2: 2017-04-18 18:00:00 2017-04-18 21:30:00
3: 2017-04-18 21:05:00 2017-04-18 22:00:00
4: 2017-04-18 16:05:00 2017-04-18 16:10:00
the first one for example will have 55 min in 18:00 slot and 5 min in 19:00 slot, the next one 60 min in 18:00,19:00, 20:00 and 30 min in 21:00, the third one will have 55 min in 21:00 and the last one 5 min in 16:00
the result should be as below (sorry if i got the basic manual additions wrong;)
interval Q
1: 2017-04-18 16:00:00 5
2: 2017-04-18 17:00:00 0
3: 2017-04-18 18:00:00 115
4: 2017-04-18 19:00:00 65
5: 2017-04-18 20:00:00 120
6: 2017-04-18 21:00:00 85
of course there is a straight forward way to cut the series by minutes and perform a count by cut/interval, but i believe the problem is so common it must have a direct method. Preferably i would have the 0 value windows as well, but i can just sequence them if required
This is a solution using dplyr
First a helper function find_slots is defined to generate all the hours between start and end. Next the Q values are calculated.
Finally the data is summarized by grouping each slot.
library(dplyr)
find_slots <- function(a, b){
slots = seq(a-minute(a)*60-second(a),
b-minute(b)*60-second(b),
"hour")
dateseq = slots
dateseq[1] = a
r = c(dateseq, b)
d = as.numeric(difftime(r[-1], r[-length(r)], unit = 'min'))
data.frame(slot = slots, Q = d)
}
x %>%
rowwise %>%
do(find_slots(.$start, .$end)) %>%
ungroup() %>%
group_by(slot) %>%
summarize(Q = sum(Q))
Result (the 0 value for 17:00 is missing) :
slot Q
1 2017-04-18 16:00:00 5
2 2017-04-18 18:00:00 115
3 2017-04-18 19:00:00 65
4 2017-04-18 20:00:00 60
5 2017-04-18 21:00:00 85
6 2017-04-18 22:00:00 0
Edit: Using data.table
(Maybe faster but I'm not too experienced with data.table)
Also using the fasttime library to speedup parsing of the datetimes.
library(fasttime)
library(data.table)
x = data.table(start=c("2017-04-18 18:05:00","2017-04-18 18:00:00",
"2017-04-18 21:05:00", "2017-04-18 16:05:00"),
end=c("2017-04-18 19:05:00","2017-04-18 21:30:00",
"2017-04-18 22:00:00", "2017-04-18 16:10:00"))
find_slots2 <- function(a, b){
a = fasttime::fastPOSIXct(a)
b = fasttime::fastPOSIXct(b)
slots = seq(a-data.table::minute(a)*60-data.table::second(a)*60,
b-data.table::minute(b)*60-data.table::second(b)*60,
"hour")
hourseq = c(a, slots[-1], b)
d = difftime(hourseq[-1], hourseq[-length(hourseq)], unit = 'min')
list(slot = slots, Q = d)
}
x[, find_slots2(start, end), by = 1:nrow(x)][order(slot), .(Q = as.numeric(sum(Q))), by = slot]
Lubridate has a function lubridate::interval() that could be useful here.
Related
I have a time series (xts) of rain gage data and I would like to be able to sum all the rain amounts between a beginning and end time point from a list. And then make a new data frame that is StormNumber and TotalRain over that time
> head(RainGage)
Rain_mm
2019-07-01 00:00:00 0
2019-07-01 00:15:00 0
2019-07-01 00:30:00 0
2019-07-01 00:45:00 0
2019-07-01 01:00:00 0
2019-07-01 01:15:00 0
head(StormTimes)
StormNumber RainStartTime RainEndTime
1 1 2019-07-21 20:00:00 2019-07-22 04:45:00
2 2 2019-07-22 11:30:00 2019-07-22 23:45:00
3 3 2019-07-11 09:15:00 2019-07-11 19:00:00
4 4 2019-05-29 17:00:00 2019-05-29 20:45:00
5 5 2019-06-27 14:30:00 2019-06-27 17:15:00
6 6 2019-07-11 06:15:00 2019-07-11 09:00:00
I have this code that I got from the SO community when I was trying to do something similar in the past (but extract data rather than sum it). However, I have no idea how it works so I am struggling to adapt it to this situation.
do.call(rbind, Map(function(x, y) RainGage[paste(x, y, sep="/")],
StormTimes$RainStartTime, StormTimes$RainEndTime)
In this case I would suggest just to write your own function and then use apply to achieve what you want, for example:
dates <- c('2019-07-01 00:00:00', '2019-07-01 00:15:00',
'2019-07-01 00:30:00', '2019-07-01 00:45:00',
'2019-07-01 01:00:00', '2019-07-01 01:15:00')
dates <- as.POSIXct(strptime(dates, '%Y-%m-%d %H:%M:%S'))
mm <- c(0, 10, 10, 20, 0, 0)
rain <- data.frame(dates, mm)
number <- c(1,2)
start <- c('2019-07-01 00:00:00','2019-07-01 00:18:00')
start <- as.POSIXct(strptime(start, '%Y-%m-%d %H:%M:%S'))
end <- c('2019-07-01 00:17:00','2019-07-01 01:20:00')
end <- as.POSIXct(strptime(end, '%Y-%m-%d %H:%M:%S'))
storms <- data.frame(number, start, end)
# Sum of rain
f = function(x, output) {
# Get storm number
number = x[1]
# Get starting moment
start = x[2]
# Get ending moment
end = x[3]
# Calculate sum
output <- sum(rain[rain$dates >= start & rain$dates < end, 'mm'])
}
# Apply function to each row of the dataframe
storms$rain <- apply(storms, 1, f)
print(storms)
This yields:
number start end rain
1 1 2019-07-01 00:00:00 2019-07-01 00:17:00 10
2 2 2019-07-01 00:18:00 2019-07-01 01:20:00 30
So a column rain in storms now holds the sum of rain$mm, which is what you're after.
Hope that helps you out!
I have a problem applying a function (min) to a specific repeating time-period. Basically my data looks like in that sample:
library(xts)
start <- as.POSIXct("2018-05-18 00:00")
tseq <- seq(from = start, length.out = 1440, by = "10 mins")
Measurings <- data.frame(
Time = tseq,
Temp = sample(10:37,1440, replace = TRUE, set.seed(seed = 10)))
)
Measurings_xts <- xts(Measurings[,-1], Measurings$Time)
with much appreciated help (here), I managed to find out that min and max functions (contrary to mean, which works right away in period.apply) must be defined by a helper function and can then be calculated for logical datetime arguments(hours, days, years...) by using this solution:
colMin <- function(x, na.rm = FALSE) {
apply(x, 2, min, na.rm = na.rm)
}
epHours <- endpoints(Measurings_xts, "hours")
Measurings_min <- period.apply(Measurings_xts, epHours, colMin)
For meteorological analyses I need to calculate further minima for a less intuitive timespan, crossing the calendar day, that I fail to define in code:
I need to output the minimum nighttime temperature from e.g. 2018-05-18 19:00 to 2018-05-19 7:00 in the morning for each night in my dataset.
I have tried to move the timespan by manipulating(moving) the time column up or down, to include the nighttime in one calendar day. Since this solution is error-prone and doesn´t work for my real data, where some observations are missing. How do I use the POSIXct datetime and/or xts functionalities to calculate minima in this case?
You could solve this by creating your own "end points" when you use period.apply
# Choose the appropriate time ranges
z <- Measurings_xts["T19:00/T07:00"]
# Creating your own "endpoints":
epNights <- which(diff.xts(index(z), units = "mins") > 10) - 1
Subtract one off each index because the jumps are recorded at the start of the next "night interval" in the output from which().
Then add the last data point in the data set to your end points vector, and you can then use this in period.apply
epNights <- c(epNights, nrow(z))
Measurings_min <- period.apply(z, epNights, colMin)
Measurings_min
# [,1]
# 2018-05-18 07:00:00 10
# 2018-05-19 07:00:00 10
# 2018-05-20 07:00:00 10
# 2018-05-21 07:00:00 10
# 2018-05-22 07:00:00 10
# 2018-05-23 07:00:00 10
# 2018-05-24 07:00:00 11
# 2018-05-25 07:00:00 10
# 2018-05-26 07:00:00 10
# 2018-05-27 07:00:00 10
# 2018-05-27 23:50:00 12
here is one approach that works by defining a new group for each night interval
# define the time interval, e.g. from 19:00 to 7:00
from <- 19
to <- 7
hours <- as.numeric(strftime(index(Measurings_xts), format="%H"))
y <- rle(as.numeric(findInterval(hours, c(to,from)) != 1))
y$values[c(TRUE, FALSE)] <- cumsum(y$values[c(TRUE, FALSE)])
grp <- inverse.rle(y)
# grp is a grouping variable that is 0 for everything outside the
# defined interval , 1 for the first night, 2 for the second...
s <- split(Measurings_xts, grp); s$`0` <- NULL
# min_value will contain the minimum value for each night interval
min_value <- sapply(s, min)
# to see the date interval for each value
start <- sapply(s, function(x) as.character(index(x)[1]))
end <- sapply(s, function(x) as.character(index(x)[length(x)]))
data.frame(start, end, min_value)
# start end min_value
#1 2018-05-18 2018-05-18 06:50:00 10
#2 2018-05-18 19:00:00 2018-05-19 06:50:00 10
#3 2018-05-19 19:00:00 2018-05-20 06:50:00 10
#4 2018-05-20 19:00:00 2018-05-21 06:50:00 10
#5 2018-05-21 19:00:00 2018-05-22 06:50:00 10
#6 2018-05-22 19:00:00 2018-05-23 06:50:00 10
#7 2018-05-23 19:00:00 2018-05-24 06:50:00 11
#8 2018-05-24 19:00:00 2018-05-25 06:50:00 10
#9 2018-05-25 19:00:00 2018-05-26 06:50:00 10
#10 2018-05-26 19:00:00 2018-05-27 06:50:00 10
#11 2018-05-27 19:00:00 2018-05-27 23:50:00 12
How do you set 0:00 as end of day instead of 23:00 in an hourly data? I have this struggle while using period.apply or to.period as both return days ending at 23:00. Here is an example :
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:00:00"), by="hour"), x = rnorm(120))
The following functions show periods ends at 23:00
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "days")
x1[endpoints(x1, 'days')]
So when I am aggregating the hourly data to daily, does someone have an idea how to set the end of day at 0:00?
As already pointed out by another answer here, to.period on days computes on the data with timestamps between 00:00:00 and 23:59:59.9999999 on the day in question. so 23:00:00 is seen as the last timestamp in your data, and 00:00:00 corresponds to a value in the next day "bin".
What you can do is shift all the timestamps back 1 hour, use to.period get the daily data points from the hour points, and then using align.time to get the timestamps aligned correctly.
(More generally, to.period is useful for generating OHLCV type data, and so if you're say generating say hourly bars from ticks, it makes sense to look at all the ticks between 23:00:00 and 23:59:59.99999 in the bar creation. then 00:00:00 to 00:59:59.9999.... would form the next hourly bar and so on.)
Here is an example:
> tail(x1["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -1.2760349
# 2018-02-01 19:00:00 -0.1496041
# 2018-02-01 20:00:00 -0.5989614
# 2018-02-01 21:00:00 -0.9691905
# 2018-02-01 22:00:00 -0.2519618
# 2018-02-01 23:00:00 -1.6081656
> head(x1["2018-02-02"])
# [,1]
# 2018-02-02 00:00:00 -0.3373271
# 2018-02-02 01:00:00 0.8312698
# 2018-02-02 02:00:00 0.9321747
# 2018-02-02 03:00:00 0.6719425
# 2018-02-02 04:00:00 -0.5597391
# 2018-02-02 05:00:00 -0.9810128
> head(x1["2018-02-03"])
# [,1]
# 2018-02-03 00:00:00 2.3746424
# 2018-02-03 01:00:00 0.8536594
# 2018-02-03 02:00:00 -0.2467268
# 2018-02-03 03:00:00 -0.1316978
# 2018-02-03 04:00:00 0.3079848
# 2018-02-03 05:00:00 0.2445634
x2 <- x1
.index(x2) <- .index(x1) - 3600
> tail(x2["2018-02-01"])
# [,1]
# 2018-02-01 18:00:00 -0.1496041
# 2018-02-01 19:00:00 -0.5989614
# 2018-02-01 20:00:00 -0.9691905
# 2018-02-01 21:00:00 -0.2519618
# 2018-02-01 22:00:00 -1.6081656
# 2018-02-01 23:00:00 -0.3373271
x.d2 <- to.period(x2, OHLC = FALSE, drop.date = FALSE, period = "days")
> x.d2
# [,1]
# 2018-01-31 23:00:00 0.12516594
# 2018-02-01 23:00:00 -0.33732710
# 2018-02-02 23:00:00 2.37464235
# 2018-02-03 23:00:00 0.51797747
# 2018-02-04 23:00:00 0.08955208
# 2018-02-05 22:00:00 0.33067734
x.d2 <- align.time(x.d2, n = 86400)
> x.d2
# [,1]
# 2018-02-01 0.12516594
# 2018-02-02 -0.33732710
# 2018-02-03 2.37464235
# 2018-02-04 0.51797747
# 2018-02-05 0.08955208
# 2018-02-06 0.33067734
Want to convince yourself? Try something like this:
x3 <- rbind(x1, xts(x = matrix(c(1,2), nrow = 2), order.by = as.POSIXct(c("2018-02-01 23:59:59.999", "2018-02-02 00:00:00"))))
x3["2018-02-01 23/2018-02-02 01"]
# [,1]
# 2018-02-01 23:00:00.000 -1.6081656
# 2018-02-01 23:59:59.999 1.0000000
# 2018-02-02 00:00:00.000 -0.3373271
# 2018-02-02 00:00:00.000 2.0000000
# 2018-02-02 01:00:00.000 0.8312698
x3.d <- to.period(x3, OHLC = FALSE, drop.date = FALSE, period = "days")
> x3.d <- align.time(x3.d, 86400)
> x3.d
[,1]
2018-02-02 1.00000000
2018-02-03 -0.09832625
2018-02-04 -0.65075506
2018-02-05 -0.09423664
2018-02-06 0.33067734
See that the value of 2 on 00:00:00 did not form the last observation in the day for 2018-02-02 (00:00:00), which went from 2018-02-01 00:00:00 to 2018-02-01 23:59:59.9999.
Of course, if you want the daily timestamp to be the start of the day, not the end of the day, which would be 2018-02-01 as start of bar for the first row, in x3.d above, you could shift back the day by one. You could do this relatively safely for most timezones, when your data doesn't involve weekend dates:
index(x3.d) = index(x3.d) - 86400
I say relatively safetly, because there are corner cases when there are time shifts in a time zone. e.g. Be careful with day light savings. Simply subtracting -86400 can be a problem when going from Sunday to Saturday in time zones where day light saving occurs:
#e.g. bad: day light savings occurs on this weekend for US EST
z <- xts(x = 9, order.by = as.POSIXct("2018-03-12", tz = "America/New_York"))
> index(z) - 86400
[1] "2018-03-10 23:00:00 EST"
i.e. the timestamp is off by one hour, when you really want the midnight timestamp (00:00:00).
You could get around this problem using something much safer like this:
library(lubridate)
# right
> index(z) - days(1)
[1] "2018-03-11 EST"
I don't think this is possible because 00:00 is the start of the day. From the manual:
These endpoints are aligned in POSIXct time to the zero second of the day at the beginning, and the 59.9999th second of the 59th minute of the 23rd hour of the final day
I think the solution here is to use minutes instead of hours. Using your example:
x1 = xts(seq(as.POSIXct("2018-02-01 00:00:00"), as.POSIXct("2018-02-05 23:59:99"), by="min"), x = rnorm(7200))
to.period(x1, OHLC = FALSE, drop.date = FALSE, period = "day")
x1[endpoints(x1, 'day')]
I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455
The problem:
I have two dataframes that I would like to merge depending on the date/time of one dataframe being in the interval of the other dataframe.
traffic: Date and Time (Posixct), Frequency
mydata: Interval, Sum of Frequency
I would now like to calculate if the Posixct time from traffic is within the interval of mydata and if this is TRUE I would like to count the frequency in the column "Sum of Frequencies" in mydata.
The two problems, that I encountered:
1. traffic data frame has significantly more rows than mydata. I dont know how to tell R to loop through every observation in traffic to check for one row in mydata.
There can be more than one observation fitting in the frequency interval of mydata. I want R to add up all frequencies of the different traffic observations to get a total score of frequencies. Also the intervals are overlapping.
Here is the data:
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
mydata$SumFrequency <- NA
The expected outcome should be something like this:
mydata$SumFrequency <- c(24, 2, 2)
head(mydata)
I tried int_overlaps from package lubridate.
Any tips on how to solve this are higly appreciated!
A short solution with foverlaps from the data.table package:
mydata <- data.table(DateTime1, DateTime2, key = c("DateTime1", "DateTime2"))
traffic <- data.table(start = DateTime, end = DateTime, Frequency, key = c("start","end"))
foverlaps(traffic, mydata, type="within", nomatch=0L)[, .(sumFreq = sum(Frequency)),
by = .(DateTime1, DateTime2)]
which gives:
DateTime1 DateTime2 sumFreq
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
On a data.table approach with between to filter traffic dataset on time:
setDT(traffic)
setDT(mydata)
mydata[,SumFrequency := as.numeric(SumFrequency)] # coerce logical to numeric for next step.
mydata[,SumFrequency := sum( traffic[ DateTime %between% c(DateTime1, DateTime2), Frequency] ), by=1:nrow(mydata)]
which give:
DateTime1 DateTime2 Interval SumFrequency
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2
If there's a lot of row in mydata, it could be better to create an index column and use it in by clause:
mydata[, idx := .I]
mydata[, SumFrequency := sum( traffic[DateTime %between% c(DateTime1, DateTime2),Frequency] ),by=idx]
And this gives:
DateTime1 DateTime2 Interval SumFrequency idx
1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24 1
2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2 3
I see two solutions :
With data.frame and plyr
You could do it using %within% function in lubridate and with a for-loop or using plyr loop functions like dlply
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.frame(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
DateTime1 <- as.POSIXct(DateTime1)
DateTime2 <- as.POSIXct(DateTime2)
mydata <- data.frame(DateTime1, DateTime2)
mydata$Interval <- as.interval(DateTime1, DateTime2)
library(plyr)
# Create a group-by variable
mydata$NumInt <- 1:nrow(mydata)
mydata$SumFrequency <- dlply(mydata, .(NumInt),
function(row){
sum(
traffic[traffic$DateTime %within% row$Interval, "Frequency"]
)
})
mydata
#> DateTime1 DateTime2
#> 1 2014-11-01 04:00:00 2014-11-01 04:15:00
#> 2 2015-08-01 04:03:00 2015-08-01 04:13:00
#> 3 2015-08-01 14:00:00 2015-08-01 14:15:00
#> Interval NumInt SumFrequency
#> 1 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 1 24
#> 2 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2 2
#> 3 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 3 2
With data.table and functions foverlaps
data.table has implemented a function for overlapping joins that you could use in your case with a little trick.
This functions is foverlaps (I uses below data.table 1.9.6)
(see How to perform join over date ranges using data.table? and this presentation)
Notice that you do not need to create interval with lubridate
DateTime <- c("2014-11-01 04:00:00", "2014-11-01 04:03:00", "2014-11-01 04:06:00", "2014-11-01 04:08:00", "2014-11-01 04:10:00", "2014-11-01 04:12:00", "2015-08-01 04:13:00", "2015-08-01 04:45:00", "2015-08-01 14:15:00", "2015-08-01 14:13:00")
DateTime <- as.POSIXct(DateTime)
Frequency <- c(1,2,3,5,12,1,2,2,1,1)
traffic <- data.table(DateTime, Frequency)
library(lubridate)
DateTime1 <- c("2014-11-01 04:00:00", "2015-08-01 04:03:00", "2015-08-01 14:00:00")
DateTime2 <- c("2014-11-01 04:15:00", "2015-08-01 04:13:00", "2015-08-01 14:15:00")
mydata <- data.table(DateTime1 = as.POSIXct(DateTime1), DateTime2 = as.POSIXct(DateTime2))
# Use function `foverlaps` for overlapping joins
# Here's the trick : create a dummy variable to artificially have an interval
traffic[, dummy:=DateTime]
setkey(mydata, DateTime1, DateTime2)
# do the join
mydata2 <- foverlaps(traffic, mydata, by.x=c("DateTime", "dummy"), type ="within", nomatch=0L)[, dummy := NULL][]
mydata2
#> DateTime1 DateTime2 DateTime Frequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 1
#> 2: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:03:00 2
#> 3: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:06:00 3
#> 4: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:08:00 5
#> 5: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:10:00 12
#> 6: 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:12:00 1
#> 7: 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:13:00 2
#> 8: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:15:00 1
#> 9: 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:13:00 1
# summarise with a sum by grouping by each line of mydata
setkeyv(mydata2, key(mydata))
mydata2[mydata, .(SumFrequency = sum(Frequency)), by = .EACHI]
#> DateTime1 DateTime2 SumFrequency
#> 1: 2014-11-01 04:00:00 2014-11-01 04:15:00 24
#> 2: 2015-08-01 04:03:00 2015-08-01 04:13:00 2
#> 3: 2015-08-01 14:00:00 2015-08-01 14:15:00 2
As far as point 2 is concerned you can use aggregate for instance
aggData <- aggregate(traffic$Frequency~format(traffic$DateTime, "%Y%m%d h:m"), data=traffic, sum)
This sums all frequencies in minute intervals.
And for point 1. Wouldn't a merge work?
merge(x = myData, y = aggData, by = "DateTime", all.x = TRUE)
The outer merge is explained here
Using a for.loop we could do something like this:
for(i in 1:nrow(mydata)) {
mydata$SumFrequency[i] <- sum(traffic$Frequency[traffic$DateTime %within% mydata$Interval[i]])
}
> mydata
# DateTime1 DateTime2 Interval SumFrequency
#1 2014-11-01 04:00:00 2014-11-01 04:15:00 2014-11-01 04:00:00 CET--2014-11-01 04:15:00 CET 24
#2 2015-08-01 04:03:00 2015-08-01 04:13:00 2015-08-01 04:03:00 CEST--2015-08-01 04:13:00 CEST 2
#3 2015-08-01 14:00:00 2015-08-01 14:15:00 2015-08-01 14:00:00 CEST--2015-08-01 14:15:00 CEST 2