Build interval then calculate the maximum number of Disjoint Intervals - r

I want to know the maximum number of flight on the ground by station .
I have time when the flight arrive to the station and depart from the station.
the problem that my data frame is in this format
REG DEP ARV STD STA
XYZ ZRH GVA 2021-08-01 07:20:00 2021-08-01 08:35:00
XYZ GVA ZRH 2021-08-01 09:20:00 2021-08-01 10:35:00
KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
KLN GVA CGD 2021-08-01 08:45:00 2021-08-01 10:10:00
So in this example
flight XYZ arrive in GVA AT 08H35 (first line STA) and then depart from GVA AT 09H20( LINE 2 STD) and
flight KLN arrive to GVA AT 07H10 and depart AT 08H45.
so from 08h35 to 08h45 there is 2 flight in GVA..
the output should be 2 if for this day there is only this two flight who meet in GVA.
if in other time in the day there other flight who meet suppose there is 5 flights in the afternoon who meet in GVA.
so the output should be the maximum it mean 5.
so I was thinking to build interval by flight [STA, STD] or [STD,STA] then find Maximal Disjoint Intervals...
I tried this code to builds the interval but is not working..
interval_sta_std<-function(i,j){
for (i in 1:length(df)){
key= df$DEP[i]
min_key=min(df$STD[i])
max_key=max(df$STD[i])
for (j in 1:length(df)){
value= df$ARV[j]
min_value=min(df$STA[j])
max_value=max(df$STA[j])
if(value==key) {
test_inter<-interval(min(min_value,min_key),
max(max_key,max_value))
}
}
}
return(test_inter)}

Perhaps one way is to look at after minute during your data and count how many flights are on deck for that minute. This doesn't always scale well depending on the breadth of your data, but if you limit minutes to a reasonable scope, then it should be fine.
Sample data
quux <- structure(list(REG = c("XYZ", "XYZ", "KLN", "KLN"), DEP = c("ZRH", "GVA", "MUC", "GVA"), ARV = c("GVA", "ZRH", "GVA", "CGD"), STD = structure(c(1627816800, 1627824000, 1627812000, 1627821900), class = c("POSIXct", "POSIXt"), tzone = ""), STA = structure(c(1627821300, 1627828500, 1627816200, 1627827000), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA, -4L), class = "data.frame")
quux[,c("STD","STA")] <- lapply(quux[,c("STD","STA")], as.POSIXct)
(Converting your STD and STA to POSIXt objects.)
base R with fuzzyjoin
minutes <- seq(min(quux$STD), max(quux$STA), by = "mins")
head(minutes)
# [1] "2021-08-01 06:00:00 EDT" "2021-08-01 06:01:00 EDT" "2021-08-01 06:02:00 EDT" "2021-08-01 06:03:00 EDT"
# [5] "2021-08-01 06:04:00 EDT" "2021-08-01 06:05:00 EDT"
length(minutes)
# [1] 276
range(minutes)
# [1] "2021-08-01 06:00:00 EDT" "2021-08-01 10:35:00 EDT"
Now the join and aggregation.
joined <- fuzzyjoin::fuzzy_left_join(data.frame(M = minutes), quux, by = c("M" = "STD", "M" = "STA"), match_fun = list(`>=`, `<=`))
head(joined)
# M REG DEP ARV STD STA
# 1 2021-08-01 06:00:00 KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
# 2 2021-08-01 06:01:00 KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
# 3 2021-08-01 06:02:00 KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
# 4 2021-08-01 06:03:00 KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
# 5 2021-08-01 06:04:00 KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
# 6 2021-08-01 06:05:00 KLN MUC GVA 2021-08-01 06:00:00 2021-08-01 07:10:00
nrow(joined)
# [1] 327
Recall that we had 276 in minutes. Now we have 327, indicating that 51 rows (each minute) indicate more than one flight on deck at a time.
joined2 <- aggregate(REG ~ M, data = joined[complete.cases(joined),], FUN = length)
nrow(joined2)
# [1] 258
head(joined2)
# M REG
# 1 2021-08-01 06:00:00 1
# 2 2021-08-01 06:01:00 1
# 3 2021-08-01 06:02:00 1
# 4 2021-08-01 06:03:00 1
# 5 2021-08-01 06:04:00 1
# 6 2021-08-01 06:05:00 1
We've reduced a bit, indicating that 258 minutes during the day(s) in the data had at least one plane on deck; if you look at where REG > 1, you'll find where there are two or more.
The final piece:
joined2$Date <- as.Date(joined2$M)
aggregate(REG ~ Date, data = joined2, FUN = max)
# Date REG
# 1 2021-08-01 2
Note: this might be subject to time zone issues, ensure you're confident they are all correct.

Related

R Populate a datetime column starting from last row

I have a dataframe that I need to add a column of datetime to. It is recording water levels every hour for 2 years. The original data frame has the wrong dates and times. i.e. the dates say 2015 instead of 2020. The date and month are also wrong. I do not know the original start date and time. However, I know the date and time of the very last recording (28-03-2022 14:00:00). I need to calculate a column from the bottom to the top to figure out the original start date.
Current Code
I have this code which populates the dates from a known start date (i.e. top down), but I want to population the data from down up. Is these a way to alter this or another solution??
# recalculate date to correct date
# set start dates
startDate5 <- as.POSIXct("2020-03-05 17:00:00")
startDateMere <- as.POSIXct("2020-07-06 17:00:00")
# find length of dataframe to populate required rows.
len5 <- max(dataList$`HMB 5`$Rec)
lenMere <- max(dataList$`HM SSSI 4`$Rec)
# calculate new date column
dataList$`HMB 5`$DateTimeNew <- seq(startDate5, by='hour', length.out=len5)
dataList$`HM SSSI 4`$DateTimeNew <-seq(startDateMere, by='hour', length.out=lenMere)
Current dataframe - top 10 rows
structure(list(Rec = 1:10, DateTime = structure(c(1436202000,
1436205600, 1436209200, 1436212800, 1436216400, 1436220000, 1436223600,
1436227200, 1436230800, 1436234400), class = c("POSIXct", "POSIXt"
), tzone = "GMT"), Temperature = c(16.59, 16.49, 16.74, 17.14,
17.47, 17.71, 18.43, 18.78, 19.06, 19.18), Pressure = c(1050.64,
1050.86, 1051.28, 1051.56, 1051.48, 1051.2, 1051.12, 1050.83,
1050.83, 1050.76), DateTimeNew = structure(c(1594051200L, 1594054800L,
1594058400L, 1594062000L, 1594065600L, 1594069200L, 1594072800L,
1594076400L, 1594080000L, 1594083600L), class = c("POSIXct",
"POSIXt"), tzone = "")), row.names = c(NA, 10L), class = "data.frame")
Desired Output
This is what the desired output looks like: The date I know is correct for example is '2020-07-07 02:00:00' (e.g. value in 10th row, final column). And I need to figure out the rest of the column from this value.
NB: I do not actually know what the original start date is (2020-07-06 17:00:00) should be. Its just illustrative.
Here's a sequence method:
startDateMere <- as.POSIXct("2020-07-06 17:00:00")
new_date = seq(startDateMere, length.out = nrow(data), by = "-1 hour")
data$result = rev(new_date)
data
# Rec DateTime Temperature Pressure DateTimeNew result
# 1 1 2015-07-06 17:00:00 16.59 1050.64 2020-07-06 12:00:00 2020-07-06 08:00:00
# 2 2 2015-07-06 18:00:00 16.49 1050.86 2020-07-06 13:00:00 2020-07-06 09:00:00
# 3 3 2015-07-06 19:00:00 16.74 1051.28 2020-07-06 14:00:00 2020-07-06 10:00:00
# 4 4 2015-07-06 20:00:00 17.14 1051.56 2020-07-06 15:00:00 2020-07-06 11:00:00
# 5 5 2015-07-06 21:00:00 17.47 1051.48 2020-07-06 16:00:00 2020-07-06 12:00:00
# 6 6 2015-07-06 22:00:00 17.71 1051.20 2020-07-06 17:00:00 2020-07-06 13:00:00
# 7 7 2015-07-06 23:00:00 18.43 1051.12 2020-07-06 18:00:00 2020-07-06 14:00:00
# 8 8 2015-07-07 00:00:00 18.78 1050.83 2020-07-06 19:00:00 2020-07-06 15:00:00
# 9 9 2015-07-07 01:00:00 19.06 1050.83 2020-07-06 20:00:00 2020-07-06 16:00:00
# 10 10 2015-07-07 02:00:00 19.18 1050.76 2020-07-06 21:00:00 2020-07-06 17:00:00

How to round datetime to nearest time of day, preferably vectorized?

Say I have a POSIXct vector like
timestamps = seq(as.POSIXct("2021-01-23"), as.POSIXct("2021-01-24"), length.out = 6)
I would like to round these times up to the nearest hour of the day in a vector:
hours_of_day = c(6, 14, 20)
i.e., the following result:
timestamps result
1 2021-01-23 00:00:00 2021-01-23 02:00:00
2 2021-01-23 04:48:00 2021-01-23 14:00:00
3 2021-01-23 09:36:00 2021-01-23 14:00:00
4 2021-01-23 14:24:00 2021-01-23 20:00:00
5 2021-01-23 19:12:00 2021-01-23 20:00:00
6 2021-01-24 00:00:00 2021-01-24 02:00:00
Is there a vectorized solution to this (or otherwise fast)? I have a few million timestamps and need to apply it for several hours_of_day.
One way to simplify this problem is to (1) find the next hours_of_day for each lubridate::hour(timestamps) and then (2) result = lubridate::floor_date(timestamps) + next_hour_of_day * 3600. But how to do step 1 vectorized?
Convert to as.POSIXlt, which allows you to extract hours and minutes, and calculate decimal hours. In an lapply/sapply combination first look up where these are less than the hours of the day vector, and choose the maximum hour using which.max. Now create new date-time using ISOdate and add one day ifelse date-time is smaller than original time.
timestamps <- as.POSIXlt(timestamps)
h <- hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600),
`<=`, hours_of_day), which.max)]
r <- with(timestamps, ISOdate(1900 + year, mon + 1, mday, h,
tz=attr(timestamps, "tzone")[[1]]))
r[r < timestamps] <- r[r < timestamps] + 86400
Result
r
# [1] "2021-01-23 06:00:00 CET" "2021-01-23 06:00:00 CET"
# [3] "2021-01-23 14:00:00 CET" "2021-01-23 20:00:00 CET"
# [5] "2021-01-23 20:00:00 CET" "2021-01-24 06:00:00 CET"
# [7] "2021-01-25 06:00:00 CET" "2021-01-27 20:00:00 CET"
data.frame(timestamps, r)
# timestamps r
# 1 2021-01-23 00:00:00 2021-01-23 06:00:00
# 2 2021-01-23 04:48:00 2021-01-23 06:00:00
# 3 2021-01-23 09:36:00 2021-01-23 14:00:00
# 4 2021-01-23 14:24:00 2021-01-23 20:00:00
# 5 2021-01-23 19:12:00 2021-01-23 20:00:00
# 6 2021-01-24 00:00:00 2021-01-24 06:00:00
# 7 2021-01-24 23:59:00 2021-01-25 06:00:00
# 8 2021-01-27 20:00:00 2021-01-27 20:00:00
Note: I've added "2021-01-24 23:59:00 CET" to timestamps to demonstrate the date change.
Benchmark
Tested on a length 1.4e6 vector.
# Unit: seconds
# expr min lq mean median uq max neval cld
# POSIX() 32.96197 33.06495 33.32104 33.16793 33.50057 33.83321 3 a
# lubridate() 47.36412 47.57762 47.75280 47.79113 47.94715 48.10316 3 b
Data:
timestamps <- structure(c(1611356400, 1611373680, 1611390960, 1611408240, 1611425520,
1611442800, 1611529140, 1611774000), class = c("POSIXct", "POSIXt"
))
hours_of_day <- c(6, 14, 20)
I would extract the hour component, use cut to bin it, and assign the binned hours back to the original:
hours_of_day = c(2, 14, 20)
library(lubridate)
library(magrittr) ## just for the pipe
new_hours = timestamps %>%
hour %>%
cut(breaks = c(0, hours_of_day), labels = hours_of_day, include.lowest = TRUE) %>%
as.character() %>%
as.integer()
result = floor_date(timestamps, "hour")
hour(result) = new_hours
result
# [1] "2021-01-23 02:00:00 EST" "2021-01-23 14:00:00 EST" "2021-01-23 14:00:00 EST"
# [4] "2021-01-23 14:00:00 EST" "2021-01-23 20:00:00 EST" "2021-01-24 02:00:00 EST"
Building on the approach by #jay.sf, I made a function for floor as well while adding support for NA values.
floor_date_to = function(timestamps, hours_of_day) {
# Handle NA with a temporary filler so code below doesn't break
na_timestamps = is.na(timestamps)
timestamps[na_timestamps] = as.POSIXct("9999-12-31")
# Proceed as usual
timestamps = as.POSIXlt(timestamps)
hours_of_day = rev(hours_of_day) # floor-specific: because which.max returns the first index by default
nearest_hour = hours_of_day[sapply(lapply(with(timestamps, hour + min/60 + sec/3600), `<`, hours_of_day), function(x) which.max(-x))] # floor-specific: negative which.max()
rounded = with(timestamps, ISOdate(1900 + year, mon + 1, mday, nearest_hour, tz = attr(timestamps, "tzone")[1]))
rounded[rounded > timestamps] = rounded[rounded > timestamps] - 86400 # floor: use minus
return(rounded)
timestamps[na_timestamps] = NA # Overwrite with NA again
}

How to extract data from a time series based on start and end dates from a different dataframe?

I am working with water quality data and I have a list of storm events I extracted from the streamflow time series.
head(Storms)
PeakNumber PeakTime PeakHeight PeakStartTime PeakEndTime DurationHours
1 1 2019-07-21 22:15:00 81.04667 2019-07-21 21:30:00 2019-07-22 04:45:00 7.25
2 2 2019-07-22 13:45:00 66.74048 2019-07-22 13:00:00 2019-07-22 23:45:00 10.75
3 3 2019-07-11 11:30:00 49.08663 2019-07-11 10:45:00 2019-07-11 19:00:00 8.25
4 4 2019-05-29 18:45:00 37.27926 2019-05-29 18:30:00 2019-05-29 20:45:00 2.25
5 5 2019-06-27 16:30:00 33.12268 2019-06-27 16:00:00 2019-06-27 17:15:00 1.25
6 6 2019-07-11 08:15:00 31.59931 2019-07-11 07:45:00 2019-07-11 09:00:00 1.25
I would like to use these PeakStartTime and PeakEndTime points to subset my other data. The other data is 15-minute time series data in xts or data.table format (I am constantly going back and forth for various functions/plots)
> head(Nitrogen)
[,1]
2019-03-20 10:00:00 2.12306
2019-03-20 10:15:00 2.13538
2019-03-20 10:30:00 2.14180
2019-03-20 10:45:00 2.14704
2019-03-20 11:00:00 2.14464
2019-03-20 11:15:00 2.15548
So I would like to create a new dataframe for each storm that is just the Nitrogen data between those PeakStartTime and PeakEndTime points. And then hopefully loop this, so it will do so for each of the peaks in the Storms dataframe.
One option is to do the comparison on each corresponding StartTime, EndTime, and subset the data
library(xts)
do.call(rbind, Map(function(x, y) Nitrogen[paste( x, y, sep="/")],
Storms$PeakStartTime, Storms$PeakEndTime))
# [,1]
#2019-05-29 18:30:00 -0.07102752
#2019-05-29 18:45:00 -0.19454811
#2019-05-29 19:00:00 -1.69684540
#2019-05-29 19:15:00 1.09384970
#2019-05-29 19:30:00 0.20019572
#2019-05-29 19:45:00 -0.76086259
# ...
data
set.seed(24)
Nitrogen <- xts(rnorm(20000), order.by = seq(as.POSIXct('2019-03-20 10:00:00'),
length.out = 20000, by = '15 min'))
Storms <- structure(list(PeakNumber = 1:6, PeakTime = structure(c(1563761700,
1563817500, 1562859000, 1559169900, 1561667400, 1562847300), class = c("POSIXct",
"POSIXt"), tzone = ""), PeakHeight = c(81.04667, 66.74048, 49.08663,
37.27926, 33.12268, 31.59931), PeakStartTime = structure(c(1563759000,
1563814800, 1562856300, 1559169000, 1561665600, 1562845500), class = c("POSIXct",
"POSIXt"), tzone = ""), PeakEndTime = structure(c(1563785100,
1563853500, 1562886000, 1559177100, 1561670100, 1562850000), class = c("POSIXct",
"POSIXt"), tzone = ""), DurationHours = c(7.25, 10.75, 8.25,
2.25, 1.25, 1.25)), row.names = c("1", "2", "3", "4", "5", "6"
), class = "data.frame")

Calculation of the maximum duration over threshold in R (timeseries)

I have a xts-timeseries temperature data in 5 min resolution.
head(dataset)
Time Temp
2016-04-26 10:00:00 6.877
2016-04-26 10:05:00 6.877
2016-04-26 10:10:00 6.978
2016-04-26 10:15:00 6.978
2016-04-26 10:20:00 6.978
I want to calculate the longest duration the temperature exceeds a certain threshold. (let's say 20 °C)
I want to calculate all the periods with their duration the temperature exceeds a certain threshold.
I create a data.frame from my xts-data:
df=data.frame(Time=index(dataset),coredata(dataset))
head(df)
Time Temp
1 2016-04-26 10:00:00 6.877
2 2016-04-26 10:05:00 6.877
3 2016-04-26 10:10:00 6.978
4 2016-04-26 10:15:00 6.978
5 2016-04-26 10:20:00 6.978
6 2016-04-26 10:25:00 7.079
then I create a subset with only the data that exceeds the threshold:
sub=(subset(x=df,subset = df$Temp>20))
head(sub)
Time Temp
7514 2016-05-22 12:05:00 20.043
7515 2016-05-22 12:10:00 20.234
7516 2016-05-22 12:15:00 20.329
7517 2016-05-22 12:20:00 20.424
7518 2016-05-22 12:25:00 20.615
7519 2016-05-22 12:30:00 20.805
But now im having trouble to calculate the duration of the event the temperature exceeds the threshold. I dont know how to identify a connected period and calculate their duration?
I would be happy if you have a solution for this question (it's my first thread so please excuse minor mistakes) If you need more information on my data, feel free to ask.
This may work. I take as example this data:
df <- structure(list(Time = structure(c(1463911500, 1463911800, 1463912100,
1463912400, 1463912700, 1463913000), class = c("POSIXct", "POSIXt"
), tzone = ""), Temp = c(20.043, 20.234, 6.329, 20.424, 20.615,
20.805)), row.names = c(NA, -6L), class = "data.frame")
> df
Time Temp
1 2016-05-22 12:05:00 20.043
2 2016-05-22 12:10:00 20.234
3 2016-05-22 12:15:00 6.329
4 2016-05-22 12:20:00 20.424
5 2016-05-22 12:25:00 20.615
6 2016-05-22 12:30:00 20.805
library(dplyr)
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)))
id event_duration
<int> <time>
1 1 5 mins
2 3 10 mins

Convert timestamps to frequency-binned timeseries in R?

I have a selection of scattered timestamp data based on requests to a particular service. This data covers approximately 3.5-4 years of requests against this service.
I am looking to turn this selection of variable-interval timestamps into a frequency-binned timeseries in R.
How would I go about converting these timestamps into a frequency-binned timeseries, such as "between 1 and 1:15PM on this day, there were 7 requests, and between 1:15 and 1:30PM there were 2, and between 1:30 and 1:45, there were 0", being sure to also have a bin where there is nothing?
The data is just a vector of timestamps from a database dump, all of the format: ""2014-02-17 13:10:46". Just a big ol' vector with ~2 million objects in it.
You could use tools for handling time series data from xts and zoo. Note that you will need some artificial 'data':
library(xts)
set.seed(42)
ts.index <- ISOdatetime(2018, 1, 8, 8:9, sample(60, 10), 0)
ts <- xts(rep(1, length(ts.index)), ts.index)
aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
#>
#> 2018-01-08 08:15:00 1
#> 2018-01-08 08:30:00 3
#> 2018-01-08 08:45:00 1
#> 2018-01-08 09:00:00 1
#> 2018-01-08 09:15:00 1
#> 2018-01-08 09:45:00 3
Edit: If you want to include bins without observations, you can convert to a strictly regular ts object and replace the inserted NAvalues with zero:
raw <- aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
as.xts(na.fill(as.ts(raw), 0), dateFormat = "POSIXct")
#> zoo(coredata(x), tt)
#> 2018-01-08 08:15:00 1
#> 2018-01-08 08:30:00 3
#> 2018-01-08 08:45:00 1
#> 2018-01-08 09:00:00 1
#> 2018-01-08 09:15:00 1
#> 2018-01-08 09:30:00 0
#> 2018-01-08 09:45:00 3
Edit 2: It also works for the provided sample data:
library(xts)
data <- c(1228917812, 1245038910, 1245986979, 1268750482, 1281615510, 1292561113)
class(data) = c("POSIXct", "POSIXt")
attr(data, "tzone") <- "UTC"
dput(data)
#> structure(c(1228917812, 1245038910, 1245986979, 1268750482, 1281615510,
#> 1292561113), class = c("POSIXct", "POSIXt"), tzone = "UTC")
ts <- xts(rep(1, length(data)), data)
raw <- aggregate(ts, time(ts) - as.numeric(time(ts)) %% 900, length, regular = TRUE)
head(as.xts(na.fill(as.ts(raw), 0), dateFormat = "POSIXct"))
#> zoo(coredata(x), tt)
#> 2008-12-10 15:00:00 1
#> 2008-12-10 15:15:00 0
#> 2008-12-10 15:30:00 0
#> 2008-12-10 15:45:00 0
#> 2008-12-10 16:00:00 0
#> 2008-12-10 16:15:00 0

Resources