Union of time intervals that are not necessarily contiguous - r

I am looking for an implementation of union for time intervals which is capable of dealing with unions that are not themselves intervals.
I have noticed lubridate includes a union function for time intervals but it always returns a single interval even if the union is not an interval (ie it returns the interval defined by the minimum of both start dates and the maximum of both end dates, ignoring intervening periods not covered by either interval):
library(lubridate)
int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
union(int1, int2)
# Union includes intervening time between intervals.
# [1] 2001-01-01 UTC--2004-01-01 UTC
I have also looked at the interval package, but its documentation makes no reference to union.
My end goal is to use the complex union with %within%:
my_int %within% Reduce(union, list_of_intervals)
So if we consider a concrete example, suppose the list_of_intervals is:
[[1]] 2000-01-01 -- 2001-01-02
[[2]] 2001-01-01 -- 2004-01-02
[[3]] 2005-01-01 -- 2006-01-02
Then my_int <- 2001-01-01 -- 2004-01-01 is not %within% the list_of_intervals so it should return FALSE and my_int <- 2003-01-01 -- 2006-01-01 is so it should be TRUE.
However, I suspect the complex union has more uses than this.

If I understand your question correctly, you'd like to start with a set of potentially overlapping intervals and obtain a list of intervals that represents the UNION of the input set, rather than just the single interval spanning the mininum and maximum of the input set. This is the same question I had.
A similar question was asked at: Union of intervals
... but the accepted response fails with overlapping intervals. However, hosolmaz (I am new to SO, so don't know how to link to this user) posted a modification (in Python) that fixes the issue, which I then converted to R as follows:
library(dplyr) # for %>%, arrange, bind_rows
interval_union <- function(input) {
if (nrow(input) == 1) {
return(input)
}
input <- input %>% arrange(start)
output = input[1, ]
for (i in 2:nrow(input)) {
x <- input[i, ]
if (output$stop[nrow(output)] < x$start) {
output <- bind_rows(output, x)
} else if (output$stop[nrow(output)] == x$start) {
output$stop[nrow(output)] <- x$stop
}
if (x$stop > output$stop[nrow(output)]) {
output$stop[nrow(output)] <- x$stop
}
}
return(output)
}
With your example with overlapping and non-contiguous intervals:
d <- as.data.frame(list(
start = c('2005-01-01', '2000-01-01', '2001-01-01'),
stop = c('2006-01-02', '2001-01-02', '2004-01-02')),
stringsAsFactors = FALSE)
This produces:
> d
start stop
1 2005-01-01 2006-01-02
2 2000-01-01 2001-01-02
3 2001-01-01 2004-01-02
> interval_union(d)
start stop
1 2000-01-01 2004-01-02
2 2005-01-01 2006-01-02
I am a relative novice to R programming, so if anyone could convert the interval_union() function above to accept as parameters not only the input data frame, but also the names of the 'start' and 'stop' columns to use so the function could be more easily re-usable, that'd be great.

Well, in the example you provided, the union of int1 and int2 could be seen just as a vector with the two intervals :
int1 <- new_interval(ymd("2001-01-01"), ymd("2002-01-01"))
int2 <- new_interval(ymd("2003-06-01"), ymd("2004-01-01"))
ints <- c(int1,int2)
%within% works on vectors, so you can do something like this :
my_int <- new_interval(ymd("2001-01-01"), ymd("2004-01-01"))
my_int %within% ints
# [1] TRUE FALSE
So you can check if your interval is in one of the intervals of your list with any :
any(my_int %within% ints)
# [1] TRUE
Your comment is right, the results given by %within% doesn't seem coherent with the documentation, which says :
If a is an interval, both its start and end dates must fall within b
to return TRUE.
If I look at the source code of %within% when a and b are both intervals, it seems to be the following :
setMethod("%within%", signature(a = "Interval", b = "Interval"), function(a,b){
as.numeric(a#start) - as.numeric(b#start) <= b#.Data & as.numeric(a#start) - as.numeric(b#start) >= 0
})
So it seems that only the starting point of a is tested against b, and it looks coherent with the results. Maybe this should be considered as a bug and should be reported ?

Related

Calculating the time length of each binary/boolean column as reference

I have two columns. One is listed as True/False for a series of data. The entire dataset also has a timestep column. I want to write code that can read when the Boolean column changes to true, the time is calculated from the timestamp column until the Boolean changes back to false. And repeat this for the entire series, and bin the times in a data frame for a histogram. Apologies for the poor attempt, I really don't know where to start. Note that the running column is listed as characters -- perhaps I need to convert to Boolean for this to work?
running <- c("t","t","f","f","t","f","t","t")
time <- c("2022-01-01 00:00:10", "2022-01-01 00:00:20","2022-01-01 00:00:30","2022-01-01 00:00:40","2022-01-01 00:00:50","2022-01-01 00:01:00","2022-01-01 00:01:10","2022-01-01 00:01:20")
dataset <- data.frame(time, running)
datafinal <- data.frame()
for (i in dataset){
if running == f,
result <- sum(i:n)
datafinal <- c(datafinal, result)
}
Converting running column to boolean and working with for-loop is a way. Also, you can operate in the dataframe. You already have one! This is a solution working with tidyverse library and some date operations thanks to lubridate library. I encourage you to learn to work with these libraries for this kind of problem.
rleid() function from data.table library add +1 everytime a value in the target column running changes.
running <- c("t","t","f","f","t","f","t","t")
time <- c("2022-01-01 00:00:10", "2022-01-01 00:00:20","2022-01-01 00:00:30","2022-01-01 00:00:40","2022-01-01 00:00:50","2022-01-01 00:01:00","2022-01-01 00:01:10","2022-01-01 00:01:20")
dataset <- data.frame(time, running)
# times to date time object
dataset$time = lubridate::ymd_hms(dataset$time,tz="UTC")
library(tidyverse)
solution = dataset %>%
mutate(Grp=data.table::rleid(running)) %>% # rows in the same state before change get same value
group_by(Grp) %>% # rows in the same state are grouped together
slice(1) %>% # keep first row
ungroup %>% # you don't need grouping anymore
mutate(timeLength = difftime(time, lag(time), units="secs"))
# calculate the differences between a row and previous one (lag(n=1))
Output:
# A tibble: 5 x 4
time running Grp timeLength
<dttm> <chr> <int> <drtn>
1 2022-01-01 00:00:10 t 1 NA secs
2 2022-01-01 00:00:30 f 2 20 secs
3 2022-01-01 00:00:50 t 3 20 secs
4 2022-01-01 00:01:00 f 4 10 secs
5 2022-01-01 00:01:10 t 5 10 secs
If you want to get rid of the first NA row, just add to the pipeline %>% filter(!is.na(timeLength)).
Update to add how you could do it with for-loop and nested if-else. But note the code is longer and more difficult to track.
dataset$time = lubridate::ymd_hms(dataset$time,tz="UTC")
# empty array for tracking changes in rows
current = c()
# datafinal empty dataframe
datafinal = data.frame()
# better working with the rows index
for (i in seq(nrow(dataset))){
# extract current vale of running
current = c(current,dataset[i,]$running)
if (i>1){ # we can't operate with first row, right?
if (current[i] == current[i-1]){
next # pass iteration if they keep in same state (true or false)
}
else { # different state? let's operate
result = difftime(dataset[i,]$time, previous_time, units="secs")
}
# (note: if 'next' jump in if-loop this part doesn't jump)
# create the outcome row for iteration
new_row = cbind(dataset[i,],result)
# add row to final dataframe
datafinal = rbind(datafinal,new_row)
}
# keep first time of state when it changes or we initiate the loop
previous_time = dataset[i,]$time
}

Not all values storing in a loop

I want to store values in "yy" but my code below stores only one row (last value). Please see the output below. Can somebody help to store all the values in "yy"
Thanks in advance. I am a beginner to R.
arrPol <- as.matrix(unique(TN_97_Lau_Cot[,6]))
arrYear <- as.matrix(unique(TN_97_Lau_Cot[,1]))
for (ij in length(arrPol)){
for (ik in length(arrYear)) {
newPolicy <- subset(TN_97_Lau_Cot, POLICY == as.character(arrPol[ij]) & as.numeric(arrYear[ik]))
yy <- newPolicy[which.min(newPolicy$min_dist),]
}
}
Output:
YEAR DIVISION STATE COUNTY CROP POLICY STATE_ABB LRPP min_dist
1: 2016 8 41 97 21 699609 TN 0 2.6
Here is a image of "TN_97_Lau_Cot" matrix.
No loops required. There could be an easier way to do it, but two set-based steps are better than two loops. These are the two ways I would try and do it:
base
# Perform an aggregate and merge it to your data.frame.
TN_97_Lau_Cot_Agg <- merge(
x = TN_97_Lau_Cot,
y = aggregate(min_dist ~ YEAR + POLICY, data = TN_97_Lau_Cot, min),
by = c("YEAR","POLICY"),
all.x = TRUE
)
# Subset the values that you want.
TN_97_Lau_Cot_Final <- unique(subset(TN_97_Lau_Cot_Agg, min_dist.x == min_dist.y))
data.table
library(data.table)
# Convert your data.frame to a data.table.
TN_97_Lau_Cot <- data.table(TN_97_Lau_Cot)
# Perform a "window" function that calculates the min value for each year without reducing the rows.
TN_97_Lau_Cot[, minDistAggregate:=min(min_dist), by = c("YEAR","POLICY")]
# Find the policy numbers that match the minimum distance for that year.
TN_97_Lau_Cot_Final <- unique(TN_97_Lau_Cot[min_dist==minDistAggregate, -10, with=FALSE])

Matrix Error- data length doesn't match, even though it should

I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}

Subsetting zoo series by a time that is not in the series

Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Resources