Generate id for durations with dplyr - r

I have a column with dates (Time), I consider one duration as consecutive times of 1s:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"))
I would like to generate an Id for each duration like this:
data <- data.frame(Time = c("2021-12-01 01:01:01","2021-12-01 01:01:02","2021-12-01 01:01:03","2021-12-01 01:01:05","2021-12-01 01:01:06"),Id = c(1,1,1,2,2))
With dplyr...
Thank you

Up front:
cumsum(c(TRUE, as.numeric(diff(as.POSIXct(data$Time)), units = "secs") > 1L))
# [1] 1 1 1 2 2
First, you should really be working with real timestamps and not strings. If you're doing anything else with your Time field, it is almost certainly going to be a number-like operation, so you should do this up-front with
data$Time <- as.POSIXct(data$Time)
This works easily here because they are well-formed along the default format of "%Y-%m-%d %H:%M:%S"; see ?strptime for the %-codes.
From here, you want to keep track of when a difference in time is more than 1 second. The differencing is easy enough with:
as.numeric(diff(data$Time), units = "secs")
# [1] 1 1 2 1
Really, the key operator is diff, but it can report minutes or hours or such if the data is widely-enough spaced; there's an internal heuristic for that. Wrapping it in as.numeric(., units="secs") forces it to always be in seconds.
From here, we need a cumulative sum of when it is above 1, ergo > 1L, so cumsum(. > 1L).
Note that we have input length 5 but output length 4, this makes sense realizing that differences are between two elements. We force the first difference-test to be TRUE. If you have since changed to POSIXt-class, then the original code is reduced slightly to be
cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))
and therefore to store it as Id,
data$Id <- cumsum(c(TRUE, as.numeric(diff(data$Time), units = "secs") > 1L))

Related

Converting time in hh:mm:ss format to second or minutes

I have a column of time in this format: 43:46:18 which shows hh:mm:ss (h: hour, m: minutes, s: second), I want to convert this column to seconds (or minutes), but ":" is redundant and I do not know how I should deal with that?
1) Define a function which separates the 3 parts of the times into a data frame of 3 columns, converts it to a matrix and then matrix multiplies that with a vector whose 3 components are the number of seconds in an hour, minute and second. Finally use c(...) to convert that from an nx1 matrix to a plain vector.
To test, apply that to the time column of the test data dat defined in the Note at the end. No packages are used.
time2sec <- function(x) {
c(as.matrix(read.table(text = x, sep = ":")) %*% c(3600, 60, 1))
}
# test using dat from Note at the end
transform(dat, seconds = time2sec(time))
giving:
id time seconds
1 1 43:46:18 157578
2 2 43:46:18 157578
2) An alternate method is to convert to POSIXlt regarding the hour as a year and then convert that to seconds. This gives the same result as above. Again, no packages are used.
time2sec_2 <- function(x) {
with(as.POSIXlt(x, format = "%Y:%M:%S"), 3600 * (year + 1900) + 60 * min + sec)
}
transform(dat, seconds = time2sec_2(time))
Note
dat <- data.frame(id = 1:2, time = "43:46:18")
The lubridate package contains some useful functions for dealing with dates and times, in particular hms is useful for reading times in your format. You could use this as in:
library(lubridate)
as.numeric(hms("43:46:18"))
# [1] 157578 #Conversion to seconds

Calculate recovery time to a condition in time series in R

I would like to calculate the number of days from the time a condition is not met, to when it is met again, in a time series of daily data in R.
Toy data:
day <- data.frame(
date = seq.POSIXt(
from = ISOdatetime(2017,07,01,0,0,0),
to = ISOdatetime(2017,08,26,0,0,0),
by = "1 day" ))
var <- c(5,6,5,5,0,0,0,0,0,1,1,2,3,3,4,3,4,5,4,5,5,4,5,4,0,1,1,2,3,4,5,5,5,4,4,4,4,5,3,3,3,2,2,2,2,2,2,2,2,2,2,2,2,1,1,0,0)
ts = cbind(day, var)
The condition is var > 3.
I'd like to identify each "recovery" period as the time where var > 0 but <= 3, but only following var going to zero. Then, I'd like the number of days to recovery for each period.
So, for the example data given here, I'd expect this output:
period 1 6
period 2 5
Since var never "recovers" at the end of the dataset, I would either want it not identified as a recovery period, or given a recovery time of 0 days.
I tried this:
ifelse(ts$var >3, 0 ,(ifelse(ts$var>0 & ts$var<4, 1, 0)))
and I think I could pair this if else statement with something that only counts sequential 1s and that would mostly do it. Only problem is that it identifies the end period with the slow drop-off as a “recovery period”, and it shouldn’t. It should only identify periods following a zero as a recovery period.
Here is what this example data look like: plot of var over time. I think it's the minimal data I can provide that show the realistic issues I've had with making counts of data outside of recovery periods.
I need to do this over a long and much more dynamic time series, so an efficient way to do this would be greatly appreciated.
edit
- I don't think this will behave the way you expect it to if var does something like this
[... 0, 1, 2, 1, 0, 2, 4, ...]
But may possibly be adapted to handle this case.
original answer
I haven't tested this much, I'd suggest checking it works with weirder cases (e.g. var is all zeros, starts or ends at a period boundary, other corner cases...)
# ignore zeroes if they precede another zero
s <- which(var == 0 & c(tail(var, -1), NA) != 0)
e <- which(var > 3)
sapply(s, function(x) head(e[e > x], 1) - x)
The approach here is to identify all possible start and end points of periods, then find the first end point that occurs after each start point and taking the difference. A simple loop or maybe even a clever regex could be a good alternative.
Here is an alternative approach which uses the rleid() function from the data.table package to group by contiguous streaks of zero and non-zero values. It then finds the position within each group of the first occurrence of a value > 3:
library(data.table)
setDT(ts)[, if (.GRP > 1) first(which(var > 3)), rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
The first group is skipped because it is either a streak of zeros or has no preceeding zero value.
This approach works even in the case Callum Webb has described in the edit of his answer:
# append data
var <- c(var, 0,1,2,1,0,2,4)
date = seq.POSIXt(
from = ISOdatetime(2017,07,01,0,0,0),
along.with = var,
by = "1 day" )
ts = data.frame(date, var)
setDT(ts)[, if (.GRP > 1) first(which(var > 3)), rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
3: 9 2
So, it has recognized that there is a recovery period of 2 days after the final zero.
For the sake of completeness, in case the sequence 0, 1, 2, 1, 0 is considered to include also a recovery period of 3 days length although it has not reached a value greater 3:
setDT(ts)[, if (.GRP > 1) if (all(var %between% c(1, 3))) .N else first(which(var > 3)),
rleid(var == 0)]
rleid V1
1: 3 6
2: 5 5
3: 7 3
4: 9 2
Here all days between two zeros are counted if all values lie between 1 and 3.

Matrix Error- data length doesn't match, even though it should

I am trying to write a function that will return the number of business days between two dates (not just excluding weekends, but holidays as well). I'm approaching it by building a matrix with rownames corresponding to days of the week with the elements of the matrix either a 1 or a 0: a 0 if it is a holiday or the extra couple elements to fill the matrix.
I've checked the length of each vector in the code. It checks out. I've run the code manually in the console, one line at a time, and it works perfectly. BUT if I run the function, it displays this error message:
Warning message:
In matrix(da, nrow = 7, dimnames = list(n)) :
data length [132] is not a sub-multiple or multiple of the number of rows [7]
I'm using R 3.1.1, mostly working in Rstudio. The cal mentioned in the code can be found here.
Here's the code:
dte <- function(date) {
#Input a date and it tells you the number of business (not including holidays)
#days until that date
#Take the target date and turn it into a date
d <- strptime(date,format="%Y-%m-%d")
#Obtain current date
c <- strptime(Sys.Date(), format="%Y-%m-%d")
#Calculate the difference in days
diff <- d-c
#Extract the actual number difference
f <- diff[[1]]
#Get the list of holidays
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- Sys.Date()+0:f
#Find which days in the range are holidays
if(any(b %in% cal)) {
bt <- b[b %in% cal]
#Return the position of the holidays within the range
bn <- which(b %in% bt)
} else {
#Set holidays present to 0
bn <- 0
}
#Build a vector of the weekdays starting with the current weekday
n <- weekdays(Sys.Date()+0:6)
#Create a vector as long as the difference with a 1 in each place
v <- rep(1,f)
#Set each holiday to 0
v[bn] <- v[bn]-1
#Extra steps to make sure that the matrix is full but only with 1s where we want them.
g <- ((trunc(f/7)+1)*7)-f
u <- rep(0,g)
da <- c(v,u)
#Create the matrix
m <- matrix(da,nrow=7,dimnames=list(n))
#Extract all of the workweeks and add them up
ww <- m[c("Monday","Tuesday","Wednesday","Thursday","Friday"),]
r <- sum(ww)
r
}
The problem is that your strptime calls return POSIXt objects which have time components and are then effected by daylight savings time. Observe
(d1<-strptime("2014-08-24",format="%Y-%m-%d"))
# [1] "2014-08-24 EDT"
(d2<-strptime("2014-12-31",format="%Y-%m-%d"))
# [1] "2014-12-31 EST"
d2-d1
# Time difference of 129.0417 days
So there are not a while number of dates between the two values which causes complications for you later in your code. If you use as.Date rather than strptime then you won't have this problem because Date objects don't care about time.
But i'm not sure really why you're even bothering with the matrix at all. I think a simpler implementation would look like
dte <- function(date) {
d <- as.Date(date,format="%Y-%m-%d")
c <- Sys.Date()
cal <- dget("cal")
cal <- as.Date(cal)
#Get the full list of dates between now and the target date
b <- seq(c, d, by="1 day")
return(sum(as.POSIXlt(b)$wday %in% 1:5 & (!b %in% cal)))
}

Conditional subsetting of data frame based on HH:MM:SS formatted column

So I have a large df with a column called "session" that is in the format
HH:MM:SS (e.g. 0:35:24 for 35 mins and 24 secs).
I want to create a subset of the df based on a condition like > 2 mins or < 90 mins from the "sessions" column
I tried to first convert the column format into Date:
df$session <- as.Date(df$session, "%h/%m/%s")
I was going to then use the subset() to create my conditional subset but the above code generates a column of NAs.
subset.morethan2min <-subset(df, CONDITION)
where CONDITION is df$session >2 mins?
How should I manipulate the "session" column in order to be able to subset on a condition as described?
Sorry very new to R so welcome any suggestions.
Thanks!
UPDATE:
I converted the session column to POSIXct then used function minute() from lubridate package to get numerical values for hour and minute components. Not a near solution but seems to work for my needs right now. Still would welcome a neater solution though.
df$sessionPOSIX <- as.POSIXct(strptime(df$session, "%H:%M:%S"))
df$minute <- minute(df$sessionPOSIX)
subset.morethan2min <- subset(df, minute > 2)
A date is not the same as a period. The easiest way to handle periods is to use the lubridate package:
library(lubridate)
df$session <- hms(df$session)
df.morethan2min <- subset(df, df$session > period(2, 'minute'))
hms() converts your duration stamps into period objects, and period() creates a period object of the specified length for comparison.
As an aside, there are numerous other ways to subset data frames, including the [ operator and functions like filter() in the dplyr package, but that's beyond what you need for your current purposes.
Probably simpler ways to do this, but here's one solution:
set.seed(1234)
tDF <- data.frame(
Val = rnorm(100),
Session = paste0(
sample(0:23,100,replace=TRUE),
":",
sample(0:59,100,replace=TRUE),
":",
sample(0:59,100,replace=TRUE),
sep="",collapse=NULL),
stringsAsFactors=FALSE
)
##
toSec <- function(hms){
Long <- as.POSIXct(
paste0(
"2013-01-01 ",
hms),
format="%Y-%m-%d %H:%M:%S",
tz="America/New_York")
3600*as.numeric(substr(Long,12,13))+
60*as.numeric(substr(Long,15,16))+
as.numeric(substr(Long,18,19))
}
##
tDF <- cbind(
tDF,
Seconds = toSec(tDF$Session),
Minutes = toSec(tDF$Session)/60
)
##
> head(tDF)
Val Session Seconds Minutes
1 -1.2070657 15:21:41 55301 921.6833
2 0.2774292 12:58:24 46704 778.4000
3 1.0844412 7:32:45 27165 452.7500
4 -2.3456977 18:26:46 66406 1106.7667
5 0.4291247 12:56:34 46594 776.5667
6 0.5060559 17:27:11 62831 1047.1833
Then you can just subset your data easily by doing subset(Data, Minutes > some_number).

Subsetting zoo series by a time that is not in the series

Is there a good package in R that allows to sub-set (i.e. index into) timeseries by times that are not in the time series?
E.g. for financial applications, indexing a price series by a time stamp that is not in the database, should return the latest available price before the time stamp.
in code, this is what I would like
n =15
full.dates = seq(Sys.Date(), by = 'day', length = n)
series.dates = full.dates[c(1:10, 12, 15)]
require(zoo)
series=zoo(rep(1,length(series.dates)), series.dates)
series[full.dates[11]]
this returns
Data:
numeric(0)
Index:
character(0)
however, I would like this to return the value of the last existing date before full.dates[11], which is full.dates[10]:
series[full.dates[10]]
2014-01-03
1
Thanks
You can use index to extract index of the observations in your zoo object. The index can then be used for subsetting the object. Step by step to show the logic (you only need the last step, if I have understood you correctly):
# the index of the observations, here dates
index(series)
# are the dates smaller than your reference date?
index(series) < full.dates[11]
# subset observations: dates less than reference date
series[index(series) < full.dates[11]]
# select last observation before reference date:
tail(series[index(series) < full.dates[11]], 1)
# 2014-01-03
# 1
A possible alternative may be to expand your time series and "replac[e] each NA with the most recent non-NA" using na.locf and the xout argument (see also ?na.locf and ?approx and this answer)
# expand time series to the range of dates in 'full.dates'
series2 <- na.locf(series, xout = full.dates)
series2
# select observation at reference date
series2[full.dates[10]]
# 2014-01-03
# 1
If you rather want missing values in your incomplete series to be replaced by "next observation carried backward", you need to merge your series with with a 'dummy' zoo object which contains the desired range of consecutive dates.
series3 <- merge(series, zoo(, full.dates))
na.locf(series3, fromLast = TRUE)
na.locf(x, xout = newdate) seems not much worse than subscripting but at any rate here we define a subclass of "zoo" called "zoo2" in which [ uses na.locf. This is an untested minimal implementation but it could be extended:
as.zoo2 <- function(x) UseMethod("as.zoo2")
as.zoo2.zoo <- function(x) structure(x, class = c("zoo2", setdiff(class(x), "zoo2")))
"[.zoo2" <- function(x, i, ...) {
if (!missing(i) && inherits(i, class(index(x)))) {
zoo:::`[.zoo`(na.locf(x, xout = i),, ...)
} else as.zoo2(zoo:::`[.zoo`(x, i, ...))
}
This gives:
> series2 <- as.zoo2(series)
> series2[full.dates[11]]
2014-01-04
1
I would strongly argue that subset functions should not return the prior row if the desired index value does not exist. Subset functions should return what the user requested; they should not assume the user wanted something different than what they requested.
If this is what you want, you can handle it fairly easily with an if statement.
series.subset <- series[full.dates[11]]
if(NROW(series.subset)==0) {
# merge series with an empty zoo object
# that contains the index value you want
prior <- merge(series, zoo(,full.dates[11]))
# lag *back* one period so the NA is on the prior value
prior <- lag(prior, 1)
# get the index value at the prior value
prior <- index(prior)[is.na(prior)]
# subset again
series.subset <- series[prior]
}

Resources