converting timestamp state event logs to runtime in R data.table - r

I have large data set of logged timestamps corresponding to state changes (e.g., light switch flips) that look like this:
library(data.table)
library(lubridate)
foo <-
data.table(ts = ymd_hms("2013-01-01 01:00:01",
"2013-01-01 05:34:34",
"2013-01-02 14:12:12",
"2013-01-02 20:01:00",
"2013-01-02 23:01:00",
"2013-01-03 03:00:00",
"2013-05-04 05:00:00"),
state = c(1, 0, 1, 0, 0, 1, 0) )
And I'm trying to (1) convert the history of state logs into run-times in seconds, and (2) convert these into daily cumulative run-times. Most (but not all) of the time, consecutive logged state values alternate. This is a kludgy start, but it falls a little short.
foo[, dif:=diff(ts)]
foo[state==1][, list(runtime = sum(dif)), .(floor_date(ts, "day"))]
In particular, when the state is "on" during a period that crosses midnight, this approach isn't smart enough to split things up, and incorrectly reports runtime longer than one day. Also, using diff is not so intelligent either, since it will make mistakes if there are consecutive identical states or NAs.
Any suggestions that will correctly resolve runtime that are still fast and efficient for large data sets?

This should work. I played around with different starting values of foo but there could still be some edges cases I didn't factor in . One thing that you will need to take note of is if your real data has a timezone that accepts daylight savings time then this will break when making the data.table with all dates. You can workaround that by doing a force_tz to UTC or GMT first (you can change it back later). On the other hand if you need to account for a 25 hour or 23 hour day then you'll need to strategically change them back to your timezone.
#I'm using devel version of data.table which includes shift function for leading/lagging variables
foo[,(paste0("next",names(foo))):=shift(.SD,1,0,"lead")]
#shift with fill=NA produced an error for some reason this is workaround
foo[nrow(foo),`:=`(nextts=NA,nextstate=NA)]
#make data.table with every date from min ts to max ts
complete<-data.table(datestamp=seq(from=floor_date(foo[,min(ts)],unit="day"),to=ceiling_date(foo[,max(ts)],unit="day"),by="days"))
#make column for end of day
complete[,enddate:=datestamp+hours(23)+minutes(59)+seconds(59.999)]
#set keys and then do overlapping join
setkey(foo,ts,nextts)
setkey(complete,datestamp,enddate)
overlap<-foverlaps(foo[state==1],complete,type="any")
#compute run time for each row
overlap[,runtime:=as.numeric(difftime(pmin(datestamp+days(1),nextts),pmax(datestamp,ts),units="secs"))]
#summarize down to seconds per day
overlap[,list(runtime=sum(runtime)),by=datestamp]

Related

R: Turn timestamps into (as short as possible) integers

Edit 1: I think a possible solution would be to count the number of 15-minute intervals elapsed since a starting date. If anyone has thoughts on this, please come forward. Thanks
As the title says, I am looking for a way to turn timestamps into as small as possible integers.
Explanation of the situation:
I am working with "panelAR". I have T>N panel-data containing different timestamps that look like this (300,000 rows in total):
df$timestamp[1]
[1] "2013-08-01 00:15:00 UTC"
class(df$timestamp)
[1] "POSIXct" "POSIXt"
I am using panelAR and thus need the timestamp as an integer. I can't simply use "as.integer" because I would hit the max length for integers resulting in only NA's. This was my first try to work around this problem:
df$timestamp <- as.numeric(gsub("[: -]", "" , df$timestamp, perl=TRUE))
Subtract the numbers starting at te 3rd position (Because "20" is irrelevant) and stop before the 2nd last position (Because they all end at 00 seconds)
(I need shorter integers in order to not hit the max level of integers in R)
df$timestamp <- substr(df$timestamp, 3, nchar(df$timestamp)-2)
#Save as integer
df$timestamp <- as.integer(df$timestamp)
#Result
df$timestamp[1]
1308010015
This allows panelAR to work with it, but the numbers seem to be way too large. When I try to run a regression with it, i get the following error message:
"cannot allocate vector of size 1052.2 GB"
I am looking for a way to turn these timestamps into (as small as possible) integers in order to work with panelAR.
Any help is greatly appreciated.
this big number that you get corresponds to the number of seconds elapsed since 1970-01-01 00:00:00. Do your time stamps have regular intervals? If it is, let's say, every 15 minutes you could divide all integers by 900, and it might help.
Another option is to pick your earliest date and subtract it from the others
#generate some dates:
a <- as.POSIXct("2013-01-01 00:00:00 UTC")
b <- as.POSIXct("2013-08-01 00:15:00 UTC")
series <- seq(a,b, by = 'min')
#calculate the difference (result are integers/seconds)
integer <- as.numeric(series - min(series))
If you still get memory problems, I might combine both.
I managed to solve the main question. Since this still results in a memory error, I think it stems from the number of observations and the way panelAR computes things. I will open a separate question for that matter.
I used
df$timestampnew <- as.integer(difftime(df$timestamp, "2013-01-01 00:00:00", units = "min")/15)
to get integers that count the number of 15-min intervals elapsed since a certain date.

Grouping Time Duration by Date when Intervals Cross Midnight

I'm working with a simple dataframe in R containing two columns that represent a time interval:
Started (Date/Time)
Ended (Date/Time)
I want to create a column containing the duration of these time intervals where I can then group by date. The issue is some of the intervals cross midnight and thus have time durations associated with two different dates. Rather than arbitrarily grouping these by their start/end dates I'd like to find a way to include times prior to midnight in one date group and those after midnight in the next day's group.
My current approach seems inefficient, plus I'm hitting a roadblock. First I reformatted the df and created a blank column to hold duration, plus another to hold a "new end date" for performing interval operations:
Start.Date
Start.Time
End.Date
End.Time
Duration
End.Date.New
I then used a loop to find instances where the time crossed midnight to store the last second of that day 23:59:59 in the End.Date.New column"
for(i in 1:nrow(df)) {
if(df$End.Time[i] < df$Start.Time[i]) {
df$End.Time.New[i] = '23:59:59'}}
The idea would be that, for instances where End.Time.New != NA, I could calculate Duration using Start.Time and End.Time.New and use Start.Date as my group-by variable. I would then have to generate an identical row that added 1 day to the start time and perform a similar operation (End.Date and 00:00:00) to populate the duration column, and I haven't been able to figure out how to make this work.
Is this separate-and-loop approach the best way to achieve this or is there a more efficient strategy using functions I may not be aware of?

Changing Time To A Comparable Function In R

I have a dataset in .csv, and I have added in a column on my own in the csv that takes the total time taken for a task to be completed. There are two other columns that consists of the start time and the end time, and that is where I calculated the total time taken column from. The format of the start time and end time columns are in the datetime format 5/7/2018 16:13 while the format of the total time taken column is 0:08:20(H:MM:SS).
I understand that for datetime, it is possible to use the functions as.Date or as.POSIXlt to change the variable type from a factor to that of date. Is there a function that I can convert my total time taken column to (from that of factor) so that I can use it to plot scatterplots/plots in general? I tried as.numeric but the numbers that come out are gibberish and do not correspond to the original time.
If you want to plot the total time taken for each row, then I would suggest just plotting that difference as seconds. Here is a code snippet which shows how you can convert your start or end date into a numerical value:
start <- "5/7/2018 16:13"
start_date <- as.POSIXct(start, format="%d/%m/%Y %H:%M")
as.numeric(start_date)
[1] 1530799980
The above is a UNIX timestamp, which is number of seconds since the epoch (January 1, 1970). But, since you want a difference between start and end times, this detail does not really matter for you, and the difference you get should be valid.
If you want to use minutes, hours, or some other time unit, then you can easily convert.

Fix split.xts behaviour prior to the epoch (1-1-1970)

I noticed some strange xts behaviour when trying to split an object that goes back a long way. The behaviour of split changes at the epoch.
#Create some data
dates <- seq(as.Date("1960-01-01"),as.Date("1980-01-01"),"days")
x <- rnorm(length(dates))
data <- xts(x, order.by=dates)
If we split the xts object by week, it defines the last day of the week as Monday prior to 1970. Post-1970, it defines it as Sunday (expected behaviour).
#Split the data, keep the last day of the week
lastdayofweek <- do.call(rbind, lapply(split(data, "weeks"), last))
head(lastdayofweek)
tail(lastdayofweek)
1960 Calendar
1979 Calendar
This seems to only be a problem for weeks, not months or years.
#Split the data, keep the last day of the month
lastdayofmonth <- do.call(rbind, lapply(split(data, "months"), last))
head(lastdayofmonth)
tail(lastdayofmonth)
The behaviour seems likely to do with the following, though I am not sure why it would apply to weeks only. From the xts cran.
For dates prior to the epoch (1970-01-01) the ending time is aligned to the 59.0000 second. This is
due to a bug/feature in the R implementation of asPOSIXct and mktime0 at the C-source level. This
limits the precision of ranges prior to 1970 to 1 minute granularity with the current xts workaround.
My workaround has been to shift the dates before splitting the objects for data prior to 1970, if I am splitting on weeks. I expect someone else has a more elegant solution (or a way to avoid the error).
EDIT: To be clear as to what the question is, I am looking for an answer that
a) specifies why this happens (so I can understand the nature of the error better, and therefore avoid it) and/or
b) the best workaround to deal with it.
One "workaround" would be to check out Rev. 743 or earlier, as it appears to me that this broke in Rev. 744.
svn checkout svn://svn.r-forge.r-project.org/svnroot/xts/#743
But, a much better idea is to file a bug report so that you don't have to use an old version forever. (also, of course, other bugs may have been patched and/or new features added since Rev. 743)

Creating a specific sequence of date/times in R

I want to create a single column with a sequence of date/time increasing every hour for one year or one month (for example). I was using a code like this to generate this sequence:
start.date<-"2012-01-15"
start.time<-"00:00:00"
interval<-60 # 60 minutes
increment.mins<-interval*60
x<-paste(start.date,start.time)
for(i in 1:365){
print(strptime(x, "%Y-%m-%d %H:%M:%S")+i*increment.mins)
}
However, I am not sure how to specify the range of the sequence of dates and hours. Also, I have been having problems dealing with the first hour "00:00:00"? Not sure what is the best way to specify the length of the date/time sequence for a month, year, etc? Any suggestion will be appreciated.
I would strongly recommend you to use the POSIXct datatype. This way you can use seq without any problems and use those data however you want.
start <- as.POSIXct("2012-01-15")
interval <- 60
end <- start + as.difftime(1, units="days")
seq(from=start, by=interval*60, to=end)
Now you can do whatever you want with your vector of timestamps.
Try this. mondate is very clever about advancing by a month. For example, it will advance the last day of Jan to last day of Feb whereas other date/time classes tend to overshoot into Mar. chron does not use time zones so you can't get the time zone bugs that code as you can using POSIXct. Here x is from the question.
library(chron)
library(mondate)
start.time.num <- as.numeric(as.chron(x))
# +1 means one month. Use +12 if you want one year.
end.time.num <- as.numeric(as.chron(paste(mondate(x)+1, start.time)))
# 1/24 means one hour. Change as needed.
hours <- as.chron(seq(start.time.num, end.time.num, 1/24))

Resources