I'm working with a simple dataframe in R containing two columns that represent a time interval:
Started (Date/Time)
Ended (Date/Time)
I want to create a column containing the duration of these time intervals where I can then group by date. The issue is some of the intervals cross midnight and thus have time durations associated with two different dates. Rather than arbitrarily grouping these by their start/end dates I'd like to find a way to include times prior to midnight in one date group and those after midnight in the next day's group.
My current approach seems inefficient, plus I'm hitting a roadblock. First I reformatted the df and created a blank column to hold duration, plus another to hold a "new end date" for performing interval operations:
Start.Date
Start.Time
End.Date
End.Time
Duration
End.Date.New
I then used a loop to find instances where the time crossed midnight to store the last second of that day 23:59:59 in the End.Date.New column"
for(i in 1:nrow(df)) {
if(df$End.Time[i] < df$Start.Time[i]) {
df$End.Time.New[i] = '23:59:59'}}
The idea would be that, for instances where End.Time.New != NA, I could calculate Duration using Start.Time and End.Time.New and use Start.Date as my group-by variable. I would then have to generate an identical row that added 1 day to the start time and perform a similar operation (End.Date and 00:00:00) to populate the duration column, and I haven't been able to figure out how to make this work.
Is this separate-and-loop approach the best way to achieve this or is there a more efficient strategy using functions I may not be aware of?
Related
I have a dataset in .csv, and I have added in a column on my own in the csv that takes the total time taken for a task to be completed. There are two other columns that consists of the start time and the end time, and that is where I calculated the total time taken column from. The format of the start time and end time columns are in the datetime format 5/7/2018 16:13 while the format of the total time taken column is 0:08:20(H:MM:SS).
I understand that for datetime, it is possible to use the functions as.Date or as.POSIXlt to change the variable type from a factor to that of date. Is there a function that I can convert my total time taken column to (from that of factor) so that I can use it to plot scatterplots/plots in general? I tried as.numeric but the numbers that come out are gibberish and do not correspond to the original time.
If you want to plot the total time taken for each row, then I would suggest just plotting that difference as seconds. Here is a code snippet which shows how you can convert your start or end date into a numerical value:
start <- "5/7/2018 16:13"
start_date <- as.POSIXct(start, format="%d/%m/%Y %H:%M")
as.numeric(start_date)
[1] 1530799980
The above is a UNIX timestamp, which is number of seconds since the epoch (January 1, 1970). But, since you want a difference between start and end times, this detail does not really matter for you, and the difference you get should be valid.
If you want to use minutes, hours, or some other time unit, then you can easily convert.
Is there a way to window filter dates by a number of days excluding weekends?
I know you can use the between function for filtering between two specific dates but I only know one of the two specific dates, with the other date I would like to do is 4 days prior in business days only (not counting weekends).
An pseudo-example of what I am looking for is, given this wednesday I want to filter everything up to 4 business days beforehand:
window(z, start = as.POSIXct("2017-09-13"), end = as.POSIXct("2017-09-20"))
Another example would be if I am given this Friday's date, the start date would be Monday.
Ideally, I want to be able to play with the window value.
I have large data set of logged timestamps corresponding to state changes (e.g., light switch flips) that look like this:
library(data.table)
library(lubridate)
foo <-
data.table(ts = ymd_hms("2013-01-01 01:00:01",
"2013-01-01 05:34:34",
"2013-01-02 14:12:12",
"2013-01-02 20:01:00",
"2013-01-02 23:01:00",
"2013-01-03 03:00:00",
"2013-05-04 05:00:00"),
state = c(1, 0, 1, 0, 0, 1, 0) )
And I'm trying to (1) convert the history of state logs into run-times in seconds, and (2) convert these into daily cumulative run-times. Most (but not all) of the time, consecutive logged state values alternate. This is a kludgy start, but it falls a little short.
foo[, dif:=diff(ts)]
foo[state==1][, list(runtime = sum(dif)), .(floor_date(ts, "day"))]
In particular, when the state is "on" during a period that crosses midnight, this approach isn't smart enough to split things up, and incorrectly reports runtime longer than one day. Also, using diff is not so intelligent either, since it will make mistakes if there are consecutive identical states or NAs.
Any suggestions that will correctly resolve runtime that are still fast and efficient for large data sets?
This should work. I played around with different starting values of foo but there could still be some edges cases I didn't factor in . One thing that you will need to take note of is if your real data has a timezone that accepts daylight savings time then this will break when making the data.table with all dates. You can workaround that by doing a force_tz to UTC or GMT first (you can change it back later). On the other hand if you need to account for a 25 hour or 23 hour day then you'll need to strategically change them back to your timezone.
#I'm using devel version of data.table which includes shift function for leading/lagging variables
foo[,(paste0("next",names(foo))):=shift(.SD,1,0,"lead")]
#shift with fill=NA produced an error for some reason this is workaround
foo[nrow(foo),`:=`(nextts=NA,nextstate=NA)]
#make data.table with every date from min ts to max ts
complete<-data.table(datestamp=seq(from=floor_date(foo[,min(ts)],unit="day"),to=ceiling_date(foo[,max(ts)],unit="day"),by="days"))
#make column for end of day
complete[,enddate:=datestamp+hours(23)+minutes(59)+seconds(59.999)]
#set keys and then do overlapping join
setkey(foo,ts,nextts)
setkey(complete,datestamp,enddate)
overlap<-foverlaps(foo[state==1],complete,type="any")
#compute run time for each row
overlap[,runtime:=as.numeric(difftime(pmin(datestamp+days(1),nextts),pmax(datestamp,ts),units="secs"))]
#summarize down to seconds per day
overlap[,list(runtime=sum(runtime)),by=datestamp]
I have a csv file that contains many thousands of timestamped data points. The file includes the following columns: Date, Tag, East, North & DistFromMean. The following is a sample of the data in the file:
The data is recorded approximately every 15 minutes for 12 tags over a month. What I'm wanting to do is select from the data, starting from the first date entry, subsets of data i.e. every 3 hours but due to the tags transmitting at slightly different rates I need a minimum and maximum value start and end time.
I have found the a related previous question but don't understand the answer enough to implement.
The solution could firstly ask for the Tag number, then the period required perhaps in minutes from the start time (i.e. every 3hrs or 180 minutes), the minimum time range and the maximum time range, both of which would be constant for whatever time period was used. The minimum and maximum would probably need to be plus and minus 6 minutes from the period selected.
As the code below shows, I've managed to read in the file, change the Date format to POSIXlt and extract data within a specific time frame but the bit I'm stuck on is extracting the data every nth minute and within a range.
TestData<- read.csv ("TestData.csv", header=TRUE, as.is=TRUE)
TestData$Date <- strptime(TestData$Date, "%d/%m/%Y %H:%M")
TestData[TestData$Date >= as.POSIXlt("2014-02-26 7:10:00") & TestData$Date < as.POSIXlt("2014-02-26 7:18:00"),]
I want to create a single column with a sequence of date/time increasing every hour for one year or one month (for example). I was using a code like this to generate this sequence:
start.date<-"2012-01-15"
start.time<-"00:00:00"
interval<-60 # 60 minutes
increment.mins<-interval*60
x<-paste(start.date,start.time)
for(i in 1:365){
print(strptime(x, "%Y-%m-%d %H:%M:%S")+i*increment.mins)
}
However, I am not sure how to specify the range of the sequence of dates and hours. Also, I have been having problems dealing with the first hour "00:00:00"? Not sure what is the best way to specify the length of the date/time sequence for a month, year, etc? Any suggestion will be appreciated.
I would strongly recommend you to use the POSIXct datatype. This way you can use seq without any problems and use those data however you want.
start <- as.POSIXct("2012-01-15")
interval <- 60
end <- start + as.difftime(1, units="days")
seq(from=start, by=interval*60, to=end)
Now you can do whatever you want with your vector of timestamps.
Try this. mondate is very clever about advancing by a month. For example, it will advance the last day of Jan to last day of Feb whereas other date/time classes tend to overshoot into Mar. chron does not use time zones so you can't get the time zone bugs that code as you can using POSIXct. Here x is from the question.
library(chron)
library(mondate)
start.time.num <- as.numeric(as.chron(x))
# +1 means one month. Use +12 if you want one year.
end.time.num <- as.numeric(as.chron(paste(mondate(x)+1, start.time)))
# 1/24 means one hour. Change as needed.
hours <- as.chron(seq(start.time.num, end.time.num, 1/24))