Time to failure variable based off start and end timestamps in R - r

I have two data sets. Data set 1 contains time stamps of 15 minute intervals starting at 2009-08-18 18:15:00 and ending 2012-11-09 22:30:00 with measurements taken at those times. Data set 2 has start and end time stamps for faults occurring in a factory. There are 6 faults and these faults' start and end times are also 15 min intervals, yet can last longer than 1 interval. They also all fall somewhere between 2009-08-18 18:15:00 and 2012-11-09 22:30:00 as well. I am trying to create a time to failure variable for the faults, where -i would indicate the next fault is i intervals (which are 15 mins) away and i would indicate the fault started i intervals ago. For example,
DataSet1
Timestamp Sensor 1
2009-09-04 10:00:00 30
2009-09-04 10:30:00 40
2009-09-04 10:45:00 33
2009-09-04 11:00:00 23
2009-09-04 11:15:00 24
2009-09-04 11:30:00 42
DataSet 2
Start Time End Time Fault Type
09/04/09 10:45 9/4/2009 11:15 1
09/04/09 21:45 9/4/2009 22:00 1
09/04/09 23:00 9/4/2009 23:15 1
09/05/09 10:45 9/5/2009 11:15 1
09/05/09 21:30 9/5/2009 23:15 1
09/08/09 10:45 9/8/2009 12:30 1
So what I want to end up with is the following time to failure variable (TTF1) and then repeat the process for faults 2-6
Timestamp Sensor 1 TTF1
2009-09-04 10:00:00 30 -3
2009-09-04 10:30:00 40 -1
2009-09-04 10:45:00 33 0
2009-09-04 11:00:00 23 1
2009-09-04 11:15:00 24 2
2009-09-04 11:30:00 42 -41
I know I can use the sqldf function to separate out each fault type, but I have no clue where to begin to even create counting the time to fault variable. I'm very stuck, any help would be greatly appreciated!

You can use the difftime() function from base R to get the time difference between these 2 timestamps:
(z <- Sys.time() - 3600)
Sys.time() - z # just over 3600 seconds.
as.difftime(c("0:3:20", "11:23:15"))
as.difftime(c("3:20", "23:15", "2:"), format = "%H:%M") # 3rd gives NA
(z <- as.difftime(c(0,30,60), units = "mins"))
as.numeric(z, units = "secs")
as.numeric(z, units = "hours")
format(z)
I would recommend set units = "mins". You can convert the class to character, strip out any non-numeric data with gsub, then change the class with as.numeric. Finally just divide by 15 to get the 15-minute time units you want. You can use floor to round the result if needed.

Related

Calculating mean and sd of bedtime (hh:mm) in R - problem are times before/after midnight

I got the following dataset:
data <- read.table(text="
wake_time sleep_time
08:38:00 23:05:00
09:30:00 00:50:00
06:45:00 22:15:00
07:27:00 23:34:00
09:00:00 23:00:00
09:05:00 00:10:00
06:40:00 23:28:00
10:00:00 23:30:00
08:10:00 00:10:00
08:07:00 00:38:00", header=T)
I used the chron-package to calculate the average wake_time:
> mean(times(data$wake_time))
[1] 08:20:12
But when I do the same for the variable sleep_time, this happens:
> mean(times(data$sleep_time))
[1] 14:04:00
I guess the result is distorted because the sleep_time contains times before and after midnight.
But how can I solve this problem?
Additionally:
How can I calculate the sd of the times. I want to use it like "mean wake-up-time 08:20 ± 44 min" for example.
THe times values are stored as numbers 0-1 representing a fraction of a day. If the sleep time is earlier than the wake time, you can "add a day" before taking the mean. For example
library(chron)
wake <- times(data$wake_time)
sleep <- times(data$sleep_time)
times(mean(ifelse(sleep < wake, sleep+1, sleep)))
# [1] 23:40:00
And since the values are parts of a day, if you want the sd in minutes, you'd take the partial day values and convert to minutes
sd(ifelse(sleep < wake, sleep+1, sleep) * 24*60)
# [1] 47.60252

boxplot time series by monthly average in r

I have a time series with hourly data on energy consumption in the form of a zoo object. And there are 16 indices (in the range [1:143206]) for which the Date is NA. Here is a sample of the data:
Date PJMW_MW
1 2002-04-01 01:00:00 4374
...
8709 2003-03-29 23:00:00 4827
8710 2003-03-30 00:00:00 4611
8711 2003-03-30 01:00:00 4421
8712 NA 4285
8713 2003-03-30 03:00:00 4212
8714 2003-03-30 04:00:00 4321
...
143206 2018-08-03 00:00:005489
The data above is a data.frame object called dat but I have it in a zoo object called hourly_ts:
1 4374
...
7709 6135
7710 6324
7711 6626
7712 6866
7713 6987
7714 7028
7715 7026
...
143206 5265
I would like to see the monthly averages, like, for which month is the consumption generally higher, and I saw that there is a simple formula for this: boxplot(hourly_ts ~ cycle(hourly_ts))
But the error Error in cycle.zoo(hourly_ts) : ‘x’is not regular appears.
The weird thing is that hourly_ts has a specified frequency (24 hours per day) and start time (April 1st 2002 01:00:00), so from that there shouldn't be any missing values in the time.
Supposing the missing values are what's causing the irregularity, is there a way I can add the values myself?
I would also like to use the aggregate function but have no idea what the by parameter should be.

Maintain date time stamp when calculating time intervals

I calculated the time intervals between date and time based on location and sensor. Here is some of my data:
datehour <- c("2016-03-24 20","2016-03-24 06","2016-03-24 18","2016-03-24 07","2016-03-24 16",
"2016-03-24 09","2016-03-24 15","2016-03-24 09","2016-03-24 20","2016-03-24 05",
"2016-03-25 21","2016-03-25 07","2016-03-25 19","2016-03-25 09","2016-03-25 12",
"2016-03-25 07","2016-03-25 18","2016-03-25 08","2016-03-25 16","2016-03-25 09",
"2016-03-26 20","2016-03-26 06","2016-03-26 18","2016-03-26 07","2016-03-26 16",
"2016-03-26 09","2016-03-26 15","2016-03-26 09","2016-03-26 20","2016-03-26 05",
"2016-03-27 21","2016-03-27 07","2016-03-27 19","2016-03-27 09","2016-03-27 12",
"2016-03-27 07","2016-03-27 18","2016-03-27 08","2016-03-27 16","2016-03-27 09")
location <- c(1,1,2,2,3,3,4,4,"out","out",1,1,2,2,3,3,4,4,"out","out",
1,1,2,2,3,3,4,4,"out","out",1,1,2,2,3,3,4,4,"out","out")
sensor <- c(1,16,1,16,1,16,1,16,1,16,1,16,1,16,1,16,1,16,1,16,
1,16,1,16,1,16,1,16,1,16,1,16,1,16,1,16,1,16,1,16)
Temp <- c(35,34,92,42,21,47,37,42,63,12,35,34,92,42,21,47,37,42,63,12,
35,34,92,42,21,47,37,42,63,12,35,34,92,42,21,47,37,42,63,12)
df <- data.frame(datehour,location,sensor,Temp)
I used the following code to calculate the time differences. However it does not maintain the correct date hour with each entry. See columns datehour1 and datehour2.
df$datehour <- as.POSIXct(df$datehour, format = "%Y-%m-%d %H")
final.time.df <- setDT(df)[order(datehour, location, sensor), .(difftime(datehour[-length(datehour)],
datehour[-1], unit = "hour"),
datehour1 = datehour[1], datehour2 = datehour[2]), .(location, sensor)]
I would like each time difference to have the two times used to calculate it to identify it. I would like the result to be the following:
location sensor V1 datehour1 datehour2
out 16 -28 hours 2016-03-24 05:00:00 2016-03-25 09:00:00
1 16 -25 hours 2016-03-24 06:00:00 2016-03-25 07:00:00
2 16 -26 hours 2016-03-24 07:00:00 2016-03-25 09:00:00
3 16 -22 hours 2016-03-24 09:00:00 2016-03-25 07:00:00
4 16 -23 hours 2016-03-24 09:00:00 2016-03-25 08:00:00
4 1 -27 hours 2016-03-24 15:00:00 2016-03-25 18:00:00
3 1 -20 hours 2016-03-24 16:00:00 2016-03-25 12:00:00
2 1 -25 hours 2016-03-24 18:00:00 2016-03-25 19:00:00
1 1 -25 hours 2016-03-24 20:00:00 2016-03-25 21:00:00
out 1 -20 hours 2016-03-24 20:00:00 2016-03-25 16:00:00
Okay, so I'm not an expert by any means at data.tables solutions, and as a result I'm not quite sure how you're using the grouping statement to resolve the number of values down to 10.
That said, I think the answer to your question (if you haven't already solved this another way) lies in the difftime(datehour[-length(datehour)], datehour[-1], unit = "hour") chunk of code, but not in the fact that it's calculating the difference incorrectly, but in that it's preventing the grouping statement from resolving to the expected number of groups.
I tried separating the grouping from the time difference calculation, and was able to get to your expected output (obviously some formatting required):
final.time.df <- setDT(df)[order(datehour, location, sensor), .(datehour1 = datehour[1], datehour2 = datehour[2]), .(location, sensor)]
final.time.df$diff = final.time.df$datehour1 - final.time.df$datehour2
If I've missed the point, feel free to let me know and I'll delete the answer! I know it's not a particularly insightful answer, but it looks like this might do it, and I'm stuck on a problem myself right now, and wanted to try to help.

Create a time interval of 15 minutes from minutely data in R?

I have some data which is formatted in the following way:
time count
00:00 17
00:01 62
00:02 41
So I have from 00:00 to 23:59hours and with a counter per minute. I'd like to group the data in intervals of 15 minutes such that:
time count
00:00-00:15 148
00:16-00:30 284
I have tried to do it manually but this is exhausting so I am sure there has to be a function or sth to do it easily but I haven't figured out yet how to do it.
I'd really appreciate some help!!
Thank you very much!
For data that's in POSIXct format, you can use the cut function to create 15-minute groupings, and then aggregate by those groups. The code below shows how to do this in base R and with the dplyr and data.table packages.
First, create some fake data:
set.seed(4984)
dat = data.frame(time=seq(as.POSIXct("2016-05-01"), as.POSIXct("2016-05-01") + 60*99, by=60),
count=sample(1:50, 100, replace=TRUE))
Base R
cut the data into 15 minute groups:
dat$by15 = cut(dat$time, breaks="15 min")
time count by15
1 2016-05-01 00:00:00 22 2016-05-01 00:00:00
2 2016-05-01 00:01:00 11 2016-05-01 00:00:00
3 2016-05-01 00:02:00 31 2016-05-01 00:00:00
...
98 2016-05-01 01:37:00 20 2016-05-01 01:30:00
99 2016-05-01 01:38:00 29 2016-05-01 01:30:00
100 2016-05-01 01:39:00 37 2016-05-01 01:30:00
Now aggregate by the new grouping column, using sum as the aggregation function:
dat.summary = aggregate(count ~ by15, FUN=sum, data=dat)
by15 count
1 2016-05-01 00:00:00 312
2 2016-05-01 00:15:00 395
3 2016-05-01 00:30:00 341
4 2016-05-01 00:45:00 318
5 2016-05-01 01:00:00 349
6 2016-05-01 01:15:00 397
7 2016-05-01 01:30:00 341
dplyr
library(dplyr)
dat.summary = dat %>% group_by(by15=cut(time, "15 min")) %>%
summarise(count=sum(count))
data.table
library(data.table)
dat.summary = setDT(dat)[ , list(count=sum(count)), by=cut(time, "15 min")]
UPDATE: To answer the comment, for this case the end point of each grouping interval is as.POSIXct(as.character(dat$by15)) + 60*15 - 1. In other words, the endpoint of the grouping interval is 15 minutes minus one second from the start of the interval. We add 60*15 - 1 because POSIXct is denominated in seconds. The as.POSIXct(as.character(...)) is because cut returns a factor and this just converts it back to date-time so that we can do math on it.
If you want the end point to the nearest minute before the next interval (instead of the nearest second), you could to as.POSIXct(as.character(dat$by15)) + 60*14.
If you don't know the break interval, for example, because you chose the number of breaks and let R pick the interval, you could find the number of seconds to add by doing max(unique(diff(as.POSIXct(as.character(dat$by15))))) - 1.
The cut approach is handy but slow with large data frames. The following approach is approximately 1,000x faster than the cut approach (tested with 400k records.)
# Function: Truncate (floor) POSIXct to time interval (specified in seconds)
# Author: Stephen McDaniel # PowerTrip Analytics
# Date : 2017MAY
# Copyright: (C) 2017 by Freakalytics, LLC
# License: MIT
floor_datetime <- function(date_var, floor_seconds = 60,
origin = "1970-01-01") { # defaults to minute rounding
if(!is(date_var, "POSIXct")) stop("Please pass in a POSIXct variable")
if(is.na(date_var)) return(as.POSIXct(NA)) else {
return(as.POSIXct(floor(as.numeric(date_var) /
(floor_seconds))*(floor_seconds), origin = origin))
}
}
Sample output:
test <- data.frame(good = as.POSIXct(Sys.time()),
bad1 = as.Date(Sys.time()),
bad2 = as.POSIXct(NA))
test$good_15 <- floor_datetime(test$good, 15 * 60)
test$bad1_15 <- floor_datetime(test$bad1, 15 * 60)
Error in floor_datetime(test$bad, 15 * 60) :
Please pass in a POSIXct variable
test$bad2_15 <- floor_datetime(test$bad2, 15 * 60)
test
good bad1 bad2 good_15 bad2_15
1 2017-05-06 13:55:34.48 2017-05-06 <NA> 2007-05-06 13:45:00 <NA>
You can do it in one line by using trs function from FQOAT, just like:
df_15mins=trs(df, "15 mins")
Below is a repeatable example:
library(foqat)
head(aqi[,c(1,2)])
# Time NO
#1 2017-05-01 01:00:00 0.0376578
#2 2017-05-01 01:01:00 0.0341483
#3 2017-05-01 01:02:00 0.0310285
#4 2017-05-01 01:03:00 0.0357016
#5 2017-05-01 01:04:00 0.0337507
#6 2017-05-01 01:05:00 0.0238120
#mean
aqi_15mins=trs(aqi[,c(1,2)], "15 mins")
head(aqi_15mins)
# Time NO
#1 2017-05-01 01:00:00 0.02736549
#2 2017-05-01 01:15:00 0.03244958
#3 2017-05-01 01:30:00 0.03743626
#4 2017-05-01 01:45:00 0.02769419
#5 2017-05-01 02:00:00 0.02901817
#6 2017-05-01 02:15:00 0.03439455

Combine timedelta and date column, group by time interval

I need to combine two separate columns to one datetime column.
The pandas dataframe looks as follows:
calendarid time_delta_actualdeparture actualtriptime
20140101 0 days 06:35:49.000020000 27.11666667
20140101 0 days 06:51:37.000020000 24.83333333
20140101 0 days 07:11:40.000020000 28.1
20140101 0 days 07:31:40.000020000 23.03333333
20140101 0 days 07:53:34.999980000 23.3
20140101 0 days 08:14:13.000020000 51.81666667
I would like to convert it to look like this:
calendarid actualtriptime
2014-01-01 6:30:00 mean of trip times in time interval
2014-01-01 7:00:00 mean of trip times in time interval
2014-01-01 7:30:00 mean of trip times in time interval
2014-01-01 8:00:00 mean of trip times in time interval
2014-01-01 8:30:00 mean of trip times in time interval
Essentially i would like to combine the two columns as one and then group into 30 minute time intervals, taking the mean of the actual trip time in that interval. I've unsuccessfully tried many techniques, but i am still learning python/pandas. Can anyone help me with this?
Convert your 'calendarid' column to a datetime and add the delta to get the starting times.
In [5]: df['calendarid'] = pd.to_datetime(df['calendarid'], format='%Y%m%d')
In [7]: df['calendarid'] = df['calendarid'] + df['time_delta_actualdeparture']
In [8]: df
Out[8]:
calendarid time_delta_actualdeparture actualtriptime
0 2014-01-01 06:35:49.000020 06:35:49.000020 27.116667
1 2014-01-01 06:51:37.000020 06:51:37.000020 24.833333
2 2014-01-01 07:11:40.000020 07:11:40.000020 28.100000
3 2014-01-01 07:31:40.000020 07:31:40.000020 23.033333
4 2014-01-01 07:53:34.999980 07:53:34.999980 23.300000
5 2014-01-01 08:14:13.000020 08:14:13.000020 51.816667
Then you can you set your date column as an index and resample at the 30 minutes frequency to get the mean over each interval.
In [19]: df.set_index('calendarid').resample('30Min', how='mean', label='right')
Out[19]:
actualtriptime
calendarid
2014-01-01 07:00:00 25.975000
2014-01-01 07:30:00 28.100000
2014-01-01 08:00:00 23.166667
2014-01-01 08:30:00 51.816667

Resources