finding time interval using time and date variables - r

I have multiple date variables that are in the following format ("01-Jan-2010"), pretty much %d-%b-%Y and I also have multiple time variables which are in military format.
So for event 1, there are date1 and time1, for event 2, there are date2 and time2.
How can I get the time difference between the events in minutes in R?
Thanks in advance
event1 event1_time event1_date event2 event2_time event_2date
1 14:13 2014-10-10 1 15:34 2014-10-11
1 16:15 2011-02-01 1 18:22 2011-02-02

Here is an option
# Define a function to parse the `date` and `time` strings
to_dt <- function(date, time)
strptime(sprintf("%s %s", date, time), format = "%Y-%m-%d %H:%M")
transform(
df,
diff = difftime(
to_dt(event_2date, event2_time),
to_dt(event1_date, event1_time),
units = "mins"))
# event1 event1_time event1_date event2 event2_time event_2date diff
#1 1 14:13 2014-10-10 1 15:34 2014-10-11 1521 mins
#2 1 16:15 2011-02-01 1 18:22 2011-02-02 1567 mins
Or the same in dplyr
library(dplyr)
df %>%
mutate(diff = difftime(
to_dt(event_2date, event2_time),
to_dt(event1_date, event1_time),
units = "mins"))
If you would like the difference in another unit (e.g. days), you can use units = "days" inside difftime (take a look at ?difftime to see more options).
Sample data
df <- read.table(text =
"event1 event1_time event1_date event2 event2_time event_2date
1 14:13 2014-10-10 1 15:34 2014-10-11
1 16:15 2011-02-01 1 18:22 2011-02-02", header = T)

Related

Converting mixed times into 24 hour format

I currently have a dataset with multiple different time formats(AM/PM, numeric, 24hr format) and I'm trying to turn them all into 24hr format. Is there a way to standardize mixed format columns?
Current sample data
time
12:30 PM
03:00 PM
0.961469907
0.913622685
0.911423611
09:10 AM
18:00
Desired output
new_time
12:30:00
15:00:00
23:04:31
21:55:37
21:52:27
09:10:00
18:00:00
I know how to do them all individually(an example below), but is there a way to do it all in one go because I have a large amount of data and can't go line by line?
#for numeric time
> library(chron)
> x <- c(0.961469907, 0.913622685, 0.911423611)
> times(x)
[1] 23:04:31 21:55:37 21:52:27
The decimal times are a pain but we can parse them first, feed them back as a character then use lubridate's parse_date_time to do them all at once
library(tidyverse)
library(chron)
# Create reproducible dataframe
df <-
tibble::tibble(
time = c(
"12:30 PM",
"03:00 PM",
0.961469907,
0.913622685,
0.911423611,
"09:10 AM",
"18:00")
)
# Parse times
df <-
df %>%
dplyr::mutate(
time_chron = chron::times(as.numeric(time)),
time_chron = if_else(
is.na(time_chron),
time,
as.character(time_chron)),
time_clean = lubridate::parse_date_time(
x = time_chron,
orders = c(
"%I:%M %p", # HH:MM AM/PM 12 hour format
"%H:%M:%S", # HH:MM:SS 24 hour format
"%H:%M")), # HH:MM 24 hour format
time_clean = hms::as_hms(time_clean)) %>%
select(-time_chron)
Which gives us
> df
# A tibble: 7 × 2
time time_clean
<chr> <time>
1 12:30 PM 12:30:00
2 03:00 PM 15:00:00
3 0.961469907 23:04:31
4 0.913622685 21:55:37
5 0.911423611 21:52:27
6 09:10 AM 09:10:00
7 18:00 18:00:00

How to calculate time elapsed between rows in R

I have date and time as separate columns, which i combined into a single column using library(lubridate)
Now i want to create a new column that would calculate the elapsed time between two consecutive rows for each unique ID
I tried diff, however the error i am getting is that the new column has +1 rows compared to original data set
s1$DT<-with(s1, mdy(Date.of.Collection) + hm(MILITARY.TIME))#this worked - #needs the library lubridate
s1$ElapsedTime<-difff(s1$DT)
units(s1$ElapsedTime)<-"hours"
Subject.ID time DT Time elapsed
1 Dose 8/1/2018 8:15 0
1 time point1 8/1/2018 9:56 0.070138889
1 time point2 8/2/2018 9:56 1.070138889
2 Dose 9/4/2018 10:50 0
2 time point1 9/11/2018 11:00 7.006944444
3 Dose 10/1/2018 10:20 0
3 time point1 10/2/2018 14:22 1.168055556
3 time point2 10/3/2018 12:15 2.079861111
From your comment, you don't need a "diff"; in conventional R-speak, a "diff" would be T1-T0, T2-T1, T3-T2, ..., Tn - Tn-1.
For you, one of these will work to give you T1,2,...,n - T0.
Base R
do.call(
rbind,
by(patients, patients$Subject.ID, function(x) {
x$elapsed <- x$realDT - x$realDT[1]
units(x$elapsed) <- "hours"
x
})
)
# Subject.ID time1 DT Time elapsed realDT
# 1.1 1 Dose 8/1/2018 8:15 0.000000 hours 2018-08-01 08:15:00
# 1.2 1 time_point1 8/1/2018 9:56 1.683333 hours 2018-08-01 09:56:00
# 1.3 1 time_point2 8/2/2018 9:56 25.683333 hours 2018-08-02 09:56:00
# 2.4 2 Dose 9/4/2018 10:50 0.000000 hours 2018-09-04 10:50:00
# 2.5 2 time_point1 9/11/2018 11:00 168.166667 hours 2018-09-11 11:00:00
# 3.6 3 Dose 10/1/2018 10:20 0.000000 hours 2018-10-01 10:20:00
# 3.7 3 time_point1 10/2/2018 14:22 28.033333 hours 2018-10-02 14:22:00
# 3.8 3 time_point2 10/3/2018 12:15 49.916667 hours 2018-10-03 12:15:00
dplyr
library(dplyr)
patients %>%
group_by(Subject.ID) %>%
mutate(elapsed = `units<-`(realDT - realDT[1], "hours")) %>%
ungroup()
data.table
library(data.table)
patDT <- copy(patients)
setDT(patDT)
patDT[, elapsed := `units<-`(realDT - realDT[1], "hours"), by = "Subject.ID"]
Notes:
The "hours" in the $elapsed column is just an artifact of dealing with a time-difference thing, it should not affect most operations. To get rid of it, make sure you're in the right units ("hours", "secs", ..., see ?units) and use as.numeric.
The only reasons I used as.POSIXct as above are that I'm not a lubridate user, and the data as provided is not in a time format. You shouldn't need it if your Time is a proper time format, in which case you'd use that field instead of my hacky realDT.
On similar lines, if you do calculate realDT and use it, you really don't need both realDT and the pair of DT and Time.
The data I used:
patients <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
Subject.ID time1 DT Time elapsed
1 Dose 8/1/2018 8:15 0
1 time_point1 8/1/2018 9:56 0.070138889
1 time_point2 8/2/2018 9:56 1.070138889
2 Dose 9/4/2018 10:50 0
2 time_point1 9/11/2018 11:00 7.006944444
3 Dose 10/1/2018 10:20 0
3 time_point1 10/2/2018 14:22 1.168055556
3 time_point2 10/3/2018 12:15 2.079861111")
# this is necessary for me because DT/Time here are not POSIXt (they're just strings)
patients$realDT <- as.POSIXct(paste(patients$DT, patients$Time), format = "%m/%d/%Y %H:%M")

aggregate data frame to typical year/week

so i have a large data frame with a date time column of class POSIXct and a another column with price data of class numeric. the date time column has values of the form "1998-12-07 02:00:00 AEST" that are half hour observations across 20 years. a sample data set can be generated with the following code (vary the 100 to whatever number of observations are necessary):
data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"), as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))
i want to look at a typical year and typical week. so for the typical year i have the following code:
mean.year <- aggregate(df$price, by = list(format(df$date.time, "%m-%d %H:%M")), mean)
it seems to give me what i want:
Group.1 x
1 01-01 00:00 31.86200
2 01-01 00:30 34.20526
3 01-01 01:00 28.40105
4 01-01 01:30 26.01684
5 01-01 02:00 23.68895
6 01-01 02:30 23.70632
however the column "Group.1" is of class character and i would like it to be of class POSIXct. how can i do this?
for the typical week i have the following code
mean.week <- aggregate(df$price, by = list(format(df$date.time, "%wday %H:%M")), mean)
the output is as follows
Group.1 x
1 0day 00:00 33.05613
2 0day 00:30 30.92815
3 0day 01:00 29.26245
4 0day 01:30 29.47959
5 0day 02:00 29.18380
6 0day 02:30 25.99400
again, column "Group.1" is of class character and i would like POSIXct. also, i would like to have the day of the week as "Monday", "Tuesday", etc. instead of 0day. how would i do this?
Convert the datetime to a character string that can validly be converted back to POSIXct and then do so:
mean.year <- aggregate(df["price"],
by = list(time = as.POSIXct(format(df$date.time, "2000-%m-%d %H:%M"))), mean)
head(mean.year)
## time price
## 1 2000-12-07 02:00:00 -0.56047565
## 2 2000-12-07 02:30:00 -0.23017749
## 3 2000-12-07 03:00:00 1.55870831
## 4 2000-12-07 03:30:00 0.07050839
## 5 2000-12-07 04:00:00 0.12928774
## 6 2000-12-07 04:30:00 1.71506499
To get the day of the week use %a or %A -- see ?strptime for the list of percent codes.
mean.week <- aggregate(df["price"],
by = list(time = format(df$date.time, "%a %H:%M")), mean)
head(mean.week)
## time price
## 1 Mon 02:00 -0.56047565
## 2 Mon 02:30 -0.23017749
## 3 Mon 03:00 1.55870831
## 4 Mon 03:30 0.07050839
## 5 Mon 04:00 0.12928774
## 6 Mon 04:30 1.71506499
Note
The input df in reproducible form -- note that set.seed is needed to make it reproducible:
set.seed(123)
df <- data.frame(date.time = seq.POSIXt(as.POSIXct("1998-12-07 02:00:00 AEST"),
as.POSIXct(Sys.Date()+1), by = "30 min")[1:100], price = rnorm(100))

R: extract hour from variable format timestamp

My dataframe has timestamp with and without seconds, and a random use of 0 in front of months and hours, i.e. 01 or 1
library(tidyverse)
df <- data_frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 01:03', '12/30/2015 23:04:25'))
cust timestamp
A 5/31/2016 1:03:12
A 05/25/2016 01:06
B 6/16/2016 01:03
B 12/30/2015 23:04:25
How to extract hours into a separate column? The desired output:
cust timestamp hours
A 5/31/2016 1:03:12 1
A 05/25/2016 01:06 1
B 6/16/2016 9:03 9
B 12/30/2015 23:04:25 23
I prefer the answer with tidyverse and mutate, but my attempt fails to extract hours correctly:
df %>% mutate(hours=strptime(timestamp, '%H') %>% as.character() )
# A tibble: 4 × 3
cust timestamp hours
<chr> <chr> <chr>
1 A 5/31/2016 1:03:12 2016-10-31 05:00:00
2 A 05/25/2016 01:06 2016-10-31 05:00:00
3 B 6/16/2016 01:03 2016-10-31 06:00:00
4 B 12/30/2015 23:04:25 2016-10-31 12:00:00
Try this:
library(lubridate)
df <- data.frame(cust=c('A','A','B','B'), timestamp=c('5/31/2016 1:03:12', '05/25/2016 01:06',
'6/16/2016 09:03', '12/30/2015 23:04:25'))
df %>% mutate(hours=hour(strptime(timestamp, '%m/%d/%Y %H:%M')) %>% as.character() )
cust timestamp hours
1 A 5/31/2016 1:03:12 1
2 A 05/25/2016 01:06 1
3 B 6/16/2016 09:03 9
4 B 12/30/2015 23:04:25 23
Here is a solution that appends 00 for the seconds when they are missing, then converts to a date using lubridate and extracts the hours using format. Note, if you don't want the 00:00 at the end of the hours, you can just eliminate them from the output format in format:
df %>%
mutate(
cleanTime = ifelse(grepl(":[0-9][0-9]:", timestamp)
, timestamp
, paste0(timestamp, ":00")) %>% mdy_hms
, hour = format(cleanTime, "%H:00:00")
)
returns:
cust timestamp cleanTime hour
<chr> <chr> <dttm> <chr>
1 A 5/31/2016 1:03:12 2016-05-31 01:03:12 01:00:00
2 A 05/25/2016 01:06 2016-05-25 01:06:00 01:00:00
3 B 6/16/2016 01:03 2016-06-16 01:03:00 01:00:00
4 B 12/30/2015 23:04:25 2015-12-30 23:04:25 23:00:00
Your timestamp is a character string (), you need to format is as a date (with as.Date for example) before you can start using functions like strptime.
You are going to have to go through some string manipulations to have properly formatted data before you can convert it to dates. Prepend a zero to months with a single digit and append :00 to hours with missing seconds. Use strsplit() and other regex functions. Afterwards do as.Date(df$timestamp,format = '%m/%d/%Y %H:%M:%S'), then you will be able to use strptime to extract the hours.

How to finding a given length of runs in a series of data?

I'm trying to study times in which flow was operating at a given level. I would like to find when flows were above a given level for 4 or more hours. How would I go about doing this?
Sample code:
Date<-format(seq(as.POSIXct("2014-01-01 01:00"), as.POSIXct("2015-01-01 00:00"), by="hour"), "%Y-%m-%d %H:%M", usetz = FALSE)
Flow<-runif(8760, 0, 2300)
IsHigh<- function(x ){
if (x < 1600) return(0)
if (1600 <= x) return(1)
}
isHighFlow = unlist(lapply(Flow, IsHigh))
df = data.frame(Date, Flow, isHighFlow )
I was asked to edit my questions to supply what I would like to see as an output.
I would like to see a data from such as the one below. The only issue is the hourseHighFlow is incorrect. I'm not sure how to fix the code to generation the correct hoursHighFlow.
temp <- df %>%
mutate(highFlowInterval = cumsum(isHighFlow==1)) %>%
group_by(highFlowInterval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)))
#Then join the two tables together.
temp2<-sqldf("SELECT *
FROM temp LEFT JOIN df
ON df.Date BETWEEN temp.minDate AND temp.maxDate")
Able to use subset to select the length of time running at a high flow rate.
t<-subset(temp2,isHighFlow==1)
t<-subset(t, hoursHighFlow>=4)
Put it in a data.table:
require(data.table)
DT <- data.table(df)
Mark runs and lengths:
DT[,`:=`(r=.GRP,rlen=.N),by={r <- rle(isHighFlow);rep(1:length(r[[1]]),r$lengths)}]
Subset to long runs:
DT[rlen>4L]
How it works:
New columns are created in the second argument of DT[i,j,by] with :=.
.GRP and .N are special variables for, respectively, the index and size of the by group.
A data.table can be subset simply with DT[i], unlike a data.frame.
Apart from subsetting, most of what works with a data.frame works the same on a data.table.
Here is a solution using the dplyr package:
df %>%
mutate(interval = cumsum(isHighFlow!=lag(isHighFlow, default = 0))) %>%
group_by(interval) %>%
summarise(hoursHighFlow = n(), minDate = min(as.character(Date)), maxDate = max(as.character(Date)), isHighFlow = mean(isHighFlow)) %>%
filter(hoursHighFlow >= 4, isHighFlow == 1)
Result:
interval hoursHighFlow minDate maxDate isHighFlow
1 25 4 2014-01-03 07:00 2014-01-03 10:00 1
2 117 4 2014-01-12 01:00 2014-01-12 04:00 1
3 245 6 2014-01-23 13:00 2014-01-23 18:00 1
4 401 6 2014-02-07 03:00 2014-02-07 08:00 1
5 437 5 2014-02-11 02:00 2014-02-11 06:00 1
6 441 4 2014-02-11 21:00 2014-02-12 00:00 1
7 459 4 2014-02-13 09:00 2014-02-13 12:00 1
8 487 4 2014-02-16 03:00 2014-02-16 06:00 1
9 539 7 2014-02-21 08:00 2014-02-21 14:00 1
10 567 4 2014-02-24 11:00 2014-02-24 14:00 1
.. ... ... ... ... ...
As Frank notes, you could achieve the same result with using rle to set intervals, replacing the mutate line with:
mutate(interval = rep(1:length(rle(df$isHighFlow)[[2]]),rle(df$isHighFlow)[[1]])) %>%

Resources