Consolidating rows by max and min dates - r

I have a dataset that looks like this.
id1 = c(1,1,1,1,1,1,1,1,2,2)
id2 = c(3,3,3,3,3,3,3,3,3,3)
lat = c(-62.81559,-62.82330, -62.78693,-62.70136, -62.76476,-62.48157,-62.49064,-62.45838,42.06258,42.06310)
lon = c(-61.15518, -61.14885,-61.17801,-61.00363, -59.14270, -59.22009, -59.32967, -59.04125 ,154.70579, 154.70625)
start_date= as.POSIXct(c('2016-03-24 15:30:00', '2016-03-24 15:30:00','2016-03-24 23:40:00','2016-03-25 12:50:00','2016-03-29 18:20:00','2016-06-01 02:40:00','2016-06-01 08:00:00','2016-06-01 16:30:00','2016-07-29 20:20:00','2016-07-29 20:20:00'), tz = 'UTC')
end_date = as.POSIXct(c('2016-03-24 23:40:00', '2016-03-24 18:50:00','2016-03-25 03:00:00','2016-03-25 19:20:00','2016-04-01 03:30:00','2016-06-02 01:40:00','2016-06-01 14:50:00','2016-06-02 01:40:00','2016-07-30 07:00:00','2016-07-30 07:00:00'),tz = 'UTC')
speed = c(2.9299398, 2.9437502, 0.0220565, 0.0798409, 1.2824859, 1.8685429, 3.7927680, 1.8549291, 0.8140249,0.8287073)
df = data.frame(id1, id2, lat, lon, start_date, end_date, speed)
id1 id2 lat lon start_date end_date speed
1 1 3 -62.81559 -61.15518 2016-03-24 15:30:00 2016-03-24 23:40:00 2.9299398
2 1 3 -62.82330 -61.14885 2016-03-24 15:30:00 2016-03-24 18:50:00 2.9437502
3 1 3 -62.78693 -61.17801 2016-03-24 23:40:00 2016-03-25 03:00:00 0.0220565
4 1 3 -62.70136 -61.00363 2016-03-25 12:50:00 2016-03-25 19:20:00 0.0798409
5 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
6 1 3 -62.48157 -59.22009 2016-06-01 02:40:00 2016-06-02 01:40:00 1.8685429
7 1 3 -62.49064 -59.32967 2016-06-01 08:00:00 2016-06-01 14:50:00 3.7927680
8 1 3 -62.45838 -59.04125 2016-06-01 16:30:00 2016-06-02 01:40:00 1.8549291
9 2 3 42.06258 154.70579 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8140249
10 2 3 42.06310 154.70625 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8287073
The actual dataset is larger. What I would like to do is consolidate this dataset based on date ranges and grouped by id1 and id2, such that if the date/time range on one row is within 12 hours of the next date/time range 'ABS(end_date[1] - start_date[2]) < 12hrs' the rows should be consolidated with the new start_date being the earliest date and the end_date being the latest. All other values (lat, lon, speed) will be averaged. This is some sense a 'deduping' effort as rows that are within 12 hours actually represent the same 'event'. For the above example the final result would be
id1 id2 lat lon start_date end_date speed
1 1 3 -62.7818 -61.12142 2016-03-24 15:30:00 2016-03-25 19:20:00 1.493897
2 1 3 -62.76476 -59.14270 2016-03-29 18:20:00 2016-04-01 03:30:00 1.2824859
3 1 3 -62.47686 -59.197 2016-06-01 02:40:00 2016-06-02 01:40:00 2.505413
4 2 3 42.06284 154.706 2016-07-29 20:20:00 2016-07-30 07:00:00 0.8213661
With the first four rows consolidated (into row1), the 5 row left alone (row2), the 6-8 rows consolidated (row3), and the 9-10 rows consolidated (row4).
I have been trying to do this with dplyr group_by and summarize, but I can't seem to get the get the date ranges to come out correctly.
Hopefully someone can determine a simple means of solving the problem. Extra points if you know how to do it in SQL ;-) so I can dedupe before even pulling this into R.

Here is a first very naive implementation. Warning: it is slow, not pretty and still missing the start and end dates in the output! Note that it expects the rows to be ordered by date and time. If that's not the case in the data set, you can do it in R or SQL first. Sorry that I can't think of a dplyr or SQL solution. I'd also like to see those two, if anyone has got an idea.
dedupe <- function(df) {
counter = 1
temp_vector = unlist(df[1, ])
summarized_df = df[0, c(1, 2, 3, 4, 7)]
colnames(summarized_df) = colnames(df)[c(1, 2, 3, 4, 7)]
summarized_df$counter = NULL
for (i in 2:nrow(df)) {
if (((abs(difftime(df[i, "start_date"], df[i - 1, "end_date"], units = "h")) <
12) ||
abs(difftime(df[i, "start_date"], df[i - 1, "start_date"], units = "h")) <
12) &&
df[i, "id1"] == df[i - 1, "id1"] &&
df[i, "id2"] == df[i - 1, "id2"]) {
#group events because id is the same and time range overlap
#sum up columns and select maximum end_date
temp_vector[c(3, 4, 7)] = temp_vector[c(3, 4, 7)] + unlist(df[i, c(3, 4, 7)])
temp_vector["end_date"] = max(temp_vector["end_date"], df[i, "end_date"])
counter = counter + 1
if (i == nrow(df)) {
#in the last iteration we need to create a new group
summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
summarized_df[nrow(summarized_df), "counter"] = counter
}
} else {
#new event so we calculate group statistics for temp_vector and reset its value as well as counter
summarized_df[nrow(summarized_df) + 1, c(1, 2)] = df[i, c(1, 2)]
summarized_df[nrow(summarized_df), 3:5] = temp_vector[c(3, 4, 7)] / counter
summarized_df[nrow(summarized_df), "counter"] = counter
counter = 1
temp_vector[c(3, 4, 7)] = unlist(df[i, c(3, 4, 7)])
}
}
return(summarized_df)
}
Function call
> dedupe(df)
id1 id2 lat lon speed counter
5 1 3 -62.78179 -61.12142 1.4938968 4
6 1 3 -62.76476 -59.14270 1.2824859 1
9 2 3 -62.47686 -59.19700 2.5054133 3
10 2 3 42.06284 154.70602 0.8213661 2

This can be easily achieved by using insurancerating::reduce():
df |>
insurancerating::reduce(begin = start_date, end = end_date, id1, id2,
agg_cols = c(lat, lon, speed), agg = "mean",
min.gapwidth = 12 * 3600)
#> id1 id2 index end_date start_date lat lon
#> 1 1 3 0 2016-03-25 19:20:00 2016-03-24 15:30:00 -62.78180 -61.12142
#> 2 1 3 1 2016-04-01 03:30:00 2016-03-29 18:20:00 -62.76476 -59.14270
#> 3 1 3 2 2016-06-02 01:40:00 2016-06-01 02:40:00 -62.47686 -59.19700
#> 4 2 3 0 2016-07-30 07:00:00 2016-07-29 20:20:00 42.06284 154.70602
#> speed
#> 1 1.4938969
#> 2 1.2824859
#> 3 2.5054133
#> 4 0.8213661
Created on 2022-06-13 by the reprex package (v2.0.1)

Related

make monthly ranges in R

I've this function to generate monthly ranges, it should consider years where february has 28 or 29 days:
starts ends
1 2017-01-01 2017-01-31
2 2017-02-01 2017-02-28
3 2017-03-01 2017-03-31
It works with:
make_date_ranges(as.Date("2017-01-01"), Sys.Date())
But gives error with:
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Why?
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
Error in data.frame(starts, ends) :
arguments imply differing number of rows: 38, 36
add_months <- function(date, n){
seq(date, by = paste (n, "months"), length = 2)[2]
}
make_date_ranges <- function(start, end){
starts <- seq(from = start,
to = Sys.Date()-1 ,
by = "1 month")
ends <- c((seq(from = add_months(start, 1),
to = end,
by = "1 month" ))-1,
(Sys.Date()-1))
data.frame(starts,ends)
}
## useage
make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
1) First, define start of month, som, and end of month, eom functions which take a Date class object, date string in standard Date format or yearmon object and produce a Date class object giving the start or end of its year/months.
Using those, create a monthly Date series s using the start of each month from the month/year of from to that of to. Use pmax to ensure that the series does not extend before from and pmin so that it does not extend past to.
The input arguments can be strings in standard Date format, Date class objects or yearmon class objects. In the yearmon case it assumes the user wanted the full month for every month. (The if statement can be omitted if you don't need to support yearmon inputs.)
library(zoo)
som <- function(x) as.Date(as.yearmon(x))
eom <- function(x) as.Date(as.yearmon(x), frac = 1)
date_ranges2 <- function(from, to) {
if (inherits(to, "yearmon")) to <- eom(to)
s <- seq(som(from), eom(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges2("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges2(as.yearmon("2000-01"), as.yearmon("2000-06"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
2) This alternative takes the same approach but defines start of month (som) and end of month (eom) functions without using yearmon so that only base R is needed. It takes character strings in standard Date format or Date class inputs and gives the same output as (1).
som <- function(x) as.Date(cut(as.Date(x), "month")) # start of month
eom <- function(x) som(som(x) + 32) - 1 # end of month
date_ranges3 <- function(from, to) {
s <- seq(som(from), as.Date(to), "month")
data.frame(from = pmax(as.Date(from), s), to = pmin(as.Date(to), eom(s)))
}
date_ranges3("2000-01-10", "2000-06-20")
## from to
## 1 2000-01-10 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-20
date_ranges3(som("2000-01-10"), eom("2000-06-20"))
## from to
## 1 2000-01-01 2000-01-31
## 2 2000-02-01 2000-02-29
## 3 2000-03-01 2000-03-31
## 4 2000-04-01 2000-04-30
## 5 2000-05-01 2000-05-31
## 6 2000-06-01 2000-06-30
You don't need to use seq twice -- you can subtract 1 day from the firsts of each month to get the ends, and generate one too many starts, then shift & subset:
make_date_ranges = function(start, end) {
# format(end, "%Y-%m-01") essentially truncates end to
# the first day of end's month; 32 days later is guaranteed to be
# in the subsequent month
starts = seq(from = start, to = as.Date(format(end, '%Y-%m-01')) + 32, by = 'month')
data.frame(starts = head(starts, -1L), ends = tail(starts - 1, -1L))
}
x = make_date_ranges(as.Date("2017-01-01"), as.Date("2019-12-31"))
rbind(head(x), tail(x))
# starts ends
# 1 2017-01-01 2017-01-31
# 2 2017-02-01 2017-02-28
# 3 2017-03-01 2017-03-31
# 4 2017-04-01 2017-04-30
# 5 2017-05-01 2017-05-31
# 6 2017-06-01 2017-06-30
# 31 2019-07-01 2019-07-31
# 32 2019-08-01 2019-08-31
# 33 2019-09-01 2019-09-30
# 34 2019-10-01 2019-10-31
# 35 2019-11-01 2019-11-30
# 36 2019-12-01 2019-12-31

Complex conditional groupby in R

Here is the problem I am trying to solve.
I want to take table 1 to table 2.
Table 1 :
df
# icustay_id starttime endtime vaso_rate vaso_amount
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 3 293.0896
# 2 1 2019-09-11 13:30:00 2019-09-12 01:20:00 9 602.9983
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939
Table 2 :
df
# icustay_id starttime endtime vaso_rate vaso_amount
# 1 1 2019-09-10 13:20:00 2019-09-12 01:20:00 3 293.0896
# 2 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360
# 3 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494
# 4 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939
As you notice :
I am trying to build a function that will :
For every single unique patient (unique icustay_id), groupby icustay_id ONLY if the medication has been stopped for less than an hour.
When the row merges :
Some columns will retain the same value (i.e. the patient identifiers)
Some columns must be modified :
Keep the earlier starttime
Keep the latter endttime
Average the vaso-rate
Sum the vaso-amount
To do so, I have decided to add another column identifier that takes the value 1 when the condition is met and when all the rows are verified, groupby (icustay_id and that new column)
My code as it is written however does not assign the appropriate ID in respect to the condition.
Here is the sample df creation code :
set.seed(1)
df <- data.frame(
icustay_id = c(1, 1, 1, 2, 3),
starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30", "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
endtime = as.POSIXct(c("2019-09-11 13:20", "2019-09-11 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
vaso_rate = sample(1:10, 5, replace = TRUE),
vaso_amount = runif(5, 0, 1000)
)
Here is the function code that I have right now :
merge_pressor_doses <- function(df){
df %>% arrange(icustay_id,starttime)
for (i in unique(df$icustay_id))
{
for (j in which(df$icustay_id==i))
{
start <- df$starttime[as.numeric(j)+1]
end <- df$endtime[as.numeric(j)]
stopduration <- as.numeric(difftime(start, end, units = 'mins'))
bool <- stopduration < 60
df <- df%>%mutate(
group = case_when(
bool = TRUE ~ 1,
bool = FALSE ~ 0)
)
}
}
return(df)
}
This should result in :
df
# icustay_id starttime endtime vaso_rate vaso_amount group
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 3 293.0896 1
# 2 1 2019-09-11 13:30:00 2019-09-12 01:20:00 9 602.9983 1
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 4 208.9360 0
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 2 864.1494 1
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 9 405.2939 1
But in my case the 3rd row is assign a value of 1...
If I can manage to make this portion of the code work, I could proceed with this portion of the code to achieve my objective.
The eventual second portion of the code would be :
group_by(group, icustay_id) %>%
summarise(
starttime = min(starttime),
endtime = max(endtime),
vaso_rate = mean(vaso_rate),
sum_vaso_amount = sum(vaso_amount))
Thank you in advance!!
I'd create a new column pause which says how much time passed since the last medication. Then using this column we assign groups ids to medications: cumsum(pause >= 1) - start with 0, then if pause is >=1 hours, it's a different group.
set.seed(1)
df <- data.frame(
icustay_id = c(1, 1, 1, 2, 3),
starttime = as.POSIXct(c("2019-09-10 13:20", "2019-09-11 13:30", "2019-09-14 16:40", "2019-09-10 12:40", "2019-09-10 01:20")),
endtime = as.POSIXct(c("2019-09-11 13:20", "2019-09-11 01:20", "2019-09-15 16:40", "2019-09-13 13:20", "2019-09-11 13:20")),
vaso_rate = sample(1:10, 5, replace = TRUE),
vaso_amount = runif(5, 0, 1000)
)
library(dplyr)
library(tidyr)
df <-
df %>%
group_by(icustay_id) %>%
mutate(pause = difftime(starttime, lag(endtime), units = "hours")) %>%
replace_na(list(pause = 0)) %>%
mutate(vaso_id = cumsum(pause >= 1))
# A tibble: 5 x 7
# Groups: icustay_id [3]
# icustay_id starttime endtime vaso_rate vaso_amount pause vaso_id
# <dbl> <dttm> <dttm> <int> <dbl> <drtn> <int>
# 1 1 2019-09-10 13:20:00 2019-09-11 13:20:00 9 898. 0.0000000 hours 0
# 2 1 2019-09-11 13:30:00 2019-09-11 01:20:00 4 945. 0.1666667 hours 0
# 3 1 2019-09-14 16:40:00 2019-09-15 16:40:00 7 661. 87.3333333 hours 1
# 4 2 2019-09-10 12:40:00 2019-09-13 13:20:00 1 629. 0.0000000 hours 0
# 5 3 2019-09-10 01:20:00 2019-09-11 13:20:00 2 61.8 0.0000000 hours 0
Then we can use the code you provided.
df %>%
group_by(icustay_id, vaso_id) %>%
summarise(
starttime = min(starttime),
endtime = max(endtime),
vaso_rate = mean(vaso_rate),
sum_vaso_amount = sum(vaso_amount)
)
# A tibble: 4 x 6
# Groups: icustay_id [3]
# icustay_id vaso_id starttime endtime vaso_rate sum_vaso_amount
# <dbl> <int> <dttm> <dttm> <dbl> <dbl>
# 1 1 0 2019-09-10 13:20:00 2019-09-11 13:20:00 6.5 1843.
# 2 1 1 2019-09-14 16:40:00 2019-09-15 16:40:00 7 661.
# 3 2 0 2019-09-10 12:40:00 2019-09-13 13:20:00 1 629.
# 4 3 0 2019-09-10 01:20:00 2019-09-11 13:20:00 2 61.8

How to generate a unique ID for each group based on relative date interval in R using dplyr?

I have a cohort of data with multiple person visits and want to group visits with a common ID based on person # and the time of the visit. The condition is if an start is within 24 hours of a the previous exit, then I want those to have the same ID.
Sample of what data looks like:
dat <- data.frame(
Person_ID = c(1,1,1,2,3,3,3,4,4),
Admit_Date_Time = as.POSIXct(c("2017-02-07 15:26:00","2017-04-21 10:20:00",
"2017-04-22 12:12:00", "2017-10-16 01:31:00","2017-01-24 02:41:00","2017- 01-24 05:31:00", "2017-01-28 04:26:00", "2017-12-01 01:31:00","2017-12-01
01:31:00"), format = "%Y-%m-%d %H:%M"),
Discharge_Date_Time = as.POSIXct(c("2017-03-01 11:42:00","2017-04-22
05:56:00",
"2017-04-26 21:01:00",
"2017-10-18 20:11:00",
"2017-01-27 22:15:00",
"2017-01-26 15:35:00",
"2017-01-28 09:25:00",
"2017-12-05 18:33:00",
"2017-12-04 16:41:00"),format = "%Y-%m-%d %H:%M" ),
Visit_ID = c(1:9))
this is what I tried to start:
dat1 <-
dat %>%
arrange(Person_ID, Admit_Date_Time) %>%
group_by(Person_ID) %>%
mutate(Previous_Visit_Interval = difftime(lag(Discharge_Date_Time,
1),Admit_Date_Time, units = "hours")) %>%
mutate(start = c(1,Previous_Visit_Interval[-1] < hours(-24)), run =
cumsum(start))
dat1$ID = as.numeric(as.factor(paste0(dat1$Person_ID,dat1$run)))
Which is almost right, except it does not give the correct ID for visit 7 (person #3). Since there are three visits and the second visit is entirely within the first, and the third starts within 24 hours of the first but not the second.
There's probably a way to shorten this, but here's an approach using tidyr::gather and spread. By gathering into long format, we can track the cumulative admissions inside each visit. A new visit is recorded whenever there's a new Person_ID or that Person_ID completed a visit (cumulative admissions went to zero) at least 24 hours prior.
library(tidyr)
dat1 <- dat %>%
# Gather into long format with event type in one column, timestamp in another
gather(event, time, Admit_Date_Time:Discharge_Date_Time) %>%
# I want discharges to have an effect up to 24 hours later. Sort using that.
mutate(time_adj = if_else(event == "Discharge_Date_Time",
time + ddays(1),
time)) %>%
arrange(Person_ID, time_adj) %>%
# For each Person_ID, track cumulative admissions. 0 means a visit has completed.
# (b/c we sorted by time_adj, these reflect the 24hr period after discharges.)
group_by(Person_ID) %>%
mutate(admissions = if_else(event == "Admit_Date_Time", 1, -1)) %>%
mutate(admissions_count = cumsum(admissions)) %>%
ungroup() %>%
# Record a new Hosp_ID when either (a) a new Person, or (b) preceded by a
# completed visit (ie admissions_count was zero).
mutate(Hosp_ID_chg = 1 *
(Person_ID != lag(Person_ID, default = 1) | # (a)
lag(admissions_count, default = 1) == 0), # (b)
Hosp_ID = cumsum(Hosp_ID_chg)) %>%
# Spread back into original format
select(-time_adj, -admissions, -admissions_count, -Hosp_ID_chg) %>%
spread(event, time)
Results
> dat1
# A tibble: 9 x 5
Person_ID Visit_ID Hosp_ID Admit_Date_Time Discharge_Date_Time
<dbl> <int> <dbl> <dttm> <dttm>
1 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
2 1 2 2 2017-04-21 10:20:00 2017-04-22 05:56:00
3 1 3 2 2017-04-22 12:12:00 2017-04-26 21:01:00
4 2 4 3 2017-10-16 01:31:00 2017-10-18 20:11:00
5 3 5 4 2017-01-24 02:41:00 2017-01-27 22:15:00
6 3 6 4 2017-01-24 05:31:00 2017-01-26 15:35:00
7 3 7 4 2017-01-28 04:26:00 2017-01-28 09:25:00
8 4 8 5 2017-12-01 01:31:00 2017-12-05 18:33:00
9 4 9 5 2017-12-01 01:31:00 2017-12-04 16:41:00
Here's a data.table approach using an overlap-join
library( data.table )
library( lubridate )
setDT( dat )
setorder( dat, Person_ID, Admit_Date_Time )
#create a 1-day extension after each discharge
dt2 <- dat[, discharge_24h := Discharge_Date_Time %m+% days(1)][]
#now create id
setkey( dat, Admit_Date_Time, discharge_24h )
#create data-table with overlap-join, create groups based on overlapping ranges
dt2 <- setorder(
foverlaps( dat,
dat,
mult = "first",
type = "any",
nomatch = 0L
),
Visit_ID )[, list( Visit_ID = i.Visit_ID,
Hosp_ID = .GRP ),
by = .( Visit_ID )][, Visit_ID := NULL]
#reorder the result
setorder( dt2[ dat, on = "Visit_ID" ][, discharge_24h := NULL], Visit_ID )[]
# Visit_ID Hosp_ID Person_ID Admit_Date_Time Discharge_Date_Time
# 1: 1 1 1 2017-02-07 15:26:00 2017-03-01 11:42:00
# 2: 2 2 1 2017-04-21 10:20:00 2017-04-22 05:56:00
# 3: 3 2 1 2017-04-22 12:12:00 2017-04-26 21:01:00
# 4: 4 3 2 2017-10-16 01:31:00 2017-10-18 20:11:00
# 5: 5 4 3 2017-01-24 02:41:00 2017-01-27 22:15:00
# 6: 6 4 3 2017-01-24 05:31:00 2017-01-26 15:35:00
# 7: 7 4 3 2017-01-28 04:26:00 2017-01-28 09:25:00
# 8: 8 5 4 2017-12-01 01:31:00 2017-12-05 18:33:00
# 9: 9 5 4 2017-12-01 01:31:00 2017-12-04 16:41:00

Convert start time and total duration to elapsed time per hour

I have data on start time ('startTime', a date-time variable, POSIXct) and duration in minutes ('duration_minutes'):
df <- data.frame(id = c(1, 2, 3),
startTime = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11")),
duration_minutes = c(315, 120, 45))
I want to convert the start time and duration to elapsed time per hour, for each hour, from the hour of the start time to the last hour at the end of the duration:
df_result <- data.frame(id = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3),
startTime = c("2018-01-01 12:15:31","2018-01-01 13:00:00",
"2018-01-01 14:00:00","2018-01-01 15:00:00",
"2018-01-01 16:00:00","2018-01-01 17:00:00",
"2018-01-02 23:43:00","2018-01-03 00:00:00",
"2018-01-03 01:00:00",
"2018-01-03 11:00:11"),
duration_minutes = c(44.48, 60, 60, 60, 60, 30.5, 17, 60, 43, 45))
Please, advice with the possible solution.
Another possibility:
library(data.table)
library(lubridate)
setDT(df)
df[ , ceil_start := ceiling_date(start, "hour", change_on_boundary = TRUE)]
df[ , {
if(difftime(ceil_start, start, units = "min") > dur) {
.SD[ , .(start, dur)]
} else {
end <- start + dur * 60
time <- c(start,
seq(from = ceil_start,
to = floor_date(end, "hour"),
by = "hour"),
end)
.(start = head(time, -1), dur = `units<-`(diff(time), "mins"))
}
},
by = id]
# id start dur
# 1: 1 2018-01-01 12:15:31 44.48333 mins
# 2: 1 2018-01-01 13:00:00 60.00000 mins
# 3: 1 2018-01-01 14:00:00 60.00000 mins
# 4: 1 2018-01-01 15:00:00 60.00000 mins
# 5: 1 2018-01-01 16:00:00 60.00000 mins
# 6: 1 2018-01-01 17:00:00 30.51667 mins
# 7: 2 2018-01-02 23:43:00 17.00000 mins
# 8: 2 2018-01-03 00:00:00 60.00000 mins
# 9: 2 2018-01-03 01:00:00 43.00000 mins
# 10: 3 2018-01-03 11:00:11 45.00000 mins
# 11: 4 2018-01-03 11:35:00 25.00000 mins
# 12: 4 2018-01-03 12:00:00 10.00000 mins
# 13: 5 2018-01-03 00:00:00 60.00000 mins
# 14: 5 2018-01-03 01:00:00 0.00000 mins
Explanation
Convert data.frame to data.table (setDT). Round up start times to nearest hour (ceiling_date(start, "hour", ...). Use change_on_boundary = TRUE for easier handling of times without minutes and seconds (not in the data, but tested).
To handle cases when the end time (start + duration) is in the same hour as the start time (e.g. id = 3), check if difference between rounded time and start time is larger than duration (if(difftime(ceil_start, start, units = "min") > dur))). If so, just select the start and duration columns (.SD[ , .(start, dur)).
For other cases (else), calculate end time: end <- start + dur * 60. Create a sequence from the up-rounded start time ('ceil_start'), to the down-rounded end time, with an hourly increment (seq(from = ceil_start, to = floor_date(end, "hour"), by = "hour")). Concatenate with 'start' and 'end' times. Return all times except the last (head(time, -1) and calculate difference between time steps in minutes (`units<-`(diff(time), "mins")).
For times with H:M:S = 00:00:00 and duration is a multiple of 60 min, like id = 5, the current solution gives a row with a duration of 0 minutes for the last hour. While waiting for a more elegant solution, a quick and dirty way is just to delete such rows with duration = 0.
Data
Please note that I have added a case not included in original data, id = 4 (see also my comment above) and id = 5.
df <- data.frame(id = 1:5,
start = as.POSIXct(c("2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-03 11:35:00",
"2018-01-03 00:00:00")),
dur = c(315, 120, 45, 35, 60))
Try this:
library(data.table)
library(lubridate)
library(magrittr)
df <-
setDT(df)[, start_ceiling := ceiling_date(startTime, "hour", change_on_boundary = TRUE)] %>%
.[, `:=` (
reps = ifelse(
startTime + (duration_minutes * 60) <= start_ceiling, 1, pmax(2, floor(duration_minutes / 60) + 1)
),
initial_diff = as.numeric(difftime(start_ceiling[1], startTime[1], units = "mins"))
), by = id] %>%
.[, df[df[, rep(.I, reps)]]] %>%
.[, startTime := pmax(startTime, floor_date(startTime, "hour") + hours(0:(.N - 1))), by = id] %>%
.[reps > 1, duration_minutes := c(initial_diff[.N],
rep(60, reps[.N] - 2),
(duration_minutes[.N] - initial_diff[.N]) %% 60), by = id] %>%
.[!(duration_minutes == 0 & reps > 1), ] %>%
.[, c("reps", "start_ceiling", "initial_diff") := NULL]
I've tested this with all the scenarios we've gathered so far, and this is the output:
id startTime duration_minutes
1: 1 2018-01-01 12:15:31 44.48333
2: 1 2018-01-01 13:00:00 60.00000
3: 1 2018-01-01 14:00:00 60.00000
4: 1 2018-01-01 15:00:00 60.00000
5: 1 2018-01-01 16:00:00 60.00000
6: 1 2018-01-01 17:00:00 30.51667
7: 2 2018-01-02 23:43:00 17.00000
8: 2 2018-01-03 00:00:00 60.00000
9: 2 2018-01-03 01:00:00 43.00000
10: 3 2018-01-03 11:00:11 45.00000
11: 4 2018-01-04 10:00:00 60.00000
12: 4 2018-01-04 11:00:00 5.00000
13: 5 2018-01-05 00:00:00 60.00000
14: 6 2018-01-06 11:35:00 25.00000
15: 6 2018-01-06 12:00:00 10.00000
16: 7 2018-01-07 00:00:00 60.00000
17: 7 2018-01-07 01:00:00 60.00000
Data used:
df <- data.frame(
id = c(1, 2, 3, 4, 5, 6, 7),
startTime = as.POSIXct(
c(
"2018-01-01 12:15:31",
"2018-01-02 23:43:00",
"2018-01-03 11:00:11",
"2018-01-04 10:00:00",
"2018-01-05 00:00:00",
"2018-01-06 11:35:00",
"2018-01-07 00:00:00"
)
),
duration_minutes = c(315, 120, 45, 65, 60, 35, 120)
)
df
id startTime duration_minutes
1 1 2018-01-01 12:15:31 315
2 2 2018-01-02 23:43:00 120
3 3 2018-01-03 11:00:11 45
4 4 2018-01-04 10:00:00 65
5 5 2018-01-05 00:00:00 60
6 6 2018-01-06 11:35:00 35
7 7 2018-01-07 00:00:00 120

How to replace missing value in time series data by looping?

I'm trying to create looping to replace missing time series data with value == 0.
This is my data:
df
Times value
05-03-2018 09:00:00 1
05-03-2018 09:01:26 2
05-03-2018 09:04:28 1
05-03-2018 09:07:05 2
05-03-2018 09:09:05 1
and my desired output is:
Times value
05-03-2018 09:00:00 1
05-03-2018 09:01:26 2
05-03-2018 09:02:00 0
05-03-2018 09:03:00 0
05-03-2018 09:04:28 1
05-03-2018 09:05:00 0
05-03-2018 09:06:00 0
05-03-2018 09:07:05 2
05-03-2018 09:08:00 0
05-03-2018 09:09:05 1
Missing minutes in the data should be created and assigned a value of 0.
What should I do? Create new dummies table with missing minute or make a sequence looping?
You could do this with dplyr and padr packages. padr is very useful for extending datetime series between to dates or adding missing values.
library(dplyr)
library(padr)
df1 %>%
thicken(interval = "min") %>% # roll time series up to minutes
pad(by = "Times_min") %>% # add missing minute intervals
fill_by_value(value) %>% # fill missing values with 0
mutate(Times = if_else(is.na(Times), Times_min, Times)) %>% # fill NA's in Times column
select(-Times_min) # drop not needed column
pad applied on the interval: min
Times value
1 2018-03-05 09:00:00 1
2 2018-03-05 09:01:26 2
3 2018-03-05 09:02:00 0
4 2018-03-05 09:03:00 0
5 2018-03-05 09:04:28 1
6 2018-03-05 09:05:00 0
7 2018-03-05 09:06:00 0
8 2018-03-05 09:07:05 2
9 2018-03-05 09:08:00 0
10 2018-03-05 09:09:05 1
data:
df1 <- structure(list(Times = structure(c(1520240400, 1520240486, 1520240668,
1520240825, 1520240945), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
value = c(1, 2, 1, 2, 1)), row.names = c(NA, -5L), class = "data.frame")
You could create a second 'complete' data frame and merge them together.
dif <- diff(as.numeric(range(df1$Times)))
df1 <- merge(df1,
data.frame(Times=as.POSIXct(0:(dif/60)*60,
origin=df1[1, 1], tz="UTC")), all=TRUE)
Then replace resulting NAs with 0.
df1[is.na(df1$value), 2] <- 0
Finally remove the duplicates.
df1 <- df1[-which(duplicated(strftime(df1$Times, format="%M"))) + 1, ]
Yields:
> df1
Times value
1 2018-03-05 09:00:00 1
3 2018-03-05 09:01:26 2
4 2018-03-05 09:02:00 0
5 2018-03-05 09:03:00 0
7 2018-03-05 09:04:28 1
8 2018-03-05 09:05:00 0
9 2018-03-05 09:06:00 0
11 2018-03-05 09:07:05 2
12 2018-03-05 09:08:00 0
14 2018-03-05 09:09:05 1
Data:
df1 <- structure(list(Times = structure(c(1520240400, 1520240486, 1520240668,
1520240825, 1520240945), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
value = c(1, 2, 1, 2, 1)), row.names = c(NA, -5L), class = "data.frame")
library(tidyverse)
library(lubridate)
library(magrittr)
Recreate your data
df <- tibble(
Times = c("05-03-2018 09:00:00", "05-03-2018 09:01:26",
"05-03-2018 09:04:28", "05-03-2018 09:07:05",
"05-03-2018 09:09:05"),
value = c(1, 2, 1, 2, 1)
)
Code
Parse your Times variable to datetime
df$Times %<>% parse_datetime("%d-%m-%Y %H:%M:%S")
Create a new variable join that is truncated to the minute
df %<>% mutate(join = floor_date(Times, unit = "minute"))
Create a new data frame with one variable also called join and containing every minute in your range
all <- tibble(
join = seq(as_datetime(first(df$Times), as_datetime(last(df$Times)), by = 60)
)
Join both data frames
result <- left_join(all, df)
Add the "missing minutes" to your Times variable
result$Times[is.na(result$Times)] <- result$join[is.na(result$Times)]
Replace the NA by 0
result$value[is.na(result$value)] <- 0
Remove the join variable
result %>%
select(- join)
Result
# A tibble: 10 x 2
Times value
<dttm> <dbl>
1 2018-03-05 09:00:00 1
2 2018-03-05 09:01:26 2
3 2018-03-05 09:02:00 0
4 2018-03-05 09:03:00 0
5 2018-03-05 09:04:28 1
6 2018-03-05 09:05:00 0
7 2018-03-05 09:06:00 0
8 2018-03-05 09:07:05 2
9 2018-03-05 09:08:00 0
10 2018-03-05 09:09:05 1

Resources