How to calculate duration in R sequentially - r

I have a dataset that looks like this where patients are intubated (based on ObservationValue of "Start") and extubation (based on ObservationValue of "Stop"). Occasionally there is unfortunately misentries where the patient has an extubation before the intubation, as seen in EncounterID #3 below, where they had an extubation, followed by an intubation. In those instances I have no interest in calculating their ventilation duration. My question therefore is, how can I calculate the duration in which I subtract the first extubation that happened after each intubation?
My apologies if this is a basic question, I am still fairly new to using R for data management.
Here is an example of my dataset:
And this is what I would like ideally:
Here is dput output of the dataset I currently have:
test<-structure(list(EncounterID=structure(c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3)),
ObservationDate=structure(c("2018-01-12 15:27:00", "2018-01-12 19:02:00", "2018-03-03 21:09:00", "2018-03-06 07:56:00",
"2019-12-03 15:54:00", "2019-12-03 20:06:00", "2019-12-04 11:40:00", "2019-12-06 08:13:00",
"2019-12-23 18:50:00", "2019-12-23 16:00:00")),
ObservationValue=structure(c("Start", "Stop", "Start", "Stop", "Start", "Stop", "Start", "Stop", "Stop", "Start"))),
row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"))
test$ObservationDate<-as.POSIXct(test[["ObservationDate"]],tz="", format="%Y-%m-%d %H:%M")

With a bit of data wrangling and reshaping to wide format you could do:
library(dplyr)
library(tidyr)
test |>
arrange(EncounterID, ObservationDate) |>
group_by(EncounterID) |>
mutate(id = cumsum(ObservationValue == "Start")) |>
ungroup() |>
pivot_wider(names_from = ObservationValue, values_from = ObservationDate) |>
rename(Inturbation = Start, Exturbation = Stop) |>
mutate(duration = difftime(Exturbation, Inturbation, units = "days"))
#> # A tibble: 5 × 5
#> EncounterID id Inturbation Exturbation duration
#> <dbl> <int> <dttm> <dttm> <drtn>
#> 1 1 1 2018-01-12 15:27:00 2018-01-12 19:02:00 0.1493056 days
#> 2 1 2 2018-03-03 21:09:00 2018-03-06 07:56:00 2.4493056 days
#> 3 2 1 2019-12-03 15:54:00 2019-12-03 20:06:00 0.1750000 days
#> 4 2 2 2019-12-04 11:40:00 2019-12-06 08:13:00 1.8562500 days
#> 5 3 1 2019-12-23 16:00:00 2019-12-23 18:50:00 0.1180556 days

Related

Stocks Daily Returns with R data.frame

(daily return percentage) / 100 = (today's close - yesterday's close) / yesterday's close
I have a data frame like this,
date close
1 2018-09-21 3410.486
2 2018-09-22 3310.126
3 2018-09-23 3312.482
4 2018-09-24 3269.432
5 2018-09-25 3204.922
I'd like to calculate daily returns and make it like this,
date close change
1 2018-09-21 3410.486 3.03%
2 2018-09-22 3310.126 -0.07%
3 2018-09-23 3312.482 1.32%
4 2018-09-24 3269.432 2.01%
5 2018-09-25 3321.825 NA
library(tidyverse)
library(tidyquant)
df %>%
tq_mutate(select = close,
mutate_fun = periodReturn,
period = "daily",
col_rename = "return")
# A tibble: 5 x 3
date close return
<date> <dbl> <dbl>
1 2018-09-21 3410. 0
2 2018-09-22 3310. -0.0294
3 2018-09-23 3312. 0.000712
4 2018-09-24 3269. -0.0130
5 2018-09-25 3205. -0.0197
Just using dplyr.
df1 %>%
mutate(change = (close - lag(close)) / lag(close))
date close change
1 2018-09-21 3410.486 NA
2 2018-09-22 3310.126 -0.0294268911
3 2018-09-23 3312.482 0.0007117554
4 2018-09-24 3269.432 -0.0129962970
5 2018-09-25 3204.922 -0.0197312561
data:
df1 <- structure(list(date = structure(c(17795, 17796, 17797, 17798,
17799), class = "Date"), close = c(3410.486, 3310.126, 3312.482,
3269.432, 3204.922), change = structure(c(0, 0, 0, 0, 0), tsp = c(0,
4, 1))), row.names = c(NA, -5L), class = "data.frame")

Perform calculation on column of data frame that has array of datetimes as values

I am trying to get the time difference between elements of an array a sample of the data is below and the image at the bottom describes the problem I am trying to solve. I have a dataframe column events where each value is an array of date and time entries that correspond to events and other columns which partition time into a before, evaluation and after period. I would like to calculate the statistics on the time between events.
** Update **
Using the excellent answer by danlooo below which gives me almost exactly what I need if I
add the four boundary events corresponding to before_eval_begin, eval_month, after_eval_end to the event array
duration is calculated for consecutive events
the before and after case_when statement is tweaked
the following code appears to work:
duration <-
data %>% mutate(across(before_event_eval:after_eval_end,as.character)) %>%
as_tibble() %>%
mutate(
events = events %>% str_remove_all("[\\[\\]\\\"]")
) %>%
mutate( events = ifelse(events == "",events,paste0(events,",",
before_event_eval,",",as.character(as.Date(eval_month)-days(1)),
",",as.character(ceiling_date(as.Date('2021-02-01'),"month")),
",",after_eval_end))) %>%
separate_rows(events, sep = ",") %>%
rename(event = events) %>%
filter(event != "") %>%
mutate(across(before_event_eval:after_eval_end,parse_datetime)) %>%
mutate(
event = event %>% parse_datetime(),
position = case_when(
event >= before_event_eval &
event < eval_month ~ "before",
event <= after_eval_end &
event > eval_month ~ "after"
)
) %>%
arrange(id,event) %>% group_by(id) %>%
mutate(duration = as.numeric(event - lag(event))) %>%
group_by(id,position) %>%
summarise(time_until_first = first(duration[!is.na(duration)]),
timebetween_last = last(duration[!is.na(duration)]),
min_duration = min(duration,na.rm=TRUE),
avg_duration = mean(duration,na.rm=TRUE),
max_duration = max(duration,na.rm=TRUE))
I think a general strategy would be as follows but I am not sure how to proceed after step 1 and perform computations on the cleaned array:
remove brackets and parenthesis from string
create ordered vector of events
Determine if event falls before or after eval month:
Before: event is >= before_eval_begin and < eval_month
After: event is > eval_month and <= after_eval_end
Determine time between events for each period (before, after) including times
relative to before_eval_begin, eval_month, after_eval_end
Return the below statistics:
If events is missing then all the values below should be set to 185
• Time to first event in pre period
• Time between last event in pre period and end of pre period
• Average time between events for pre period
• Minimum of time between events in pre period
• Maximum of time between events in pre period
• Time to first event in post period
• Time between last event in post period and end of post period
• Minimum of time between events in post period
• Maximum of time between events in post period
*Edit: removed duplicate events and added id column
Data
structure(list(id = c(1, 2, 3, 4), before_event_eval = structure(c(1596240000,
1596240000, 1604188800, 1604188800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), eval_month = structure(c(1612137600, 1612137600,
1619827200, 1619827200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
after_eval_end = structure(c(1627776000, 1627776000, 1635724800,
1635724800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
events = c("[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]",
"[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:26:32\"]",
"[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-4L))
Picture of Problem
Something like this?
library(tidyverse)
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
data <- structure(list(
before_event_eval = structure(c(
1596240000, 1596240000,
1604188800, 1604188800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
eval_month = structure(c(
1612137600, 1612137600, 1619827200,
1619827200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
after_eval_end = structure(c(
1627776000, 1627776000, 1635724800,
1635724800
), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
events = c(
"[\"2021-01-28 13:25:32\",\"2021-01-28 18:25:32\"]",
"[\"2021-04-30 18:25:32\",\"2021-01-15 11:25:32\",\"2021-01-30 18:25:32\",\"2021-03-30 18:25:32\",\"2021-01-27 11:25:32\",\"2021-01-30 18:25:32\",\"2021-01-30 18:25:32\"]",
"[]", "[\"2021-04-28 13:25:32\",\"2021-05-28 10:25:32\"]"
)
), class = c("tbl_df", "tbl", "data.frame"), row.names = c(
NA,
-4L
))
events <-
data %>%
as_tibble() %>%
mutate(
id = row_number(),
events = events %>% str_remove_all("[\\[\\]\\\"]")
) %>%
separate_rows(events, sep = ",") %>%
rename(event = events) %>%
filter(event != "") %>%
mutate(
event = event %>% parse_datetime(),
position = case_when(
event >= before_event_eval &
year(event) == year(eval_month) &
month(event) < month(eval_month) ~ "before",
event <= after_eval_end &
year(event) == year(eval_month) &
month(event) > month(eval_month) ~ "after"
)
) %>%
arrange(event)
events
#> # A tibble: 11 × 6
#> before_event_eval eval_month after_eval_end
#> <dttm> <dttm> <dttm>
#> 1 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 2 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 3 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 4 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 5 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 6 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 7 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 8 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 9 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> 10 2020-08-01 00:00:00 2021-02-01 00:00:00 2021-08-01 00:00:00
#> 11 2020-11-01 00:00:00 2021-05-01 00:00:00 2021-11-01 00:00:00
#> # … with 3 more variables: event <dttm>, id <int>, position <chr>
durations <-
events$event %>%
as.character() %>%
unique() %>%
combn(2) %>%
t() %>%
as_tibble() %>%
transmute(
from = parse_datetime(V1),
to = parse_datetime(V2),
duration = to - from
) %>%
left_join(events, by = c("from" = "event"))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
#> Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
durations
#> # A tibble: 44 × 8
#> from to duration before_event_eval
#> <dttm> <dttm> <drtn> <dttm>
#> 1 2021-01-15 11:25:32 2021-01-27 11:25:32 288 hours 2020-08-01 00:00:00
#> 2 2021-01-15 11:25:32 2021-01-28 13:25:32 314 hours 2020-08-01 00:00:00
#> 3 2021-01-15 11:25:32 2021-01-28 18:25:32 319 hours 2020-08-01 00:00:00
#> 4 2021-01-15 11:25:32 2021-01-30 18:25:32 367 hours 2020-08-01 00:00:00
#> 5 2021-01-15 11:25:32 2021-03-30 18:25:32 1783 hours 2020-08-01 00:00:00
#> 6 2021-01-15 11:25:32 2021-04-28 13:25:32 2474 hours 2020-08-01 00:00:00
#> 7 2021-01-15 11:25:32 2021-04-30 18:25:32 2527 hours 2020-08-01 00:00:00
#> 8 2021-01-15 11:25:32 2021-05-28 10:25:32 3191 hours 2020-08-01 00:00:00
#> 9 2021-01-27 11:25:32 2021-01-28 13:25:32 26 hours 2020-08-01 00:00:00
#> 10 2021-01-27 11:25:32 2021-01-28 18:25:32 31 hours 2020-08-01 00:00:00
#> # … with 34 more rows, and 4 more variables: eval_month <dttm>,
#> # after_eval_end <dttm>, id <int>, position <chr>
durations %>%
group_by(position) %>%
summarise(
min_duration = min(duration),
avg_duration = mean(duration),
max_duration = max(duration)
)
#> # A tibble: 2 × 4
#> position min_duration avg_duration max_duration
#> <chr> <drtn> <drtn> <drtn>
#> 1 after 664 hours 876.750 hours 1408 hours
#> 2 before 5 hours 1600.925 hours 3191 hours
Created on 2022-04-26 by the reprex package (v2.0.0)
To only look at consecutive events, one can do
durations <-
events %>%
arrange(position, event) %>%
mutate(
from = event,
to = lead(event)
)

Adding dates and times to event durations

As an addition to this question, is it possible to add when an event started and when it finished in another column(s)?
Here is a reproducible example pulled from the OP.
df <- structure(list(Time = structure(c(1463911500, 1463911800, 1463912100,
1463912400, 1463912700, 1463913000), class = c("POSIXct", "POSIXt"
), tzone = ""), Temp = c(20.043, 20.234, 6.329, 20.424, 20.615,
20.805)), row.names = c(NA, -6L), class = "data.frame")
> df
Time Temp
1 2016-05-22 12:05:00 20.043
2 2016-05-22 12:10:00 20.234
3 2016-05-22 12:15:00 6.329
4 2016-05-22 12:20:00 20.424
5 2016-05-22 12:25:00 20.615
6 2016-05-22 12:30:00 20.805
library(dplyr)
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)))
id event_duration
<int> <time>
1 1 5 mins
2 3 10 mins
i.e there are two more columns: "start_DateTime" and "end_DateTime"
Thanks!
Sure. Modify the final summarise() like this:
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)),
start_DateTime = min(Time),
end_DateTime = max(Time))
#> # A tibble: 2 × 4
#> id event_duration start_DateTime end_DateTime
#> <int> <drtn> <dttm> <dttm>
#> 1 1 5 mins 2016-05-22 12:05:00 2016-05-22 12:10:00
#> 2 3 10 mins 2016-05-22 12:20:00 2016-05-22 12:30:00

Convert date fromat from 31-May-2020 to timeseries data in R

I have a date format as 01-June-2020. I would like to convert it to a time series data in R. I tried as.Date but it returns NAs.
Here is the data:
dput(head(TData))
structure(list(Date = c("31-May-20", "01-Jun-20", "02-Jun-20",
"03-Jun-20", "04-Jun-20", "07-Jun-20"), Price = c(7213.03, 7288.81,
7285.23, 7222.41, 7207.78, 7267.86), Open = c(7050.66, 7213.03,
7288.81, 7285.23, 7222.41, 7207.78), High = c(7338.96, 7288.81,
7321.36, 7311.85, 7207.78, 7277.7), Low = c(7149.71, 7202.14,
7277.63, 7202.39, 7129.25, 7233.67), Vol. = c("307.44M", "349.59M",
"343.52M", "286.85M", "234.18M", "225.87M"), `Change %` = c("2.30%",
"1.05%", "-0.05%", "-0.86%", "-0.20%", "0.83%")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
We have to specify the format. By default, the format is YYYY-MM-DD i.e. %Y-%m-%d. Here, it is %d 2 digit day, followed by abbreviated month in characters- %b and 2 digit year - %y
TData$Date <- as.Date(TData$Date, '%d-%b-%y')
If we want to create a time series data, may be use xts
library(lubridate)
library(xts)
library(dplyr)
TData %>%
mutate(Date = dmy(Date)) %>%
select(Date, where(is.numeric)) %>%
{xts(.[-1], order.by = .$Date)}
Price Open High Low
2020-05-31 7213.03 7050.66 7338.96 7149.71
2020-06-01 7288.81 7213.03 7288.81 7202.14
2020-06-02 7285.23 7288.81 7321.36 7277.63
2020-06-03 7222.41 7285.23 7311.85 7202.39
2020-06-04 7207.78 7222.41 7207.78 7129.25
2020-06-07 7267.86 7207.78 7277.70 7233.67
or may use tsibble
library(tsibble)
TData %>%
mutate(Date = dmy(Date)) %>%
select(Date, where(is.numeric)) %>%
as_tsibble(index = Date)
-output
# A tsibble: 6 x 5 [1D]
Date Price Open High Low
<date> <dbl> <dbl> <dbl> <dbl>
1 2020-05-31 7213. 7051. 7339. 7150.
2 2020-06-01 7289. 7213. 7289. 7202.
3 2020-06-02 7285. 7289. 7321. 7278.
4 2020-06-03 7222. 7285. 7312. 7202.
5 2020-06-04 7208. 7222. 7208. 7129.
6 2020-06-07 7268. 7208. 7278. 7234.
We can also use lubridate package functions. Since months are stored as abbreviated month names we use %b instead of %m here:
library(lubridate)
df %>%
mutate(Date = as_date(Date, format = "%d-%b-%Y"))
# A tibble: 6 x 7
Date Price Open High Low Vol. `Change %`
<date> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 2020-05-31 7213. 7051. 7339. 7150. 307.44M 2.30%
2 2020-06-01 7289. 7213. 7289. 7202. 349.59M 1.05%
3 2020-06-02 7285. 7289. 7321. 7278. 343.52M -0.05%
4 2020-06-03 7222. 7285. 7312. 7202. 286.85M -0.86%
5 2020-06-04 7208. 7222. 7208. 7129. 234.18M -0.20%
6 2020-06-07 7268. 7208. 7278. 7234. 225.87M 0.83%

R: Reshape every 2 rows of data into 1 row based on factor of a single column

I have the below data:
TimeStamp Fab23.A start.end
1 2020-03-02 20:44:00 27.54236 start
2 2020-03-02 20:50:00 186.08670 end
3 2020-03-03 18:12:00 37.33132 start
4 2020-03-03 18:16:00 189.78060 end
5 2020-03-04 17:48:00 33.78360 start
6 2020-03-04 17:52:00 190.08100 end
.
.
.
I'm trying to reshape them to the below format based on the last categorical value of last column:
start.TimeStamp end.TimeStamp start.Fab23.A start.Fab23.A
1 2020-03-02 20:44:00 2020-03-02 20:50:00 27.54236 186.08670
2 2020-03-03 18:12:00 2020-03-03 18:16:00 37.33132 189.78060
3 2020-03-04 17:48:00 2020-03-04 17:52:00 33.78360 190.08100
.
.
.
I have tried reshape and melt function but to no avail.
Will appreciate any advice.
You can use :
library(dplyr)
df %>%
group_by(start.end) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = start.end,
values_from = c(TimeStamp, Fab23.A)) %>%
select(-row)
# A tibble: 3 x 4
# TimeStamp_start TimeStamp_end Fab23.A_start Fab23.A_end
# <chr> <chr> <dbl> <dbl>
#1 2020-03-0220:44:00 2020-03-0220:50:00 27.5 186.
#2 2020-03-0318:12:00 2020-03-0318:16:00 37.3 190.
#3 2020-03-0417:48:00 2020-03-0417:52:00 33.8 190.
Or using data.table :
library(data.table)
dcast(setDT(df), rowid(start.end)~start.end,value.var = c("TimeStamp", "Fab23.A"))
data
df <- structure(list(TimeStamp = c("2020-03-0220:44:00", "2020-03-0220:50:00",
"2020-03-0318:12:00", "2020-03-0318:16:00", "2020-03-0417:48:00",
"2020-03-0417:52:00"), Fab23.A = c(27.54236, 186.0867, 37.33132,
189.7806, 33.7836, 190.081), start.end = c("start", "end", "start",
"end", "start", "end")), class = "data.frame", row.names = c(NA, -6L))

Resources