Description
I am trying to count time (in mins) between two events associated with a value column. The data has date (o3.date), time(o4.arrival), and a value (i1.arrivalEval) column, similar to the sample dataset below. The value column (i1.arrivalEval) represents a waiting time, and there could be two events associated with it.
First, if this value exceeds the 90th percentile of all waiting time, overcrowding happens. The overcrowding value is defined as the 90th percentile.
Second, once the overcrowding happens, it takes time to normalized to the average waiting time (average object in the below code) during the day. I am trying to count this time to normalized to the average waiting time after overcrowding for each incident and for each day.
For example, for day 2013-01-01, the waiting time (i1.arrivalEval) spiked to 179 mins (overcrowding is 176) at 15:30:00. Then, it reaches to below average (around 101) waiting time at 16:00:00. Thus, the count of minutes will be 30 mins. The second incident of overcrowding on the same day will be 2013-01-01 22:20:00 and it takes 10 mins to reach below the average waiting time. When one day has multiple overcrowding incidents, record each incident with the time of overcrowding.
If the computation can be done under ddply or other tidyverse packages, it would be great.
Desired Output
The output will have four columns: date, datetime of overcrowding, datetime of normalization (i.e., reaching to below average waiting time), and the time between two events.
Sample Dataset
# sample dataset
library(tidyverse)
library(tibble)
library(lubridate)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10000)
id <- 1:1000
set.seed(10000)
i1.arrivalEval <- sample(10:200, 1000, replace = T)
set.seed(10003)
o4.arrival <- sample(seq(as.POSIXct('2013/01/01'), as.POSIXct('2013/01/30'), by="10 mins"), 1000)
d <-
data.frame(id, i1.arrivalEval, o4.arrival) %>%
mutate(o3.date = as.Date(o4.arrival)) %>%
arrange(o4.arrival)
average <- mean(d$i1.arrivalEval)
overcrowding <- quantile(d$i1.arrivalEval,.90)
# Goal: For *each day*, *counting number of minutes* until the waiting time reaches
# below average after overcrowding (spike in waiting time)
#
# For example, for day '2013-01-01', the waiting time (i1.arrivalEval) spiked to 179 mins (> overcrowding 176) at 15:30:00
# Then, it reaches to below average (around 101) waiting time at 16:00:00.
# Thus, the count of minutes will be 30 mins.
# The second incident of overcrowding on the day will be 2013-01-01 22:20:00 and
# it takes 10 mins to reach bewlow the average waiting time.
# When one day has multiple overcrowding incidents, record each incident with the time of overcrowding.
# The output will have four columns:
# date, time of overcrowding, time of resolved time, and count of minutes until below average waiting time
Update: updated dataset and the solution
# Updated dataset by #Donalid Seinen
set.seed(10000)
d <- data.frame(id = 1:1000, i1.arrivalEval = sample(10:200, 1000, replace = T),
o4.arrival = sample(seq(as.POSIXct('2013/01/01'),
as.POSIXct('2013/01/30'),
by="10 mins"), 1000)) %>%
mutate(o3.date = as.Date(o4.arrival)) %>%
arrange(o4.arrival)
average <- mean(d$i1.arrivalEval)
overcrowding <- quantile(d$i1.arrivalEval,.90)
# Solution
d %>%
ddply('o3.date', function(i){
i %>%
select(-o3.date) %>%
mutate(
o2.overcrowding = ifelse(i1.arrivalEval >= overcrowding, 1, 0),
o1.incidents = cumsum(o2.overcrowding)
) %>%
filter(o1.incidents > 0) %>%
mutate(
o1.slow = ifelse(i1.arrivalEval <= average, 1, 0)
) %>%
filter(o2.overcrowding != 0 | o1.slow != 0) %>%
arrange(o4.arrival) %>%
mutate(o4.normalized = lead(o4.arrival, 1L)) %>%
select(id, i1.arrivalEval, o4.arrival, o4.normalized, everything()) %>%
filter(o2.overcrowding == 1) %>%
mutate(o1.count = difftime(o4.normalized, o4.arrival, unit = 'mins')) %>%
select(-o2.overcrowding, -o1.incidents, -i1.arrivalEval, -o1.slow)
})
Related
I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame.
I have a first data.frame (dates) which has 122880 rows of dates (each 15 min).
I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID.
I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.
Here is my code:
for(hour in 1:122880){
for(patient in 1:nrow(episode){
if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
no_episode = episode$id[patient]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[patient]
}
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
The first loop is to set the date and write the results.
The second loop + first if statement is to search in the episode data.frame which patients are in the ER during the time frame.
The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.
It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:
if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)
Any idea how to clean my code so it doesn't take so much time?
I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!
Thank you!
Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!
Here is the code after modification:
for(hour in 1:122880){
temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
for(element in 1:length(temp){
no_episode = temp[element]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[episode$id==no_episode]
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
Getting rid of the first if statement and the 2nd (longest) loop did the trick!
Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.
First, here's some fake data. I assume here that "id" + "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.
library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data
care <- tibble(
id = sample(1:99999, epi_n, replace = TRUE),
begin = ymd_h(2010010100) + runif(epi_n, max = 3.5*365*24*60*60),
date = as_date(begin)) %>%
filter(hour(begin) >= 7, hour(begin) <= 17) %>%
distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers
episode <- care %>%
transmute(id, date,
begin = begin - rpois(nrow(care), 30)*60,
end = begin + rgamma(nrow(care), 1)*600) %>%
arrange(begin)
Matching the data between the two data sets takes 0.2 sec.
tictoc::tic()
combined <- left_join(
episode %>% rename("patient_arrival" = "begin"),
care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()
Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.
Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.
tictoc::tic()
combined %>%
select(id, patient_arrival, dr_arrival) %>%
pivot_longer(-id, values_to = "hour") %>%
mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
arrange(hour) %>%
mutate(wait_count = cumsum(wait_chg)) %>%
bind_rows(.,
tibble(name = "15 minute interval",
hour = seq.POSIXt(ymd_h(2010010100),
to = ymd_h(2013070112),
by = "15 min")) %>%
filter(hour(hour) >= 6, hour(hour) < 20)
) %>%
arrange(hour) %>%
fill(wait_count) %>%
replace_na(list(wait_count = 0)) %>%
filter(name == "15 minute interval") %>%
select(hour,wait_count)
tictoc::toc()
I have a dataset with patients who were under observation for 72 hours. Patient's oxygen levels were measured every 4 seconds but some observations had to be removed due to issues with accuracy of the data. As a result, patients have a varying number of observations.
While patients were observed, they underwent various interventions. The objective of my analysis is to assess whether interventions affected the patient's oxygen levels or not. Therefore, I am doing a comparison of the oxygen levels pre-intervention, during intervention and post-intervention.
While the actual analysis isn't too difficult, I am having a hard time subsetting the data. For example, I only want observations 300 seconds prior to the start of an intervention and 300 seconds post intervention. I have to take into account the fact that an individual may have multiple interventions over the course of the time window and there are multiple subjects.
I have provided some sample code below to generate a dataset but please let me know if I'm missing anything.
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
So far, I have tried this method but I was only able to isolate single observations 5 mins pre and post an intervention and not all observations between those time windows (ie, the single observation 5mins prior to start of an intervention and the single observation 5 mins post an intervention but not all the observations in between these three points)
data <- final4 %>%
filter(intervention == 1) %>%
mutate(five_mins_after = time + 300, #5 mins after intervention
five_mins_before = time - 300) %>% #5 mins before intervention %>%
filter(id == "1")
data2 <- final4 %>%
filter(intervention == 0,
id == "1")
data_after <- data %>%
dplyr::select(five_mins_after)
data_before <- data %>%
dplyr::select(five_mins_before)
data3 <- merge(data2, data_after, by.x = "time", by.y = "five_mins_after")
data4 <- merge(data2, data_before, by.x = "time", by.y = "five_mins_before")
final <- final %>%
dplyr::bind_rows(data3) %>%
dplyr::bind_rows(data4)
Please let me know if you need any additional information and thanks for your time!
PS: Apologies if I missed anything, first time asking for help here
Here is the answer. Although long, it worked fine for gathering times 300 seconds before and 300 after the beggining of an intervention.
Let me know if you'd like further explanation or if I have misunderstood anything.
library(magrittr)
library(tidyverse)
### Sample code
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
### Start of data processing to get wanted observations
# Order it by id and time
final %<>% arrange(id, time)
# Loop over the unique ids
obs_to_keep <- list()
for(i in unique(final$id)) {
# Get starts of treatment
time_zero_intervention <- final %>%
filter(id == i & intervention == 0) %>%
select(time)
# Obtain all times after zero interventions, that could be intervention == 1
time_plus_4 <- time_zero_intervention$time + 4
# Where in the times after 0 intervention there is a 1 intervention
starts_of_interventions <- final %>%
filter(id == i & time %in% time_plus_4) %>%
filter(intervention == 1)
# Loop over each one of the times where intervention starts
all_times <- list()
for(n in 1:length(starts_of_interventions$time)) {
# Gather 300 secs prior and post
time_300_before <- starts_of_interventions$time[n] - 300
time_300_after <- starts_of_interventions$time[n] + 300
# Filter for observations in this interval
all_times[[n]] <- final %>%
filter(id == i) %>%
filter(time >= time_300_before & time <= time_300_after)
}
if(length(all_times) == 1){
obs_to_keep[[i]] <- as.data.frame(all_times)
}
else {
obs_to_keep[[i]] <- do.call(rbind, all_times)
}
}
# Make a data frame from the list
df <- do.call(rbind, obs_to_keep)
# Order it by id and time
df %<>% arrange(id, time)
I want to check if there is a discontinuity in years in the records of my streamflow data. I create a loop to select the stations with higher number of 30 years of records and I want to also select only the ones with continued data (this means continuity in years: 1998,1999,2000,20001, but not a change from 1998 to 2000.
How can I make a loop to select continuity in the years or even months? At the moment this is my code working as I expected:
Daily_Streamflow <- Daily_Streamflow %>% mutate(month = month(Date))
Daily_Streamflow <- Daily_Streamflow %>% mutate(month = month(Date))
if (length(unique(Daily_Streamflow$year)) > 30)
{
(name <- paste("DSF", siteNumber[i], sep = "_"))
(name2 <- paste("Site_Info", siteNumber[i], sep = "_"))
assign(name, value = Daily_Streamflow)
assign(name2, value = siteINFO)
#print(name)
}
How can I add the continuity condition in the if?
My desire output is to save as data frames in R only the data records of streamflow gages with a period of records higher than 30 years (length(unique(Daily_Streamflow$year)) > 30) and I want to add the condition that only when there is no a discontinuity in the records of more than 3 months. What I am missing is the second condition of continuity
need to assaign time interval based on "cnt_rows" columns group by "Name". i.e if count is around
96 means then it will be 15 mins time interval . so if count is 94 then time interval should stop at
11:15 PM (Based on number of rows) and if they are exactly 96 then it should end at 11:45 PM every day.
Same for 5 mins interval. Interval should not exceed the day
cnt_rows = c("94","94",".",".","94","286","286",".",".",".","286","96","96",".",".","96")
Name = c("Alan","Alan",".",".","Alan","Steve","Steve",".",".",".","Steve","Mike","Mike",".",".","Mike")
Values = c("10","10",".",".","45","91","35",".",".",".","46","34","5",".",".","34")
Input Table
df = data.frame(cnt_rows,Name,Values)
Output Table
dt = c("2019-12-01 00:00:00","2019-12-01 00:15:00",".",".","2019-12-01 23:15:00","2019-12-01 00:00:00","2019-12-01 00:05:00",".",".",".","2019-12-01 23:45:00","2019-12-01 00:00:00","2019-12-01 00:15:00",".",".","2019-12-01 23:45:00")
df_out = data.frame(cnt_rows,Name,Values,dt)
Thanks in advance.
Maybe you can try :
library(dplyr)
date <- as.POSIXct('2019-12-01')
df %>%
mutate(breaks = ifelse(cnt_rows %in% c(94, 96), '15 min', '5 min')) %>%
group_by(Name) %>%
mutate(dt = seq(date, by = first(breaks), length.out = n()))
I have a data frame where each row is a different timestamp. The older data in the data frame is collected at 30-minute intervals while the more recent data is collected at 15-minute intervals. I would like to run a for loop (or maybe an ifelse statement) that calulates the time difference between each row, if the difference is equal to 30 minutes (below example uses 1800 seconds) then the loop continues, but if the loop encounters a 15 minute time difference (below example uses 900 seconds) it stops and tells me which row this first occured on.
x <- as.POSIXct("2000-01-01 01:00", tz = "", "%Y-%m-%d %H:%M")
y <- as.POSIXct("2000-01-10 12:30", tz = "", "%Y-%m-%d %H:%M")
xx <- as.POSIXct("2000-01-10 12:45", tz = "", "%Y-%m-%d %H:%M")
yy <- as.POSIXct("2000-01-20 23:45", tz = "", "%Y-%m-%d %H:%M")
a.30 <- as.data.frame(seq(from = x, to = y, by = 1800))
names(a.30)[1] <- "TimeStamp"
a.15 <- as.data.frame(seq(from = xx, to = yy, by = 900))
names(a.15)[1] <- "TimeStamp"
dat <- rbind(a.30,a.15)
In the example dat data frame, the time difference switches from 30 minute to 15 minute intervals at row 457. I would like to automate the process of identifing the row where this change in time difference first occurs.
We can use difftime to calculate the difference in time in mins and create a logical vector based on the difference
library(dplyr)
dat %>%
summarise(ind = which.max(abs(as.numeric(difftime(TimeStamp,
lag(TimeStamp, default = TimeStamp[2]), unit = 'min'))) < 30))
# ind
#1 457
Here's another way that uses slightly different logic. Calculate the difference, and create a column with the row number. Then filter to where the difference is 15, and take the first row.
library(tidyverse)
dat %>% mutate(Diff = TimeStamp - lag(TimeStamp), rownum = row_number()) %>%
filter(Diff == 15) %>%
slice(1)
TimeStamp Diff rownum
1 2000-01-10 12:45:00 15 mins 457