R nested for-loops takes 30 days to run - r

I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame.
I have a first data.frame (dates) which has 122880 rows of dates (each 15 min).
I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID.
I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.
Here is my code:
for(hour in 1:122880){
for(patient in 1:nrow(episode){
if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
no_episode = episode$id[patient]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[patient]
}
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
The first loop is to set the date and write the results.
The second loop + first if statement is to search in the episode data.frame which patients are in the ER during the time frame.
The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.
It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:
if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)
Any idea how to clean my code so it doesn't take so much time?
I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!
Thank you!

Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!
Here is the code after modification:
for(hour in 1:122880){
temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
for(element in 1:length(temp){
no_episode = temp[element]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[episode$id==no_episode]
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
Getting rid of the first if statement and the 2nd (longest) loop did the trick!

Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.
First, here's some fake data. I assume here that "id" + "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.
library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data
care <- tibble(
id = sample(1:99999, epi_n, replace = TRUE),
begin = ymd_h(2010010100) + runif(epi_n, max = 3.5*365*24*60*60),
date = as_date(begin)) %>%
filter(hour(begin) >= 7, hour(begin) <= 17) %>%
distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers
episode <- care %>%
transmute(id, date,
begin = begin - rpois(nrow(care), 30)*60,
end = begin + rgamma(nrow(care), 1)*600) %>%
arrange(begin)
Matching the data between the two data sets takes 0.2 sec.
tictoc::tic()
combined <- left_join(
episode %>% rename("patient_arrival" = "begin"),
care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()
Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.
Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.
tictoc::tic()
combined %>%
select(id, patient_arrival, dr_arrival) %>%
pivot_longer(-id, values_to = "hour") %>%
mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
arrange(hour) %>%
mutate(wait_count = cumsum(wait_chg)) %>%
bind_rows(.,
tibble(name = "15 minute interval",
hour = seq.POSIXt(ymd_h(2010010100),
to = ymd_h(2013070112),
by = "15 min")) %>%
filter(hour(hour) >= 6, hour(hour) < 20)
) %>%
arrange(hour) %>%
fill(wait_count) %>%
replace_na(list(wait_count = 0)) %>%
filter(name == "15 minute interval") %>%
select(hour,wait_count)
tictoc::toc()

Related

Is there a way to isolate multiple index points in a dataset and isolate a time window around that index point in R?

I have a dataset with patients who were under observation for 72 hours. Patient's oxygen levels were measured every 4 seconds but some observations had to be removed due to issues with accuracy of the data. As a result, patients have a varying number of observations.
While patients were observed, they underwent various interventions. The objective of my analysis is to assess whether interventions affected the patient's oxygen levels or not. Therefore, I am doing a comparison of the oxygen levels pre-intervention, during intervention and post-intervention.
While the actual analysis isn't too difficult, I am having a hard time subsetting the data. For example, I only want observations 300 seconds prior to the start of an intervention and 300 seconds post intervention. I have to take into account the fact that an individual may have multiple interventions over the course of the time window and there are multiple subjects.
I have provided some sample code below to generate a dataset but please let me know if I'm missing anything.
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
So far, I have tried this method but I was only able to isolate single observations 5 mins pre and post an intervention and not all observations between those time windows (ie, the single observation 5mins prior to start of an intervention and the single observation 5 mins post an intervention but not all the observations in between these three points)
data <- final4 %>%
filter(intervention == 1) %>%
mutate(five_mins_after = time + 300, #5 mins after intervention
five_mins_before = time - 300) %>% #5 mins before intervention %>%
filter(id == "1")
data2 <- final4 %>%
filter(intervention == 0,
id == "1")
data_after <- data %>%
dplyr::select(five_mins_after)
data_before <- data %>%
dplyr::select(five_mins_before)
data3 <- merge(data2, data_after, by.x = "time", by.y = "five_mins_after")
data4 <- merge(data2, data_before, by.x = "time", by.y = "five_mins_before")
final <- final %>%
dplyr::bind_rows(data3) %>%
dplyr::bind_rows(data4)
Please let me know if you need any additional information and thanks for your time!
PS: Apologies if I missed anything, first time asking for help here
Here is the answer. Although long, it worked fine for gathering times 300 seconds before and 300 after the beggining of an intervention.
Let me know if you'd like further explanation or if I have misunderstood anything.
library(magrittr)
library(tidyverse)
### Sample code
id <- rep(c(1,2,3), each = 1000)
intervention <- c(rep(0,200), rep(1,10), rep(0,153), rep(0,5), rep(0,284), rep(0,20), rep(0,159), rep(0,23), rep(0,146),
rep(0,123), rep(1,23), rep(0,356), rep(1,8), rep(0,234), rep(1,23), rep(0,233),
rep(0,345), rep(1,12), rep(0,48), rep(1,15), rep(0,74), rep(1,4), rep(0,233), rep(1,82), rep(0,187))
final <- data.frame(id, intervention)
final <- final %>%
group_by(id) %>%
mutate(time = row_number() * 4)
### Start of data processing to get wanted observations
# Order it by id and time
final %<>% arrange(id, time)
# Loop over the unique ids
obs_to_keep <- list()
for(i in unique(final$id)) {
# Get starts of treatment
time_zero_intervention <- final %>%
filter(id == i & intervention == 0) %>%
select(time)
# Obtain all times after zero interventions, that could be intervention == 1
time_plus_4 <- time_zero_intervention$time + 4
# Where in the times after 0 intervention there is a 1 intervention
starts_of_interventions <- final %>%
filter(id == i & time %in% time_plus_4) %>%
filter(intervention == 1)
# Loop over each one of the times where intervention starts
all_times <- list()
for(n in 1:length(starts_of_interventions$time)) {
# Gather 300 secs prior and post
time_300_before <- starts_of_interventions$time[n] - 300
time_300_after <- starts_of_interventions$time[n] + 300
# Filter for observations in this interval
all_times[[n]] <- final %>%
filter(id == i) %>%
filter(time >= time_300_before & time <= time_300_after)
}
if(length(all_times) == 1){
obs_to_keep[[i]] <- as.data.frame(all_times)
}
else {
obs_to_keep[[i]] <- do.call(rbind, all_times)
}
}
# Make a data frame from the list
df <- do.call(rbind, obs_to_keep)
# Order it by id and time
df %<>% arrange(id, time)

Counting time between two events associated with a value column

Description
I am trying to count time (in mins) between two events associated with a value column. The data has date (o3.date), time(o4.arrival), and a value (i1.arrivalEval) column, similar to the sample dataset below. The value column (i1.arrivalEval) represents a waiting time, and there could be two events associated with it.
First, if this value exceeds the 90th percentile of all waiting time, overcrowding happens. The overcrowding value is defined as the 90th percentile.
Second, once the overcrowding happens, it takes time to normalized to the average waiting time (average object in the below code) during the day. I am trying to count this time to normalized to the average waiting time after overcrowding for each incident and for each day.
For example, for day 2013-01-01, the waiting time (i1.arrivalEval) spiked to 179 mins (overcrowding is 176) at 15:30:00. Then, it reaches to below average (around 101) waiting time at 16:00:00. Thus, the count of minutes will be 30 mins. The second incident of overcrowding on the same day will be 2013-01-01 22:20:00 and it takes 10 mins to reach below the average waiting time. When one day has multiple overcrowding incidents, record each incident with the time of overcrowding.
If the computation can be done under ddply or other tidyverse packages, it would be great.
Desired Output
The output will have four columns: date, datetime of overcrowding, datetime of normalization (i.e., reaching to below average waiting time), and the time between two events.
Sample Dataset
# sample dataset
library(tidyverse)
library(tibble)
library(lubridate)
rename <- dplyr::rename
select <- dplyr::select
set.seed(10000)
id <- 1:1000
set.seed(10000)
i1.arrivalEval <- sample(10:200, 1000, replace = T)
set.seed(10003)
o4.arrival <- sample(seq(as.POSIXct('2013/01/01'), as.POSIXct('2013/01/30'), by="10 mins"), 1000)
d <-
data.frame(id, i1.arrivalEval, o4.arrival) %>%
mutate(o3.date = as.Date(o4.arrival)) %>%
arrange(o4.arrival)
average <- mean(d$i1.arrivalEval)
overcrowding <- quantile(d$i1.arrivalEval,.90)
# Goal: For *each day*, *counting number of minutes* until the waiting time reaches
# below average after overcrowding (spike in waiting time)
#
# For example, for day '2013-01-01', the waiting time (i1.arrivalEval) spiked to 179 mins (> overcrowding 176) at 15:30:00
# Then, it reaches to below average (around 101) waiting time at 16:00:00.
# Thus, the count of minutes will be 30 mins.
# The second incident of overcrowding on the day will be 2013-01-01 22:20:00 and
# it takes 10 mins to reach bewlow the average waiting time.
# When one day has multiple overcrowding incidents, record each incident with the time of overcrowding.
# The output will have four columns:
# date, time of overcrowding, time of resolved time, and count of minutes until below average waiting time
Update: updated dataset and the solution
# Updated dataset by #Donalid Seinen
set.seed(10000)
d <- data.frame(id = 1:1000, i1.arrivalEval = sample(10:200, 1000, replace = T),
o4.arrival = sample(seq(as.POSIXct('2013/01/01'),
as.POSIXct('2013/01/30'),
by="10 mins"), 1000)) %>%
mutate(o3.date = as.Date(o4.arrival)) %>%
arrange(o4.arrival)
average <- mean(d$i1.arrivalEval)
overcrowding <- quantile(d$i1.arrivalEval,.90)
# Solution
d %>%
ddply('o3.date', function(i){
i %>%
select(-o3.date) %>%
mutate(
o2.overcrowding = ifelse(i1.arrivalEval >= overcrowding, 1, 0),
o1.incidents = cumsum(o2.overcrowding)
) %>%
filter(o1.incidents > 0) %>%
mutate(
o1.slow = ifelse(i1.arrivalEval <= average, 1, 0)
) %>%
filter(o2.overcrowding != 0 | o1.slow != 0) %>%
arrange(o4.arrival) %>%
mutate(o4.normalized = lead(o4.arrival, 1L)) %>%
select(id, i1.arrivalEval, o4.arrival, o4.normalized, everything()) %>%
filter(o2.overcrowding == 1) %>%
mutate(o1.count = difftime(o4.normalized, o4.arrival, unit = 'mins')) %>%
select(-o2.overcrowding, -o1.incidents, -i1.arrivalEval, -o1.slow)
})

Conditionally calculating average time between events by group in R

I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))

Check continuity in streamflow records in R

I want to check if there is a discontinuity in years in the records of my streamflow data. I create a loop to select the stations with higher number of 30 years of records and I want to also select only the ones with continued data (this means continuity in years: 1998,1999,2000,20001, but not a change from 1998 to 2000.
How can I make a loop to select continuity in the years or even months? At the moment this is my code working as I expected:
Daily_Streamflow <- Daily_Streamflow %>% mutate(month = month(Date))
Daily_Streamflow <- Daily_Streamflow %>% mutate(month = month(Date))
if (length(unique(Daily_Streamflow$year)) > 30)
{
(name <- paste("DSF", siteNumber[i], sep = "_"))
(name2 <- paste("Site_Info", siteNumber[i], sep = "_"))
assign(name, value = Daily_Streamflow)
assign(name2, value = siteINFO)
#print(name)
}
How can I add the continuity condition in the if?
My desire output is to save as data frames in R only the data records of streamflow gages with a period of records higher than 30 years (length(unique(Daily_Streamflow$year)) > 30) and I want to add the condition that only when there is no a discontinuity in the records of more than 3 months. What I am missing is the second condition of continuity

Plotting frequency of occurrences based on start/end times in R

I have a "trips" dataset that includes a unique trip id, and a start and end time (the specific hour and minute) of the trips. These trips were all taken on the same day. I am trying to determine the number of cars on the road at any given time and plot it as a line graph using ggplot in R. In other words, a car is "on the road" at any time in between its start and end time.
The most similar example I can find uses the following structure:
yearly_counts <- trips %>%
count(year, trip_id)
ggplot(data = yearly_counts, mapping = aes(x = year, y = n)) +
geom_line()
Would the best approach be to modify this structure have an "minutesByHour_count" variable that has a count for every minute of every hour? This seems inefficient to me, and still doesn't solve the problem of getting the counts from the start/end time.
Is there any easier way to do this?
Here's an example based on counting each start as an additional car, and each end as a reduction in the count:
library(tidyverse)
df %>%
gather(type, time, c(start_hour, end_hour)) %>%
mutate(count_chg = if_else(type == "start_hour", 1, -1)) %>%
arrange(time) %>%
mutate(car_count = cumsum(count_chg)) %>%
ggplot(aes(time, car_count)) +
geom_step()
Sample data:
df <- data.frame(
uniqueID = 1:60,
start_hour = seq(8, 12, length.out = 60),
dur_hour = 0.05*1:60
)
df$end_hour = df$start_hour + df$dur_hour
df$dur_hour = NULL

Resources