allocate time series interval to the column based on count of rows - r

need to assaign time interval based on "cnt_rows" columns group by "Name". i.e if count is around
96 means then it will be 15 mins time interval . so if count is 94 then time interval should stop at
11:15 PM (Based on number of rows) and if they are exactly 96 then it should end at 11:45 PM every day.
Same for 5 mins interval. Interval should not exceed the day
cnt_rows = c("94","94",".",".","94","286","286",".",".",".","286","96","96",".",".","96")
Name = c("Alan","Alan",".",".","Alan","Steve","Steve",".",".",".","Steve","Mike","Mike",".",".","Mike")
Values = c("10","10",".",".","45","91","35",".",".",".","46","34","5",".",".","34")
Input Table
df = data.frame(cnt_rows,Name,Values)
Output Table
dt = c("2019-12-01 00:00:00","2019-12-01 00:15:00",".",".","2019-12-01 23:15:00","2019-12-01 00:00:00","2019-12-01 00:05:00",".",".",".","2019-12-01 23:45:00","2019-12-01 00:00:00","2019-12-01 00:15:00",".",".","2019-12-01 23:45:00")
df_out = data.frame(cnt_rows,Name,Values,dt)
Thanks in advance.

Maybe you can try :
library(dplyr)
date <- as.POSIXct('2019-12-01')
df %>%
mutate(breaks = ifelse(cnt_rows %in% c(94, 96), '15 min', '5 min')) %>%
group_by(Name) %>%
mutate(dt = seq(date, by = first(breaks), length.out = n()))

Related

R nested for-loops takes 30 days to run

I want to count the number of patients waiting to be seen by a doctor each 15 min for a 3.5 years time frame.
I have a first data.frame (dates) which has 122880 rows of dates (each 15 min).
I have another data.frame (episode) which has 225000 rows of patient ID with the time when they came to the ER, the time when they left and their ID.
I have a third data.frame (care) with the time the doctor saw the patients and the patients' ID.
Here is my code:
for(hour in 1:122880){
for(patient in 1:nrow(episode){
if(episode$begin[patient] <dates[hour]&episode$end[patient]>dates[hour]){
no_episode = episode$id[patient]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[patient]
}
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
The first loop is to set the date and write the results.
The second loop + first if statement is to search in the episode data.frame which patients are in the ER during the time frame.
The second if is to count the patient that haven't seen the doctor yet. I also sum how much time the patients have been waiting.
It is extremely long (estimated 30 days) and by the tests I have done, the longest line is this one:
if(episode$begin[patient]<dates[hour}&episode$end[patient}>dates[hour)
Any idea how to clean my code so it doesn't take so much time?
I have tried cutting the episode data.frame into 4, which speeds up the process but would still take 7 days!
Thank you!
Update! Went from 30 days to 11 hours, thanks to the comments I had on the post!
Here is the code after modification:
for(hour in 1:122880){
temp <- episode$id[episode$begin<dates[hour]&episode$end>dates[hour]]
for(element in 1:length(temp){
no_episode = temp[element]
if(care$begin[care$id==no_episode]>dates[hour]{
nb_wait = nb_wait + 1
delay = delay + dates[hour]-episode$begin[episode$id==no_episode]
}
}
nb_wait_total = rbind(nb_wait_total, nb_wait)
delay_total = rbind(delay_total, delay)
nb_wait = 0
delay = 0
}
Getting rid of the first if statement and the 2nd (longest) loop did the trick!
Not sure if this does exactly what you need, but it uses data of similar size and calculates the wait counts at each 15 minute interval in the date range in under 1 second.
First, here's some fake data. I assume here that "id" + "date" are unique identifiers in both the "care" and "episode" tables, and there's a 1:1 match for each. I also assume patient always arrives before doctor arrival, and ends after doctor arrival.
library(tidyverse); library(lubridate)
set.seed(42)
epi_n <- 500000 # before filtering; results in 230k data
care <- tibble(
id = sample(1:99999, epi_n, replace = TRUE),
begin = ymd_h(2010010100) + runif(epi_n, max = 3.5*365*24*60*60),
date = as_date(begin)) %>%
filter(hour(begin) >= 7, hour(begin) <= 17) %>%
distinct(id, date, .keep_all = TRUE) # make "id" and "date" unique identifiers
episode <- care %>%
transmute(id, date,
begin = begin - rpois(nrow(care), 30)*60,
end = begin + rgamma(nrow(care), 1)*600) %>%
arrange(begin)
Matching the data between the two data sets takes 0.2 sec.
tictoc::tic()
combined <- left_join(
episode %>% rename("patient_arrival" = "begin"),
care %>% rename("dr_arrival" = "begin")
)
tictoc::toc()
Counting how many patients had arrived but doctors hadn't at each 15 minute interval takes another 0.2 sec.
Here, I isolate each patient arrival and each doctor arrival; the first moment adds one to the wait at that moment, and the second reduces the wait count by one. We can sort those and count the cumulative number of waiting patients. Finally, I collate in a table like your "dates" table, and fill the wait counts that were in effect prior to each 15 minute interval. Then I just show those.
tictoc::tic()
combined %>%
select(id, patient_arrival, dr_arrival) %>%
pivot_longer(-id, values_to = "hour") %>%
mutate(wait_chg = if_else(name == "patient_arrival", 1, -1)) %>%
arrange(hour) %>%
mutate(wait_count = cumsum(wait_chg)) %>%
bind_rows(.,
tibble(name = "15 minute interval",
hour = seq.POSIXt(ymd_h(2010010100),
to = ymd_h(2013070112),
by = "15 min")) %>%
filter(hour(hour) >= 6, hour(hour) < 20)
) %>%
arrange(hour) %>%
fill(wait_count) %>%
replace_na(list(wait_count = 0)) %>%
filter(name == "15 minute interval") %>%
select(hour,wait_count)
tictoc::toc()

Is it possible in R to split my date-time values into 5 different columns (Year, month, date, hour, minute)?

I am really new at R and this is probably a really basic question: Let's say I have a dataset with a column that includes date values of the format ("y-m-d H:M:S") as a Factor value.
How do I split the one column into 5?
Given example:
x <- as.factor(c("2018-01-03 12:34:32.92382", "2018-01-03 12:50:40.00040"))
x <- as_datetime(x) #to convert to type Date
x <- x %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
I get the error: for objects with the class(c('POSIXct', 'POSIXt') can not be used.
Change it into dataframe then run mutate part will works
x %>%
as.data.frame() %>%
rename(x = '.') %>%
dplyr::mutate(year = lubridate::year(x),
month = lubridate::month(x),
day = lubridate::day(x),
hour = lubridate::hour(x),
minute = lubridate::minute(x),
second = lubridate::second(x))
x year month day hour minute second
1 2018-01-03 12:34:32 2018 1 3 12 34 32.92382
2 2018-01-03 12:50:40 2018 1 3 12 50 40.00040
You could also make your mutate a little bit cleaner utilizing the power of across:
library(lubridate)
x %>%
data.frame(date = .) %>%
mutate(across(date,
funs(year, month, day, hour, minute, second),
.names = "{.fn}"))

Sequence a group of dates in R

I wish to generate some Tidy data.
26 companies are observed everyday for 10 days.
Each day a value is recorded.
The first day is: 2020/1/1
How do I create a list of dates so that the first 26 rows of the date column of the date frame is "2020/1/1" (Year, Month, Day) and the next 26 rows are "2020/1/2" etc.
Here is the data frame without the date column:
library(tidyverse)
set.seed(33)
date_chunk <- rep(as.Date("2020/1/1"), 26)
# Tidy data. 10 sequential days starting 2020/1/1/
df <- tibble(
company = rep(letters, 10),
value = sample(0:5, 260, replace = TRUE),
color = "grey"
)
You can try this
rep(seq(as.Date("2020-01-01"),as.Date("2020-01-10"),1),each=26)
This will return a list of dates from 2020-01-01 to 2020-01-10 where each date will be repeated 26 times
For each company we can add row_number() to first date_chunk to get an incremental sequence of dates.
library(dplyr)
df %>%
group_by(company) %>%
mutate(date = first(date_chunk) + row_number() - 1)

Finding fraction of time doing an activity per hour using lubridate

I have a dataset of start times and end times over an entire year for a certain activity. I want to break up the day into 24 intervals, each 1 hour in length, and then calculate and plot the fraction of time the person spent doing the activity per hour. I already converted the times using lubridate's mdy_hm().
Suppose these sample data in dataframe df:
start_time end_time duration
8/14/15 23:36 8/15/15 5:38 359
8/15/15 14:50 8/15/15 15:25 35
8/15/15 22:43 8/16/15 2:41 236
8/16/15 3:12 8/16/15 6:16 181
8/16/15 16:52 8/16/15 17:58 66
8/16/15 23:21 8/16/15 23:47 26
8/17/15 0:04 8/17/15 2:02 118
8/17/15 8:31 8/17/15 9:45 74
8/17/15 11:06 8/17/15 13:46 159
How can I find the fraction of the activity per hour over the whole year? I will then plot the result. I have tried extracting the hour with hour(), using group_by() on the time variables, and using the mean function within summarize() on duration, but I'm unsure of the logic.
Thank you for any help.
The group_by(...) %>% summarise(...) works best when your data is in the 'tidy' format, where each row is 1 observation of the data you want to aggregate over. In your case, an observation is a minute worked within some given hour and date. We can do this be generating those minute-by-minute observations as a list column, use tidyr::unnest() to expand the generated data into a long data frame, then do your counting over that data frame:
library(dplyr)
library(lubridate)
library(tidyr)
library(ggplot2)
df <-
tibble(
start_time = c("8/14/15 23:36","8/15/15 14:50","8/15/15 22:43",
"8/16/15 3:12","8/16/15 16:52","8/16/15 23:21",
"8/17/15 0:04","8/17/15 8:31","8/17/15 11:06"),
end_time = c("8/15/15 5:38","8/15/15 15:25","8/16/15 2:41",
"8/16/15 6:16","8/16/15 17:58","8/16/15 23:47",
"8/17/15 2:02","8/17/15 9:45","8/17/15 13:46")
) %>%
mutate_at(vars(start_time, end_time), funs(mdy_hm))
worked_hours <- df %>%
# First, make a long df with a minute per row
group_by(start_time, end_time) %>%
mutate(mins = list(tibble(
min = seq(from = start_time, to = end_time - minutes(1), by = as.difftime(minutes(1)))
))) %>%
unnest() %>%
ungroup() %>%
# Aggregate over the long df (count number of rows, i.e. minutes per date, hour)
select(min) %>%
mutate(date = date(min), hour = factor(hour(min), levels = 0:23)) %>%
group_by(date, hour) %>%
tally() %>%
# Calculate proportion of hour
mutate(prop = n / 60 * 100)
worked_hours %>%
# Use tidyr::complete to fill in unobserved values
complete(date, hour, fill = list(n = 0, prop = 0)) %>%
ggplot(aes(x = hour, y = prop)) +
geom_bar(stat = "identity") +
facet_wrap(~ date, ncol = 1)

How to filter a dataset by the time stamp

I'm working with some bird GPS tracking data, and I would like to exclude points based on the time stamp.
Some background information- the GPS loggers track each bird for just over 24 hours, starting in the evening, and continuing through the night and the following day. What I would like to do is exclude points taken after 9:30pm on the day AFTER deployment (so removing points from the very end of the track).
As an R novice, I'm struggling because the deployment dates differ for each bird, so I can't simply use subset() for a specific date and time.
An example of my dataframe (df):
BirdID x y Datetime
15K12 492719.9 5634805 2015-06-23 18:25:00
15K12 492491.5 5635018 2015-06-23 18:27:00
15K70 455979.1 5653581 2015-06-24 19:54:00
15K70 456040.9 5653668 2015-06-24 19:59:00
So, pretending these points represent the start of the GPS track for each animal, I would like to remove points after 9:30 pm on June 24 for bird 15K12, and after 9:30 on June 25 for bird 15K70.
Any ideas?
First, check if df$Datetime is a date variable:
class(df$Datetime)
If it's not, you can convert it with this:
df$Datetime <- ymd_hms(df&Datetime)
You use mutate to create a new variable called newdate that takes the earliest date of the bird's data and sets the date for cutoff which is the next day at 21:30:00 of the earliest date of a bird's observations.
Then you filter the Datetime column by the newdate column and you get the observations that are found earlier that the specified date.
library(dplyr); library(lubridate)
df %>%
group_by(BirdID) %>%
mutate(newdate = as.POSIXct(date(min(Datetime)) + days(1) + hours(21) + minutes(30))) %>%
filter(Datetime < newdate)
Did a reproducible example:
library(dplyr); library(lubridate)
set.seed(1)
# Create a data frame (1000 observations)
BirdID <- paste(rep(floor(runif(250, 1, 20)),4),
rep("k", 1000), rep(floor(runif(250, 1, 40)),4), sep = "")
x <- rnorm(1000, mean = 47000, sd = 2000)
y <- rnorm(1000, mean = 5650000, sd = 300000)
Datetime <- as.POSIXct(rnorm(1000, mean = as.numeric(as.POSIXct("2015-06-23 18:25:00")), sd = 99999), tz = "GMT", origin = "1970-01-01")
df <- data.frame(BirdID, x, y, Datetime, stringsAsFactors = FALSE)
# Filter the data frame by the specified date
df_filtered <- df %>%
group_by(BirdID) %>%
mutate(newdate = as.POSIXct(date(min(Datetime)) + days(1) + hours(21) + minutes(30))) %>%
filter(Datetime < newdate)
This should fix any problem.

Resources