How do I write this code (hour is from lubridate package)?
Objective: if hour part of PICK_DATE is later than 16:00, the ADJ_PICK_DATE should be next day 03:00. If the hour part of PICK_DATE is earlier than 03:00, then ADJ_PICK_DATE is to be same day 03:00. Problem is, when there is no change needed, the code still adds 3 hours to the PICK_DATE i.e. when the hour part of PICK_DATE is within 03:00 and 16:00.
x$PICK_TIME <- cut(hour(x$PICK_DATE), c(-1, 2, 15, 24), c("EARLY", "OKAY", "LATE"))
x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="EARLY",
as.POSIXct(paste(format(x$PICK_DATE, "%d-%b-%Y"), "03:00"),
format="%d-%b-%Y %H:%M"), x$PICK_DATE)
x$ADJ_PICK_DATE <- ifelse(x$PICK_TIME=="LATE",
as.POSIXct(paste(format(x$PICK_DATE+86400, "%d-%b-%Y"),
"03:00"), format="%d-%b-%Y %H:%M"),
x$ADJ_PICK_DATE)
x$ADJ_PICK_DATE <- as.POSIXct(x$ADJ_PICK_DATE, origin = "1970-01-01")
Help please.
Sample data:
PICK_DATE SHIP_DATE
01-APR-2017 00:51 02-APR-2017 06:55 AM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 12:11 PM
01-APR-2017 00:51 02-APR-2017 09:39 AM
Here is a simple, reproducible example. I had to make up some sample data, based on an earlier question you asked. I suggest reading into dplyr and lubridate as they will help you with your work on manipulating dates.
EDIT: Updated to work with end-of-month dates.
library(lubridate)
library(dplyr)
df <- data.frame(pick_date = c("01-APR-2017 00:51", "02-APR-2017 08:53", "15-APR-2017 16:12", "23-APR-2017 02:04", "30-APR-2017 20:08"), ship_date = c("05-APR-2017 06:55", "09-APR-2017 12:11", "30-APR-2017 13:11", "02-MAY-2017 15:16", "05-MAY-2017 09:57"))
df %>%
mutate(pick_date = dmy_hm(pick_date)) %>%
mutate(ship_date = dmy_hm(ship_date)) %>%
mutate(pick_time = case_when(
hour(pick_date) <= 3 ~ "early",
hour(pick_date) >= 16 ~ "late",
TRUE ~ "okay")
) %>%
mutate(new_pick_time = case_when(
pick_time == "early" ~ hms(hours(3)),
pick_time == "late" ~ hms(hours(3)),
TRUE ~ hms(paste0(hour(pick_date), "H ", minute(pick_date), "M ", second(pick_date), "S")))
) %>%
mutate(temp_pick_date = case_when(
pick_time == "early" ~ pick_date,
pick_time == "late" ~ pick_date + days(1),
TRUE ~ pick_date)
) %>%
mutate(new_pick_date = make_datetime(year(temp_pick_date), month(temp_pick_date), day(temp_pick_date), hour(new_pick_time), minute(new_pick_time), second(new_pick_time))) %>%
select(-new_pick_time, -temp_pick_date)
This returns
pick_date ship_date pick_time new_pick_date
1 2017-04-01 00:51:00 2017-04-05 06:55:00 early 2017-04-01 03:00:00
2 2017-04-02 08:53:00 2017-04-09 12:11:00 okay 2017-04-02 08:53:00
3 2017-04-15 16:12:00 2017-04-30 13:11:00 late 2017-04-16 03:00:00
4 2017-04-23 02:04:00 2017-05-02 15:16:00 early 2017-04-23 03:00:00
5 2017-04-30 20:08:00 2017-05-05 09:57:00 late 2017-05-01 03:00:00
So it sounds like you just need to do two different arithmetic operations, conditional on the hour of a date time?
The simplest way I can think to access the hour component is to store the time in a POSIXlt. I believe the "l" stands or "list", and this lets you treat a timestamp like a list with the different time measurements being accessible attributes accordingly.
Like this:
> time <- as.POSIXlt('2017-07-29 15:12:01')
> time
[1] "2017-07-29 15:12:01 EDT"
> time$hour
[1] 15
So you could write a function that does the operation you desire, and feed it your date column. Hard for me to take it further because I don't quite understand the question, but here's a skeleton:
ComputeDifference <- function(time) {
if (time$hour < 3) {
# code to count orders between 0 and 3 "from same day 3:00"
}
if (time$hour > 16) {
# code to consider late orders
}
}
If you throw in sample data and refine the question, maybe I can take a more thorough crack at this.
Related
I am looking at how extubation rates in an intensive care unit have changed over the course of the pandemic.
I have a data set which has hourly timestamps next to a category of airway types which simplified looks like this:
Time
AirwayStatus
2020/01/01 00:00
ETT/LMA
2020/01/01 01:00
ETT/LMA
2020/01/01 02:00
Own Airway
2020/01/01 03:00
Own Airway
2020/01/01 04:00
ETT/LMA
What I am effectively looking to do is find the times when the patient is extubated (ETT/LMA turns to Own Airway) and also when intubated (own airway to ETT/LMA). Eventually I want to be able to see how often an extubated patient has to be re-intubated.
Within 48 hours this is known as a failed extubation and we are expecting to see vastly different data during the pandemic compared to before.
The ideas I have so far are creating a seperate column with the airwayStatus of the prior hour and then if these are not the same then counting this. This seems unsophisticated though and I was hoping some of you clever people may have a nicer option.
Thank you in advance
Using dplyr from tidyverse:
Supposing you have a dataframe (or tibble) df and patient(?) id ID:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,1),
Time = c("2020/01/01 00:00", "2020/01/01 01:00", "2020/01/01 02:00", "2020/01/01 03:00", "2020/01/01 04:00"),
AirwayStatus = c("ETT/LMA", "ETT/LMA", "Own Airway", "Own Airway", "ETT/LMA"))
df <- df %>%
group_by(ID) %>%
arrange(Time) %>%
mutate(
Extubated = ifelse(AirwayStatus == "Own Airway" & lag(AirwayStatus) == "ETT/LMA", TRUE, FALSE),
Intubated = ifelse(AirwayStatus == "ETT/LMA" & lag(AirwayStatus) == "Own Airway", TRUE, FALSE))
result <- df %>%
summarise_at(c("Extubated", "Intubated"), sum, na.rm = TRUE)
result
Result:
# A tibble: 1 x 3
ID Extubated Intubated
<dbl> <int> <int>
1 1 1 1
This allows grouping by patient id which you will most likely do.
It's a bit longer than Oliver's answer though.
Your idea is the right way to go. You can skip storing intermediary results but they have to be estimated anyway. Lets assume your data is called df, then we could do something similar to
# Read table: (Could get read.table to work)
library(data.table)
df <- fread("Time AirwayStatus
2020/01/01 00:00 ETT/LMA
2020/01/01 01:00 ETT/LMA
2020/01/01 02:00 Own Airway
2020/01/01 03:00 Own Airway
2020/01/01 04:00 ETT/LMA")
setDF(df)
# Convert time to a date format
df$Time <- as.POSIXct(df$Time)
n <- nrow(df)
# Find changes
df$change <- with(df, c(FALSE, AirwayStatus[seq(n - 1)] != AirwayStatus[seq(2, n)]))
# estimate the length of time since last change
df$hours_between_change[df$change] <- with(df, diff(c(NA, Time[change])) / 3600)
df
Time AirwayStatus change hours_between_change
1 2020-01-01 00:00:00 ETT/LMA FALSE NA
2 2020-01-01 01:00:00 ETT/LMA FALSE NA
3 2020-01-01 02:00:00 Own Airway TRUE NA
4 2020-01-01 03:00:00 Own Airway FALSE NA
5 2020-01-01 04:00:00 ETT/LMA TRUE 2
Note I store the intermediate results here. We likely could make it a bit more readable using dplyr but this does the job.
Here is an approach using dplyr.
First, you might want to consider a separate column to indicate an intubation or extubation "event." If someone is "Own Airway" and then the previous row has "ETT/LMA", we assume the person has been extubated. The opposite can also be determined for intubation.
Then, you can filter and only focus on these events.
For each event, you may want to capture when the event is "Extubation", and then following event is "Intubation", and the time difference is < 48 hrs. If this is true, then the extubation is actually a "failed extubation."
This may handle situations where someone has data that begins with "Own Airway" and gets intubated (if no extubation event, then cannot be failed extubation). It will also keep extubation events where the time difference is > 48 hrs as well.
library(tidyverse)
df %>%
mutate(Event = case_when(
AirwayStatus == "Own Airway" & lag(AirwayStatus) == "ETT/LMA" ~ "Extubation",
AirwayStatus == "ETT/LMA" & lag(AirwayStatus) == "Own Airway" ~ "Intubation",
TRUE ~ NA_character_)
) %>%
filter(!is.na(Event)) %>%
mutate(Event = ifelse(
Event == "Extubation" & lead(Event) == "Intubation" & (lead(Time) - Time < 48),
"Failed Extubation",
Event
))
Output
Time AirwayStatus Event
1 2020-01-01 02:00:00 Own Airway Failed Extubation
2 2020-01-01 04:00:00 ETT/LMA Intubation
Data
df <- structure(list(Time = structure(c(1577858400, 1577862000, 1577865600,
1577869200, 1577872800), class = c("POSIXct", "POSIXt"), tzone = ""),
AirwayStatus = c("ETT/LMA", "ETT/LMA", "Own Airway", "Own Airway",
"ETT/LMA"), Event = c(NA, NA, "Extubated", NA, "Intubated"
)), row.names = c(NA, -5L), class = "data.frame")
I'm trying to calculate business hours between two dates. Business hours vary depending on the day.
Weekdays have 15 business hours (8:00-23:00), saturdays and sundays have 12 business hours (9:00-21:00).
For example: start date 07/24/2020 22:20 (friday) and end date 07/25/2020 21:20 (saturday), since I'm only interested in the business hours the result should be 12.67hours.
Here an example of the dataframe and desired output:
start_date end_date business_hours
07/24/2020 22:20 07/25/2020 21:20 12.67
07/14/2020 21:00 07/16/2020 09:30 18.50
07/18/2020 08:26 07/19/2020 10:00 13.00
07/10/2020 08:00 07/13/2020 11:00 42.00
Here is something you can try with lubridate. I edited another function I had I thought might be helpful.
First create a sequence of dates between the two dates of interest. Then create intervals based on business hours, checking each date if on the weekend or not.
Then, "clamp" the start and end times to the allowed business hours time intervals using pmin and pmax.
You can use time_length to get the time measurement of the intervals; summing them up will give you total time elapsed.
library(lubridate)
library(dplyr)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_intervals <- if_else(weekdays(my_dates) %in% c("Saturday", "Sunday"),
interval(ymd_hm(paste(my_dates, "09:00"), tz = "UTC"), ymd_hm(paste(my_dates, "21:00"), tz = "UTC")),
interval(ymd_hm(paste(my_dates, "08:00"), tz = "UTC"), ymd_hm(paste(my_dates, "23:00"), tz = "UTC")))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])), int_start(my_intervals[length(my_intervals)]))
sum(time_length(my_intervals, "hour"))
}
calc_bus_hours(as.POSIXct("07/24/2020 22:20", format = "%m/%d/%Y %H:%M", tz = "UTC"), as.POSIXct("07/25/2020 21:20", format = "%m/%d/%Y %H:%M", tz = "UTC"))
[1] 12.66667
Edit: For Spanish language, use c("sábado", "domingo") instead of c("Saturday", "Sunday")
For the data frame example, you can use mapply to call the function using the two selected columns as arguments. Try:
df$business_hours <- mapply(calc_bus_hours, df$start_date, df$end_date)
start end business_hours
1 2020-07-24 22:20:00 2020-07-25 21:20:00 12.66667
2 2020-07-14 21:00:00 2020-07-16 09:30:00 18.50000
3 2020-07-18 08:26:00 2020-07-19 10:00:00 13.00000
4 2020-07-10 08:00:00 2020-07-13 11:00:00 42.00000
I'm fairly new in R and need some help.
I have two dataframes with rather similar information. The first dataframe has information about misconnections for an airline, whereas the other one is the entire timetable for the same airline. Now, what I need is to make a new column in the misconnection data.frame including flights from the timetable that can replace the delayed flights on the transit.
The flights that I want to replace need to meet a range of conditions (within a certain time-horizon, needs to be the same weekday and it needs to fly to the same destination). I addition, I want R to choose the flight that is closest (by time) to the new arrival time at a transit(from the misconnection data.frame).
The misconnection data.frame looks like the following (1620 lines in total):
miscon <- data.frame(flight.date = as.Date(c("2019-08-05", "2019-10-03", "2019-07-21", "2019-05-29"), format="%Y-%m-%d"),
Outbound.airport = c("MXP", "KRK", "KLU", "OTP"),
arr.time = as.POSIXct(c("19:25:00", "20:52:00", "07:33:00", "18:49:00"), format="%H:%M:%S"),
next.pos.dep = as.POSIXct(c("19:36:00", "21:17:00", "07:58:00", "19:14:00"), format="%H:%M:%S"),
weekday = c("4", "7", "7", "3"))
view(miscon)
flight.date Outbound.airport arr.time next.pos.dep Weekday
1 2019-08-05 MXP 19:25:00 19:36:00 4
2 2019-10-03 KRK 20:52:00 21:17:00 7
3 2019-07-21 KLU 07:33:00 07:58:00 7
4 2019-05-29 OTP 18:49:00 19:14:00 3
And the timetable data.frame would look like this:
tt <- data.frame(start.date = as.Date(c("2019-03-25", "2019-05-02", "2019-07-30", "2019-05-29"), format="%Y-%m-%d"),
end.date = as.Date(c("2019-10-21", "2019-10-27", "2019-08-26", "2019-06-01"), format="%Y-%m-%d"),
weekday = c("1234567", "1.3..67", "1.34567", "..3.5.."),
Outbound.airport = c("KLU", "KLU", "MXP", "OTP"),
dep.time = as.POSIXct(c("12:20:00", "15:55:00", "19:55:00", "20:34:00"), format="%H:%M:%S"))
view(tt)
start.date end.date Weekday Outbound.airport dep.time
1 2019-03-25 2019-10-21 1234567 KLU 12:20:00
2 2019-05-02 2019-10-27 1.3..67 KLU 15:55:00
3 2019-07-30 2019-08-26 1.34567 MXP 19:55:00
4 2019-03-30 2019-06-01 ..3.5.. OTP 20:34:00
In Excel, this problem is solved using Index matching, which I've managed. However, the problem is slightly to big for excel to handle which is why I need to convert this to R. Did try with the match and mutate function in R, but seems like the values I'm matching must be equal - which I do not expect mine to be.
Also found an interesting solution to a similar problem using the DescTools package, which I tried to implemt with no success.
get_close2 <- function(xx=tt, yy=miscon) {
pos <- vector(mode = "numeric")
for(i in 1:dim(yy)[1]) {
pos[i] <- DescTools::Closest(xx$dep.time, yy$next.pos.dep[i])
#print(pos[i])
yy$new.flight[i] <- pos[i]
}
out <- yy
return(out)
}
get_close2()
For this one, I tried with only one condition. It generated a column, but with NA's only. Obviously, I am far out right now, which is why I'm reaching out for help. Hope the problem was clear. The end result would preferrably look something like the following:
miscon
flight.date Outbound.airport arr.time next.pos.dep Weekday new.flight.time
1 2019-12-05 MXP 19:25:00 19:36:00 4 19:55:00
2 2019-10-03 KRK 20:52:00 21:17:00 7 NA
3 2019-07-21 KLU 07:33:00 07:58:00 7 12:20:00
4 2019-05-29 OTP 18:49:00 19:14:00 3 20:34:00
I think you can do it as follows. First, I would rearrange the Weekday column so that you have one row for each weekday a flight is going:
library(data.table)
library(dplyr)
library(tidyr)
tt <- tt %>% separate(weekday, into = as.character(1:7), sep = 1:6) %>%
gather(key="key", value="weekday", -c(start.date, end.date, Outbound.airport, dep.time)) %>%
filter(weekday %in% 1:7) %>%
select(-key)
Then I would do a left join of miscon and tt on the airport and weekday.
tt <- data.table(tt)
miscon <- data.table(miscon)
setkey(miscon, Outbound.airport, weekday)
setkey(tt, Outbound.airport, weekday)
df <- tt[miscon]
Check if flight date is on a valid date:
df = df[flight.date>=start.date & flight.date<=end.date]
Now you have a data.frame of all possible connections. The only thing left is to find the minimum time between the flights for each connection.
df[,timediff:= dep.time-arr.time, by=.(weekday, Outbound.airport)]
Now you can filter the rows by the minimum time delay (timediff):
df = df[ , .SD[which.min(timediff)], by=.(weekday, Outbound.airport, flight.date, arr.time, next.pos.dep)]
setnames(df, "dep.time", "new.flight.time")
> df
weekday Outbound.airport flight.date arr.time next.pos.dep start.date end.date new.flight.time timediff
1: 7 KLU 2019-07-21 2020-04-27 07:33:00 2020-04-27 07:58:00 2019-03-25 2019-10-21 2020-04-27 12:20:00 17220 secs
2: 4 MXP 2019-08-05 2020-04-27 19:25:00 2020-04-27 19:36:00 2019-07-30 2019-08-26 2020-04-27 19:55:00 1800 secs
3: 3 OTP 2019-05-29 2020-04-27 18:49:00 2020-04-27 19:14:00 2019-05-29 2019-06-01 2020-04-27 20:34:00 6300 secs
The solution is a bit of a mix of dplyr and data.table.
Ok, it's not pretty but you have a fairly complex issue, and it's not fully clear to me if this gives you what you are looking for - you will need to check it on a larger dataset than the small example you provide to be sure first.
# setup
library(data.table)
setDT(tt)
setDT(miscon)
# make tt long format splitting weekdays out
tt <- melt(tt[, paste("V", 1:7, sep = "") := tstrsplit(weekday, "")][, -"weekday"], measure.vars = paste("V", 1:7, sep = ""))[value != "."][, c("weekday", "value", "variable") := .(value, NULL, NULL)]
# join, calculate time difference, convert format of times, rank on new.dep.time within group, and filter
newDT <- miscon[tt, on = c("Outbound.airport", "weekday"), nomatch = 0][
, new.dep.time := as.numeric(dep.time - arr.time)][
, c("arr.time", "dep.time", "next.pos.dep") := .(format(arr.time, "%H:%M"), format(dep.time, "%H:%M"), format(next.pos.dep, "%H:%M"))][
, new.dep.rank := rank(new.dep.time), by = c("Outbound.airport", "weekday")][
new.dep.rank == 1, -c("new.dep.rank", "new.dep.time")]
I am trying to summarize time from 4 am to 12 pm as morning, 12-9 as evening and 9 pm to 4 am as night. I am doing this to make a logistic regression model to know if the arrest would happen or not considering the type of crime and the time of the crime.
I have tried using the lubridate function but because the format is the string I am not able to use the function. And, as.Date function is neither of help since some of the strings are having this value: 03/26/2015 06:56:30 PM while some of the rows have this value: 04-12-15 20:24. Both the formatting are totally different so not able to use the as.Date function.
Apart from the as.Date function what we can do is convert all the 04-12-15 20:24 to 03/26/2015 06:56:30 PM format by doing something like => if you find - then replace it with / (for the date format).
I don't know how to achieve this goal.
You can use case_when() from the dplyr library to determine the format of the date and then proceed with the conversion based on the format type. From there we check the 24H time component to determine the time of day based on the bins in the OP.
library(dplyr)
chicago15 <- data.frame(Date = c("03/26/2015 06:56:30 PM","04-12-15 20:24",
"03/26/2015 06:56:30 AM","04-12-15 21:24",
"12/31/2017 03:28:43 AM"))
chicago15 %>%
dplyr::mutate(Date2 = dplyr::case_when(
grepl('-',Date) ~ as.POSIXct(Date,format = '%m-%d-%y %H:%M'),
TRUE ~ as.POSIXct(Date,format = '%m/%d/%Y %I:%M:%S %p')
)) %>%
dplyr::mutate(Time_of_Day = dplyr::case_when(
as.numeric(format(Date2,'%H')) >= 21 ~ 'night',
as.numeric(format(Date2,'%H')) >= 12 ~ 'evening',
as.numeric(format(Date2,'%H')) >= 4 ~ 'morning',
TRUE ~ 'night'
))
Date Date2 Time_of_Day
1 03/26/2015 06:56:30 PM 2015-03-26 18:56:30 evening
2 04-12-15 20:24 2015-04-12 20:24:00 evening
3 03/26/2015 06:56:30 AM 2015-03-26 06:56:30 morning
4 04-12-15 21:24 2015-04-12 21:24:00 night
5 12/31/2017 03:28:43 AM 2017-12-31 03:28:43 night
I have a data frame with hour stamp and corresponding temperature measured. The measurements are taken at random intervals over time continuously. I would like to convert the hours to respective date-time and temperature measured. My data frame looks like this: (The measurement started at 20/05/2016)
Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25
I would like to create a data.frame with respective date-time and Temp like below:
Time, Temp
2016-05-20 09:25,28
2016-05-20 10:35,28.2
2016-05-20 18:25,29
2016-05-20 23:50,30
2016-05-21 01:10,31
2016-05-21 12:00,36
2016-05-22 02:00,25
I am thankful for any comments and tips on the packages or functions in R, I can have a look to do this. Thanks for your time.
A possible solution in base R:
df$Time <- as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',df$Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT'))
df$Time <- df$Time + cumsum(c(0,diff(df$Time)) < 0) * 86400 # 86400 = 60 * 60 * 24
which gives:
> df
Time Temp
1 2016-05-20 09:25:00 28.0
2 2016-05-20 10:35:00 28.2
3 2016-05-20 18:25:00 29.0
4 2016-05-20 23:50:00 30.0
5 2016-05-21 01:10:00 31.0
6 2016-05-21 12:00:00 36.0
7 2016-05-22 02:00:00 25.0
An alternative with data.table (off course you can also use cumsum with diff instead of rleid & shift):
setDT(df)[, Time := as.POSIXct(strptime(paste('2016-05-20', sprintf('%05.2f',Time)), format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
(rleid(Time < shift(Time, fill = Time[1]))-1) * 86400]
Or with dplyr:
library(dplyr)
df %>%
mutate(Time = as.POSIXct(strptime(paste('2016-05-20',
sprintf('%05.2f',Time)),
format = '%Y-%m-%d %H.%M', tz = 'GMT')) +
cumsum(c(0,diff(Time)) < 0)*86400)
which will both give the same result.
Used data:
df <- read.table(text='Time, Temp
09.25,28
10.35,28.2
18.25,29
23.50,30
01.10,31
12.00,36
02.00,25', header=TRUE, sep=',')
You can use a custom date format combined with some code that detects when a new day begins (assuming the first measurement takes place earlier in the day than the last measurement of the previous day).
# starting day
start_date = "2016-05-20"
values=read.csv('values.txt', colClasses=c("character",NA))
last=c(0,values$Time[1:nrow(values)-1])
day=cumsum(values$Time<last)
Time = strptime(paste(start_date,values$Time), "%Y-%m-%d %H.%M")
Time = Time + day*86400
values$Time = Time