This is a tough one for me. I have 3 months data (up to 1m obs) and I have 2 columns in my data.frame
Date_Time Number
12/1/2015 12:00:01 AM 92222222
12/1/2015 12:00:29 AM 32211111
12/1/2015 12:00:41 AM 22333333
12/1/2015 12:00:43 AM 12222222
..... .....
12/1/2015 9:00:02 AM 92222222
12/2/2015 12:00:02 AM 32211111
How to count the occurrence/Frequency of each value in column "Number" within time frame of 24 hours.
the expected result of the above example
92222222 Freq: 2
32211111 Freq: 2
22333333 Freq: 1
12222222 Freq: 1
EDIT
time frame of 24 hours refer to interval of 24 hours. it doesn't mean from midnight to midnight. for example, if someone calls at 5 PM today, and call again at 3 PM next day, this should be counted as 2
Edit 2:
To be clearer, the objective of this analysis is to know the number of repeat calls in the call center for window period of 24 hours.
for example, customer called from contact number 01101111 on 1/Jan/2016 1:32:01 PM
& then called again on 1/Jan/2016 1:59:43 PM. and finally called next day 2/Jan/2016 12:21:02 PM
It's considered that the frequency of 0110111 is "3" because the number is repeated 3 times in less than 24 hours.
Based on your comments, for any number the start of the period is the earliest call from that number.
Below is the commented code:
library(lubridate)
library(dplyr)
calls <- structure(list(Date_Time = structure(1:6, .Label = c("12/1/2015 12:00:01 AM",
"12/1/2015 12:00:29 AM", "12/1/2015 12:00:41 AM", "12/1/2015 12:00:43 AM",
"12/1/2015 9:00:02 AM", "12/2/2015 12:00:02 AM"), class = "factor"),
Number = structure(c(4L, 3L, 2L, 1L, 4L, 3L), .Label = c("12222222",
"22333333", "32211111", "92222222"), class = "factor")), .Names = c("Date_Time",
"Number"), row.names = c(NA, -6L), class = "data.frame")
count_freq <- function(timestamps){
#Given all the ocurrences of calls from a number find the
#earliest one and count how many occur within 24 hours
dtime <- sort(mdy_hms(timestamps))
start_time <- dtime[1]
end_time <- start_time + hours(24)
sum(dtime >= start_time & dtime <= end_time)
}
out <- group_by(calls, Number) %>%
summarise(freq = count_freq(Date_Time))
Here is another approach to output the freq of the number in each row for the 24 hrs, but most likely slower than tfc's.
df<-read.table(header = TRUE, sep=",", text="Date_Time, Number
12/1/2015 12:00:01 AM, 92222222
12/1/2015 12:00:29 AM, 32211111
12/1/2015 12:00:41 AM, 22333333
12/1/2015 12:00:43 AM, 12222222
12/1/2015 9:00:02 AM, 92222222
12/2/2015 12:00:02 AM, 32211111")
df$Date_Time<-as.POSIXct(df$Date_Time, format="%m/%d/%Y %I:%M:%S %p")
library(dplyr)
ncount<-function(x){
target<-x[2]
starttime<-as.POSIXct(x[1], format="%Y-%m-%d %H:%M:%S")
endtime<-starttime+ 24*60*60 #1 day later
nrow(filter(df, Number==target & Date_Time>=starttime & Date_Time<=endtime))
}
df$freq<-apply(df, 1, function(x){ncount(x)} )
Related
I have the following df with the Date column having hourly marks for an entire year:
Date TD RN D.RN Press Temp G.Temp. Rad
1 2018-01-01 00:00:00 154.0535 9.035156 1.416667 950.7833 7.000000 60.16667 11.27000
2 2018-01-01 01:00:00 154.5793 9.663900 1.896667 951.2000 6.766667 59.16667 11.23000
3 2018-01-01 01:59:59 154.5793 7.523438 2.591667 951.0000 6.066667 65.16667 11.23500
4 2018-01-01 02:59:59 154.0535 7.994792 2.993333 951.1833 5.733333 64.00000 11.16833
5 2018-01-01 03:59:59 154.4041 6.797526 3.150000 951.4833 5.766667 57.83333 11.13500
6 2018-01-01 04:59:59 155.1051 12.009766 3.823333 951.0833 5.216667 61.33333 11.22167
I want to add a factor column 'Quarters' that indicates each quarter according to the 'Date'.
As far as I understand I can do that by:
Radiation$Quarter<-cut(Radiation$Date, breaks = "quarters", labels = c("Q1", "Q2", "Q3", "Q4"))
But I also want to add a factor column 'Day/Night' which indicates whether it's day or night, having:
Day → 8am - 8pm
Night → 8pm - 8am
It seems like with the cut() function there's no way to indicate time ranges.
You can use an ifelse/case_when statement after extracting hour from time.
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(Date),
label = case_when(hour >= 8 & hour <= 19 ~ 'Day',
TRUE ~ 'Night'))
In base R :
df$hour = as.integer(format(df$Date, '%H'))
transform(df, label = ifelse(hour >= 8 & hour <= 19, 'Day', 'Night'))
We can also do
library(dplyr)
library(lubridate)
df %>%
mutate(hour = hour(Date),
label = case_when(between(hour, 8, 19) ~ "Day", TRUE ~ "Night"))
I am looking at how extubation rates in an intensive care unit have changed over the course of the pandemic.
I have a data set which has hourly timestamps next to a category of airway types which simplified looks like this:
Time
AirwayStatus
2020/01/01 00:00
ETT/LMA
2020/01/01 01:00
ETT/LMA
2020/01/01 02:00
Own Airway
2020/01/01 03:00
Own Airway
2020/01/01 04:00
ETT/LMA
What I am effectively looking to do is find the times when the patient is extubated (ETT/LMA turns to Own Airway) and also when intubated (own airway to ETT/LMA). Eventually I want to be able to see how often an extubated patient has to be re-intubated.
Within 48 hours this is known as a failed extubation and we are expecting to see vastly different data during the pandemic compared to before.
The ideas I have so far are creating a seperate column with the airwayStatus of the prior hour and then if these are not the same then counting this. This seems unsophisticated though and I was hoping some of you clever people may have a nicer option.
Thank you in advance
Using dplyr from tidyverse:
Supposing you have a dataframe (or tibble) df and patient(?) id ID:
library(dplyr)
df <- tibble(
ID = c(1,1,1,1,1),
Time = c("2020/01/01 00:00", "2020/01/01 01:00", "2020/01/01 02:00", "2020/01/01 03:00", "2020/01/01 04:00"),
AirwayStatus = c("ETT/LMA", "ETT/LMA", "Own Airway", "Own Airway", "ETT/LMA"))
df <- df %>%
group_by(ID) %>%
arrange(Time) %>%
mutate(
Extubated = ifelse(AirwayStatus == "Own Airway" & lag(AirwayStatus) == "ETT/LMA", TRUE, FALSE),
Intubated = ifelse(AirwayStatus == "ETT/LMA" & lag(AirwayStatus) == "Own Airway", TRUE, FALSE))
result <- df %>%
summarise_at(c("Extubated", "Intubated"), sum, na.rm = TRUE)
result
Result:
# A tibble: 1 x 3
ID Extubated Intubated
<dbl> <int> <int>
1 1 1 1
This allows grouping by patient id which you will most likely do.
It's a bit longer than Oliver's answer though.
Your idea is the right way to go. You can skip storing intermediary results but they have to be estimated anyway. Lets assume your data is called df, then we could do something similar to
# Read table: (Could get read.table to work)
library(data.table)
df <- fread("Time AirwayStatus
2020/01/01 00:00 ETT/LMA
2020/01/01 01:00 ETT/LMA
2020/01/01 02:00 Own Airway
2020/01/01 03:00 Own Airway
2020/01/01 04:00 ETT/LMA")
setDF(df)
# Convert time to a date format
df$Time <- as.POSIXct(df$Time)
n <- nrow(df)
# Find changes
df$change <- with(df, c(FALSE, AirwayStatus[seq(n - 1)] != AirwayStatus[seq(2, n)]))
# estimate the length of time since last change
df$hours_between_change[df$change] <- with(df, diff(c(NA, Time[change])) / 3600)
df
Time AirwayStatus change hours_between_change
1 2020-01-01 00:00:00 ETT/LMA FALSE NA
2 2020-01-01 01:00:00 ETT/LMA FALSE NA
3 2020-01-01 02:00:00 Own Airway TRUE NA
4 2020-01-01 03:00:00 Own Airway FALSE NA
5 2020-01-01 04:00:00 ETT/LMA TRUE 2
Note I store the intermediate results here. We likely could make it a bit more readable using dplyr but this does the job.
Here is an approach using dplyr.
First, you might want to consider a separate column to indicate an intubation or extubation "event." If someone is "Own Airway" and then the previous row has "ETT/LMA", we assume the person has been extubated. The opposite can also be determined for intubation.
Then, you can filter and only focus on these events.
For each event, you may want to capture when the event is "Extubation", and then following event is "Intubation", and the time difference is < 48 hrs. If this is true, then the extubation is actually a "failed extubation."
This may handle situations where someone has data that begins with "Own Airway" and gets intubated (if no extubation event, then cannot be failed extubation). It will also keep extubation events where the time difference is > 48 hrs as well.
library(tidyverse)
df %>%
mutate(Event = case_when(
AirwayStatus == "Own Airway" & lag(AirwayStatus) == "ETT/LMA" ~ "Extubation",
AirwayStatus == "ETT/LMA" & lag(AirwayStatus) == "Own Airway" ~ "Intubation",
TRUE ~ NA_character_)
) %>%
filter(!is.na(Event)) %>%
mutate(Event = ifelse(
Event == "Extubation" & lead(Event) == "Intubation" & (lead(Time) - Time < 48),
"Failed Extubation",
Event
))
Output
Time AirwayStatus Event
1 2020-01-01 02:00:00 Own Airway Failed Extubation
2 2020-01-01 04:00:00 ETT/LMA Intubation
Data
df <- structure(list(Time = structure(c(1577858400, 1577862000, 1577865600,
1577869200, 1577872800), class = c("POSIXct", "POSIXt"), tzone = ""),
AirwayStatus = c("ETT/LMA", "ETT/LMA", "Own Airway", "Own Airway",
"ETT/LMA"), Event = c(NA, NA, "Extubated", NA, "Intubated"
)), row.names = c(NA, -5L), class = "data.frame")
I'm trying to calculate business hours between two dates. Business hours vary depending on the day.
Weekdays have 15 business hours (8:00-23:00), saturdays and sundays have 12 business hours (9:00-21:00).
For example: start date 07/24/2020 22:20 (friday) and end date 07/25/2020 21:20 (saturday), since I'm only interested in the business hours the result should be 12.67hours.
Here an example of the dataframe and desired output:
start_date end_date business_hours
07/24/2020 22:20 07/25/2020 21:20 12.67
07/14/2020 21:00 07/16/2020 09:30 18.50
07/18/2020 08:26 07/19/2020 10:00 13.00
07/10/2020 08:00 07/13/2020 11:00 42.00
Here is something you can try with lubridate. I edited another function I had I thought might be helpful.
First create a sequence of dates between the two dates of interest. Then create intervals based on business hours, checking each date if on the weekend or not.
Then, "clamp" the start and end times to the allowed business hours time intervals using pmin and pmax.
You can use time_length to get the time measurement of the intervals; summing them up will give you total time elapsed.
library(lubridate)
library(dplyr)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_intervals <- if_else(weekdays(my_dates) %in% c("Saturday", "Sunday"),
interval(ymd_hm(paste(my_dates, "09:00"), tz = "UTC"), ymd_hm(paste(my_dates, "21:00"), tz = "UTC")),
interval(ymd_hm(paste(my_dates, "08:00"), tz = "UTC"), ymd_hm(paste(my_dates, "23:00"), tz = "UTC")))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])), int_start(my_intervals[length(my_intervals)]))
sum(time_length(my_intervals, "hour"))
}
calc_bus_hours(as.POSIXct("07/24/2020 22:20", format = "%m/%d/%Y %H:%M", tz = "UTC"), as.POSIXct("07/25/2020 21:20", format = "%m/%d/%Y %H:%M", tz = "UTC"))
[1] 12.66667
Edit: For Spanish language, use c("sábado", "domingo") instead of c("Saturday", "Sunday")
For the data frame example, you can use mapply to call the function using the two selected columns as arguments. Try:
df$business_hours <- mapply(calc_bus_hours, df$start_date, df$end_date)
start end business_hours
1 2020-07-24 22:20:00 2020-07-25 21:20:00 12.66667
2 2020-07-14 21:00:00 2020-07-16 09:30:00 18.50000
3 2020-07-18 08:26:00 2020-07-19 10:00:00 13.00000
4 2020-07-10 08:00:00 2020-07-13 11:00:00 42.00000
I would like to calculate time difference in R of "A" and "B". The data that I have is the hour/minute/am-pm of individuals when they go to sleep("A") and at what time they wake up("B"): (df is called time)
Hour(A) Min(A) AMPM(A) Hour(B) Min(B) AMPM(B)
1 30 AM 7 30 AM
4 00 AM 9 00 AM
11 30 PM 6 30 AM
I have been doing some research and what I found is that I could create the time as a character and then change it as a time formate.
First, I used the unite() function (tidyverse) to join the hour(A) and min(A). Then, I created another column with a "fake" date (if it was pm: "2019-04-13" & am "2019-04-14"). Then, I used again the function unite() to join the date and the time and with the function strptime() I change the class to time.
For hour(B), min(B) and AMPM(B), I also used the function unite and join the three columns. Then I applied the function strptime() to change the class to a time.
Finally, I am using the function difftime() to find the difference between A and B, but I can't understand why I am getting unusual results.
time <- time %>% mutate(Date = ifelse(AMPM(A) == " AM", "2019-04-14", "2019-04-13"))
time$Date <- as.Date(time$Date)
#Using unite to join Hour(A) with Mins(A) and Hour(B) with Mins(B)
time <- time %>% unite(Sleeptime,HourA,MinsA, sep = ":") %>% unite(Wakeuptime, HourB,MinsB, sep = ":")
#Adding the seconds
time$Sleeptime <- paste0(time$Sleeptime,":00")
#Using unite to join Hours(B)Mins(B) with AMPM(B)
time <- time %>% unite(Wakeuptime, Wakeuptime ,AMPMWake, sep = "" )
#Changing the class for time (B)
time$Wakeuptime2 <- strptime(x = paste0(time$Wakeuptime2, "m"), format = "%I:%M %p")
#Joining the fake date for (A) with the time(A)
time <- time %>% unite(ST, Date, Sleeptime, sep = " ")
#Changing the class for time (A)
time$ST = strptime(time$ST,format='%Y-%m-%d %H:%M:%S')
#Calculating the difference in time
time$difference <- difftime(time$Wakeuptime2, time$ST, units = "hours")
What I need is another column with the difference in hour or minutes
Hour(A) Min(A) AMPM(A) Hour(B) Min(B) AMPM(B) DIFF (min)
1 30 AM 7 30 AM 300
4 00 AM 9 00 AM 300
11 30 PM 6 30 AM 420
We could use paste to assemble the fragments of time(A) and time(B), then convert as.POSIXct. From bed-times with PM we subtract 8.64e4 (one day in seconds). Now it's easy to calculate the differences within an apply.
tmp <- sapply(list(1:3, 4:6), function(x) {
cl <- as.POSIXct(apply(time[x], 1, paste, collapse=":"), format="%I:%M:%p")
return(ifelse(time[tail(x, 1)] == "PM", cl - 8.64e4, cl))
})
time <- cbind(time, `DIFF(min)`=apply(tmp, 1, diff)/60)
time
# Hour(A) Min(A) AMPM(A) Hour(B) Min(B) AMPM(B) DIFF(min)
# 1 1 30 AM 7 30 AM 360
# 2 4 0 AM 9 0 AM 300
# 3 11 30 PM 6 30 AM 420
Data
time <- structure(list(`Hour(A)` = c(1L, 4L, 11L), `Min(A)` = c(30L,
0L, 30L), `AMPM(A)` = c("AM", "AM", "PM"), `Hour(B)` = c(7L,
9L, 6L), `Min(B)` = c(30L, 0L, 30L), `AMPM(B)` = c("AM", "AM",
"AM")), row.names = c(NA, -3L), class = "data.frame")
I have a data.frame that contains two date columns, one for date of birth (DOB) for an individual, and a reference point in time (Snapshot.Date), let's say it's the date we last saw that individual. There are other columns (omitted), so I'd ideally like the results to be added as a column to my existing data.frame.
I would like to calculate how many months (continuous), between the individuals last birthday (relative to the Snapshot.Date) and the Snapshot.Date.
I've tried a plyr solution and a base sapply solution, and they are both slower than I expected they would be -- (and I need to process one million rows in my 'real' data.frame)
First, here is a test dataset. 20 original records (with the 'special' case of Feb 29th, only existing in a leap year).
data.test = structure(list(Snapshot.Date = structure(c(1433030400, 1396224000,
1375228800, 1396224000, 1383177600, 1362009600, 1367280000, 1369958400,
1346371200, 1348963200, 1435622400, 1435622400, 1435622400, 1435622400,
1435622400, 1435622400, 1435622400, 1435622400, 1435622400, 1346371200
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), DOB = structure(c(-268790400,
-155692800, -955065600, -551232000, -149644800, -774230400, -485395200,
-17625600, -131932800, -387244800, 545961600, 18489600, -230515200,
441676800, -32745600, 775180800, 713491200, 483235200, 114307200,
-815443200), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("Snapshot.Date",
"DOB"), row.names = c(32806L, 21294L, 14880L, 21730L, 17525L,
8516L, 11068L, 11751L, 2564L, 3832L, 802276L, 1031697L, 129222L,
588224L, 1093247L, 878037L, 370736L, 709108L, 861908L, 2199L), class = "data.frame")
And the function for calculating months (I'm sure this can be improved too).
months_since_last_birthday = function(CurrentDate, DateOfBirth)
{
last_birthday = DateOfBirth
if(month(last_birthday) == 2 & day(last_birthday) == 29) # this birthday only occurs once every four years, let's reset them to be the 28th
{
day(last_birthday) = 28
}
year(last_birthday) = year(CurrentDate)
if(last_birthday > CurrentDate)
{
last_birthday = last_birthday - years(1) #last year's birthday is the most recent occurrence
}
return(as.period(new_interval(last_birthday, CurrentDate)) / months(1))
}
For the base 20 records, here is the desired output:
Snapshot.Date DOB Months.Since.Birthday
32806 2015-05-31 1961-06-26 11.1643836
21294 2014-03-31 1965-01-25 2.1972603
14880 2013-07-31 1939-09-27 10.1315068
21730 2014-03-31 1952-07-14 8.5589041
17525 2013-10-31 1965-04-05 6.8547945
8516 2013-02-28 1945-06-20 8.2630137
11068 2013-04-30 1954-08-15 8.4931507
11751 2013-05-31 1969-06-11 11.6575342
2564 2012-08-31 1965-10-27 10.1315068
3832 2012-09-30 1957-09-24 0.1972603
802276 2015-06-30 1987-04-21 2.2958904
1031697 2015-06-30 1970-08-03 10.8876712
129222 2015-06-30 1962-09-12 9.5917808
588224 2015-06-30 1983-12-31 5.9863014
1093247 2015-06-30 1968-12-18 6.3945205
878037 2015-06-30 1994-07-26 11.1315068
370736 2015-06-30 1992-08-11 10.6246575
709108 2015-06-30 1985-04-25 2.1643836
861908 2015-06-30 1973-08-16 10.4602740
2199 2012-08-31 1944-02-29 6.0986301
Scaling up the dataset for benchmarking:
# Make 5000 records total for benchmarking, didn't replicate Feb 29th
# since it is a very rare case in the data
set.seed(1)
data.test = rbind(data.test, data.test[sample(1:19, size = 4980, replace = TRUE),])
start.time = Sys.time()
res = suppressMessages(adply(data.test , 1, transform, Months.Since.Birthday = months_since_last_birthday(Snapshot.Date, DOB)))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.793945 mins
start.time = Sys.time()
data.test$Months.Since.Birthday = suppressMessages(sapply(1:5000, function(x){return(months_since_last_birthday(data.test$Snapshot.Date[x], data.test$DOB[x]))}))
end.time = Sys.time()
# end.time - start.time
# Time difference of 1.743053 mins
Am I doing something seriously wrong? Does this seem really slow to you?
Any feedback is welcome!
Unless I'm missing something obvious, there are a bunch of built in ways of working with time data in R, notably base::difftime which may have saved you some trouble.
Taking your above dataset data.test:
data.test$dif <- round(as.vector(as.double(difftime(strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"), strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days"))) %% 365, 1)
or to lay it out more logically (this wont work if you copy paste it).
data.test$dif <-
round(
as.vector(
as.double(
difftime(
strptime(data.test$Snapshot.Date, format = "%Y-%m-%d"),
strptime(data.test$DOB, format = "%Y-%m-%d"), units = "days")
)
)
%% 365,
1)
The above uses the difftime function to find the difference between the two dates with the given format (format = "%Y-%m-%d") in terms of days, then performs remainder division to get the number of days since the last birthday. I personally think this is a better measure than months because a difference of 2 months between July and August is a different number of days than a 2 month difference between January and February.
Note: The above solution does not incorporate leap years. You could easily look up a list of leap years and add 1 day to the checkup or subtract 1 day from the birthday of each individual who lived through that leap year to get an accurate number.