How come difftime() gives me answers way out of proportion? - r

I have date time values and I'd like to calculate the difference. I tried to use - between the two times, as in t1 - t2 but it switches the units. Some of the output is in minutes, some in hours and some in days- which makes it hard to work with.
I used difftime from lubridate and it gave me results that don't make sense.
my_tibble %>%
mutate(time_diff = difftime(t2, t1, units = "mins"))
t1 t2 time_diff
<dttm> <dttm> <drtn>
1 2018-06-30 18:26:28 2018-07-01 01:26:43 0.2342667 mins
2 2018-06-30 19:33:03 2018-07-01 09:36:56 423.8818500 mins
3 2018-06-30 19:32:51 2018-07-01 02:33:41 0.8219833 mins
4 2018-06-30 23:09:59 2018-07-01 06:11:45 1.7654167 mins
5 2018-06-30 23:22:30 2018-07-01 06:23:00 0.4852000 mins
Here's more information.
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6 obs. of 1 variable:
$ t1: POSIXct, format: "2018-07-01 01:26:43" "2018-07-01 09:36:56" "2018-07-01 02:33:41" "2018-07-01 06:11:45"
For what it's worth, the file comes from a CSV that has the time t1 defined as the number of milliseconds since the unix epoch. Here is how I read in the dataframe.
my_tibble <- read_csv("table.csv") %>%
mutate(t1 = as.POSIXct(epoch_milli / 1000, origin="1970-01-01")) %>%
mutate_if(is.numeric, as.character) %>%
as_tibble()

Couldn't be able to reproduce the issue after quoting the "mins correctly ("mins"). Also, difftime is a base R function
library(dplyr)
my_tibble %>%
mutate(time_diff = difftime(t2, t1, units = "mins"))
# A tibble: 5 x 3
# t1 t2 time_diff
# <dttm> <dttm> <drtn>
#1 2018-06-30 18:26:28 2018-07-01 01:26:43 420.2500 mins
#2 2018-06-30 19:33:03 2018-07-01 09:36:56 843.8833 mins
#3 2018-06-30 19:32:51 2018-07-01 02:33:41 420.8333 mins
#4 2018-06-30 23:09:59 2018-07-01 06:11:45 421.7667 mins
#5 2018-06-30 23:22:30 2018-07-01 06:23:00 420.5000 mins
data
my_tibble <- structure(list(t1 = structure(c(1530383188, 1530387183, 1530387171,
1530400199, 1530400950), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
t2 = structure(c(1530408403, 1530437816, 1530412421, 1530425505,
1530426180), class = c("POSIXct", "POSIXt"), tzone = "UTC")),
row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

Related

How to add rows in a dataset based on conditions in R

I have dataset where length of stay of booking going in two or three month i want to create a row for every such bookings where revenue will be divided for every month and remaining information about the booking will remain same. if a booking length is in same month then it will show that as it is.
structure(list(channel = c("109", "109", "Agent"), room_stay_status = c("ENQUIRY",
"ENQUIRY", "CHECKED_OUT"), start_date = structure(c(1637971200,
1640995200, 1640995200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end_date = structure(c(1643155200, 1642636800, 1641168000
), tzone = "UTC", class = c("POSIXct", "POSIXt")), los = c(60,
19, 2), booker = c("Anuj", "Anuj", "Anuj"), area = c("Goa", "Goa",
"Goa"), property_sku = c("Amna-3b", "Amna-3b", "Amna-3b"), Revenue = c(90223.666,
5979, 7015.9), Booking_ref = c("aed97", "b497h9", "bde65")), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
output should look like this
structure(list(channel = c("109", "109", "109", "109", "Agent"
), room_stay_status = c("ENQUIRY", "ENQUIRY", "ENQUIRY", "ENQUIRY",
"CHECKED_OUT"), start_date = structure(c(1637971200, 1638316800,
1640995200, 1640995200, 1640995200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end_date = structure(c(1638230400, 1640908800, 1643155200,
1642636800, 1641168000), tzone = "UTC", class = c("POSIXct",
"POSIXt")), los = c(4, 31, 25, 19, 2), booker = c("Anuj", "Anuj",
"Anuj", "Anuj", "Anuj"), area = c("Goa", "Goa", "Goa", "Goa",
"Goa"), property_sku = c("Amna-3b", "Amna-3b", "Amna-3b", "Amna-3b",
"Amna-3b"), Revenue = c(6014.91106666667, 46615.5607666667, 37593.1941666667,
5979, 7015.9), Booking_ref = c("aed97", "aed97", "aed97", "b497h9",
"bde65")), row.names = c(NA, -5L), class = c("tbl_df", "tbl",
"data.frame"))
Many thanks in advance.
An quick attempt here (assuming your data is named df_in and df_out) which seems to do the trick:
library("dplyr")
library("tidyr")
library("lubridate")
# Function for creating a vector from start (st) to end (nd) with intermediate
# months inside
cut_months <- function(st, nd) {
repeat {
# Grow vector, keep adding next month
next_month <- ceiling_date(tail(st, 1) + seconds(1), "month")
if (next_month > nd) {
st <- append(st, nd)
break
} else {
st <- append(st, next_month)
}
}
return(st)
}
# Let's try it
print(cut_months(df_in$start_date[1], df_in$end_date[2]))
# [1] "2021-11-27 01:00:00 CET" "2021-12-01 01:00:00 CET" "2022-01-01 00:00:00 CET" "2022-01-20 01:00:00 CET"
# Function for expanding months:
expand_months <- function(df) {
expand_rows <-
df %>%
# Expand months and unnest list-column
mutate(key_dates = mapply(cut_months, start_date, end_date)) %>%
select(-start_date, -end_date) %>%
unnest(key_dates) %>%
# Compute needed values
group_by(Booking_ref) %>%
arrange(Booking_ref, key_dates) %>%
mutate(
start_date = key_dates,
end_date = lead(key_dates),
los = as.numeric(as.duration(start_date %--% end_date), "days"), # Ceiling this?
Revenue = Revenue * los / sum(los, na.rm = TRUE)
) %>%
arrange(Booking_ref, start_date) %>%
# Clean-up
filter(!is.na(end_date)) %>%
select(-key_dates)
expand_rows
}
# Print results and compare:
expand_months(df_in)
## A tibble: 5 x 10
## Groups: Booking_ref [3]
#channel room_stay_status los booker area property_~1 Revenue Booki~2 start_date end_date
#<chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dttm> <dttm>
#1 109 ENQUIRY 4 Anuj Goa Amna-3b 6015. aed97 2021-11-27 01:00:00 2021-12-01 01:00:00
#2 109 ENQUIRY 31.0 Anuj Goa Amna-3b 46553. aed97 2021-12-01 01:00:00 2022-01-01 00:00:00
#3 109 ENQUIRY 25.0 Anuj Goa Amna-3b 37656. aed97 2022-01-01 00:00:00 2022-01-26 01:00:00
#4 109 ENQUIRY 19 Anuj Goa Amna-3b 5979 b497h9 2022-01-01 01:00:00 2022-01-20 01:00:00
#5 Agent CHECKED_OUT 2 Anuj Goa Amna-3b 7016. bde65 2022-01-01 01:00:00 2022-01-03 01:00:00
## ... with abbreviated variable names 1: property_sku, 2: Booking_ref
df_out
## A tibble: 5 x 10
#channel room_stay_status start_date end_date los booker area property_~1 Revenue Booki~2
#<chr> <chr> <dttm> <dttm> <dbl> <chr> <chr> <chr> <dbl> <chr>
#1 109 ENQUIRY 2021-11-27 00:00:00 2021-11-30 00:00:00 4 Anuj Goa Amna-3b 6015. aed97
#2 109 ENQUIRY 2021-12-01 00:00:00 2021-12-31 00:00:00 31 Anuj Goa Amna-3b 46616. aed97
#3 109 ENQUIRY 2022-01-01 00:00:00 2022-01-26 00:00:00 25 Anuj Goa Amna-3b 37593. aed97
#4 109 ENQUIRY 2022-01-01 00:00:00 2022-01-20 00:00:00 19 Anuj Goa Amna-3b 5979 b497h9
#5 Agent CHECKED_OUT 2022-01-01 00:00:00 2022-01-03 00:00:00 2 Anuj Goa Amna-3b 7016. bde65
## ... with abbreviated variable names 1: property_sku, 2: Booking_ref
I do not understand entirely how you distribute the Revenue. Consider that left as an exercise to fix :).
Hint: you need a ceiling() around the computation of the new los which computes decimal days.
Using solution from this post to split date:
df2 <- df %>%
group_by(id = row_number()) %>% # for each row
mutate(seq = list(seq(start_date, end_date, "day")), # create a sequence of dates with 1 day step
month = map(seq, month)) %>% # get the month for each one of those dates in sequence
unnest() %>% # unnest data
group_by(Booking_ref, id, month) %>% # for each group, row and month
summarise(start_date = min(seq), # get minimum date as start
end_date = max(seq)) %>% # get maximum date as end
ungroup() %>% # ungroup
select(-id, - month)%>%
group_by(Booking_ref)%>%
mutate(last_date=max(end_date)) # get last_date to determine los
df3 <- merge(df2,df%>%select(-start_date,-end_date),by=c('Booking_ref'),all.x=T)%>%
mutate(timespam=end_date-start_date)%>%
mutate(los2=as.numeric(case_when(last_date==end_date~timespam,
T~timespam+1)),
Revenue2=Revenue*los2/los)
out_df <- df3%>%
select(-Revenue,-los,-timespam,-last_date)%>%
rename(Revenue=Revenue2,
los=los2)
> out_df
Booking_ref start_date end_date channel room_stay_status booker area property_sku los Revenue
1 aed97 2022-01-01 2022-01-26 109 ENQUIRY Anuj Goa Amna-3b 25 37593.194
2 aed97 2021-11-27 2021-11-30 109 ENQUIRY Anuj Goa Amna-3b 4 6014.911
3 aed97 2021-12-01 2021-12-31 109 ENQUIRY Anuj Goa Amna-3b 31 46615.561
4 b497h9 2022-01-01 2022-01-20 109 ENQUIRY Anuj Goa Amna-3b 19 5979.000
5 bde65 2022-01-01 2022-01-03 Agent CHECKED_OUT Anuj Goa Amna-3b 2 7015.900

Calculate availability with dates

I have a tibble and want to compute monthly availability (A), defined as
A = uptime / (uptime + downtime),
where (monthly) downtime is end - start, by month and uptime is total time (1 month) - downtime. What is the way to compute monthly availability for the year 2019?
This is the sample data
structure(list(start = structure(c(1550048400, 1558008000, 1558703040,
1561032000, 1560945660, 1563451200), tzone = "UTC", class = c("POSIXct",
"POSIXt")), end = structure(c(1550143989, 1558008000, 1558956840,
1561032000, 1560945660, 1563451200), tzone = "GMT", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
First, you have inconsistent "tzone" attributes, one is "UTC" and the other is "GMT". It's minor (and slightly noisy), so I'll preempt the noise (though no change in the results):
attr(dat$end, "tzone") <- "UTC"
A helper function:
fun <- function(mon1, mon2, x = dat) {
# if either start/end is between mon1/mon2, include it ...
tmp <- x[with(x, (start >= mon1 & start < mon2) | (end >= mon1 & end < mon2)),] |>
# ... but if start-to-end straddles a month begin/end, then truncate it
transform(
start = pmax(start, mon1),
end = pmin(end, mon2)
)
data.frame(start = mon1, end = mon2) |>
transform(downtime = c(sum(with(tmp, as.numeric(end - start, units = "hours"))), 0)[1]) |>
transform(uptime = as.numeric(mon2 - mon1, units = "hours") - downtime) |>
transform(A = uptime / ( uptime + downtime))
}
And the work in base R:
months <- seq(as.POSIXct("2019-01-01 00:00:00", tz="UTC"), by="1 month", length.out=12)
months
# [1] "2019-01-01 UTC" "2019-02-01 UTC" "2019-03-01 UTC" "2019-04-01 UTC" "2019-05-01 UTC" "2019-06-01 UTC" "2019-07-01 UTC"
# [8] "2019-08-01 UTC" "2019-09-01 UTC" "2019-10-01 UTC" "2019-11-01 UTC" "2019-12-01 UTC"
do.call(rbind, Map(fun, months[-12], months[-1]))
# start end downtime uptime A
# 1 2019-01-01 2019-02-01 0.0000 744.0000 1.0000000
# 2 2019-02-01 2019-03-01 26.5525 645.4475 0.9604874
# 3 2019-03-01 2019-04-01 0.0000 744.0000 1.0000000
# 4 2019-04-01 2019-05-01 0.0000 720.0000 1.0000000
# 5 2019-05-01 2019-06-01 70.5000 673.5000 0.9052419
# 6 2019-06-01 2019-07-01 0.0000 720.0000 1.0000000
# 7 2019-07-01 2019-08-01 0.0000 744.0000 1.0000000
# 8 2019-08-01 2019-09-01 0.0000 744.0000 1.0000000
# 9 2019-09-01 2019-10-01 0.0000 720.0000 1.0000000
# 10 2019-10-01 2019-11-01 0.0000 744.0000 1.0000000
# 11 2019-11-01 2019-12-01 0.0000 720.0000 1.0000000
If you are trying to calculate the value of 'A' for each month, then the process would be:
sum up all the down time in each month
subtract that from the total time in the month to get the uptime
divide the uptime by the total time in the month
This is possible using the lubridate package:
library(lubridate)
library(dplyr)
data <- data %>%
mutate(downtime = end-start,
month = format(end, "%Y-%m %b"),
month_time = ceiling_date(end,
unit = "months") - floor_date(end,
unit = "months")) %>%
group_by(month) %>%
summarise(downtime = sum(downtime),
month_time = month_time[1]) %>%
mutate(uptime = month_time - downtime,
A = as.numeric(uptime) / as.numeric(uptime + downtime))

Adding dates and times to event durations

As an addition to this question, is it possible to add when an event started and when it finished in another column(s)?
Here is a reproducible example pulled from the OP.
df <- structure(list(Time = structure(c(1463911500, 1463911800, 1463912100,
1463912400, 1463912700, 1463913000), class = c("POSIXct", "POSIXt"
), tzone = ""), Temp = c(20.043, 20.234, 6.329, 20.424, 20.615,
20.805)), row.names = c(NA, -6L), class = "data.frame")
> df
Time Temp
1 2016-05-22 12:05:00 20.043
2 2016-05-22 12:10:00 20.234
3 2016-05-22 12:15:00 6.329
4 2016-05-22 12:20:00 20.424
5 2016-05-22 12:25:00 20.615
6 2016-05-22 12:30:00 20.805
library(dplyr)
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)))
id event_duration
<int> <time>
1 1 5 mins
2 3 10 mins
i.e there are two more columns: "start_DateTime" and "end_DateTime"
Thanks!
Sure. Modify the final summarise() like this:
df %>%
# add id for different periods/events
mutate(tmp_Temp = Temp > 20, id = rleid(tmp_Temp)) %>%
# keep only periods with high temperature
filter(tmp_Temp) %>%
# for each period/event, get its duration
group_by(id) %>%
summarise(event_duration = difftime(last(Time), first(Time)),
start_DateTime = min(Time),
end_DateTime = max(Time))
#> # A tibble: 2 × 4
#> id event_duration start_DateTime end_DateTime
#> <int> <drtn> <dttm> <dttm>
#> 1 1 5 mins 2016-05-22 12:05:00 2016-05-22 12:10:00
#> 2 3 10 mins 2016-05-22 12:20:00 2016-05-22 12:30:00

diff time r studio is not giving the outcome properly

Im trying to calculate the difftime in minutes between two columns and the results seems to be not right.
BLOCK_DATE_TIME.x
a1_0
diff
2019-04-26 19:07:00
2019-04-27 09:00:00
773
2019-08-27 08:30:00
2019-08-27 09:00:00
-30
the code to calculate diff is the following:
test$diff <- difftime(test$a1_0,test$BLOCK_DATE_TIME.x,units = "mins")
The desire outcome is for the first line 833 and for the second 30.
Thanks in advance
It is possible that the columns are not Datetime class. If we convert to POSIXct, it works as expected
library(dplyr)
library(lubridate)
test1 <- test %>%
mutate(across(c(BLOCK_DATE_TIME.x, a1_0), ymd_hms),
diff = difftime(a1_0, BLOCK_DATE_TIME.x, units = "mins"))
test1
# BLOCK_DATE_TIME.x a1_0 diff
#1 2019-04-26 19:07:00 2019-04-27 09:00:00 833 mins
#2 2019-08-27 08:30:00 2019-08-27 09:00:00 30 mins
data
test <- structure(list(BLOCK_DATE_TIME.x = c("2019-04-26 19:07:00",
"2019-08-27 08:30:00"
), a1_0 = c("2019-04-27 09:00:00", "2019-08-27 09:00:00")), row.names = c(NA,
-2L), class = "data.frame")

Converting dates to hours in R

I have a start and end date for individuals and i need to estimate if the time passed from the start to the end is within 2 days
or 3 plus days.These dates are assign to record ids, how can i filter ones that ended within 2 days (from the start date)
and the ones that ended after 3 days or later.
Record_id <- c("2245","6728","5122","9287")
Start <- c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End <- c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
I tried using
elapsed.time <- DF$start %--% DF$End
time.duration <- as.duration(elapsed.time)
but I am getting error because End date contains hour.Thank you.
Here's a dplyr pipe that will include both constraints (2 and 3 days):
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 2, 3))
# # A tibble: 4 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750000 days
# 2 6728 2021-01-21 00:00:00 2021-01-22 16:00:00 1.666667 days
# 3 5122 2021-01-17 00:00:00 2021-01-22 13:00:00 5.541667 days
# 4 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625000 days
I included mutate(d= so that we can see what the actual differences are. If you were looking to remove those, then use filter(between(..)) (no !).
In the case of the data you provided, all observations are less than 2 or more than 3 days. I'll expand this range so that we can see it in effect:
df %>%
mutate(across(Start:End, as.POSIXct)) %>%
mutate(d = difftime(End, Start, units = "days")) %>%
filter(!between(difftime(End, Start, units = "days"), 1, 6))
# # A tibble: 2 x 4
# Record_id Start End d
# <chr> <dttm> <dttm> <drtn>
# 1 2245 2021-01-13 00:00:00 2021-01-21 18:00:00 8.750 days
# 2 9287 2021-01-13 00:00:00 2021-01-25 15:00:00 12.625 days
Data
df <- structure(list(Record_id = c("2245", "6728", "5122", "9287"), Start = c("2021-01-13 CST", "2021-01-21 CST", "2021-01-17 CST", "2021-01-13 CST"), End = c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST", "2021-01-25 15:00:00 CST")), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"))
I just converted the character to a date time with lubridate and then subtracted the dates. What you'll get back are days. I then filter for dates that are within 2 days.
Record_id<- c("2245","6728","5122","9287")
Start<-c("2021-01-13 CST" ,"2021-01-21 CST" ,"2021-01-17 CST","2021-01-13 CST")
End<-c("2021-01-21 18:00:00 CST", "2021-01-22 16:00:00 CST", "2021-01-22 13:00:00 CST","2021-01-25 15:00:00 CST")
df <- dplyr::tibble(x = Record_id, y = Start, z = End)
df %>%
dplyr::mutate_at(vars(y:z), ~ lubridate::as_datetime(.)) %>%
dplyr::mutate(diff = as.numeric(z - y)) %>%
dplyr::filter(diff <= 2 )

Resources