I have a dataframe where i have to columns that represent the start of an event and the planned end of the event
What is the best way to add a column in which i could see the duration in days of the event in the dataframe ?
Another alternative would be to directly create a new dataset from it by using the group_by function on which i could see for each day the average duration of a campaign, but it seems too complicated
structure(list(launched_at = c("03/26/2021", "03/24/2021", "01/05/2021",
"02/17/2021", "02/15/2021", "02/25/2021"), deadline = c("04/25/2021",
"04/08/2021", "01/17/2021", "03/03/2021", "03/01/2021", "04/26/2021"
)), row.names = c(NA, 6L), class = "data.frame")
We could use mdy function from lubridate package:
library(lubridate)
library(dplyr)
df %>%
mutate(across(, mdy), # this line only if your dates are not in date format
duration_days = as.integer(deadline - launched_at))
launched_at deadline duration_days
1 2021-03-26 2021-04-25 30
2 2021-03-24 2021-04-08 15
3 2021-01-05 2021-01-17 12
4 2021-02-17 2021-03-03 14
5 2021-02-15 2021-03-01 14
6 2021-02-25 2021-04-26 60
One option
as.POSIXct(df$deadline,tz="UTC",format="%m/%d/%y")-
as.POSIXct(df$launched_at,tz="UTC",format="%m/%d/%y")
Time differences in days
[1] 30 15 12 15 15 61
If you're looking for duration between 'launched_at' and 'deadline',
library(dplyr)
df %>%
mutate(launched_at = as.Date(launched_at, "%m/%d/%Y"),
deadline = as.Date(deadline, "%m/%d/%Y"),
duration = deadline - launched_at)
launched_at deadline duration
1 2021-03-26 2021-04-25 30 days
2 2021-03-24 2021-04-08 15 days
3 2021-01-05 2021-01-17 12 days
4 2021-02-17 2021-03-03 14 days
5 2021-02-15 2021-03-01 14 days
6 2021-02-25 2021-04-26 60 days
more concise way(#Darren Tsai)
df %>%
mutate(across(c(launched_at, deadline), as.Date, "%m/%d/%Y"),
duration = deadline - launched_at)
You can use the built-in functions within and as.Date:
df = within(df, {
launched_at = as.Date(launched_at, "%m/%d/%y")
deadline = as.Date(deadline, "%m/%d/%y")
duration = deadline-launched_at})
launched_at deadline duration
1 2020-03-26 2020-04-25 30 days
2 2020-03-24 2020-04-08 15 days
3 2020-01-05 2020-01-17 12 days
4 2020-02-17 2020-03-03 15 days
5 2020-02-15 2020-03-01 15 days
6 2020-02-25 2020-04-26 61 days
Another option using difftime:
df <- structure(list(launched_at = c("03/26/2021", "03/24/2021", "01/05/2021",
"02/17/2021", "02/15/2021", "02/25/2021"), deadline = c("04/25/2021",
"04/08/2021", "01/17/2021", "03/03/2021", "03/01/2021", "04/26/2021"
)), row.names = c(NA, 6L), class = "data.frame")
df$duration <- with(df, difftime(as.Date(deadline, "%m/%d/%Y"), as.Date(launched_at, "%m/%d/%Y"), units = c("days")))
df
#> launched_at deadline duration
#> 1 03/26/2021 04/25/2021 30 days
#> 2 03/24/2021 04/08/2021 15 days
#> 3 01/05/2021 01/17/2021 12 days
#> 4 02/17/2021 03/03/2021 14 days
#> 5 02/15/2021 03/01/2021 14 days
#> 6 02/25/2021 04/26/2021 60 days
Created on 2022-07-22 by the reprex package (v2.0.1)
Related
I am having trouble figuring out how to account for and sum continuous time observations across multiple dates and time events in my dataset. A similar question is found here, but it only accounts for one instance of a continuous time event. I have a dataset with multiple date and time combinations. Here is an example from that dataset, which I am manipulating in R:
date.1 <- c("2021-07-21", "2021-07-21", "2021-07-21", "2021-07-29", "2021-07-29", "2021-07-30", "2021-08-01","2021-08-01","2021-08-01")
time.1 <- c("15:57:59", "15:58:00", "15:58:01", "15:46:10", "15:46:13", "18:12:10", "18:12:10","18:12:11","18:12:13")
df <- data.frame(date.1, time.1)
df
date.1 time.1
1 2021-07-21 15:57:59
2 2021-07-21 15:58:00
3 2021-07-21 15:58:01
4 2021-07-29 15:46:10
5 2021-07-29 15:46:13
6 2021-07-30 18:12:10
7 2021-08-01 18:12:10
8 2021-08-01 18:12:11
9 2021-08-01 18:12:13
I tried following the following script from the link I present:
df$missingflag <- c(1, diff(as.POSIXct(df$time.1, format="%H:%M:%S", tz="UTC"))) > 1
df
date.1 time.1 missingflag
1 2021-07-21 15:57:59 FALSE
2 2021-07-21 15:58:00 TRUE
3 2021-07-21 15:58:01 FALSE
4 2021-07-29 15:46:10 FALSE
5 2021-07-29 15:46:13 TRUE
6 2021-07-30 18:12:10 TRUE
7 2021-08-01 18:12:10 FALSE
8 2021-08-01 18:12:11 FALSE
9 2021-08-01 18:12:13 TRUE
But it did not working as anticipated and did not get closer to my answer. It would have been an intermediate goal and probably wouldn't answer my questions.
The GOAL of my dilemma would be account to for all the continuous time observations and put into a new table like this:
date.1 time.1 secs
1 2021-07-21 15:57:59 3
4 2021-07-29 15:46:10 1
5 2021-07-29 15:46:13 1
6 2021-07-30 18:12:10 1
7 2021-08-01 18:12:10 2
9 2021-08-01 18:12:13 1
You will see that the start time of each of the continuous time observations are recorded and the total number of seconds (secs) observed since the start of the continuous observation are being recorded. The script would account for date.1 as there are multiple dates in the dataset.
Thank you in advance.
You can create a datetime object combining date and time columns, get the difference of consecutive values and create groups where all the time 1s apart are part of the same group. For each group count the number of rows and their first datetime value.
library(dplyr)
library(tidyr)
df %>%
unite(datetime, date.1, time.1, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(grp = cumsum(difftime(datetime,
lag(datetime, default = first(datetime)), units = 'secs') > 1)) %>%
summarise(datetime = first(datetime),
secs = n(), .groups = 'drop') %>%
select(-grp)
# datetime secs
# <dttm> <int>
#1 2021-07-21 15:57:59 3
#2 2021-07-29 15:46:10 1
#3 2021-07-29 15:46:13 1
#4 2021-07-30 18:12:10 1
#5 2021-08-01 18:12:10 2
#6 2021-08-01 18:12:13 1
I have kept datetime as single combined column here but if needed you can separate them again as two different columns using
%>% separate(datetime, c('date', 'time'), sep = ' ')
I’d like to count using R, how many days of given list:
2020-10-01
2020-10-03
2020-10-07
2020-10-08
2020-10-09
2020-10-10
2020-10-14
2020-10-17
2020-10-21
2020-10-22
2020-10-27
2020-10-29
2020-10-30
Were in given period from start to end:
id start end
1 2020-10-05 2020-10-30
2 2020-10-06 2020-10-29
3 2020-10-10 2020-10-12
And the result should be for example:
id number of days
1 5
2 18
3 12
Here you can find a tidyverse approch with lubridate and dplyr.
library(lubridate)
library(dplyr)
df %>%
count(id, start, end,
wt = days %within% interval(start, end),
name = "number_of_days")
#> id start end number_of_days
#> 1 1 2020-10-05 2020-10-30 11
#> 2 2 2020-10-06 2020-10-29 10
#> 3 3 2020-10-10 2020-10-12 1
For each row, count the number of days within the interval of start and end (extremes included).
(If you don't want to see start and end just remove them from the first line of count)
Where:
days <- c("2020-10-01",
"2020-10-03",
"2020-10-07",
"2020-10-08",
"2020-10-09",
"2020-10-10",
"2020-10-14",
"2020-10-17",
"2020-10-21",
"2020-10-22",
"2020-10-27",
"2020-10-29",
"2020-10-30")
df <- read.table(text = " id start end
1 2020-10-05 2020-10-30
2 2020-10-06 2020-10-29
3 2020-10-10 2020-10-12", header = TRUE)
days <- as.Date(days)
df$start <- as.Date(df$start)
df$end <- as.Date(df$end)
Assuming all the dates are of date class you can use mapply :
df2$num_days <- mapply(function(x, y) sum(df1$dates >= x & df1$dates <= y), df2$start, df2$end)
Is there a way to calculate a "task time" for working hours only? Working hours 8 to 5, Monday through Friday. Example:
Using datediff():
expected result:
sample task times:
df %>%
select(v_v_initiated,v_v_complete)
v_v_initiated v_v_complete
1 2020-04-23 14:13:52.0000000 2020-04-23 16:04:28.0000000
2 2020-11-10 11:48:53.0000000 2020-11-10 13:12:31.0000000
3 2020-10-20 16:03:39.0000000 2020-10-20 16:25:16.0000000
4 2020-04-02 13:43:54.0000000 2020-04-02 14:14:45.0000000
5 2020-07-09 08:52:54.0000000 2020-07-23 09:18:29.0000000
6 2020-06-09 14:56:33.0000000 2020-06-10 07:44:17.0000000
7 2020-09-17 15:11:39.0000000 2020-09-17 15:13:41.0000000
8 2020-10-28 14:08:20.0000000 2020-10-28 14:07:35.0000000
9 2020-04-21 12:55:36.0000000 2020-04-27 12:56:17.0000000
10 2020-11-06 11:02:03.0000000 2020-11-06 11:02:30.0000000
11 2020-02-17 12:29:21.0000000 2020-02-18 12:52:23.0000000
12 2020-08-25 15:25:46.0000000 2020-08-26 10:18:26.0000000
13 2020-02-19 15:05:28.0000000 2020-02-20 09:43:48.0000000
14 2020-09-23 21:19:41.0000000 2020-09-24 14:52:21.0000000
15 2020-07-01 14:20:11.0000000 2020-07-01 14:20:59.0000000
16 2020-05-01 15:22:58.0000000 2020-05-01 16:32:35.0000000
17 2020-06-29 13:10:58.0000000 2020-06-30 13:53:29.0000000
18 2020-06-16 12:56:54.0000000 2020-06-16 14:27:15.0000000
19 2020-03-27 11:02:29.0000000 2020-03-30 15:18:51.0000000
20 2020-04-08 07:38:01.0000000 2020-04-08 07:52:35.0000000
21 2020-07-30 09:32:42.0000000 2020-07-30 10:32:28.0000000
22 2020-06-17 14:03:31.0000000 2020-07-10 15:38:03.0000000
23 2020-04-24 10:41:27.0000000 2020-04-29 13:07:05.0000000
24 2020-08-26 10:41:10.0000000 2020-08-26 12:55:23.0000000
25 2020-10-26 18:11:16.0000000 2020-10-27 16:10:39.0000000
26 2020-01-08 11:12:49.0000000 2020-01-09 09:18:37.0000000
27 2020-04-17 11:40:10.0000000 2020-04-17 15:51:21.0000000
28 2020-02-11 10:38:21.0000000 2020-02-11 10:33:54.0000000
29 2020-03-23 12:10:21.0000000 2020-03-23 12:33:06.0000000
30 2020-06-02 12:44:00.0000000 2020-06-03 08:28:05.0000000
31 2020-04-13 09:30:31.0000000 2020-04-13 13:16:55.0000000
32 2020-04-07 17:36:02.0000000 2020-04-07 17:36:44.0000000
33 2020-01-15 12:24:42.0000000 2020-01-15 12:25:00.0000000
34 2020-08-18 08:55:58.0000000 2020-08-18 09:02:34.0000000
35 2020-07-06 14:10:23.0000000 2020-07-07 10:28:05.0000000
36 2020-03-25 15:03:20.0000000 2020-03-31 14:17:43.0000000
37 2020-01-29 12:58:33.0000000 2020-02-14 09:53:06.0000000
38 2020-02-07 15:11:21.0000000 2020-02-10 09:13:53.0000000
39 2020-07-27 17:51:13.0000000 2020-07-29 11:52:51.0000000
40 2020-09-02 11:43:02.0000000 2020-09-02 13:10:46.0000000
41 2020-07-22 11:04:50.0000000 2020-07-22 11:12:34.0000000
42 2020-06-29 13:57:17.0000000 2020-06-30 07:34:55.0000000
43 2020-07-21 10:46:58.0000000 2020-07-21 16:15:59.0000000
44 2020-05-27 07:38:46.0000000 2020-05-27 07:51:24.0000000
45 2020-07-14 10:33:49.0000000 2020-07-14 11:38:28.0000000
46 2020-06-04 16:59:09.0000000 2020-06-09 10:49:20.0000000
You could adapt another function that calculates business hours for a time interval (such as this.
First, create a sequence of dates from start to end, and filter by only include weekdays.
Next, create time intervals using the business hours of interest (in this case, "08:00" to "17:00").
Determine how much of each day business hours overlap with your times. This way, if a time starts at "09:05", that time will be used for the start of the day, and not "08:00".
Finally, sum up the time intervals, and determine the number of business days (assuming a 9-hour day), and remainder hours and minutes.
If you want to apply this function to rows in a data frame, you could use mapply as in:
df$business_hours <- mapply(calc_bus_hours, df$start_date, df$end_date)
Hope this is helpful.
library(lubridate)
library(dplyr)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_dates <- my_dates[!weekdays(my_dates) %in% c("Saturday", "Sunday")]
my_intervals <- interval(ymd_hm(paste(my_dates, "08:00"), tz = "UTC"), ymd_hm(paste(my_dates, "17:00"), tz = "UTC"))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])),
int_start(my_intervals[length(my_intervals)]))
total_time <- sum(time_length(my_intervals, "minutes"))
total_days <- total_time %/% (9 * 60)
total_hours <- total_time %% (9 * 60) %/% 60
total_minutes <- total_time - (total_days * 9 * 60) - (total_hours * 60)
paste(total_days, "days,", total_hours, "hours,", total_minutes, "minutes")
}
calc_bus_hours(as.POSIXct("11/4/2020 9:05", format = "%m/%d/%Y %H:%M", tz = "UTC"),
as.POSIXct("11/9/2020 11:25", format = "%m/%d/%Y %H:%M", tz = "UTC"))
[1] "3 days, 2 hours, 20 minutes"
Edit: As mentioned by #DPH this is more complex with holidays and partial holidays.
You could create a data frame of holidays and indicate times open, allowing for partial holidays (e.g., Christmas Eve from 8:00 AM to Noon).
Here is a modified function that should give comparable results.
library(lubridate)
library(dplyr)
holiday_df <- data.frame(
date = as.Date(c("2020-12-24", "2020-12-25", "2020-12-31", "2020-01-01")),
start = c("08:00", "08:00", "08:00", "08:00"),
end = c("12:00", "08:00", "08:00", "08:00")
)
calc_bus_hours <- function(start, end) {
my_dates <- seq.Date(as.Date(start), as.Date(end), by = "day")
my_dates_df <- data.frame(
date = my_dates[!weekdays(my_dates) %in% c("Saturday", "Sunday")],
start = "08:00",
end = "17:00"
)
all_dates <- union_all(
inner_join(my_dates_df["date"], holiday_df),
anti_join(my_dates_df, holiday_df["date"])
) %>%
arrange(date)
my_intervals <- interval(ymd_hm(paste(all_dates$date, all_dates$start), tz = "UTC"),
ymd_hm(paste(all_dates$date, all_dates$end), tz = "UTC"))
int_start(my_intervals[1]) <- pmax(pmin(start, int_end(my_intervals[1])), int_start(my_intervals[1]))
int_end(my_intervals[length(my_intervals)]) <- pmax(pmin(end, int_end(my_intervals[length(my_intervals)])),
int_start(my_intervals[length(my_intervals)]))
total_time <- sum(time_length(my_intervals, "minutes"))
total_days <- total_time %/% (9 * 60)
total_hours <- total_time %% (9 * 60) %/% 60
total_minutes <- total_time - (total_days * 9 * 60) - (total_hours * 60)
paste(total_days, "days,", total_hours, "hours,", total_minutes, "minutes")
}
I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10
I have a data frame df that contains different transactions. Each transaction has a start date and an end date. The two variables for this are start_time and end_time. They are of the class POSIXct.
An example of how they look are as follows "2018-05-23 23:40:00" "2018-06-24 00:10:00".
There are about 13000 transactions in df and I want to extract all transactions that contain at least a bit of the specified time interval if not all. The time interval or range is 20:00:00 - 8:00:00 so basically 8 P.M =< interval < 8 A.M.
I am trying to use dplyr and the function filter() to do this however my problem is I am not sure how to write the boolean expression. What I have written in code so far is this:
df %>% filter(hour(start_time) >= 20 | hour(start_time) < 8 |hour(end_time) >= 20 | hour(end_time) < 8 )
I thought maybe this would get all transactions that contain at least a part of that interval but then I thought about transactions that maybe start and end outside of that interval but their duration is so long that it contains those hours from the interval. I thought maybe of adding | duration > 12 because any start time that has a duration longer than 12 hours will contain a part of that time interval. However, I feel like this code is unnecessarily long and there must be a simpler way but I don't know how.
I'll start with a sample data frame, since a sample df isn't given in the question:
library(lubridate)
library(dplyr)
set.seed(69)
dates <- as.POSIXct("2020-04-01") + days(sample(30, 10, TRUE))
start_time <- dates + seconds(sample(86400, 10, TRUE))
end_time <- start_time + seconds(sample(50000, 10, TRUE))
df <- data.frame(Transaction = LETTERS[1:10], start_time, end_time)
df
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 E 2020-04-09 11:36:19 2020-04-09 19:01:14
#> 6 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 7 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 8 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 9 I 2020-04-08 23:11:51 2020-04-09 09:04:26
#> 10 J 2020-04-07 12:28:08 2020-04-07 12:55:42
We can enumerate the possibilities for a match as follows:
Any start time before 08:00 or after 20:00
Any stop time before 08:00 or after 20:00
The stop and start times are on different dates.
Using a little modular math, we can write this as:
df %>% filter((hour(start_time) + 12) %% 20 > 11 |
(hour(end_time) + 12) %% 20 > 11 |
date(start_time) != date(end_time))
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 6 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 7 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 8 I 2020-04-08 23:11:51 2020-04-09 09:04:26
You can check that all the times are at least partly within the given range, and that the two removed rows are not.