I've been looking for answers and messing around with my code for a couple hours. I have a dataset that looks like the following for a specific ID:
# A tibble: 14 × 3
ID state orderDate
<dbl> <chr> <dttm>
1 4227631 1 2022-03-14 19:00:00
2 4227631 1 2022-03-14 20:00:00
3 4227631 1 2022-03-15 11:00:00
4 4227631 0 2022-03-15 11:00:00
5 4227631 1 2022-03-15 20:00:00
6 4227631 1 2022-03-16 04:00:00
7 4227631 0 2022-03-16 04:00:00
8 4227631 1 2022-03-16 05:00:00
9 4227631 0 2022-03-16 13:00:00
10 4227631 1 2022-03-16 15:00:00
This occurs for hundreds of IDs. For this example, I am using dplyr to group_by ID. I only care when status changes between values, not if it stays the same.
I want to calculate the cumulative time each ID remains in status 1. The instances where status 1 is repeated multiple times before it changes should be ignored. I have been planning to use lubridate and dplyr to perform the analysis.
Tibble I am using for this example:
structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631,
4227631, 4227631, 4227631, 4227631, 4227631), state = c("1",
"1", "1", "0", "1", "1", "0", "1", "0", "1"), orderDate = structure(c(1647284400,
1647288000, 1647342000, 1647342000, 1647374400, 1647403200, 1647403200,
1647406800, 1647435600, 1647442800), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -10L), class = c("tbl_df", "tbl",
"data.frame"))
I've tried various solutions such as Cumulative time with reset however I'm having trouble with lag and incorporating it into this specific analysis.
The expected output would maybe look something like this:
And then I would plan to sum all statusOne together to figure out cumulative time spent in this state.
Invite all more elegant solutions or if someone has a link to a prior question.
EDIT
Using solution below I figured it out!
The solution didn't look at the situations where state 0 immediately followed state 1 and we wanted to look at the total time elapsed between these states.
df %>%
group_by(ID) %>%
mutate(max = cumsum(ifelse(orderName == lag(orderName, default = "1"), 0, 1))) %>%
mutate(hours1 = ifelse(max == lag(max) &
orderName=="1", difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
mutate(hours2 = ifelse(orderName=="0" & lag(orderName)=="1",
difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
mutate(hours1 = replace_na(hours1, 0),
hours2 = replace_na(hours2, 0)) %>%
mutate(hours = hours1+hours2) %>%
select(-hours1, -hours2) %>%
summarise(total_hours = sum(hours, na.rm = TRUE)) %>%
filter(total_hours!=0)
This is far from elegant, but at least it appears to provide the correct answer:
library(tidyverse)
df <- structure(list(ID = c(4227631, 4227631, 4227631, 4227631, 4227631,
4227631, 4227631, 4227631, 4227631, 4227631),
state = c("1", "1", "1", "0", "1", "1", "0", "1", "0", "1"),
orderDate = structure(c(1647284400, 1647288000, 1647342000,
1647342000, 1647374400, 1647403200,
1647403200, 1647406800, 1647435600,
1647442800),
tzone = "UTC",
class = c("POSIXct", "POSIXt"))),
row.names = c(NA, -10L),
class = c("tbl_df", "tbl", "data.frame"))
df2 <- df %>%
group_by(ID) %>%
mutate(tmp = ifelse(state == lag(state, default = "1"), 0, 1),
max = cumsum(tmp)) %>%
mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
select(-tmp)
df3 <- df2 %>%
group_by(max) %>%
summarise(max, statusOne = sum(hours, na.rm = TRUE))
df4 <- left_join(df2, df3, by = "max") %>%
distinct() %>%
select(-c(max, hours)) %>%
mutate(statusOne = ifelse(statusOne != 0 & lag(statusOne, default = 1) == statusOne, 0, statusOne))
df4
#> # A tibble: 10 × 4
#> # Groups: ID [1]
#> ID state orderDate statusOne
#> <dbl> <chr> <dttm> <dbl>
#> 1 4227631 1 2022-03-14 19:00:00 16
#> 2 4227631 1 2022-03-14 20:00:00 0
#> 3 4227631 1 2022-03-15 11:00:00 0
#> 4 4227631 0 2022-03-15 11:00:00 0
#> 5 4227631 1 2022-03-15 20:00:00 8
#> 6 4227631 1 2022-03-16 04:00:00 0
#> 7 4227631 0 2022-03-16 04:00:00 0
#> 8 4227631 1 2022-03-16 05:00:00 0
#> 9 4227631 0 2022-03-16 13:00:00 0
#> 10 4227631 1 2022-03-16 15:00:00 0
Created on 2022-04-04 by the reprex package (v2.0.1)
Edit
It's a lot more straightforward to get the total_hours state=1 for each ID:
df %>%
group_by(ID) %>%
mutate(max = cumsum(ifelse(state == lag(state, default = "1"), 0, 1))) %>%
mutate(hours = ifelse(max == lag(max), difftime(orderDate, lag(orderDate), units = "h"), NA)) %>%
summarise(total_hours = sum(hours, na.rm = TRUE))
#> # A tibble: 1 × 2
#> ID total_hours
#> <dbl> <dbl>
#> 1 4227631 24
Created on 2022-04-04 by the reprex package (v2.0.1)
Related
I have an odd situation where when I use dplyr::rowwise() and min in mutate, it outputs a single value across all rows rather than by row. It works with my other dataframes in the same session, and not sure what the issue is. I have also restarted my Rstudio.
df <- indf
dplyr::rowwise(.) %>%
mutate(test = min(as.Date(date1), as.Date(date2), na.rm = T)
structure(list(id = structure(c("5001", "3002", "2001", "1001",
"6001", "9001"), label = "Subject name or identifier", format.sas = "$"),
date1 = structure(c(NA, 18599, NA, NA, NA, NA), class = "Date"),
date2 = structure(c(18472, 18597, 18638, 18675, 18678, 18696
), class = "Date"), test = structure(c(18472, 18472, 18472,
18472, 18472, 18472), class = "Date")), class = c("rowwise_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), groups = structure(list(
.rows = structure(list(1L, 2L, 3L, 4L, 5L, 6L), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame")))
It could be a result of loading plyr package after dplyr which masked the mutate from dplyr
library(dplyr)
indf %>%
rowwise %>%
plyr::mutate(test = min(date1, date2, na.rm = TRUE))
# A tibble: 6 × 4
# Rowwise:
id date1 date2 test
<chr> <date> <date> <date>
1 5001 NA 2020-07-29 2020-07-29
2 3002 2020-12-03 2020-12-01 2020-07-29
3 2001 NA 2021-01-11 2020-07-29
4 1001 NA 2021-02-17 2020-07-29
5 6001 NA 2021-02-20 2020-07-29
6 9001 NA 2021-03-10 2020-07-29
versus using :: to load the function from dplyr
> indf %>%
rowwise %>%
dplyr::mutate(test = min(date1, date2, na.rm = TRUE))
# A tibble: 6 × 4
# Rowwise:
id date1 date2 test
<chr> <date> <date> <date>
1 5001 NA 2020-07-29 2020-07-29
2 3002 2020-12-03 2020-12-01 2020-12-01
3 2001 NA 2021-01-11 2021-01-11
4 1001 NA 2021-02-17 2021-02-17
5 6001 NA 2021-02-20 2021-02-20
6 9001 NA 2021-03-10 2021-03-10
Note that rowwise is slow, it may be better to use vectorized pmin
indf %>%
ungroup %>%
dplyr::mutate(test = pmin(date1, date2, na.rm = TRUE))
# A tibble: 6 × 4
id date1 date2 test
<chr> <date> <date> <date>
1 5001 NA 2020-07-29 2020-07-29
2 3002 2020-12-03 2020-12-01 2020-12-01
3 2001 NA 2021-01-11 2021-01-11
4 1001 NA 2021-02-17 2021-02-17
5 6001 NA 2021-02-20 2021-02-20
6 9001 NA 2021-03-10 2021-03-10
I have a list of accounts (300k plus rows), going back six years, with a user number, open and close dates, and other information, such as location. We offer a variety of accounts, and a user can have one or several, in any combination, and both in succession as well as overlapping.
I've been asked to find out how many users we have in any given month. They'd like it split by location, as well as total.
so I have a table like this:
User Open Close Area
1 A 2018-02-13 2018-07-31 West
2 B 2018-02-26 2018-06-04 North
3 B 2018-02-27 2018-03-15 North
4 C 2018-02-27 2018-05-26 South
5 C 2018-03-15 2018-06-03 South
6 D 2018-03-20 2018-07-02 East
7 E 2018-04-01 2018-06-19 West
8 E 2018-04-14 2018-05-04 West
9 F 2018-03-20 2018-04-19 North
10 G 2018-04-26 2018-07-04 South
11 H 2017-29-12 2018-03-21 East
12 I 2016-11-29 2020-04-10 West
13 J 2018-01-31 2018-12-20 West
14 K 2017-10-31 2018-10-30 North
15 K 2018-10-31 2019-10-30 North
And I want to get to one that looks something like this:
Month Total North East South West
1 Feb 18 3 1 0 1 1
2 Mar 18 5 2 1 1 1
3 Apr 18 7 2 1 2 2
4 May 18 6 1 1 2 2
5 Jun 18 6 1 1 2 2
6 Jul 18 3 0 1 1 1
I can filter the data to get to what I need for individual months using
df%>%
filter(Open <= as.Date("2018-04-30") & Close >= as.Date("2018-04-01")) %>%
distinct(PERSON_ID, .keep_all = TRUE) %>%
count(Area)
But what I can't figure out is how to repeat that for every month in the data set automatically. Is there any where of getting r to repeat the above for every month in my data set, and then pass the results into a second table?
Any and all help gratefully received, and many thanks for your time.
Edit: added examples to the source data where Matin Gal's solution returned NA for years
This is a general solution working for dates spanning over more than one year.
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
group_by(rn = row_number()) %>%
mutate(seq = list(seq(month(Open), month(Close) + 12 * (year(Close) - year(Open))))) %>%
unnest(seq) %>%
mutate(
seq_2 = (seq - 1) %% 12 + 1,
month = month(seq_2, label = TRUE),
year = year(Open + months(seq - first(seq)))
) %>%
ungroup() %>%
distinct(User, month, year, Area) %>%
count(month, year, Area) %>%
pivot_wider(
names_from = "Area",
values_from = "n",
values_fill = 0
) %>%
mutate(Total = rowSums(across(c(North, South, West, East))))
returns
month year North South West East Total
<ord> <dbl> <int> <int> <int> <int> <dbl>
1 Feb 2018 1 1 1 0 3
2 Mar 2018 2 1 1 1 5
3 Apr 2018 2 2 2 1 7
4 May 2018 1 2 2 1 6
5 Jun 2018 1 2 2 1 6
6 Jul 2018 0 1 1 1 3
Data
df <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), User = c("A",
"B", "B", "C", "C", "D", "E", "E", "F", "G"), Open = structure(c(17575,
17588, 17589, 17589, 17605, 17610, 17622, 17635, 17610, 17647
), class = "Date"), Close = structure(c(17743, 17686, 17605,
17677, 17685, 17714, 17701, 17655, 17640, 17716), class = "Date"),
Area = c("West", "North", "North", "South", "South", "East",
"West", "West", "North", "South")), problems = structure(list(
row = 10L, col = "Area", expected = "", actual = "embedded null",
file = "literal data"), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -10L), spec = structure(list(
cols = list(id = structure(list(), class = c("collector_double",
"collector")), User = structure(list(), class = c("collector_character",
"collector")), Open = structure(list(format = ""), class = c("collector_date",
"collector")), Close = structure(list(format = ""), class = c("collector_date",
"collector")), Area = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Here's how I'd do it:
library(tidyverse)
set.seed(14159)
## generating some data that looks roughly
## like your data
data <- tibble(
user = sample(LETTERS[1:5], size = 20, replace = TRUE),
open = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
close = sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 20),
area = sample(c("N", "E", "S", "W"), 20, replace = T)
) %>%
filter(
close > open
)
data
#> # A tibble: 9 × 4
#> user open close area
#> <chr> <date> <date> <chr>
#> 1 A 1999-04-03 1999-07-28 N
#> 2 B 1999-01-27 1999-05-12 W
#> 3 B 1999-06-05 1999-12-29 W
#> 4 C 1999-09-26 1999-12-30 W
#> 5 C 1999-04-21 1999-12-04 E
#> 6 C 1999-08-11 1999-12-12 N
#> 7 A 1999-02-13 1999-09-16 W
#> 8 E 1999-02-17 1999-05-21 E
#> 9 B 1999-07-26 1999-08-16 S
## figuring out what months are in between open and close
get_months_in_range <- function(open, close) {
seq.Date(
open,
close,
by = "month"
) %>%
list()
}
data %>%
rowwise() %>%
mutate(
Month = get_months_in_range(open, close)
) %>%
ungroup() %>%
unnest_longer(
col = Month
) %>%
count(Month, area) %>%
pivot_wider(
names_from = area,
values_from = n,
values_fill = 0
) %>%
rowwise() %>%
mutate(
Total = sum(
c_across(
-Month
)
)
) %>%
ungroup()
#> # A tibble: 45 × 6
#> Month W E N S Total
#> <date> <int> <int> <int> <int> <int>
#> 1 1999-01-27 1 0 0 0 1
#> 2 1999-02-13 1 0 0 0 1
#> 3 1999-02-17 0 1 0 0 1
#> 4 1999-02-27 1 0 0 0 1
#> 5 1999-03-13 1 0 0 0 1
#> 6 1999-03-17 0 1 0 0 1
#> 7 1999-03-27 1 0 0 0 1
#> 8 1999-04-03 0 0 1 0 1
#> 9 1999-04-13 1 0 0 0 1
#> 10 1999-04-17 0 1 0 0 1
#> # … with 35 more rows
Created on 2021-08-18 by the reprex package (v2.0.1)
It's not the world's sexiest solution, but I think it'll get you where you're trying to go. Basically, I just make a helper function that gives me all the dates between open and close and then you can group by those to figure out how many users you have in any given month. Let me know if you want more explanation about what the long chain of dplyr stuff is doing.
welcome to SO. I can't test this code as you haven't provided a snippet of your data in the right format (see below for a suggestion on this point), but I think the basic idea of what you want to do is extract a month-year value from Open and then use group_by. For example:
library(lubridate)
library(dplyr)
df %>% mutate(
Date = dmy(Open),
Month_Yr = format_ISO8601(Date, precision = "ym")) %>%
group_by(Month_Yr) %>%
distinct(PERSON.ID, .keep_all = TRUE) %>%
count(Area)
Generally when sharing data on SO it's best to use a dput. See ?dput for info on how to use it if you're unsure.
I have two datasets, one with values at specific time points for different IDs and another one with several time frames for the IDs. Now I want to check if the timepoint in dataframe one is within any of the time frames from dataset 2 matching the ID.
For example:
df1:
ID date time
1 2020-04-14 11:00:00
1 2020-04-14 18:00:00
1 2020-04-15 10:00:00
1 2020-04-15 20:00:00
1 2020-04-16 11:00:00
1 ...
2 ...
df2:
ID start end
1 2020-04-14 16:00:00 2020-04-14 20:00:00
1 2020-04-15 18:00:00 2020-04-16 13:00:00
2 ...
2
what I want
df1_new:
ID date time mark
1 2020-04-14 11:00:00 0
1 2020-04-14 18:00:00 1
1 2020-04-15 10:00:00 0
1 2020-04-15 20:00:00 1
1 2020-04-16 11:00:00 1
1 ...
2 ...
Any help would be appreciated!
An option could be:
library(tidyverse)
library(lubridate)
#> date, intersect, setdiff, union
df_1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L), date = c("14.04.2020",
"14.04.2020", "15.04.2020", "15.04.2020", "16.04.2020"), time = c("11:00:00",
"18:00:00", "10:00:00", "20:00:00", "11:00:00"), date_time = structure(c(1586862000,
1586887200, 1586944800, 1586980800, 1587034800), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-5L))
df_2 <- structure(list(ID = c(1L, 1L), start = c("14.04.2020 16:00",
"15.04.2020 18:00"), end = c("14.04.2020 20:00", "16.04.2020 13:00"
)), class = "data.frame", row.names = c(NA, -2L))
df_22 <- df_2 %>%
mutate(across(c("start", "end"), dmy_hm)) %>%
group_nest(ID)
left_join(x = df_1, y = df_22, by = "ID") %>%
as_tibble() %>%
mutate(mark = map2_dbl(date_time, data, ~+any(.x %within% interval(.y$start, .y$end)))) %>%
select(-data)
#> # A tibble: 5 x 5
#> ID date time date_time mark
#> <int> <chr> <chr> <dttm> <dbl>
#> 1 1 14.04.2020 11:00:00 2020-04-14 11:00:00 0
#> 2 1 14.04.2020 18:00:00 2020-04-14 18:00:00 1
#> 3 1 15.04.2020 10:00:00 2020-04-15 10:00:00 0
#> 4 1 15.04.2020 20:00:00 2020-04-15 20:00:00 1
#> 5 1 16.04.2020 11:00:00 2020-04-16 11:00:00 1
Created on 2021-05-25 by the reprex package (v2.0.0)
My question extends this one: Generate rows between two dates into a data frame in R
I have a dataset on admissions, discharges and lengths of stay (Stay_in_days) of patients from a hospital. It looks like this:
ID Admission Discharge Stay_in_days
1 2020-08-20 15:25:03 2020-08-21 21:09:34 1.239
2 2020-10-04 21:53:43 2020-10-09 11:02:57 4.548
...
Dates are in POSIXct format so far.
I aim for this:
ID Date Stay_in_days
1 2020-08-20 15:25:03 0.357
1 2020-08-21 21:09:49 1.239
2 2020-10-04 21:53:43 0.087
2 2020-10-05 00:00:00 1.087
2 2020-10-06 00:00:00 2.087
2 2020-10-07 00:00:00 3.087
2 2020-10-08 00:00:00 4.087
2 2020-10-09 11:02:57 4.548
...
What I have done so far:
M <- Map(seq, patients$Admission, patients$Discharge, by = "day")
patients2 <- data.frame(
ID = rep.int(patients$ID, vapply(M, length, 1L)),
Date = do.call(c, M)
)
patients <- patients %>%
mutate(
Date2=as.Date(Date, format = "%Y-%m-%d"),
Dat2=Date2+1,
Diff=difftime(Date2, Date, units = "days")
)
but this gives me:
ID Date Date2 Diff
1 2020-08-20 17:25:03 2020-08-21 0.375
1 2020-08-21 17:25:03 2020-08-22 0.357
2 2020-10-04 23:53:43 2020-10-05 0.087
2 2020-10-05 23:53:43 2020-10-06 0.087
2 2020-10-06 23:53:43 2020-10-07 0.087
2 2020-10-07 23:53:43 2020-10-08 0.087
2 2020-10-08 23:53:43 2020-10-09 0.087
...
Strangely enough, it adds two hours to the Admission date but calculates the correct length of stay. Can someone explain?
Here is some data:
structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), Admission = structure(c(1597937103.872,
1598717768.704, 1599060521.984, 1599758087.168, 1599815496.704,
1600702198.784, 1600719631.36, 1601065923.584, 1601119400.96,
1601215476.736, 1601236710.4, 1601416934.4, 1601499640.832, 1601545647.104,
1601587328, 1601644868.608, 1601741206.528, 1601848423.424, 1601901245.44,
1601913828.352), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Discharge = structure(c(1598044189.696, 1598897337.344, 1599144670.208,
1599845118.976, 1599842366.464, 1602733683.712, 1603372135.424,
1601125168.128, 1601314173.952, 1605193905.152, 1602190259.2,
1601560720.384, 1601737143.296, 1602705634.304, 1602410460.16,
1602698425.344, 1601770566.656, 1602241377.28, 1602780476.416,
1602612048.896), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
Stay_in_days = c(1.239, 2.078, 0.974, 1.007, 0.311, 23.513,
30.7, 0.686, 2.254, 46.047, 11.036, 1.664, 2.749, 13.426,
9.527, 12.194, 0.34, 4.548, 10.176, 8.081)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
Thanks in advance for your help!
Though it is a bit crude but it'll work
library(tidyverse)
library(lubridate)
df %>%
pivot_longer(cols = -c(ID, Stay_in_days), names_to = "Event", values_to = "DATE") %>%
group_by(ID) %>%
mutate(dummy = case_when(Event == "Admission" ~ 0,
Event == "Discharge" ~ max(floor(Stay_in_days),1))) %>%
complete(dummy = seq(min(dummy), max(dummy), 1)) %>%
mutate(Event = ifelse(is.na(Event), "Dummy", Event),
DATE = if_else(is.na(DATE), first(DATE)+dummy*24*60*60, DATE),
Stay_in_days = case_when(Event == "Admission" ~ as.numeric(difftime(ceiling_date(DATE, "day"), DATE, units = "days")),
Event == "Discharge" ~ Stay_in_days,
TRUE ~ dummy + as.numeric(difftime(ceiling_date(first(DATE), "day"), first(DATE), units = "days")))) %>%
select(ID, DATE, Stay_in_days)
# A tibble: 199 x 3
# Groups: ID [20]
ID DATE Stay_in_days
<dbl> <dttm> <dbl>
1 1 2020-08-20 15:25:03 0.358
2 1 2020-08-21 21:09:49 1.24
3 2 2020-08-29 16:16:08 0.322
4 2 2020-08-30 16:16:08 1.32
5 2 2020-08-31 18:08:57 2.08
6 3 2020-09-02 15:28:41 0.355
7 3 2020-09-03 14:51:10 0.974
8 4 2020-09-10 17:14:47 0.281
9 4 2020-09-11 17:25:18 1.01
10 5 2020-09-11 09:11:36 0.617
# ... with 189 more rows
Explanation of logic For the first date in every ID, the stay_in_days gives the duration from admission date-time to following 24 Hrs. For intermediate dates, it just adds 1 to previous value. For discharge_date it retains the stay value calculated prior to pivoting. Hope this was you after.
Explanation of code After pivoting longer, I used a dummy column to create intermediate date-time objects. After that I just mutate the columns for generating output as described above.
You can achieve this with pivot_longer from tidyr.
Edit: with comments:
df1 <- df %>%
select(ID = ID, date1 = Admission, date2 = Discharge, Stay_in_days) %>% # prepare for pivoting
pivot_longer(
cols = starts_with("date"),
names_to = "Date1",
values_to = "Date",
) %>% # pivot to longformat
select(-Date1) %>% # remove temporary Date1
relocate(Stay_in_days, .after = Date) %>% # change column order
group_by(ID) %>%
mutate(idgroup = rep(row_number(), each=1:2, length.out = n())) %>% # id for admission = 1 and for discharge id = 2
mutate(Stay_in_days = replace(Stay_in_days, row_number() == 1, 0)) %>% # set Admission to zero
ungroup()
I have the following dataset
flight DateTime
<chr> <dttm>
1 1 2016-08-16 07:56:06
2 1 2016-08-16 07:57:10
3 1 2016-08-16 07:57:07
4 2 2016-08-15 18:35:09
5 2 2016-08-15 18:39:51
6 2 2016-08-15 18:46:53
7 3 2016-08-16 14:02:33
8 3 2016-08-16 13:25:10
9 3 2016-08-16 13:39:43
dt <- structure(list(flight = c("1", "1", "1", "2", "2", "2", "3",
"3", "3"), DateTime = structure(c(1471334352, 1471334210, 1471334262,
1471286072, 1471284963, 1471286347, 1471356056, 1471355584, 1471353810
), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -9L))
I'm trying to calculate the duration of each flight (for instance in hours) using tidyverse. So far I have tried the code below, but I suspect there is a more concise way to get to the same result, for instance a function that I could summarise() by?
dt %>% group_by(flight) %>%
filter(DateTime == max(DateTime) | DateTime == min(DateTime)) %>%
mutate(dur = if_else(DateTime == max(DateTime), 'Max', 'Min')) %>%
spread(dur, DateTime) %>% mutate(duration = Max - Min)
# A tibble: 3 x 4
# Groups: flight [3]
flight Max Min duration
<chr> <dttm> <dttm> <time>
1 1 2016-08-16 07:59:12 2016-08-16 07:56:50 2.36666666666667
2 2 2016-08-15 18:39:07 2016-08-15 18:16:03 23.0666666666667
3 3 2016-08-16 14:00:56 2016-08-16 13:23:30 37.4333333333333
You don't have to use spread, this works the same if your goal is only to calculate the time difference:
dt %>%
group_by(flight) %>%
summarise(duration = difftime(min(DateTime),max(DateTime),units = "hours"))