i have a dataframe with this data :
# A tibble: 6 × 3
rowid Arrival Depart
<int> <dttm> <dttm>
1 1 2023-02-11 07:00:00 2023-02-11 17:30:00
2 2 2023-02-13 10:00:00 2023-02-13 18:00:00
3 3 2023-02-14 08:00:00 2023-02-14 17:00:00
4 4 2023-02-15 08:00:00 2023-02-15 17:00:00
5 5 2023-02-16 08:00:00 2023-02-16 18:00:00
6 6 2023-02-18 07:00:00 2023-02-18 17:30:00
structure(list(rowid = 1:6, Arrival = structure(c(1676098800,
1676282400, 1676361600, 1676448000, 1676534400, 1676703600), tzone = "UTC", class = c("POSIXct",
"POSIXt")), Depart = structure(c(1676136600, 1676311200, 1676394000,
1676480400, 1676570400, 1676741400), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
i set the following objects :
ri <- 2
int <- test_int[2]
int (an interval) becomes :
> int
[1] 2023-02-13 06:00:00 EST--2023-02-13 13:00:00 EST
and then run this code :
test <- test %>% mutate(
interval_start = if_else(rowid == ri, int_start(int), Arrival),
interval_end = if_else(rowid == ri, int_end(int), Depart)
) %>% select(Arrival, interval_start, Depart, interval_end)
the result is this :
# A tibble: 6 × 4
Arrival interval_start Depart interval_end
<dttm> <dttm> <dttm> <dttm>
1 2023-02-11 07:00:00 2023-02-11 02:00:00 2023-02-11 17:30:00 2023-02-11 12:30:00
2 2023-02-13 10:00:00 2023-02-13 06:00:00 2023-02-13 18:00:00 2023-02-13 13:00:00
3 2023-02-14 08:00:00 2023-02-14 03:00:00 2023-02-14 17:00:00 2023-02-14 12:00:00
4 2023-02-15 08:00:00 2023-02-15 03:00:00 2023-02-15 17:00:00 2023-02-15 12:00:00
5 2023-02-16 08:00:00 2023-02-16 03:00:00 2023-02-16 18:00:00 2023-02-16 13:00:00
6 2023-02-18 07:00:00 2023-02-18 02:00:00 2023-02-18 17:30:00 2023-02-18 12:30:00
structure(list(Arrival = structure(c(1676098800, 1676282400,
1676361600, 1676448000, 1676534400, 1676703600), tzone = "UTC", class = c("POSIXct",
"POSIXt")), interval_start = structure(c(1676098800, 1676286000,
1676361600, 1676448000, 1676534400, 1676703600), class = c("POSIXct",
"POSIXt")), Depart = structure(c(1676136600, 1676311200, 1676394000,
1676480400, 1676570400, 1676741400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), interval_end = structure(c(1676136600, 1676311200,
1676394000, 1676480400, 1676570400, 1676741400), class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
for some reason the if_else statement is returning some weird times instead of the values for arrival/depart, but it is correctly mutating rowid 2 to the correct time.
anyone knows why this could be and how i can fix it?
thanks to #George Savva, i was able to see that the erroneous values i was getting was because my int variable was in the EST timezone instead of UTC which is why there was a 5 hour difference
Related
I have a table with multiple datetime columns, I wish to extract weekday for each of those columns and add as a new column.
Sample dataset:
structure(list(mealTime = structure(c(1542492000, 1578852000,
1604253600, 1545901200, 1549821600, 1544306400), tzone = "UTC", class = c("POSIXct",
"POSIXt")), weight_measure_time = structure(c(1542226000, 1578812400,
1594710000, 1545896762, 1546416823, 1544227245), tzone = "UTC", class = c("POSIXct",
"POSIXt")), height_measure_time = structure(c(1542106434, 1543337043,
1543337043, 1542387988, 1542366547, 1542802228), tzone = "UTC", class = c("POSIXct",
"POSIXt")), hba1c_measure_time = structure(c(1542106860, 1573455600,
1594625400, 1544781600, 1545920520, 1544096580), tzone = "UTC", class = c("POSIXct",
"POSIXt")), bpMeasureTime = structure(c(1542380623, 1578812400,
1583218800, 1545896774, 1546416837, 1544266110), tzone = "UTC", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
which looks something like this:
> smple
# A tibble: 6 x 5
mealTime weight_measure_time height_measure_time
<dttm> <dttm> <dttm>
1 2018-11-17 22:00:00 2018-11-14 20:06:40 2018-11-13 10:53:54
2 2020-01-12 18:00:00 2020-01-12 07:00:00 2018-11-27 16:44:03
3 2020-11-01 18:00:00 2020-07-14 07:00:00 2018-11-27 16:44:03
4 2018-12-27 09:00:00 2018-12-27 07:46:02 2018-11-16 17:06:28
5 2019-02-10 18:00:00 2019-01-02 08:13:43 2018-11-16 11:09:07
6 2018-12-08 22:00:00 2018-12-08 00:00:45 2018-11-21 12:10:28
# ... with 2 more variables: hba1c_measure_time <dttm>, bpMeasureTime <dttm>
For the above dataset, the expected result I am expecting is, i.e. for each datetime column extract the weekday and add it in respective column:
glimpse(smple)
Rows: 6
Columns: 10
$ mealTime <dttm> 2018-11-17 22:00:00, 2020-01-12 18:00:00, 20~
$ weight_measure_time <dttm> 2018-11-14 20:06:40, 2020-01-12 07:00:00, 20~
$ height_measure_time <dttm> 2018-11-13 10:53:54, 2018-11-27 16:44:03, 20~
$ hba1c_measure_time <dttm> 2018-11-13 11:01:00, 2019-11-11 07:00:00, 20~
$ bpMeasureTime <dttm> 2018-11-16 15:03:43, 2020-01-12 07:00:00, 20~
$ mealTime_day <chr> "Saturday", "Sunday", "Sunday", "Thursday", "~
$ weight_measure_time_day <chr> "Wednesday", "Sunday", "Tuesday", "Thursday",~
$ height_measure_time_day <chr> "Tuesday", "Tuesday", "Tuesday", "Friday", "F~
$ hba1c_measure_time_day <chr> "Tuesday", "Monday", "Monday", "Friday", "Thu~
$ bpMeasureTime_day <chr> "Friday", "Sunday", "Tuesday", "Thursday", "W~
In the base R, I can achieve the above as follows:
smple[paste(colnames(smple), "day", sep="_")] = apply(smple, 2, lubridate::wday, label=TRUE, abbr=FALSE)
I wanted to know if there is a similar way in tidyverse, which adds column dynamically by evaluating both LHS and RHS.
Making use of across and where you could do:
library(dplyr)
library(lubridate)
mutate(smpl, across(where(is.POSIXct), lubridate::wday,
label=TRUE, abbr=FALSE, .names = "{.col}_day"))
#> # A tibble: 6 x 10
#> mealTime weight_measure_time height_measure_time
#> <dttm> <dttm> <dttm>
#> 1 2018-11-17 22:00:00 2018-11-14 20:06:40 2018-11-13 10:53:54
#> 2 2020-01-12 18:00:00 2020-01-12 07:00:00 2018-11-27 16:44:03
#> 3 2020-11-01 18:00:00 2020-07-14 07:00:00 2018-11-27 16:44:03
#> 4 2018-12-27 09:00:00 2018-12-27 07:46:02 2018-11-16 17:06:28
#> 5 2019-02-10 18:00:00 2019-01-02 08:13:43 2018-11-16 11:09:07
#> 6 2018-12-08 22:00:00 2018-12-08 00:00:45 2018-11-21 12:10:28
#> # … with 7 more variables: hba1c_measure_time <dttm>, bpMeasureTime <dttm>,
#> # mealTime_day <dbl>, weight_measure_time_day <dbl>,
#> # height_measure_time_day <dbl>, hba1c_measure_time_day <dbl>,
#> # bpMeasureTime_day <dbl>
Here is one way to solve your problem:
df[paste0(names(df), "_day")] <- lapply(df, weekdays)
Base R solution:
cbind(
df,
setNames(
data.frame(
Map(
weekdays,
df
)
),
paste0(
names(df),
ifelse(
grepl(
"_",
names(df)
),
"_day_of_week",
"DayOfWeek"
)
)
)
)
dplyr solution only using weekdays from base R
library(dplyr)
df %>%
mutate(across(everything(), weekdays, .names = "{.col}_day"))
Output:
mealTime weight_measure_time height_measure_time hba1c_measure_time bpMeasureTime mealTime_day weight_measure_time_day
<dttm> <dttm> <dttm> <dttm> <dttm> <chr> <chr>
1 2018-11-17 22:00:00 2018-11-14 20:06:40 2018-11-13 10:53:54 2018-11-13 11:01:00 2018-11-16 15:03:43 Samstag Mittwoch
2 2020-01-12 18:00:00 2020-01-12 07:00:00 2018-11-27 16:44:03 2019-11-11 07:00:00 2020-01-12 07:00:00 Sonntag Sonntag
3 2020-11-01 18:00:00 2020-07-14 07:00:00 2018-11-27 16:44:03 2020-07-13 07:30:00 2020-03-03 07:00:00 Sonntag Dienstag
4 2018-12-27 09:00:00 2018-12-27 07:46:02 2018-11-16 17:06:28 2018-12-14 10:00:00 2018-12-27 07:46:14 Donnerstag Donnerstag
5 2019-02-10 18:00:00 2019-01-02 08:13:43 2018-11-16 11:09:07 2018-12-27 14:22:00 2019-01-02 08:13:57 Sonntag Mittwoch
6 2018-12-08 22:00:00 2018-12-08 00:00:45 2018-11-21 12:10:28 2018-12-06 11:43:00 2018-12-08 10:48:30 Samstag Samstag
# ... with 3 more variables: height_measure_time_day <chr>, hba1c_measure_time_day <chr>, bpMeasureTime_day <chr>
I have 2 dates columns in a dataframe with more than 100k observations
date1
startdate
2020-07-30 23:00:00
NA
2020-12-10 04:00:00
2021-06-30 20:00:00
2020-10-26 21:00:00
NA
2019-12-03 03:01:00
2020-02-01 01:00:00
NA
2020-06-28 07:30:00
I have to fill the missing values in startdate column, so my idea is to compute the average of days between date1 and startdate and to replace the NA in startdate after by doing an addition between this average and the date1 date.
DESIRED OUTPUT
For instance, if the average of days is 70, then :
date1
startdate
2020-07-30 23:00:00
2020-10-08 23:00:00
2020-12-10 04:00:00
2021-06-30 20:00:00
2020-10-26 21:00:00
2021-01-04 21:00:00
2019-12-03 03:01:00
2020-02-01 01:00:00
NA
2020-06-28 07:30:00
Reproducible example :
structure(list(date1 = structure(c(1594069500,
1575320400, 1603742400, NA, 1574975100, 1570845660, 1575061500,
1564714860, 1576544400, 1574802300, 1576198800, 1575338460, 1575666180,
NA, 1594327800, 1595365200, 1594069800, 1591905600, 1594414800,
NA), class = c("POSIXct", "POSIXt"), tzone = ""), startdate = structure(c(1599242400,
1577127600, NA, 1603396800, 1577516400, 1573714800, 1577689200,
1566374400, 1577343600, 1577516400, 1577343600, NA, 1577257200,
NA, 1605193200, 1605106800, 1600358400, 1600358400, 1600272000,
NA), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(1L,
2L, 7591L, 8301L, 8692L, 8694L, 8699L, 8703L, 8706L, 8709L, 8710L,
8714L, 8715L, 8730L, 8732L, 8733L, 8736L, 8740L, 8745L, 8749L
), class = "data.frame")
You can use difftime to calculate average duration between startdate and date1. Replace NA values in startdate by adding the average value to date1.
avg <- as.numeric(mean(difftime(df$startdate, df$date1, units = 'secs'), na.rm = TRUE))
df$startdate[is.na(df$startdate)] <- df$date1[is.na(df$startdate)] + avg
df
something like this??
I cannot verify, since your desired output does not match your sample data..
library(data.table)
setDT(mydata)
mydata[is.na(startdate) & !is.na(date1),
startdate := date1 + round(mean(abs(DT$date1 - DT$startdate), na.rm = TRUE))]
I have two datasets, one with values at specific time points for different IDs and another one with several time frames for the IDs. Now I want to check if the timepoint in dataframe one is within any of the time frames from dataset 2 matching the ID.
For example:
df1:
ID date time
1 2020-04-14 11:00:00
1 2020-04-14 18:00:00
1 2020-04-15 10:00:00
1 2020-04-15 20:00:00
1 2020-04-16 11:00:00
1 ...
2 ...
df2:
ID start end
1 2020-04-14 16:00:00 2020-04-14 20:00:00
1 2020-04-15 18:00:00 2020-04-16 13:00:00
2 ...
2
what I want
df1_new:
ID date time mark
1 2020-04-14 11:00:00 0
1 2020-04-14 18:00:00 1
1 2020-04-15 10:00:00 0
1 2020-04-15 20:00:00 1
1 2020-04-16 11:00:00 1
1 ...
2 ...
Any help would be appreciated!
An option could be:
library(tidyverse)
library(lubridate)
#> date, intersect, setdiff, union
df_1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L), date = c("14.04.2020",
"14.04.2020", "15.04.2020", "15.04.2020", "16.04.2020"), time = c("11:00:00",
"18:00:00", "10:00:00", "20:00:00", "11:00:00"), date_time = structure(c(1586862000,
1586887200, 1586944800, 1586980800, 1587034800), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA,
-5L))
df_2 <- structure(list(ID = c(1L, 1L), start = c("14.04.2020 16:00",
"15.04.2020 18:00"), end = c("14.04.2020 20:00", "16.04.2020 13:00"
)), class = "data.frame", row.names = c(NA, -2L))
df_22 <- df_2 %>%
mutate(across(c("start", "end"), dmy_hm)) %>%
group_nest(ID)
left_join(x = df_1, y = df_22, by = "ID") %>%
as_tibble() %>%
mutate(mark = map2_dbl(date_time, data, ~+any(.x %within% interval(.y$start, .y$end)))) %>%
select(-data)
#> # A tibble: 5 x 5
#> ID date time date_time mark
#> <int> <chr> <chr> <dttm> <dbl>
#> 1 1 14.04.2020 11:00:00 2020-04-14 11:00:00 0
#> 2 1 14.04.2020 18:00:00 2020-04-14 18:00:00 1
#> 3 1 15.04.2020 10:00:00 2020-04-15 10:00:00 0
#> 4 1 15.04.2020 20:00:00 2020-04-15 20:00:00 1
#> 5 1 16.04.2020 11:00:00 2020-04-16 11:00:00 1
Created on 2021-05-25 by the reprex package (v2.0.0)
I have a dataset with tracking data containing Datetime, Latitude and Longitude variables for several months which looks like this:
> start <- as.POSIXct("2018-08-01 00:00:00", format="%Y-%m-%d %H:%M:%S", tz="UTC")
> datetime <- seq(from = start, length.out = 2880, by = "5 mins")
> lat<-rep(seq(from=50, to= 30, length.out = 10), each=288)
> lon<-rep(seq(from=110, to= 70, length.out = 10), each=288)
> data<-cbind.data.frame(datetime, lat, lon)
> head(data)
datetime lat lon
1 2018-08-01 00:00:00 50 110
2 2018-08-01 00:05:00 50 110
3 2018-08-01 00:10:00 50 110
4 2018-08-01 00:15:00 50 110
5 2018-08-01 00:20:00 50 110
6 2018-08-01 00:25:00 50 110
I want to add a new column with information on the period of the day, i.e. day, dusk, night or dawn, based on information from getSunlightTimes:
> data$date<-as.Date(data$datetime)
> sun<-getSunlightTimes(data=data, tz="UTC", keep=c("sunrise","sunset","night","nightEnd"))
> head(sun)
date lat lon sunrise sunset night
1 2018-08-01 50 110 2018-07-31 21:09:50 2018-08-01 12:25:18 2018-08-01 15:01:56
2 2018-08-01 50 110 2018-07-31 21:09:50 2018-08-01 12:25:18 2018-08-01 15:01:56
3 2018-08-01 50 110 2018-07-31 21:09:50 2018-08-01 12:25:18 2018-08-01 15:01:56
4 2018-08-01 50 110 2018-07-31 21:09:50 2018-08-01 12:25:18 2018-08-01 15:01:56
5 2018-08-01 50 110 2018-07-31 21:09:50 2018-08-01 12:25:18 2018-08-01 15:01:56
6 2018-08-01 50 110 2018-07-31 21:09:50 2018-08-01 12:25:18 2018-08-01 15:01:56
nightEnd
1 2018-07-31 18:33:13
2 2018-07-31 18:33:13
3 2018-07-31 18:33:13
4 2018-07-31 18:33:13
5 2018-07-31 18:33:13
6 2018-07-31 18:33:13
So dusk would correspond to datetime values between sunset and night, dawn between nightEnd and sunrise, day between sunrise and sunset and night between night and nightEnd.
I've tried:
> data$period<-rep(" ", length.out=nrow(data))
> data$period[which(data$datetime>sun$sunrise & data$datetime<sun$sunset)]<-"day"
> data$period[which(data$datetime>sun$sunset & data$datetime<sun$night)]<-"dusk"
> data$period[which(data$datetime>sun$nightEnd & data$datetime<sun$sunrise)]<-"dawn"
> data$period[which(data$period==" ")]<-"night"
but this causes problems at the transition between days. Anyone have any suggestions?
Kind regards
A tidyverse option that - like the solution by Roman Luštrik - uses a left_join.
library(dplyr)
library(lubridate)
df1 %>%
mutate(date = date(datetime)) %>%
left_join(df2) %>%
distinct() %>%
mutate(period = case_when(datetime %within% interval(sunset, night) ~ 'dusk',
datetime %within% interval(nightEnd, sunrise) ~ 'dawn',
datetime %within% interval(sunrise, sunset) ~ 'day',
datetime %within% interval(night, nightEnd) ~ 'night')) %>%
select(datetime, lat, lon, period)
# # A tibble: 6 x 4
# datetime lat lon period
# <dttm> <dbl> <dbl> <chr>
# 1 2018-08-01 00:00:00 50 110 day
# 2 2018-08-01 00:05:00 50 110 day
# 3 2018-08-01 00:10:00 50 110 day
# 4 2018-08-01 00:15:00 50 110 day
# 5 2018-08-01 00:20:00 50 110 day
# 6 2018-08-01 00:25:00 50 110 day
Data
df1 <- structure(list(datetime = structure(c(1533081600, 1533081900,
1533082200, 1533082500, 1533082800, 1533083100), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), lat = c(50, 50, 50, 50, 50, 50), lon = c(110,
110, 110, 110, 110, 110)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
df2 <- structure(list(date = structure(c(17744, 17744, 17744, 17744,
17744, 17744), class = "Date"), lat = c(50, 50, 50, 50, 50, 50
), lon = c(110, 110, 110, 110, 110, 110), sunrise = structure(c(1533071390,
1533071390, 1533071390, 1533071390, 1533071390, 1533071390), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), sunset = structure(c(1533126318, 1533126318,
1533126318, 1533126318, 1533126318, 1533126318), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), night = structure(c(1533135716, 1533135716,
1533135716, 1533135716, 1533135716, 1533135716), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), nightEnd = structure(c(1533061993,
1533061993, 1533061993, 1533061993, 1533061993, 1533061993), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
What I would do is do a left-join, that is attach the sunset/sunrise data to your actual dates and do your magic there.
library(suncalc)
start <- as.POSIXct("2018-08-01 00:00:00", format = "%Y-%m-%d %H:%M:%S",
tz = "UTC")
datetime <- seq(from = start, length.out = 2880, by = "5 mins")
lat <- rep(seq(from = 50, to = 30, length.out = 10), each = 288)
lon <- rep(seq(from = 110, to = 70, length.out = 10), each = 288)
data <- cbind.data.frame(datetime, lat, lon)
data$date <- as.Date(data$datetime)
sun <- getSunlightTimes(data = data, tz = "UTC",
keep = c("sunrise", "sunset", "night", "nightEnd"))
xy <- merge(x = data,
y = sun[, c("date", "sunrise", "sunset", "night", "nightEnd")],
by = "date")
xy$period <- rep(" ", length.out = nrow(xy))
xy$period[xy$datetime > xy$sunrise & xy$datetime < xy$sunset] <- "day"
xy$period[xy$datetime > xy$sunset & xy$datetime < xy$night] <- "dusk"
xy$period[xy$datetime > xy$nightEnd & xy$datetime < xy$sunrise] <- "dawn"
xy$period[xy$period == " "] <- "night"
set.seed(357)
xy[sample(1:nrow(xy), size = 7), ]
date datetime lat lon sunrise sunset night nightEnd period
396824 2018-08-05 2018-08-05 18:45:00 41.11111 92.22222 2018-08-04 22:51:28 2018-08-05 13:05:23 2018-08-05 14:55:08 2018-08-04 21:01:43 night
407953 2018-08-05 2018-08-05 22:00:00 41.11111 92.22222 2018-08-04 22:51:28 2018-08-05 13:05:23 2018-08-05 14:55:08 2018-08-04 21:01:43 night
685732 2018-08-09 2018-08-09 06:25:00 32.22222 74.44444 2018-08-09 00:23:28 2018-08-09 13:54:45 2018-08-09 15:25:18 2018-08-08 22:52:56 day
638810 2018-08-08 2018-08-08 16:50:00 34.44444 78.88889 2018-08-08 00:01:06 2018-08-08 13:41:48 2018-08-08 15:16:02 2018-08-07 22:26:53 night
392649 2018-08-05 2018-08-05 17:35:00 41.11111 92.22222 2018-08-04 22:51:28 2018-08-05 13:05:23 2018-08-05 14:55:08 2018-08-04 21:01:43 night
165451 2018-08-02 2018-08-02 23:50:00 47.77778 105.55556 2018-08-01 21:36:31 2018-08-02 12:34:05 2018-08-02 14:53:05 2018-08-01 19:17:31 night
159367 2018-08-02 2018-08-02 22:05:00 47.77778 105.55556 2018-08-01 21:36:31 2018-08-02 12:34:05 2018-08-02 14:53:05 2018-08-01 19:17:31 night
This question already has answers here:
How to flatten / merge overlapping time periods
(5 answers)
Closed 4 years ago.
I know the following problam can be solved using Bioconductor's IRanges-package, using reduce.
But since that function only accepts numeric input, and I am working with data.table anyway, I am wondering is the following van be achieved using data.tables'foverlaps().
Sample data
structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B"
), subgroup = c(1, 1, 2, 2, 1, 1, 2, 2), start = structure(c(1514793600,
1514795400, 1514794200, 1514798100, 1514815200, 1514817000, 1514815800,
1514818800), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
end = structure(c(1514794500, 1514797200, 1514794800, 1514799000,
1514816100, 1514818800, 1514817600, 1514820600), class = c("POSIXct",
"POSIXt"), tzone = "UTC")), row.names = c(NA, -8L), class = c("tbl_df",
"tbl", "data.frame"))
# group subgroup start end
# 1: A 1 2018-01-01 08:00:00 2018-01-01 08:15:00
# 2: A 1 2018-01-01 08:30:00 2018-01-01 09:00:00
# 3: A 2 2018-01-01 08:10:00 2018-01-01 08:20:00
# 4: A 2 2018-01-01 09:15:00 2018-01-01 09:30:00
# 5: B 1 2018-01-01 14:00:00 2018-01-01 14:15:00
# 6: B 1 2018-01-01 14:30:00 2018-01-01 15:00:00
# 7: B 2 2018-01-01 14:10:00 2018-01-01 14:40:00
# 8: B 2 2018-01-01 15:00:00 2018-01-01 15:30:00
Question
What I would like to achieve, is to join/merge events (by group) when:
a range (start - end) overlaps (or partially overlaps) another range
the start of a range is the end of another range
Subgroups can be ignored
As mentioned above, I'm know this can be done using biocondustor's IRanges reduce, but I wonder if the same can be achieved using data.table. I can't shake the feeling that foverlaps should be able to tackle my problem, but I cannot figure out how...
Since I'm an intermediate R-user, but pretty much a novice in data.table, it's hard for me to 'read' some solutions already provided on stackoverflow. So I'm not sure if a similar quenstion has already been asked and answered (if so, please be gentle ;-) )
Desired output
structure(list(group = c("A", "A", "A", "B"), start = structure(c(1514793600,
1514795400, 1514798100, 1514815200), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), end = structure(c(1514794800, 1514797200,
1514799000, 1514820600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
# group start end
# 1: A 2018-01-01 08:00:00 2018-01-01 08:20:00
# 2: A 2018-01-01 08:30:00 2018-01-01 09:00:00
# 3: A 2018-01-01 09:15:00 2018-01-01 09:30:00
# 4: B 2018-01-01 14:00:00 2018-01-01 15:30:00
If you arrange on group and start (in that order) and unselect the indx column, this solution posted by David Arenburg works perfectly: How to flatten/merge overlapping time periods in R
library(dplyr)
df1 %>%
group_by(group) %>%
arrange(group, start) %>%
mutate(indx = c(0, cumsum(as.numeric(lead(start)) >
cummax(as.numeric(end)))[-n()])) %>%
group_by(group, indx) %>%
summarise(start = first(start), end = last(end)) %>%
select(-indx)
group start end
<chr> <dttm> <dttm>
1 A 2018-01-01 08:00:00 2018-01-01 08:20:00
2 A 2018-01-01 08:30:00 2018-01-01 09:00:00
3 A 2018-01-01 09:15:00 2018-01-01 09:30:00
4 B 2018-01-01 14:00:00 2018-01-01 15:30:00