Time-interval overlap match by group - r

Suppose I have the following DF:
id flag time
1 1 2017-01-01 UTC--2017-01-07 UTC
1 0 2018-01-01 UTC--2019-01-01 UTC
1 0 2017-01-03 UTC--2017-01-09 UTC
2 1 2017-01-01 UTC--2017-01-15 UTC
2 1 2018-07-01 UTC--2018-09-01 UTC
2 1 2018-10-12 UTC--2018-10-20 UTC
2 0 2017-01-12 UTC--2017-01-16 UTC
2 0 2017-03-01 UTC--2017-03-15 UTC
2 0 2017-12-01 UTC--2017-12-31 UTC
2 0 2018-08-15 UTC--2018-09-19 UTC
2 0 2018-10-01 UTC--2018-10-21 UTC
Created with the following code:
df <- data.frame(id=c(1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2),
flag=c(1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0),
time=c(interval(ymd(20170101), ymd(20170107)),
interval(ymd(20180101), ymd(20190101)),
interval(ymd(20170103), ymd(20170109)),
# Casos
interval(ymd(20170101), ymd(20170115)),
interval(ymd(20180701), ymd(20180901)),
interval(ymd(20181012), ymd(20181020)),
# Controles
interval(ymd(20170112), ymd(20170116)),
interval(ymd(20170301), ymd(20170315)),
interval(ymd(20171201), ymd(20171231)),
interval(ymd(20180815), ymd(20180919)),
interval(ymd(20181001), ymd(20181021))))
And I want to obtain this result
id flag time value
1 1 2017-01-01 UTC--2017-01-07 UTC NA
1 0 2018-01-01 UTC--2019-01-01 UTC 0
1 0 2017-01-03 UTC--2017-01-09 UTC 1
2 1 2017-01-01 UTC--2017-01-15 UTC NA
2 1 2018-07-01 UTC--2018-09-01 UTC NA
2 1 2018-10-12 UTC--2018-10-20 UTC NA
2 0 2017-01-12 UTC--2017-01-16 UTC 1
2 0 2017-03-01 UTC--2017-03-15 UTC 0
2 0 2017-12-01 UTC--2017-12-31 UTC 0
2 0 2018-08-15 UTC--2018-09-19 UTC 1
2 0 2018-10-01 UTC--2018-10-21 UTC 1
This is, I want to compare the time intervals of flag = 0 to all possible flag = 1, within each group, to see if there is at least one time overlap between flag 0 and flag 1
For these purpose I have tried with lubridate int_overlaps function
I have tried the following code but does not work:
result <- df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0 & int_overlaps(time, any(time[flag == 1])), 1, 0))
I have found a very similar approach:
R: Determine if each date interval overlaps with all other date intervals in a dataframe

You can use map_int from purrr to see if any intervals overlap within each id:
library(tidyverse)
library(lubridate)
df %>%
group_by(id) %>%
mutate(value = ifelse(flag == 0, map_int(time, ~ any(int_overlaps(.x, time[flag == 1]))), NA))
Output
# A tibble: 11 x 4
# Groups: id [2]
id flag time value
<dbl> <dbl> <Interval> <int>
1 1 1 2017-01-01 UTC--2017-01-07 UTC NA
2 1 0 2018-01-01 UTC--2019-01-01 UTC 0
3 1 0 2017-01-03 UTC--2017-01-09 UTC 1
4 2 1 2017-01-01 UTC--2017-01-15 UTC NA
5 2 1 2018-07-01 UTC--2018-09-01 UTC NA
6 2 1 2018-10-12 UTC--2018-10-20 UTC NA
7 2 0 2017-01-12 UTC--2017-01-16 UTC 1
8 2 0 2017-03-01 UTC--2017-03-15 UTC 0
9 2 0 2017-12-01 UTC--2017-12-31 UTC 0
10 2 0 2018-08-15 UTC--2018-09-19 UTC 1
11 2 0 2018-10-01 UTC--2018-10-21 UTC 1

I add another answer extracted from here:
R: Determine if each date interval overlaps with all other date intervals in a dataframe
result <- df %>% group_by(id) %>%
mutate(value = map(seq_along(time), function(x){
y = setdiff(seq_along(time[flag == 1]), x)
return(any(int_overlaps(time[x], time[y])))
}))

Related

R: Replace NA resulting from joining but not NA in original row

Let's say I have to dataframes:
weights <- tibble(Time = c(as.POSIXct("1900-01-01 10:00:00"), as.POSIXct("1900-01-01 13:00:00"), as.POSIXct("1900-01-01 14:00:00")), weight = c(3, NA, 4), is_weight = c(1, 1, 1))
heights <- tibble(Time = c(as.POSIXct("1900-01-01 11:00:00"), as.POSIXct("1900-01-01 12:00:00"), as.POSIXct("1900-01-01 15:00:00")), height = c(4, NA, 5), is_height = c(1, 1, 1))
After joining them by Time and filling NAs in the is_<> cols, I get
df <- full_join(weights, heights, by = "Time") %>% arrange(Time) %>% mutate(is_weight = replace(is_weight, is.na(is_weight), 0)) %>% mutate(is_height = replace(is_height, is.na(is_height), 0))
df
# A tibble: 6 x 5
Time weight is_weight height is_height
<dttm> <dbl> <dbl> <dbl> <dbl>
1 1900-01-01 10:00:00 3 1 NA 0
2 1900-01-01 11:00:00 NA 0 4 1
3 1900-01-01 12:00:00 NA 0 NA 1
4 1900-01-01 13:00:00 NA 1 NA 0
5 1900-01-01 14:00:00 4 1 NA 0
6 1900-01-01 15:00:00 NA 0 5 1
Now, there are to types of NAs. That is, in the third row, height is NA because it comes from the original data, i.e. we really have no value at this time. However, in row 4, height is NA because of the joining operation. Same goes for weight. What I want: I want to keep the "original" NA, and only fill the NAs resulting from the joining operation with the previous value when possible. For example for weight, whenever is_weight = 0, I want to copy the value of weight from the last column where is_weight = 1. Same for height.
Time weight is_weight height is_height
<dttm> <dbl> <dbl> <dbl> <dbl>
1 1900-01-01 10:00:00 3 1 NA 0
2 1900-01-01 11:00:00 3 0 4 1
3 1900-01-01 12:00:00 3 0 NA 1
4 1900-01-01 13:00:00 NA 1 NA 0
5 1900-01-01 14:00:00 4 1 NA 0
6 1900-01-01 15:00:00 4 0 5 1
So with this I want to ensure that at every timstep, I have the last received value of each variable. Is there a neat generic way that I can also apply on let's say 10 columns?
My idea was to replace the "original" NA with some value like 1234, then use fill method after joing and replace 1234 back to NA.
With match + fill:
df %>%
fill(weight) %>%
mutate(weight = ifelse(!is.na(match(Time, weights$Time)) & is.na(weights$weight[match(df$Time, weights$Time)]), NA, weight))
Time weight is_weight height is_height
<dttm> <dbl> <dbl> <dbl> <dbl>
1 1900-01-01 10:00:00 3 1 NA 0
2 1900-01-01 11:00:00 3 0 4 1
3 1900-01-01 12:00:00 3 0 NA 1
4 1900-01-01 13:00:00 NA 1 NA 0
5 1900-01-01 14:00:00 4 1 NA 0
6 1900-01-01 15:00:00 4 0 5 1

R: Compute max of next 12 hours for each timestep

I have this dataframe df <- tibble(id = c(1, 1, 2, 2), v= c(0, 3, 1, 2), time = c(as.POSIXct("2016-12-01 12:30:00"), as.POSIXct("2016-12-01 20:30:00"), as.POSIXct("2016-12-01 3:30:00"), as.POSIXct("2016-12-01 12:30:00")))
# A tibble: 4 x 3
id v time
<dbl> <dbl> <dttm>
1 1 0 2016-12-01 12:30:00
2 1 3 2016-12-01 20:30:00
3 2 1 2016-12-01 03:30:00
4 2 2 2016-12-01 12:30:00
For each timestep and within each id, I want to compute the max value of v within a specific time period, e.g. 12 hours. My solution is the following:
df %>% group_by(id) %>% mutate(max_in_12h = purrr::map_dbl(time, function(t){max(v[time >= t && time <= t + 60*60*12])}))
id v time max_in_12h
<dbl> <dbl> <dttm> <dbl>
1 1 0 2016-12-01 12:30:00 3
2 1 3 2016-12-01 20:30:00 -Inf
3 2 1 2016-12-01 03:30:00 2
4 2 2 2016-12-01 12:30:00 -Inf
However, in my experience, purrr scales poorly when the dataframe has millions of rows. Is there another neat option?
You will need to test whether the performance is adequate but here is an alternative.
library(sqldf)
sqldf("select a.*, max(b.v) as max
from df a
left join df b on a.id = b.id and
b.time > a.time and b.time <= a.time + 60 * 60 * 12
group by a.rowid")
giving:
id v time max
1 1 0 2016-12-01 12:30:00 3
2 1 3 2016-12-01 20:30:00 NA
3 2 1 2016-12-01 03:30:00 2
4 2 2 2016-12-01 12:30:00 NA

Filter data based on subgroups R

In reality it's much more complex, but let's say my data looks like this:
df <- data.frame(
id = c(1,1,1,2,2,2,2,3,3,3),
event = c(0,0,0,1,1,1,1,0,0,0),
day = c(1,3,3,1,6,6,7,1,4,6),
time = c("2016-10-25 14:00:00", "2016-10-27 12:00:15", "2016-10-27 15:30:00",
"2016-10-23 11:00:00", "2016-10-28 08:00:15", "2016-10-28 23:00:00", "2016-10-29 12:00:00",
"2016-10-24 15:00:00", "2016-10-27 15:00:15", "2016-10-29 16:00:00"))
df$time <- as.POSIXct(df$time)
Output:
id event day time
1 1 0 1 2016-10-25 14:00:00
2 1 0 3 2016-10-27 12:00:15
3 1 0 3 2016-10-27 15:30:00
4 2 1 1 2016-10-23 11:00:00
5 2 1 6 2016-10-28 08:00:15
6 2 1 6 2016-10-28 23:00:00
7 2 1 7 2016-10-29 12:00:00
8 3 0 1 2016-10-24 15:00:00
9 3 0 4 2016-10-27 15:00:15
10 3 0 6 2016-10-29 16:00:00
What I need to do:
If event is 0, I want to keep only the last 24 hours per id.
If event is 1, I want to keep the 6th day.
I know how to keep the last 24 hours in general:
library(lubridate)
last_twentyfour_hours <- df %>%
group_by(id) %>%
filter(time > last(time) - hours(24))
But how do i filter differently for each group?
Thank you very much in advance!
Grouped by 'id', 'event', do a filter with if/else i.e. if 0 is in 'event', then use the OP's condition or else return the rows where 'day' is 6
library(dplyr)
library(lubridate)
df %>%
group_by(id, event) %>%
filter(if(0 %in% event) time > last(time) - hours(24) else
day == 6) %>%
ungroup
-output
# A tibble: 5 × 4
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00
We could use the & and | operator:
df %>%
group_by(id) %>%
filter(event == 0 & time > last(time) - hours(24) |
event == 1 & day==6)
id event day time
<dbl> <dbl> <dbl> <dttm>
1 1 0 3 2016-10-27 12:00:15
2 1 0 3 2016-10-27 15:30:00
3 2 1 6 2016-10-28 08:00:15
4 2 1 6 2016-10-28 23:00:00
5 3 0 6 2016-10-29 16:00:00

Fill down data frame in R with date and numeric values

I have a data frame that looks this way
df <- tibble(date = c('6/8/2021 18:58',
'6/8/2021 19:00',
'6/8/2021 19:05',
'6/8/2021 19:07'),
values = c(1,0,1,0)
)
date
values
6/8/2021 18:58
1
6/8/2021 19:00
0
6/8/2021 19:05
1
6/8/2021 19:07
0
That I need to be transformed to look like this
date
values
6/8/2021 18:58
1
6/9/2021 18:59
1
6/10/2021 19:00
0
6/11/2021 19:01
0
6/12/2021 19:02
0
6/13/2021 19:03
0
6/14/2021 19:04
0
6/15/2021 19:05
1
6/16/2021 19:06
1
6/17/2021 19:07
0
Thanks in advance for the help.
Use complete to return a sequence of date from the min to max by 1 minute after converting the 'date' column to POSIXct
library(dplyr)
library(tidyr)
library(lubridate)
df %>%
mutate(date = mdy_hm(date)) %>%
complete(date = seq(first(date), last(date), by = '1 min'),
fill = list(values = 0)) %>%
mutate(date = date + days(row_number() - 1), values = +(values|lag(values)))
-output
# A tibble: 10 x 2
date values
<dttm> <int>
1 2021-06-08 18:58:00 1
2 2021-06-09 18:59:00 1
3 2021-06-10 19:00:00 0
4 2021-06-11 19:01:00 0
5 2021-06-12 19:02:00 0
6 2021-06-13 19:03:00 0
7 2021-06-14 19:04:00 0
8 2021-06-15 19:05:00 1
9 2021-06-16 19:06:00 1
10 2021-06-17 19:07:00 0
Maybe you can try the base R code like below
df <- transform(
df,
date = strptime(date, "%m/%d/%Y %H:%M")
)
dfout <- transform(
data.frame(Date = with(df, seq(date[1], date[length(date)], by = "1 min"))),
Values = with(df, values[findInterval(Date, df$date)])
)
which gives
Date Values
1 2021-06-08 18:58:00 1
2 2021-06-08 18:59:00 1
3 2021-06-08 19:00:00 0
4 2021-06-08 19:01:00 0
5 2021-06-08 19:02:00 0
6 2021-06-08 19:03:00 0
7 2021-06-08 19:04:00 0
8 2021-06-08 19:05:00 1
9 2021-06-08 19:06:00 1
10 2021-06-08 19:07:00 0

Using slice() to filter for peak intervals

Consider the following dataset:
df <- tibble(
interval = rep(1:10, 4),
channel = rep(1:2, each = 20),
date = parse_date(rep(c("2020-07-01", "2020-07-02", "2020-07-03", "2020-07-04"),
times = 2, each = 5)),
time = parse_time(
rep(format(seq.POSIXt(as.POSIXct(Sys.Date() + 0.05),
as.POSIXct(Sys.Date() + 0.95), length.out = 5),
"%H:%M:%S", tz="GMT"), 8), format = "%H:%M:%S"),
trigger = c(rep(0,5), # Ch 1, day 1; no max
0, 2, 0, 2, 0, # Ch 1, day 2; 2 maxes
rep(0, 5), # Ch 1, day 3; no max
0, 0, 2, 0, 0, # Ch 1, day 4
0, 0, 10, 0, 0, # Ch 2, day 1
10, rep(0, 4), # Ch 2, day 2; max at head
rep(0, 4), 10, # Ch 2, day 3; max at tail
4, 10, 4, 10, 0) # Ch 2, day 4; 2 maxes
)
# A tibble: 40 x 5
interval channel date time trigger
<int> <int> <date> <time> <dbl>
1 1 1 2020-07-01 01:12 0
2 2 1 2020-07-01 06:36 0
3 3 1 2020-07-01 12:00 0
4 4 1 2020-07-01 17:24 0
5 5 1 2020-07-01 22:48 0
6 6 1 2020-07-02 01:12 0
7 7 1 2020-07-02 06:36 2
8 8 1 2020-07-02 12:00 0
9 9 1 2020-07-02 17:24 2
10 10 1 2020-07-02 22:48 0
# ... with 30 more rows
My data has 10,000+ rows from a sensor recording daily how many times it's triggered in a time interval. I want to use slice() to filter a 2-hour interval around the time of peak triggers for each day. I have code that works, but it produces warnings for specific situations that I'll explain shortly. Although the warnings do not compromise the results, I would feel more at ease if I did not have them. The conditions I need to consider are:
A sensor not being triggered for > 1 day (trigger = 0)
Triggers peaking at the head or tail end of a day
Triggers peak at more than once a day (the same max at different times)
I mainly code using tidyverse and lubridate functions. My best working code so far is as follows:
df %>%
group_by(date, channel) %>%
slice(abs(which.max(trigger) + (-1:1))) %>% # Simplifying my interval with 1 row around the peak
ungroup() %>%
arrange(channel) %>%
print()
# A tibble: 20 x 5
interval channel date time trigger
<int> <int> <date> <time> <dbl>
1 1 1 2020-07-01 01:12 0
2 2 1 2020-07-01 06:36 0
3 6 1 2020-07-02 01:12 0
4 7 1 2020-07-02 06:36 2
5 8 1 2020-07-02 12:00 0
6 1 1 2020-07-03 01:12 0
7 2 1 2020-07-03 06:36 0
8 7 1 2020-07-04 06:36 0
9 8 1 2020-07-04 12:00 2
10 9 1 2020-07-04 17:24 0
11 2 2 2020-07-01 06:36 0
12 3 2 2020-07-01 12:00 10
13 4 2 2020-07-01 17:24 0
14 6 2 2020-07-02 01:12 10
15 7 2 2020-07-02 06:36 0
16 4 2 2020-07-03 17:24 0
17 5 2 2020-07-03 22:48 10
18 6 2 2020-07-04 01:12 4
19 7 2 2020-07-04 06:36 10
20 8 2 2020-07-04 12:00 4
I have thought to slice by interval rather than the peak, but the intervals are not always sequential; it depends on when I reset my programs. If there are 2 or more peaks, I wouldn't mind filtering for the first peak. If I could identify where there are multiple peaks, that's a plus! Lastly, if there are no triggers for a day, I wouldn't want that day included. I think I could post-filter the inactivity out, but I would still get the warnings.
Quick recap:
My goal is to filter a 2-hour interval around the time of peak triggers. If you can recommend tidyverse/lubridate (or any really!) solutions, I would appreciate the help. Thanks!
You can write a custom function to test various conditions so that no warning is generated.
custom_fun <- function(trigger) {
#trigger value greater than 0
inds <- trigger > 0
#If any value greater than 0
if(any(inds)) {
#return the 2-hour interval
vals <- which.max(trigger) + -1:1
#remove values during head and tail of the day
return(vals[vals > 0 & vals <= length(trigger)])
}
#Don't select anything if no trigger > 0
else return(0)
}
and then apply it for each date and channel.
library(dplyr)
df %>%
group_by(date, channel) %>%
#If multiple peaks present.
mutate(mulitple_peak = sum(trigger == max(trigger)) > 1) %>%
slice(custom_fun(trigger)) %>%
ungroup()
# A tibble: 16 x 6
# interval channel date time trigger mulitple_peak
# <int> <int> <date> <time> <dbl> <lgl>
# 1 2 2 2020-07-01 06:36 0 FALSE
# 2 3 2 2020-07-01 12:00 10 FALSE
# 3 4 2 2020-07-01 17:24 0 FALSE
# 4 6 1 2020-07-02 01:12 0 TRUE
# 5 7 1 2020-07-02 06:36 2 TRUE
# 6 8 1 2020-07-02 12:00 0 TRUE
# 7 6 2 2020-07-02 01:12 10 FALSE
# 8 7 2 2020-07-02 06:36 0 FALSE
# 9 4 2 2020-07-03 17:24 0 FALSE
#10 5 2 2020-07-03 22:48 10 FALSE
#11 7 1 2020-07-04 06:36 0 FALSE
#12 8 1 2020-07-04 12:00 2 FALSE
#13 9 1 2020-07-04 17:24 0 FALSE
#14 6 2 2020-07-04 01:12 4 TRUE
#15 7 2 2020-07-04 06:36 10 TRUE
#16 8 2 2020-07-04 12:00 4 TRUE

Resources