I have a dataframe (Reports_following_AC) where each row represents a report. This dataframe looks like this:
> head(Reports_following_AC)
Park Month Obs_con Coy_Season Number_AC Number_4w_AC
<chr> <date> <dbl> <dbl> <int> <int>
1 14st NE - Coventry 2019-06-14 1 2 8 0
2 14st NE - Coventry 2019-10-12 0 3 10 0
3 14st NE - Coventry 2019-10-13 0 3 10 0
4 14st NE - Coventry 2021-06-23 1 2 10 0
5 Airways Park 2020-07-05 0 2 3 0
6 Airways Park 2021-07-18 1 2 6 0
I would like to add a column to my Reports_following_AC dataframe, "Last_treatment", based on the "AC_code" column of the Reaction_per_park_per_day_3 dataframe (below). In my Reaction_per_park_per_day_3 dataframe, each row represents an AC event.
The Last_treatment column that would be added to the Reports_following_AC dataframe would represent the "AC_code" (treatment) of the last AC event prior to a report in a Park, if that AC event was done in the 4 weeks (28 days) prior to a report.
> head(Reaction_per_park_per_day_3)
# A tibble: 6 x 10
Park Date AC_code
<chr> <date> <dbl>
1 14st NE - Coventry 2019-06-05 6
2 14st NE - Coventry 2019-07-12 7
3 14st NE - Coventry 2019-10-05 1
4 14st NE - Coventry 2021-06-18 2
5 Airways Park 2020-06-26 1
6 Airways Park 2021-06-30 5
The resulting dataframe would therefore look like this:
Park Month Obs_con Coy_Season Number_AC Number_4w_AC Last_treatment
<chr> <date> <dbl> <dbl> <int> <int> <dbl>
1 14st NE - Coventry 2019-06-14 1 2 8 0 6
2 14st NE - Coventry 2019-10-12 0 3 10 0 1
3 14st NE - Coventry 2019-10-13 0 3 10 0 1
4 14st NE - Coventry 2021-06-23 1 2 10 0 NA
5 Airways Park 2020-07-05 0 2 3 0 1
6 Airways Park 2021-07-18 1 2 6 0 5
I tried the following code, but it's not quite working because instead of providing the AC_Code for the last AC event prior to the reports if within 30 days of the report, it provides the AC_code for all the AC events within 30 days of the report.
Reports_following_AC_1 <- Reports_following_AC %>%
left_join(select(Reaction_per_park_per_day_3, c(Park, Date, AC_code))) %>%
filter(Date <= Month ) %>%
group_by(Park, Month, Obs_con, Coy_Season) %>%
mutate(Last_treatment = if_else((Month - max(Date))<28, AC_code, as.character(NA))) %>%
distinct
> head(Reports_following_AC_1)
Park Month Obs_con Coy_Season Number_AC Number_4w_AC Date AC_code Last_treatment
<chr> <date> <dbl> <dbl> <int> <int> <date> <chr> <chr>
1 14st NE - Coventry 2019-06-14 1 2 8 0 2019-01-30 3 NA
2 14st NE - Coventry 2019-06-14 1 2 8 0 2019-01-30 4 NA
3 14st NE - Coventry 2019-06-14 1 2 8 0 2019-01-30 1 NA
4 14st NE - Coventry 2019-06-14 1 2 8 0 2019-02-01 4 NA
5 14st NE - Coventry 2019-06-14 1 2 8 0 2019-02-01 2 NA
6 14st NE - Coventry 2019-06-14 1 2 8 0 2019-02-04 1 NA
I'm ideally looking for a dplyr solution, but I'm open to other possibilities.
you want to join with a selection of columns from Reaction_per_park_per_day_3 if i understand correctly? This should work:
Reports_following_AC_1 <- Reports_following_AC %>%
left_join(select(Reaction_per_park_per_day_3, c(Park,Month,AC_cod), by="Park" ) %>%
filter(Date <= Month ) %>%
group_by(Park, Month, Obs_con, Coy_Season) %>%
mutate(Last_treatment = if_else((Month - max(Date))<28, lag(AC_code), as.character(NA))) %>%
distinct
I figured it out!
Reports_following_AC_1 <- Reports_following_AC %>%
left_join(select(Reaction_per_park_per_day_3, c(Park, Date, AC_code))) %>%
filter(Date < Month ) %>%
group_by(Park, Month, Obs_con, Coy_Season, Number_4w_AC) %>%
mutate(Last_treatment = last(if_else((Month - max(Date))<28, AC_code, as.character(NA)))) %>%
select(c(Park, Month, Obs_con, Coy_Season, Number_4w_AC, Last_treatment)) %>%
distinct
This is very similar to the question I asked previously (see Count the number of rows a month before a date), but the solution suggested does not fix my issue in this case.
I have a dataframe that looks like this:
> Reports_per_park_per_week_3
Park Week Coy_Season Reports_per_week Number_4w_AC Year
<chr> <date> <chr> <dbl> <int> <chr>
1 Airways Park 2018-04-29 1 5 0 2018
2 Airways Park 2018-05-06 2 2 1 2018
3 Airways Park 2018-05-13 2 0 1 2018
4 Baker Park 2018-05-20 2 3 2 2018
5 Baker Park 2018-05-27 2 9 2 2018
6 Baker Park 2018-06-03 2 2 5 2018
I would like to create another column that would calculate the total number of reports per park in the month prior to the week being evaluated. The column in question would therefore have to take into account the Park column, the Week column and the Reports per week column.
> Reports_per_park_per_week_3
Park Week Coy_Season Reports_per_week Number_4w_AC Year Reports_4w
<chr> <date> <chr> <dbl> <int> <chr>
1 Airways Park 2018-04-29 1 5 0 2018 5
2 Airways Park 2018-05-06 2 2 1 2018 7
3 Airways Park 2018-05-13 2 0 1 2018 7
4 Baker Park 2018-05-20 2 3 2 2018 3
5 Baker Park 2018-05-27 2 9 2 2018 12
6 Baker Park 2018-06-03 2 2 5 2018 14
Does this do what you want? It is assumed here that your time series all have 1-week spacing throughout (no weeks are skipped) and that there are zero reports prior to the earliest week in each time series.
library("dplyr")
library("zoo")
Reports_per_park_per_week_3 %>%
group_by(Park) %>%
arrange(Week, .by_group = TRUE) %>%
mutate(Reports_4w = rollsum(c(integer(3L), Reports_per_week), 4L))
I have the following two data frames:
> head(Reaction_per_park_per_day_3)
Park Date Type_1_2 Number_AC_events
<chr> <date> <chr> <int>
1 Beaverdam Flats 2018-09-25 0 1
2 Nosehill 64 ave 2018-09-26 0 1
3 Nosehill 64 ave 2018-09-26 0 1
4 Nosehill Macewin 2018-09-26 0 1
5 Crestmont 2018-09-27 0 2
6 Country Hills G.C. - Nose Creek 2018-09-28 0 1
> head(All_reports_per_month2)
Month Park Code Reports_per_month
<date> <chr> <chr> <dbl>
1 2018-09-29 Beaverdam Flats 1 1
2 2018-10-12 Nosehill 64 ave 2 1
3 2018-10-25 Nosehill 64 ave 1 2
4 2018-09-21 Crestmont 1 1
5 2018-09-29 Crestmont 2 1
I would like to add a "days since last AC event" column to All_reports_per_month2 that would take into account the date and the park of the AC event as well as the date and park of the report. If the report data is prior to the first AC event in a certain park, NA would appear. See example below:
Month Park Code Reports_per_month Days_since_last_AC
<date> <chr> <chr> <dbl> <chr>
1 2018-09-29 Beaverdam Flats 1 1 4
2 2018-10-12 Nosehill 64 ave 2 1 16
3 2018-10-25 Nosehill 64 ave 1 2 29
4 2018-09-21 Crestmont 1 1 NA
5 2018-09-29 Crestmont 2 1 2
Any help would be appreciated!
This is a joining and filtering operation that will use the dplyr package.
# import the packages
library( dplyr )
# join the data tables and filter so that we are always looking back in time
All_reports_per_month2 %>%
left_join( Reaction_per_park_per_day_3, by="Park" ) %>%
filter( Date <= Month ) %>%
group_by( Park, Month ) %>%
summarize( Days_since_last_AC = Month - max(Date) )
What I want to do
I have a dataset of protest events in the United States. Some events are stand-alone events, while others persist day-after-day (a "multi-day event"). My dataset is structured at the daily level, so a three-day multi-day event is spread out over three rows.
I want to accomplish the following:
Create a cumulative sum of the number of days thus far in any given multi-day event. Specifically, I want to count the number of days between the "First day" and "Last day" of any linked event.
Put the total number of days of each multi-event as a variable
"Name" each multi-day event by concatenating the state in which the protest occurred and a sequential identity number starting at 1 in each state and extending upwards.
Data
Here's a reproducible example:
# Library
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
# Now create some lags and leads
test <- test %>%
group_by(state) %>%
mutate(date_lag = lag(date),
date_lead = lead(date),
days_last = date - date_lag,
days_next = date_lead - date,
link_last = if_else(days_last <= 1, 1, 0),
link_next = if_else(days_next <= 1, 1, 0),
sequence = if_else(link_last == 0 & link_next == 1, "First day",
if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
if_else(link_last == 1 & link_next == 1, "Ongoing",
if_else(link_last == 1 & link_next == 0, "Last day",
if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))
This generates the following dataframe:
state date date_lag date_lead days_last days_next link_last link_next sequence
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA
What I want to create:
state date date_lag date_lead days_last days_next link_last link_next sequence cumulative duration name
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day 1 2 Washington.1
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day 2 2 Washington.1
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked NA 0 NA
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day 1 5 Washington.2
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing 2 5 Washington.2
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing 3 5 Washington.2
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing 4 5 Washington.2
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day 5 5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA NA NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA 1 3 Idaho.1
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing 2 3 Idaho.1
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day 3 3 Idaho.1
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked NA NA NA
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day 1 3 Idaho.2
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing 2 3 Idaho.2
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day 3 3 Idaho.2
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked NA NA NA
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked NA NA NA
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA NA NA NA
Side question: Why is test$sequence[11] an NA and not "First day"?
I'm not sure these are the specific numbers you're looking for, but this represents what seems to me a simpler and more idiomatic tidyverse approach:
test %>%
group_by(state) %>%
mutate(days_last = as.numeric(date - lag(date)),
new_section = 1*(is.na(days_last) | days_last > 1), # EDIT
section = cumsum(new_section),
name = paste(state,section, sep = ".")) %>%
group_by(name) %>%
mutate(duration = as.numeric(max(date) - min(date) + 1),
sequence = case_when(duration == 1 ~ "Unlinked",
row_number() == 1 ~ "First Day",
row_number() == n() ~ "Last Day",
TRUE ~ "Ongoing")) %>%
ungroup()
Here, I mark any gap of more than one day as a new event, take the cumulative sum, and use that to define the duration of each event.
# A tibble: 20 x 8
state date days_last new_section section name duration sequence
<chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Washington 2021-01-01 NA 1 1 Washington.1 1 Unlinked
2 Washington 2021-01-03 2 1 2 Washington.2 2 First Day
3 Washington 2021-01-04 1 0 2 Washington.2 2 Last Day
4 Washington 2021-01-10 6 1 3 Washington.3 1 Unlinked
5 Washington 2021-01-15 5 1 4 Washington.4 5 First Day
6 Washington 2021-01-16 1 0 4 Washington.4 5 Ongoing
7 Washington 2021-01-17 1 0 4 Washington.4 5 Ongoing
8 Washington 2021-01-18 1 0 4 Washington.4 5 Ongoing
9 Washington 2021-01-19 1 0 4 Washington.4 5 Last Day
10 Washington 2021-01-28 9 1 5 Washington.5 1 Unlinked
11 Idaho 2021-01-12 NA 1 1 Idaho.1 3 First Day
12 Idaho 2021-01-13 1 0 1 Idaho.1 3 Ongoing
13 Idaho 2021-01-14 1 0 1 Idaho.1 3 Last Day
14 Idaho 2021-02-01 18 1 2 Idaho.2 1 Unlinked
15 Idaho 2021-02-03 2 1 3 Idaho.3 3 First Day
16 Idaho 2021-02-04 1 0 3 Idaho.3 3 Ongoing
17 Idaho 2021-02-05 1 0 3 Idaho.3 3 Last Day
18 Idaho 2021-02-08 3 1 4 Idaho.4 1 Unlinked
19 Idaho 2021-02-10 2 1 5 Idaho.5 1 Unlinked
20 Idaho 2021-02-14 4 1 6 Idaho.6 1 Unlinked
I think creating specific functions to do the counting is easier than try to do everything in a single pipe.
I left all the intermediate steps and the intermediate columns in the output so you can see what each step is doing. It's very likely you won't need to keep all these columns and you probably can simplify the steps once you understand the approach.
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
event_count <- function(v){
cnt <- 0
result <- integer(length(v))
for(idx in seq_along(v)) {
if(v[idx]) {
cnt <- 0
} else {
cnt <- cnt + 1
}
result[idx] <- cnt
}
result
}
need_name <- function(cnt) {
result <- logical(length(cnt))
for(idx in seq_along(cnt)){
if(cnt[idx] == 0){
if(idx == length(cnt)){
result[idx] <- FALSE
break
}
result[idx] <- (cnt[idx + 1] != 0)
} else{
result[idx] <- TRUE
}
}
result
}
running_count <- function(v) {
cnt <- 0
flag <- FALSE
result <- integer(length(v))
for(idx in seq_along(v)){
if(v[idx]) {
if(!flag) {
cnt <- cnt + 1
flag <- !flag
}
result[idx] <- cnt
} else{
result[idx] <- 0
flag <- FALSE
}
}
result
}
test %>%
group_by(state) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
duration = date - lag(date), # --- Compute durations
is_first = duration != 1 # --- Check if it is the first day of a protest
) %>%
replace_na(list(is_first = TRUE)) %>% # --- No more NAs
ungroup %>%
mutate(
cnt = event_count(is_first), # --- How many days does this event have?
need_name = need_name(cnt) # --- Should we name this event?
) %>%
group_by(state) %>%
mutate(
name_number = running_count(need_name) # --- What's the event count?
) %>%
mutate(
name = paste0(state, ".", name_number), # ---- Create names
cumulative = cnt + 1 # --- Start counting from one instead of zero
) %>%
group_by(name) %>%
mutate(
duration = max(duration) # --- Calc total duration
) %>%
ungroup() %>%
mutate( # --- Adding the NAs back
name = if_else(name_number == 0, NA_character_, name),
duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
)
data.table::rleid is useful here to create run lengths based on the condition if days_last == 1 or days_next == 1 (ie sequential dates). If you want different event lengths you can edit that condition.
library(dplyr)
library(data.table)
test %>%
dplyr::group_by(state) %>%
dplyr::mutate(days_last = c(NA, diff(date)),
days_next = as.numeric(lead(date) - date),
name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>%
dplyr::group_by(name) %>%
dplyr::mutate(sequence = case_when(
n() == 1 ~ "Not Linked",
row_number() == 1 ~ "First day",
n() == row_number() ~ "Last day",
T ~ "Ongoing"),
duration = n(),
cumulative = seq_along(name)) %>%
dplyr::ungroup()
Output
state date days_last days_next name sequence duration cumulative
<chr> <date> <dbl> <dbl> <chr> <chr> <int> <int>
1 Washington 2021-01-01 NA 2 Washington.1 Not Linked 1 1
2 Washington 2021-01-03 2 1 Washington.2 First day 2 1
3 Washington 2021-01-04 1 6 Washington.2 Last day 2 2
4 Washington 2021-01-10 6 5 Washington.3 Not Linked 1 1
5 Washington 2021-01-15 5 1 Washington.4 First day 5 1
6 Washington 2021-01-16 1 1 Washington.4 Ongoing 5 2
7 Washington 2021-01-17 1 1 Washington.4 Ongoing 5 3
8 Washington 2021-01-18 1 1 Washington.4 Ongoing 5 4
9 Washington 2021-01-19 1 9 Washington.4 Last day 5 5
10 Washington 2021-01-28 9 NA Washington.5 Not Linked 1 1
11 Idaho 2021-01-12 NA 1 Idaho.1 First day 3 1
12 Idaho 2021-01-13 1 1 Idaho.1 Ongoing 3 2
13 Idaho 2021-01-14 1 18 Idaho.1 Last day 3 3
14 Idaho 2021-02-01 18 2 Idaho.2 Not Linked 1 1
15 Idaho 2021-02-03 2 1 Idaho.3 First day 3 1
16 Idaho 2021-02-04 1 1 Idaho.3 Ongoing 3 2
17 Idaho 2021-02-05 1 3 Idaho.3 Last day 3 3
18 Idaho 2021-02-08 3 2 Idaho.4 First day 2 1
19 Idaho 2021-02-10 2 4 Idaho.4 Last day 2 2
20 Idaho 2021-02-14 4 NA Idaho.5 Not Linked 1 1
If need by you can use the NA in the days_last column to NA values in other rows.
Side question: Why is test$sequence[11] an NA and not "First day"?
Generally, in R NA propagates, meaning if NA is part of the evaluation then normally NA is returned. When you define sequence your first ifelse condition is link_last == 0 & link_next == 1. On row 11, link_last = NA and link_next = 1. So what you're evaluating is:
NA == 0 & 1 == 1
[1] NA
Instead your nested condition should come first. How your ifelse is currently written that nested condition is not being evaluated:
is.na(NA) & 1 == 1
[1] TRUE
Here is a data.table approach.
library(data.table)
# Convert from data.frame to data.table
setDT(test)
# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last),
days_next = as.numeric(days_next), sequence)]
# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
by = state][
, ':='(duration = .N,
cumulative = seq(1:.N)),
by = name
][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!
# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date',
'sequence', 'duration',
'cumulative'))
# Print first 15 rows
print(test2[1:15,])
#> state name date sequence duration cumulative
#> 1: Washington Washington.1 2021-01-01 <NA> 1 1
#> 2: Washington Washington.2 2021-01-03 First day 2 1
#> 3: Washington Washington.2 2021-01-04 Last day 2 2
#> 4: Washington Washington.3 2021-01-10 Not linked 1 1
#> 5: Washington Washington.4 2021-01-15 First day 5 1
#> 6: Washington Washington.4 2021-01-16 Ongoing 5 2
#> 7: Washington Washington.4 2021-01-17 Ongoing 5 3
#> 8: Washington Washington.4 2021-01-18 Ongoing 5 4
#> 9: Washington Washington.4 2021-01-19 Last day 5 5
#> 10: Washington Washington.5 2021-01-28 <NA> 1 1
#> 11: Idaho Idaho.1 2021-01-12 <NA> 3 1
#> 12: Idaho Idaho.1 2021-01-13 Ongoing 3 2
#> 13: Idaho Idaho.1 2021-01-14 Last day 3 3
#> 14: Idaho Idaho.2 2021-02-01 Not linked 1 1
#> 15: Idaho Idaho.3 2021-02-03 First day 3 1
Created on 2021-03-16 by the reprex package (v0.3.0)
I have
household person time mode
1 1 07:45:00 non-car
1 1 09:05:00 car
1 2 08:10:00 non-car
1 3 22:45:00 non-car
1 4 08:30:00 car
1 5 22:00:00 car
2 1 07:45:00 non-car
2 2 16:45:00 car
I want to find a column to find if non-car mode is at most 1 hour before a car mode in each family.
I need that column to be index of a person or persons who has this time intersection with another one.
In the above example first family, the time of first person is 1 hour before person 4, so in new column 4 infant of first person and 1 infant of 4th person.
output:
household person time mode overlap
1 1 07:45:00 non-car 4
1 1 09:05:00 car 2
1 2 08:10:00 non-car 4,1
1 3 22:45:00 non-car 0
1 4 08:30:00 car 1,2
1 5 22:00:00 car 0
2 1 07:45:00 non-car 0
2 2 16:45:00 car 0
no intersection with other family member is 0 or whatever like NA
Here's a dplyr approach that produces those matches.
library(dplyr); library(hms)
df %>%
# Connect the table to itself, linking by household.
# So every row gets linked to every row (including itself)
# with the same household. The original data with end .x and
# the joined data will end .y, so we can compare then below.
left_join(df, by = c("household")) %>%
# Find the difference in time, in seconds
mutate(time_dif = abs(time.y - time.x)) %>%
filter(time_dif < 3600, # Keep if <1hr difference
person.x != person.y, # Keep if different person
mode.x != mode.y) %>% # Keep if different mode
# We have the answers now, everything below is for formatting
# Rename and hide some variables we don't need any more
select(household, person = person.x, time = time.x,
mode = mode.x, other = person.y) %>%
# Combine each person's overlaps into one row
group_by(household, person, time) %>%
summarise(overlaps = paste(other, collapse =","), times = length(other)) %>%
# Add back all original rows, even if no overlaps
right_join(df) %>%
ungroup()
## A tibble: 7 x 6
# household person time overlaps times mode
# <int> <int> <time> <chr> <int> <chr>
#1 1 1 07:45 4 1 non-car
#2 1 1 09:05 2 1 car
#3 1 2 08:10 1,4 2 non-car
#4 1 3 22:45 NA NA non-car
#5 1 4 08:30 1,2 2 car
#6 2 1 07:45 NA NA non-car
#7 2 2 16:45 NA NA car