R function for creating uneven groups based on uneven dates - r

I am trying to find an R function that can index groups iteratively, given a set of unevenly spaced dates, uneven group sizes, and by grouped cases. Here are example data:
> h
# A tibble: 20 x 2
ID date
<int> <date>
1 1 2021-01-07
2 1 2021-01-11
3 1 2021-01-15
4 1 2021-01-16
5 1 2021-01-21
6 1 2021-01-26
7 1 2021-02-04
8 1 2021-02-08
9 1 2021-02-13
10 1 2021-02-20
11 1 2021-02-23
12 1 2021-02-27
13 2 2021-01-05
14 2 2021-01-11
15 2 2021-02-02
16 2 2021-02-08
17 2 2021-02-08
18 2 2021-02-14
19 2 2021-02-17
20 2 2021-02-21
For each unique ID, I want to find the first date (chronologically) and create a group (i.e., group==1) for that case and any other rows within 7 days. For the next date after 7 days, create a second group (i.e., group==2) for that case and any others within the next 7 days. Note: the next date is not necessarily exactly 7 days after the initial date. Repeat this process for the remaining remaining cases to get the desired output:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
Using a rolling window function of 7 days will not work, as far as I can tell, as it will group the cases incorrectly. But I am wondering if a sort of custom rolling window function could be used? I would prefer a solution using dplyr, but other options would also work. Any help here is appreciated.
> dput(h)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(18634,
18638, 18642, 18643, 18648, 18653, 18662, 18666, 18671, 18678,
18681, 18685, 18632, 18638, 18660, 18666, 18666, 18672, 18675,
18679), class = "Date")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))

Define a function date1 which given the first date of the group of the prior row's point and the current row's date returns the date of the start of the current group -- that must be one of the two arguments. Then grouping by ID use Reduce to apply that to the dates in each ID and convert the result to factor and then to integer.
library(dplyr)
date1 <- function(prev, x) if (x > prev + 7) x else prev
h %>%
group_by(ID) %>%
mutate(group = as.integer(factor(Reduce(date1, date, acc = TRUE)))) %>%
ungroup
giving:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3

For each ID group, create group as a vector of NAs. While some group elements are still NA, take the first date value where group is NA and add 0 and 7 days to it to make a range of dates. For any rows where date is in the calculated date range, set elements of group to 1 more than the current max value of group (or 0 if group is still all NA).
library(data.table)
setDT(df)
df[order(ID, date), {
group <- rep(NA_real_, .N)
while(any(is.na(group))){
group_range <- first(date[is.na(group)]) + c(0, 7)
group[date %between% group_range] <- 1 + max(fcoalesce(group, 0))
}
list(date, group)
}, by = ID]
# ID date group
# 1: 1 2021-01-07 1
# 2: 1 2021-01-11 1
# 3: 1 2021-01-15 2
# 4: 1 2021-01-16 2
# 5: 1 2021-01-21 2
# 6: 1 2021-01-26 3
# 7: 1 2021-02-04 4
# 8: 1 2021-02-08 4
# 9: 1 2021-02-13 5
# 10: 1 2021-02-20 5
# 11: 1 2021-02-23 6
# 12: 1 2021-02-27 6
# 13: 2 2021-01-05 1
# 14: 2 2021-01-11 1
# 15: 2 2021-02-02 2
# 16: 2 2021-02-08 2
# 17: 2 2021-02-08 2
# 18: 2 2021-02-14 3
# 19: 2 2021-02-17 3
# 20: 2 2021-02-21 3
Here's another version where I try to limit the computations. No idea if it's actually faster
df[order(ID, date), {
group <- rep(NA_integer_, .N)
i <- 1L
g <- 1L
while(i <= .N){
group_range <- date[i] + c(0, 7)
chg <- date %between% group_range
group[chg] <- g
g <- g + 1L
i <- i + sum(chg)
}
list(date, group)
}, by = ID]

Related

How to create a new column that counts the number of occurrences of a value in another column and orders them by date

I have a 2 column data frame with "date" and "ID" headings. Some IDs are listed more than once. I want to create a new column "Attempt" that denotes the number of attempts that each ID has taken, ordered by the date of occurrence.
Here is my sample data:
ID <- c(1,2,5,8,4,9,1,11,15,32,54,1,4,2,14)
Date <- c("2021-04-12", "2021-04-12", "2021-04-13", "2021-04-14", "2021-04-19",
"2021-04-19", "2021-04-20", "2021-04-21", "2021-04-22", "2021-04-28",
"2021-04-28", "2021-04-29", "2021-04-29", "2021-05-06", "2021-05-07")
Data <- data.frame(ID, Date)
Data$Date <- as.Date(Data$Date, format="%Y-%m-%d")
I tried various iterations of duplicated(). I can remove all duplicates or make every instance of a duplicated value "2" or "3" for example, but I want each occurrence to be ordered based on the date of the attempt taken.
Here is my expected result column to be added onto the original data frame:
Attempt <- c(1,1,1,1,1,1,2,1,1,1,1,3,2,2,1)
Data %>%
group_by(ID)
mutate(Attempt1 = row_number())
ID Date Attempt
1 1 2021-04-12 1
2 2 2021-04-12 1
3 5 2021-04-13 1
4 8 2021-04-14 1
5 4 2021-04-19 1
6 9 2021-04-19 1
7 1 2021-04-20 2
8 11 2021-04-21 1
9 15 2021-04-22 1
10 32 2021-04-28 1
11 54 2021-04-28 1
12 1 2021-04-29 3
13 4 2021-04-29 2
14 2 2021-05-06 2
15 14 2021-05-07 1
If you have the latest version of dplyr use
Data %>%
mutate(Attempt = row_number(), .by = ID)
Using data.table
library(data.table)
setDT(Data)[, Attempt := rowid(ID)]
-output
> Data
ID Date Attempt
1: 1 2021-04-12 1
2: 2 2021-04-12 1
3: 5 2021-04-13 1
4: 8 2021-04-14 1
5: 4 2021-04-19 1
6: 9 2021-04-19 1
7: 1 2021-04-20 2
8: 11 2021-04-21 1
9: 15 2021-04-22 1
10: 32 2021-04-28 1
11: 54 2021-04-28 1
12: 1 2021-04-29 3
13: 4 2021-04-29 2
14: 2 2021-05-06 2
15: 14 2021-05-07 1

Using lag function to find the last value for a specific individual

I'm trying to create a column in my spreadsheet that takes the last recorded value (IC) for a specific individual (by the Datetime column) and populates it into a column (LIC) for the current event.
A sub-sample of my data looks like this (actual dataset has 4949 rows and 37 individuals):
> head(ACdatas.scale)
Date Datetime ID.2 IC LIC
1 2019-05-25 2019-05-25 11:57 139 High NA
2 2019-06-09 2019-06-09 19:42 139 Low NA
3 2019-07-05 2019-07-05 20:12 139 Medium NA
4 2019-07-27 2019-07-27 17:27 152 Low NA
5 2019-08-04 2019-08-04 9:13 152 Medium NA
6 2019-08-04 2019-08-04 16:18 139 Medium NA
I would like to be able to populate the last value from the IC column into the current LIC column for the current event (see below)
> head(ACdatas.scale)
Date Datetime ID.2 IC LIC
1 2019-05-25 2019-05-25 11:57 139 High NA
2 2019-06-09 2019-06-09 19:42 139 Low High
3 2019-07-05 2019-07-05 20:12 139 Medium Low
4 2019-07-27 2019-07-27 17:27 152 Low NA
5 2019-08-04 2019-08-04 9:13 152 Medium Low
6 2019-08-04 2019-08-04 16:18 139 Medium Medium
I've tried the following code:
ACdatas.scale <- ACdatas.scale %>%
arrange(ID.2, Datetime) %>%
group_by(ID.2) %>%
mutate(LIC= lag(IC))
This worked some of the time, but when I checked back through the data, it seemed to have a problem when the date switched, so it could accurately populate the field within the same day, but not when the previous event was on the previous day. Just to make it super confusing, it only had issues with some of the day switches, and not all! Help please!!
Sample data,
dat <- data.frame(id=c(rep("A",5),rep("B",5)), IC=c(1:5,11:15))
dplyr
library(dplyr)
dat %>%
group_by(id) %>%
mutate(LIC = lag(IC)) %>%
ungroup()
# # A tibble: 10 x 3
# id IC LIC
# <chr> <int> <int>
# 1 A 1 NA
# 2 A 2 1
# 3 A 3 2
# 4 A 4 3
# 5 A 5 4
# 6 B 11 NA
# 7 B 12 11
# 8 B 13 12
# 9 B 14 13
# 10 B 15 14
data.table
library(data.table)
as.data.table(dat)[, LIC := shift(IC, type = "lag"), by = .(id)][]
# id IC LIC
# <char> <int> <int>
# 1: A 1 NA
# 2: A 2 1
# 3: A 3 2
# 4: A 4 3
# 5: A 5 4
# 6: B 11 NA
# 7: B 12 11
# 8: B 13 12
# 9: B 14 13
# 10: B 15 14
base R
dat$LIC <- ave(dat$IC, dat$id, FUN = function(z) c(NA, z[-length(z)]))
dat
# id IC LIC
# 1 A 1 NA
# 2 A 2 1
# 3 A 3 2
# 4 A 4 3
# 5 A 5 4
# 6 B 11 NA
# 7 B 12 11
# 8 B 13 12
# 9 B 14 13
# 10 B 15 14
By using your data:
mydat <- structure(list(Date = structure(c(18041, 18056, 18082,
18104, 18112, 18112),
class = "Date"),
Datetime = structure(c(1558760220,1560084120,
1562332320, 1564223220,
1564884780, 1564910280),
class = c("POSIXct","POSIXt"),
tzone = ""),
ID.2 = c(139, 139, 139, 152, 152, 139),
IC = c("High", "Low", "Medium", "Low", "Medium", "Medium"),
LIC = c(NA, NA, NA, NA, NA, NA)), row.names = c(NA, -6L),
class = "data.frame")
mydat %>% arrange(Datetime) %>% group_by(ID.2) %>% mutate(LIC = lag(IC))
# A tibble: 6 x 5
# Groups: ID.2 [2]
Date Datetime ID.2 IC LIC
<date> <dttm> <dbl> <chr> <chr>
1 2019-05-25 2019-05-25 11:57:00 139 High NA
2 2019-06-09 2019-06-09 19:42:00 139 Low High
3 2019-07-05 2019-07-05 20:12:00 139 Medium Low
4 2019-07-27 2019-07-27 17:27:00 152 Low NA
5 2019-08-04 2019-08-04 09:13:00 152 Medium Low
6 2019-08-04 2019-08-04 16:18:00 139 Medium Medium

r - Generating cumulative sum, total sum, and unique identifiers between start and end dates

What I want to do
I have a dataset of protest events in the United States. Some events are stand-alone events, while others persist day-after-day (a "multi-day event"). My dataset is structured at the daily level, so a three-day multi-day event is spread out over three rows.
I want to accomplish the following:
Create a cumulative sum of the number of days thus far in any given multi-day event. Specifically, I want to count the number of days between the "First day" and "Last day" of any linked event.
Put the total number of days of each multi-event as a variable
"Name" each multi-day event by concatenating the state in which the protest occurred and a sequential identity number starting at 1 in each state and extending upwards.
Data
Here's a reproducible example:
# Library
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
# Now create some lags and leads
test <- test %>%
group_by(state) %>%
mutate(date_lag = lag(date),
date_lead = lead(date),
days_last = date - date_lag,
days_next = date_lead - date,
link_last = if_else(days_last <= 1, 1, 0),
link_next = if_else(days_next <= 1, 1, 0),
sequence = if_else(link_last == 0 & link_next == 1, "First day",
if_else(is.na(link_last) == TRUE & link_next == 1, "First day",
if_else(link_last == 1 & link_next == 1, "Ongoing",
if_else(link_last == 1 & link_next == 0, "Last day",
if_else(link_last == 1 & is.na(link_next)==TRUE, "Last day", "Not linked"))))))
This generates the following dataframe:
state date date_lag date_lead days_last days_next link_last link_next sequence
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA
What I want to create:
state date date_lag date_lead days_last days_next link_last link_next sequence cumulative duration name
<chr> <date> <date> <date> <drtn> <drtn> <dbl> <dbl> <chr> <dbl> <dbl> <chr>
1 Washington 2021-01-01 NA 2021-01-03 NA days 2 days NA 0 NA NA 0 NA
2 Washington 2021-01-03 2021-01-01 2021-01-04 2 days 1 days 0 1 First day 1 2 Washington.1
3 Washington 2021-01-04 2021-01-03 2021-01-10 1 days 6 days 1 0 Last day 2 2 Washington.1
4 Washington 2021-01-10 2021-01-04 2021-01-15 6 days 5 days 0 0 Not linked NA 0 NA
5 Washington 2021-01-15 2021-01-10 2021-01-16 5 days 1 days 0 1 First day 1 5 Washington.2
6 Washington 2021-01-16 2021-01-15 2021-01-17 1 days 1 days 1 1 Ongoing 2 5 Washington.2
7 Washington 2021-01-17 2021-01-16 2021-01-18 1 days 1 days 1 1 Ongoing 3 5 Washington.2
8 Washington 2021-01-18 2021-01-17 2021-01-19 1 days 1 days 1 1 Ongoing 4 5 Washington.2
9 Washington 2021-01-19 2021-01-18 2021-01-28 1 days 9 days 1 0 Last day 5 5 Washington.2
10 Washington 2021-01-28 2021-01-19 NA 9 days NA days 0 NA NA NA NA NA
11 Idaho 2021-01-12 NA 2021-01-13 NA days 1 days NA 1 NA 1 3 Idaho.1
12 Idaho 2021-01-13 2021-01-12 2021-01-14 1 days 1 days 1 1 Ongoing 2 3 Idaho.1
13 Idaho 2021-01-14 2021-01-13 2021-02-01 1 days 18 days 1 0 Last day 3 3 Idaho.1
14 Idaho 2021-02-01 2021-01-14 2021-02-03 18 days 2 days 0 0 Not linked NA NA NA
15 Idaho 2021-02-03 2021-02-01 2021-02-04 2 days 1 days 0 1 First day 1 3 Idaho.2
16 Idaho 2021-02-04 2021-02-03 2021-02-05 1 days 1 days 1 1 Ongoing 2 3 Idaho.2
17 Idaho 2021-02-05 2021-02-04 2021-02-08 1 days 3 days 1 0 Last day 3 3 Idaho.2
18 Idaho 2021-02-08 2021-02-05 2021-02-10 3 days 2 days 0 0 Not linked NA NA NA
19 Idaho 2021-02-10 2021-02-08 2021-02-14 2 days 4 days 0 0 Not linked NA NA NA
20 Idaho 2021-02-14 2021-02-10 NA 4 days NA days 0 NA NA NA NA NA
Side question: Why is test$sequence[11] an NA and not "First day"?
I'm not sure these are the specific numbers you're looking for, but this represents what seems to me a simpler and more idiomatic tidyverse approach:
test %>%
group_by(state) %>%
mutate(days_last = as.numeric(date - lag(date)),
new_section = 1*(is.na(days_last) | days_last > 1), # EDIT
section = cumsum(new_section),
name = paste(state,section, sep = ".")) %>%
group_by(name) %>%
mutate(duration = as.numeric(max(date) - min(date) + 1),
sequence = case_when(duration == 1 ~ "Unlinked",
row_number() == 1 ~ "First Day",
row_number() == n() ~ "Last Day",
TRUE ~ "Ongoing")) %>%
ungroup()
Here, I mark any gap of more than one day as a new event, take the cumulative sum, and use that to define the duration of each event.
# A tibble: 20 x 8
state date days_last new_section section name duration sequence
<chr> <date> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
1 Washington 2021-01-01 NA 1 1 Washington.1 1 Unlinked
2 Washington 2021-01-03 2 1 2 Washington.2 2 First Day
3 Washington 2021-01-04 1 0 2 Washington.2 2 Last Day
4 Washington 2021-01-10 6 1 3 Washington.3 1 Unlinked
5 Washington 2021-01-15 5 1 4 Washington.4 5 First Day
6 Washington 2021-01-16 1 0 4 Washington.4 5 Ongoing
7 Washington 2021-01-17 1 0 4 Washington.4 5 Ongoing
8 Washington 2021-01-18 1 0 4 Washington.4 5 Ongoing
9 Washington 2021-01-19 1 0 4 Washington.4 5 Last Day
10 Washington 2021-01-28 9 1 5 Washington.5 1 Unlinked
11 Idaho 2021-01-12 NA 1 1 Idaho.1 3 First Day
12 Idaho 2021-01-13 1 0 1 Idaho.1 3 Ongoing
13 Idaho 2021-01-14 1 0 1 Idaho.1 3 Last Day
14 Idaho 2021-02-01 18 1 2 Idaho.2 1 Unlinked
15 Idaho 2021-02-03 2 1 3 Idaho.3 3 First Day
16 Idaho 2021-02-04 1 0 3 Idaho.3 3 Ongoing
17 Idaho 2021-02-05 1 0 3 Idaho.3 3 Last Day
18 Idaho 2021-02-08 3 1 4 Idaho.4 1 Unlinked
19 Idaho 2021-02-10 2 1 5 Idaho.5 1 Unlinked
20 Idaho 2021-02-14 4 1 6 Idaho.6 1 Unlinked
I think creating specific functions to do the counting is easier than try to do everything in a single pipe.
I left all the intermediate steps and the intermediate columns in the output so you can see what each step is doing. It's very likely you won't need to keep all these columns and you probably can simplify the steps once you understand the approach.
library(tidyverse) # Brings in dplyr
# Generate the data set of protests
test <- data.frame(state = c(rep("Washington", 10), rep("Idaho", 10)),
date = lubridate::ymd(c("2021-01-01", "2021-01-03", "2021-01-04", "2021-01-10", "2021-01-15",
"2021-01-16", "2021-01-17", "2021-01-18", "2021-01-19", "2021-01-28",
"2021-01-12", "2021-01-13", "2021-01-14", "2021-02-01", "2021-02-03",
"2021-02-04", "2021-02-05", "2021-02-08", "2021-02-10", "2021-02-14")))
event_count <- function(v){
cnt <- 0
result <- integer(length(v))
for(idx in seq_along(v)) {
if(v[idx]) {
cnt <- 0
} else {
cnt <- cnt + 1
}
result[idx] <- cnt
}
result
}
need_name <- function(cnt) {
result <- logical(length(cnt))
for(idx in seq_along(cnt)){
if(cnt[idx] == 0){
if(idx == length(cnt)){
result[idx] <- FALSE
break
}
result[idx] <- (cnt[idx + 1] != 0)
} else{
result[idx] <- TRUE
}
}
result
}
running_count <- function(v) {
cnt <- 0
flag <- FALSE
result <- integer(length(v))
for(idx in seq_along(v)){
if(v[idx]) {
if(!flag) {
cnt <- cnt + 1
flag <- !flag
}
result[idx] <- cnt
} else{
result[idx] <- 0
flag <- FALSE
}
}
result
}
test %>%
group_by(state) %>%
arrange(date, .by_group = TRUE) %>%
mutate(
duration = date - lag(date), # --- Compute durations
is_first = duration != 1 # --- Check if it is the first day of a protest
) %>%
replace_na(list(is_first = TRUE)) %>% # --- No more NAs
ungroup %>%
mutate(
cnt = event_count(is_first), # --- How many days does this event have?
need_name = need_name(cnt) # --- Should we name this event?
) %>%
group_by(state) %>%
mutate(
name_number = running_count(need_name) # --- What's the event count?
) %>%
mutate(
name = paste0(state, ".", name_number), # ---- Create names
cumulative = cnt + 1 # --- Start counting from one instead of zero
) %>%
group_by(name) %>%
mutate(
duration = max(duration) # --- Calc total duration
) %>%
ungroup() %>%
mutate( # --- Adding the NAs back
name = if_else(name_number == 0, NA_character_, name),
duration = if_else(name_number == 0, NA_integer_, as.integer(duration)),
cumulative = if_else(name_number == 0, NA_integer_, as.integer(cumulative)),
)
data.table::rleid is useful here to create run lengths based on the condition if days_last == 1 or days_next == 1 (ie sequential dates). If you want different event lengths you can edit that condition.
library(dplyr)
library(data.table)
test %>%
dplyr::group_by(state) %>%
dplyr::mutate(days_last = c(NA, diff(date)),
days_next = as.numeric(lead(date) - date),
name = paste0(state, ".", data.table::rleid(days_last == 1 | days_next == 1))) %>%
dplyr::group_by(name) %>%
dplyr::mutate(sequence = case_when(
n() == 1 ~ "Not Linked",
row_number() == 1 ~ "First day",
n() == row_number() ~ "Last day",
T ~ "Ongoing"),
duration = n(),
cumulative = seq_along(name)) %>%
dplyr::ungroup()
Output
state date days_last days_next name sequence duration cumulative
<chr> <date> <dbl> <dbl> <chr> <chr> <int> <int>
1 Washington 2021-01-01 NA 2 Washington.1 Not Linked 1 1
2 Washington 2021-01-03 2 1 Washington.2 First day 2 1
3 Washington 2021-01-04 1 6 Washington.2 Last day 2 2
4 Washington 2021-01-10 6 5 Washington.3 Not Linked 1 1
5 Washington 2021-01-15 5 1 Washington.4 First day 5 1
6 Washington 2021-01-16 1 1 Washington.4 Ongoing 5 2
7 Washington 2021-01-17 1 1 Washington.4 Ongoing 5 3
8 Washington 2021-01-18 1 1 Washington.4 Ongoing 5 4
9 Washington 2021-01-19 1 9 Washington.4 Last day 5 5
10 Washington 2021-01-28 9 NA Washington.5 Not Linked 1 1
11 Idaho 2021-01-12 NA 1 Idaho.1 First day 3 1
12 Idaho 2021-01-13 1 1 Idaho.1 Ongoing 3 2
13 Idaho 2021-01-14 1 18 Idaho.1 Last day 3 3
14 Idaho 2021-02-01 18 2 Idaho.2 Not Linked 1 1
15 Idaho 2021-02-03 2 1 Idaho.3 First day 3 1
16 Idaho 2021-02-04 1 1 Idaho.3 Ongoing 3 2
17 Idaho 2021-02-05 1 3 Idaho.3 Last day 3 3
18 Idaho 2021-02-08 3 2 Idaho.4 First day 2 1
19 Idaho 2021-02-10 2 4 Idaho.4 Last day 2 2
20 Idaho 2021-02-14 4 NA Idaho.5 Not Linked 1 1
If need by you can use the NA in the days_last column to NA values in other rows.
Side question: Why is test$sequence[11] an NA and not "First day"?
Generally, in R NA propagates, meaning if NA is part of the evaluation then normally NA is returned. When you define sequence your first ifelse condition is link_last == 0 & link_next == 1. On row 11, link_last = NA and link_next = 1. So what you're evaluating is:
NA == 0 & 1 == 1
[1] NA
Instead your nested condition should come first. How your ifelse is currently written that nested condition is not being evaluated:
is.na(NA) & 1 == 1
[1] TRUE
Here is a data.table approach.
library(data.table)
# Convert from data.frame to data.table
setDT(test)
# Subset the variables.
test2 <- test[, .(state, date, days_last = as.numeric(days_last),
days_next = as.numeric(days_next), sequence)]
# Code
test2[, name := paste0(state, '.', rleid(days_last == 1 | days_next == 1)),
by = state][
, ':='(duration = .N,
cumulative = seq(1:.N)),
by = name
][, c('days_next', 'days_last'):=NULL] # Removing these variables. Feel free to add back!
# Reorder the variables
test2 <- setcolorder(test2, c('state', 'name', 'date',
'sequence', 'duration',
'cumulative'))
# Print first 15 rows
print(test2[1:15,])
#> state name date sequence duration cumulative
#> 1: Washington Washington.1 2021-01-01 <NA> 1 1
#> 2: Washington Washington.2 2021-01-03 First day 2 1
#> 3: Washington Washington.2 2021-01-04 Last day 2 2
#> 4: Washington Washington.3 2021-01-10 Not linked 1 1
#> 5: Washington Washington.4 2021-01-15 First day 5 1
#> 6: Washington Washington.4 2021-01-16 Ongoing 5 2
#> 7: Washington Washington.4 2021-01-17 Ongoing 5 3
#> 8: Washington Washington.4 2021-01-18 Ongoing 5 4
#> 9: Washington Washington.4 2021-01-19 Last day 5 5
#> 10: Washington Washington.5 2021-01-28 <NA> 1 1
#> 11: Idaho Idaho.1 2021-01-12 <NA> 3 1
#> 12: Idaho Idaho.1 2021-01-13 Ongoing 3 2
#> 13: Idaho Idaho.1 2021-01-14 Last day 3 3
#> 14: Idaho Idaho.2 2021-02-01 Not linked 1 1
#> 15: Idaho Idaho.3 2021-02-03 First day 3 1
Created on 2021-03-16 by the reprex package (v0.3.0)

limited size dates groups by interval

I have a data frame with dates and I would like to group dates by interval of 9 days, but the group size should be of 7 dates maximum. So if we find 9 days in the interval, the 2 last dates should roll to the next group and so on.
And the starting date of an interval can only be an existing date of the dataset.
Here is an example :
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39))
x <- data.frame(date = dates)
> x
date
1 2020-04-17
2 2020-04-27
3 2020-04-28
4 2020-04-29
5 2020-04-30
6 2020-05-01
7 2020-05-02
8 2020-05-03
9 2020-05-04
10 2020-05-05
11 2020-05-07
12 2020-05-17
13 2020-05-26
And the exected output :
date group
1 2020-04-17 1
2 2020-04-27 2
3 2020-04-28 2
4 2020-04-29 2
5 2020-04-30 2
6 2020-05-01 2
7 2020-05-02 2
8 2020-05-03 2
9 2020-05-04 3
10 2020-05-05 3
11 2020-05-07 3
12 2020-05-17 4
13 2020-05-26 4
I'm really stuck ony this, nothing worked from what I tried so far, any help would be really apprectiated, thank you !
I believe this is what you want. As you can see, the code is quite inefficient, but I can't think of the way without going sequentially.
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39))
x <- data.frame(date = dates)
assign_group <- function(group_var, group_number) {
# finding the start of the group
start_idx <- min(which(is.na(group_var)))
# finding the end of the group (either group size == 7 or the dates in the range)
end_idx <- start_idx + min(6, sum(x$date > x$date[start_idx] &
x$date <= x$date[start_idx] + 9))
# taking care of the out of range index
end_idx <- min(end_idx, length(group_var))
# assign group number
group_var[start_idx:end_idx] <- group_number
return(group_var)
}
group <- rep(NA, nrow(x))
group_number <- 1
while(sum(is.na(group[length(group)])) > 0){
group <- assign_group(group, group_number)
group_number <- group_number + 1
print(group)
}
#> [1] 1 NA NA NA NA NA NA NA NA NA NA NA NA
#> [1] 1 2 2 2 2 2 2 2 NA NA NA NA NA
#> [1] 1 2 2 2 2 2 2 2 3 3 3 NA NA
#> [1] 1 2 2 2 2 2 2 2 3 3 3 4 4
x$group <- group
x
#> date group
#> 1 2020-04-17 1
#> 2 2020-04-27 2
#> 3 2020-04-28 2
#> 4 2020-04-29 2
#> 5 2020-04-30 2
#> 6 2020-05-01 2
#> 7 2020-05-02 2
#> 8 2020-05-03 2
#> 9 2020-05-04 3
#> 10 2020-05-05 3
#> 11 2020-05-07 3
#> 12 2020-05-17 4
#> 13 2020-05-26 4
Created on 2020-05-27 by the reprex package (v0.3.0)
Here is an option using Rcpp:
library(Rcpp)
cppFunction("
IntegerVector grpDates(IntegerVector dates, int winsize, int daysaft) {
int sz = dates.size(), start = 0;
IntegerVector res(sz);
res[0] = 1;
for (int i = 1; i < sz; i++) {
if ((dates[i] - dates[start] > daysaft) || (i - start + 1 > winsize)) {
res[i] = res[i-1] + 1;
start = i;
} else {
res[i] = res[i-1];
}
}
return res;
}")
x$group <- grpDates(dates, 7L, 9L)
x
output:
date group
1 2020-04-17 1
2 2020-04-27 2
3 2020-04-28 2
4 2020-04-29 2
5 2020-04-30 2
6 2020-05-01 2
7 2020-05-02 2
8 2020-05-03 2
9 2020-05-04 3
10 2020-05-05 3
11 2020-05-07 3
12 2020-05-17 4
13 2020-05-26 4
14 2020-06-03 5
15 2020-06-04 5
16 2020-06-05 5
17 2020-06-06 5
18 2020-06-07 5
19 2020-06-08 5
20 2020-06-09 5
data with more date rows:
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39),
start_date + 47:53)
x <- data.frame(date = dates)

dplyr mutate function to evaluate values within columns (current, previous, next) vertically

I have scoured SO for a way to achieve what I need without luck so here it goes.
A while back I discovered the package dplyr and its potential. I am thinking this package can do what I want, I just don't know how. This is a small subset of my data, but should be representative of my problem.
dummy<-structure(list(time = structure(1:20, .Label = c("2015-03-25 12:24:00",
"2015-03-25 21:08:00", "2015-03-25 21:13:00", "2015-03-25 21:47:00",
"2015-03-26 03:08:00", "2015-04-01 20:30:00", "2015-04-01 20:34:00",
"2015-04-01 20:42:00", "2015-04-01 20:45:00", "2015-09-29 18:26:00",
"2015-09-29 19:11:00", "2015-09-29 21:21:00", "2015-09-29 22:03:00",
"2015-09-29 22:38:00", "2015-09-30 00:48:00", "2015-09-30 01:38:00",
"2015-09-30 01:41:00", "2015-09-30 01:45:00", "2015-09-30 01:47:00",
"2015-09-30 01:49:00"), class = "factor"), ID = c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L), station = c(1L, 1L, 1L, 2L, 3,
4L, 4L, 4L, 4L, 5L, 5L, 6L,
6L, 5, 5, 5L, 7, 7, 7L,
7)), .Names = c("time", "ID", "station"), class = "data.frame", row.names = c(NA,
-20L))
I wish to evaluate rows within the time column conditional on the ID and station column. Specifically, I would like the function (dplyr?) to evaluate each time row, and compare the time to the previous time (row-1) and next time (row+1). If the time of current row is within 1 hour of time of previous and/or next row, and the ID and station of current row match that of previous and/or next row, then I would like to add in a new row a 1, otherwise a 0.
How would I achieve this using dplyr?
The expected outcome should be like this:
time ID station new.value
1 2015-03-25 12:24:00 1 1 0
2 2015-03-25 21:08:00 1 1 1
3 2015-03-25 21:13:00 1 1 1
4 2015-03-25 21:47:00 1 2 0
5 2015-03-26 03:08:00 1 3 0
6 2015-04-01 20:30:00 1 4 1
7 2015-04-01 20:34:00 1 4 1
8 2015-04-01 20:42:00 1 4 1
9 2015-04-01 20:45:00 1 4 1
10 2015-09-29 18:26:00 2 5 1
11 2015-09-29 19:11:00 2 5 1
12 2015-09-29 21:21:00 2 6 1
13 2015-09-29 22:03:00 2 6 1
14 2015-09-29 22:38:00 2 5 0
15 2015-09-30 00:48:00 2 5 1
16 2015-09-30 01:38:00 2 5 1
17 2015-09-30 01:41:00 2 7 1
18 2015-09-30 01:45:00 2 7 1
19 2015-09-30 01:47:00 2 7 1
20 2015-09-30 01:49:00 2 7 1
Here is an option using the difftime with dplyr mutate function. Firstly, we use a group_by operation to make sure the comparison is within each unique combination of ID and Station. The difftime can be used to calculate the difference time, here the units will be set as hours for convenience. The lag and lead functions are also from dplyr package which shift the selected column backward or forward. Combining with the vectorised operation of difftime, you can calculate the time difference between the current row and the previous/next row. We use abs to make sure the result is absolute value. The condition of <1 make sure the difference is within an hour. as.integer convert the logical values (T or F) to (1 or 0) correspondingly.
library(dplyr)
dummy %>% group_by(ID, station) %>%
mutate(new.value = as.integer(
abs(difftime(time, lag(time, default = Inf), units = "hours")) < 1 |
abs(difftime(time, lead(time, default = Inf), units = "hours")) < 1))
Source: local data frame [20 x 4]
Groups: ID, station [7]
time ID station new.value
(time) (int) (dbl) (int)
1 2015-03-25 12:24:00 1 1 0
2 2015-03-25 21:08:00 1 1 1
3 2015-03-25 21:13:00 1 1 1
4 2015-03-25 21:47:00 1 2 0
5 2015-03-26 03:08:00 1 3 0
6 2015-04-01 20:30:00 1 4 1
7 2015-04-01 20:34:00 1 4 1
8 2015-04-01 20:42:00 1 4 1
9 2015-04-01 20:45:00 1 4 1
10 2015-09-29 18:26:00 2 5 1
11 2015-09-29 19:11:00 2 5 1
12 2015-09-29 21:21:00 2 6 1
13 2015-09-29 22:03:00 2 6 1
14 2015-09-29 22:38:00 2 5 0
15 2015-09-30 00:48:00 2 5 1
16 2015-09-30 01:38:00 2 5 1
17 2015-09-30 01:41:00 2 7 1
18 2015-09-30 01:45:00 2 7 1
19 2015-09-30 01:47:00 2 7 1
20 2015-09-30 01:49:00 2 7 1
Psidom's answer is great -- here's a data.table approach.
library(data.table)
setDT(dummy)
# you do NOT want a factor for your time variable
dummy[, time := as.POSIXct(time) ]
dummy[, `:=`(lag_diff = c(Inf, diff(as.numeric(time))),
lead_diff = c(diff(as.numeric(time)), Inf)),
by = .(ID, station) ]
dummy[, new.value := as.integer(lag_diff < 3600 | lead_diff < 3600) ]
dummy
Another solution using R base functions (sapply and difftime):
n=nrow(dummy)
dummy$new.value=
as.numeric(sapply(1:n, function(i)
(i<n && (dummy[i,"ID"]==dummy[i+1,"ID"] && dummy[i,"station"]==dummy[i+1,"station"])
&& abs(as.numeric(difftime(dummy[i,"time"], dummy[i+1,"time"]), "hours"))<=1)
||
(i>1 && (dummy[i,"ID"]==dummy[i-1,"ID"] && dummy[i,"station"]==dummy[i-1,"station"])
&& abs(as.numeric(difftime(dummy[i,"time"], dummy[i-1,"time"]), "hours"))<=1)
))
# > dummy
# time ID station new.value
# 1 2015-03-25 12:24:00 1 1 0
# 2 2015-03-25 21:08:00 1 1 1
# 3 2015-03-25 21:13:00 1 1 1
# 4 2015-03-25 21:47:00 1 2 0
# 5 2015-03-26 03:08:00 1 3 0
# 6 2015-04-01 20:30:00 1 4 1
# 7 2015-04-01 20:34:00 1 4 1
# 8 2015-04-01 20:42:00 1 4 1
# 9 2015-04-01 20:45:00 1 4 1
# 10 2015-09-29 18:26:00 2 5 1
# 11 2015-09-29 19:11:00 2 5 1
# 12 2015-09-29 21:21:00 2 6 1
# 13 2015-09-29 22:03:00 2 6 1
# 14 2015-09-29 22:38:00 2 5 0
# 15 2015-09-30 00:48:00 2 5 1
# 16 2015-09-30 01:38:00 2 5 1
# 17 2015-09-30 01:41:00 2 7 1
# 18 2015-09-30 01:45:00 2 7 1
# 19 2015-09-30 01:47:00 2 7 1
# 20 2015-09-30 01:49:00 2 7 1

Resources