limited size dates groups by interval - r

I have a data frame with dates and I would like to group dates by interval of 9 days, but the group size should be of 7 dates maximum. So if we find 9 days in the interval, the 2 last dates should roll to the next group and so on.
And the starting date of an interval can only be an existing date of the dataset.
Here is an example :
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39))
x <- data.frame(date = dates)
> x
date
1 2020-04-17
2 2020-04-27
3 2020-04-28
4 2020-04-29
5 2020-04-30
6 2020-05-01
7 2020-05-02
8 2020-05-03
9 2020-05-04
10 2020-05-05
11 2020-05-07
12 2020-05-17
13 2020-05-26
And the exected output :
date group
1 2020-04-17 1
2 2020-04-27 2
3 2020-04-28 2
4 2020-04-29 2
5 2020-04-30 2
6 2020-05-01 2
7 2020-05-02 2
8 2020-05-03 2
9 2020-05-04 3
10 2020-05-05 3
11 2020-05-07 3
12 2020-05-17 4
13 2020-05-26 4
I'm really stuck ony this, nothing worked from what I tried so far, any help would be really apprectiated, thank you !

I believe this is what you want. As you can see, the code is quite inefficient, but I can't think of the way without going sequentially.
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39))
x <- data.frame(date = dates)
assign_group <- function(group_var, group_number) {
# finding the start of the group
start_idx <- min(which(is.na(group_var)))
# finding the end of the group (either group size == 7 or the dates in the range)
end_idx <- start_idx + min(6, sum(x$date > x$date[start_idx] &
x$date <= x$date[start_idx] + 9))
# taking care of the out of range index
end_idx <- min(end_idx, length(group_var))
# assign group number
group_var[start_idx:end_idx] <- group_number
return(group_var)
}
group <- rep(NA, nrow(x))
group_number <- 1
while(sum(is.na(group[length(group)])) > 0){
group <- assign_group(group, group_number)
group_number <- group_number + 1
print(group)
}
#> [1] 1 NA NA NA NA NA NA NA NA NA NA NA NA
#> [1] 1 2 2 2 2 2 2 2 NA NA NA NA NA
#> [1] 1 2 2 2 2 2 2 2 3 3 3 NA NA
#> [1] 1 2 2 2 2 2 2 2 3 3 3 4 4
x$group <- group
x
#> date group
#> 1 2020-04-17 1
#> 2 2020-04-27 2
#> 3 2020-04-28 2
#> 4 2020-04-29 2
#> 5 2020-04-30 2
#> 6 2020-05-01 2
#> 7 2020-05-02 2
#> 8 2020-05-03 2
#> 9 2020-05-04 3
#> 10 2020-05-05 3
#> 11 2020-05-07 3
#> 12 2020-05-17 4
#> 13 2020-05-26 4
Created on 2020-05-27 by the reprex package (v0.3.0)

Here is an option using Rcpp:
library(Rcpp)
cppFunction("
IntegerVector grpDates(IntegerVector dates, int winsize, int daysaft) {
int sz = dates.size(), start = 0;
IntegerVector res(sz);
res[0] = 1;
for (int i = 1; i < sz; i++) {
if ((dates[i] - dates[start] > daysaft) || (i - start + 1 > winsize)) {
res[i] = res[i-1] + 1;
start = i;
} else {
res[i] = res[i-1];
}
}
return res;
}")
x$group <- grpDates(dates, 7L, 9L)
x
output:
date group
1 2020-04-17 1
2 2020-04-27 2
3 2020-04-28 2
4 2020-04-29 2
5 2020-04-30 2
6 2020-05-01 2
7 2020-05-02 2
8 2020-05-03 2
9 2020-05-04 3
10 2020-05-05 3
11 2020-05-07 3
12 2020-05-17 4
13 2020-05-26 4
14 2020-06-03 5
15 2020-06-04 5
16 2020-06-05 5
17 2020-06-06 5
18 2020-06-07 5
19 2020-06-08 5
20 2020-06-09 5
data with more date rows:
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39),
start_date + 47:53)
x <- data.frame(date = dates)

Related

How to create a new column that counts the number of occurrences of a value in another column and orders them by date

I have a 2 column data frame with "date" and "ID" headings. Some IDs are listed more than once. I want to create a new column "Attempt" that denotes the number of attempts that each ID has taken, ordered by the date of occurrence.
Here is my sample data:
ID <- c(1,2,5,8,4,9,1,11,15,32,54,1,4,2,14)
Date <- c("2021-04-12", "2021-04-12", "2021-04-13", "2021-04-14", "2021-04-19",
"2021-04-19", "2021-04-20", "2021-04-21", "2021-04-22", "2021-04-28",
"2021-04-28", "2021-04-29", "2021-04-29", "2021-05-06", "2021-05-07")
Data <- data.frame(ID, Date)
Data$Date <- as.Date(Data$Date, format="%Y-%m-%d")
I tried various iterations of duplicated(). I can remove all duplicates or make every instance of a duplicated value "2" or "3" for example, but I want each occurrence to be ordered based on the date of the attempt taken.
Here is my expected result column to be added onto the original data frame:
Attempt <- c(1,1,1,1,1,1,2,1,1,1,1,3,2,2,1)
Data %>%
group_by(ID)
mutate(Attempt1 = row_number())
ID Date Attempt
1 1 2021-04-12 1
2 2 2021-04-12 1
3 5 2021-04-13 1
4 8 2021-04-14 1
5 4 2021-04-19 1
6 9 2021-04-19 1
7 1 2021-04-20 2
8 11 2021-04-21 1
9 15 2021-04-22 1
10 32 2021-04-28 1
11 54 2021-04-28 1
12 1 2021-04-29 3
13 4 2021-04-29 2
14 2 2021-05-06 2
15 14 2021-05-07 1
If you have the latest version of dplyr use
Data %>%
mutate(Attempt = row_number(), .by = ID)
Using data.table
library(data.table)
setDT(Data)[, Attempt := rowid(ID)]
-output
> Data
ID Date Attempt
1: 1 2021-04-12 1
2: 2 2021-04-12 1
3: 5 2021-04-13 1
4: 8 2021-04-14 1
5: 4 2021-04-19 1
6: 9 2021-04-19 1
7: 1 2021-04-20 2
8: 11 2021-04-21 1
9: 15 2021-04-22 1
10: 32 2021-04-28 1
11: 54 2021-04-28 1
12: 1 2021-04-29 3
13: 4 2021-04-29 2
14: 2 2021-05-06 2
15: 14 2021-05-07 1

Converting variable with 5 digit numbers and dates into date values

I have the following data, which contains some date values as 5 digit character values. When I try to convert to date, the correct date changes to NA value.
dt <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2),
Registrationdate=c('2019-01-09','2019-01-09','2019-01-09','2019-01-09','2019-01-09',
'2019-01-09',"44105","44105","44105","44105","44105"))
Expected value
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
I tried using
library(openxlsx)
dt$Registrationdate <- convertToDate(dt$Registrationdate, origin = "1900-01-01")
But I got
1 1 <NA>
2 1 <NA>
3 1 <NA>
4 1 <NA>
5 1 <NA>
6 1 <NA>
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Here's one approach using a mix of dplyr and base R:
library(dplyr, warn = FALSE)
dt |>
mutate(Registrationdate = if_else(grepl("-", Registrationdate),
as.Date(Registrationdate),
openxlsx::convertToDate(Registrationdate, origin = "1900-01-01")))
#> Warning in openxlsx::convertToDate(Registrationdate, origin = "1900-01-01"): NAs
#> introduced by coercion
#> id Registrationdate
#> 1 1 2019-01-09
#> 2 1 2019-01-09
#> 3 1 2019-01-09
#> 4 1 2019-01-09
#> 5 1 2019-01-09
#> 6 1 2019-01-09
#> 7 2 2020-10-01
#> 8 2 2020-10-01
#> 9 2 2020-10-01
#> 10 2 2020-10-01
#> 11 2 2020-10-01
Created on 2022-10-15 with reprex v2.0.2
library(janitor)
dt$Registrationdate <- convert_to_date(dt$Registrationdate)
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
Another option is to import columns in the expected format. An example with openxlsx2 is shown below. The top half creates a file that causes the behavior you see with openxlsx. This is because some of the rows in the Registrationdate column are formatted as dates and some as strings, a fairly common error caused by the person who generated the xlsx input.
With openxlsx2 you can define the type of column you want to import. The option was inspired by readxl (iirc).
library(openxlsx2)
## prepare data
date_as_string <- data.frame(
id = rep(1, 6),
Registrationdate = rep('2019-01-09', 6)
)
date_as_date <- data.frame(
id = rep(2, 5),
Registrationdate = rep(as.Date('2019-01-10'), 5)
)
options(openxlsx2.dateFormat = "yyyy-mm-dd")
wb <- wb_workbook()$
add_worksheet()$
add_data(x = date_as_string)$
add_data(x = date_as_date, colNames = FALSE, startRow = 7)
#wb$open()
## read data as date
dt <- wb_to_df(wb, types = c(id = 1, Registrationdate = 2))
## check that Registrationdate is actually a Date column
str(dt$Registrationdate)
#> Date[1:10], format: "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" ...

R function for creating uneven groups based on uneven dates

I am trying to find an R function that can index groups iteratively, given a set of unevenly spaced dates, uneven group sizes, and by grouped cases. Here are example data:
> h
# A tibble: 20 x 2
ID date
<int> <date>
1 1 2021-01-07
2 1 2021-01-11
3 1 2021-01-15
4 1 2021-01-16
5 1 2021-01-21
6 1 2021-01-26
7 1 2021-02-04
8 1 2021-02-08
9 1 2021-02-13
10 1 2021-02-20
11 1 2021-02-23
12 1 2021-02-27
13 2 2021-01-05
14 2 2021-01-11
15 2 2021-02-02
16 2 2021-02-08
17 2 2021-02-08
18 2 2021-02-14
19 2 2021-02-17
20 2 2021-02-21
For each unique ID, I want to find the first date (chronologically) and create a group (i.e., group==1) for that case and any other rows within 7 days. For the next date after 7 days, create a second group (i.e., group==2) for that case and any others within the next 7 days. Note: the next date is not necessarily exactly 7 days after the initial date. Repeat this process for the remaining remaining cases to get the desired output:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
Using a rolling window function of 7 days will not work, as far as I can tell, as it will group the cases incorrectly. But I am wondering if a sort of custom rolling window function could be used? I would prefer a solution using dplyr, but other options would also work. Any help here is appreciated.
> dput(h)
structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), date = structure(c(18634,
18638, 18642, 18643, 18648, 18653, 18662, 18666, 18671, 18678,
18681, 18685, 18632, 18638, 18660, 18666, 18666, 18672, 18675,
18679), class = "Date")), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
Define a function date1 which given the first date of the group of the prior row's point and the current row's date returns the date of the start of the current group -- that must be one of the two arguments. Then grouping by ID use Reduce to apply that to the dates in each ID and convert the result to factor and then to integer.
library(dplyr)
date1 <- function(prev, x) if (x > prev + 7) x else prev
h %>%
group_by(ID) %>%
mutate(group = as.integer(factor(Reduce(date1, date, acc = TRUE)))) %>%
ungroup
giving:
# A tibble: 20 x 3
ID date group
<int> <date> <dbl>
1 1 2021-01-07 1
2 1 2021-01-11 1
3 1 2021-01-15 2
4 1 2021-01-16 2
5 1 2021-01-21 2
6 1 2021-01-26 3
7 1 2021-02-04 4
8 1 2021-02-08 4
9 1 2021-02-13 5
10 1 2021-02-20 5
11 1 2021-02-23 6
12 1 2021-02-27 6
13 2 2021-01-05 1
14 2 2021-01-11 1
15 2 2021-02-02 2
16 2 2021-02-08 2
17 2 2021-02-08 2
18 2 2021-02-14 3
19 2 2021-02-17 3
20 2 2021-02-21 3
For each ID group, create group as a vector of NAs. While some group elements are still NA, take the first date value where group is NA and add 0 and 7 days to it to make a range of dates. For any rows where date is in the calculated date range, set elements of group to 1 more than the current max value of group (or 0 if group is still all NA).
library(data.table)
setDT(df)
df[order(ID, date), {
group <- rep(NA_real_, .N)
while(any(is.na(group))){
group_range <- first(date[is.na(group)]) + c(0, 7)
group[date %between% group_range] <- 1 + max(fcoalesce(group, 0))
}
list(date, group)
}, by = ID]
# ID date group
# 1: 1 2021-01-07 1
# 2: 1 2021-01-11 1
# 3: 1 2021-01-15 2
# 4: 1 2021-01-16 2
# 5: 1 2021-01-21 2
# 6: 1 2021-01-26 3
# 7: 1 2021-02-04 4
# 8: 1 2021-02-08 4
# 9: 1 2021-02-13 5
# 10: 1 2021-02-20 5
# 11: 1 2021-02-23 6
# 12: 1 2021-02-27 6
# 13: 2 2021-01-05 1
# 14: 2 2021-01-11 1
# 15: 2 2021-02-02 2
# 16: 2 2021-02-08 2
# 17: 2 2021-02-08 2
# 18: 2 2021-02-14 3
# 19: 2 2021-02-17 3
# 20: 2 2021-02-21 3
Here's another version where I try to limit the computations. No idea if it's actually faster
df[order(ID, date), {
group <- rep(NA_integer_, .N)
i <- 1L
g <- 1L
while(i <= .N){
group_range <- date[i] + c(0, 7)
chg <- date %between% group_range
group[chg] <- g
g <- g + 1L
i <- i + sum(chg)
}
list(date, group)
}, by = ID]

Generating test data in R

I am trying to generate this table as one of the inputs to a test.
id diff d
1: 1 2 2020-07-31
2: 1 1 2020-08-01
3: 1 1 2020-08-02
4: 1 1 2020-08-03
5: 1 1 2020-08-04
6: 2 2 2020-07-31
7: 2 1 2020-08-01
8: 2 1 2020-08-02
9: 2 1 2020-08-03
10: 2 1 2020-08-04
11: 3 2 2020-07-31
12: 3 1 2020-08-01
13: 3 1 2020-08-02
14: 3 1 2020-08-03
15: 3 1 2020-08-04
16: 4 2 2020-07-31
17: 4 1 2020-08-01
18: 4 1 2020-08-02
19: 4 1 2020-08-03
20: 4 1 2020-08-04
21: 5 2 2020-07-31
22: 5 1 2020-08-01
23: 5 1 2020-08-02
24: 5 1 2020-08-03
25: 5 1 2020-08-04
id diff d
I have done it like this -
input1 = data.table(id=as.character(1:5), diff=1)
input1 = input1[,.(d=seq(as.Date('2020-07-31'), by='days', length.out = 5)),.(id, diff)]
input1[d == '2020-07-31']$diff = 2
diff is basically the number of days to the next weekday. Eg. 31st Jul 2020 is Friday. Hence diff is 2 which is the diff to the next weekday, Monday. For the others it will be 1.
Is there a more R idiomatic way of doing this ?
I personally dont like that I had to generate the date sequence for each of the ids separately or the hardcoding of the diff that I have to do in the input for 31st July. Is there a more generic way of doing this without the hardcoding?
We can create all combination of dates and id using crossing and create diff column based on whether the weekday is "Friday".
library(dplyr)
tidyr::crossing(id = 1:5, d = seq(as.Date('2020-07-31'),
by='days', length.out = 5)) %>%
mutate(diff = as.integer(weekdays(d) == 'Friday') + 1)
Similar logic using base R expand.grid :
transform(expand.grid(id = 1:5,
d = seq(as.Date('2020-07-31'), by='days', length.out = 5)),
diff = as.integer(weekdays(d) == 'Friday') + 1)
and CJ in data.table :
library(data.table)
df <- CJ(id = 1:5, d = seq(as.Date('2020-07-31'), by='days', length.out = 5))
df[, diff := as.integer(weekdays(d) == 'Friday') + 1]

Using slice() to filter for peak intervals

Consider the following dataset:
df <- tibble(
interval = rep(1:10, 4),
channel = rep(1:2, each = 20),
date = parse_date(rep(c("2020-07-01", "2020-07-02", "2020-07-03", "2020-07-04"),
times = 2, each = 5)),
time = parse_time(
rep(format(seq.POSIXt(as.POSIXct(Sys.Date() + 0.05),
as.POSIXct(Sys.Date() + 0.95), length.out = 5),
"%H:%M:%S", tz="GMT"), 8), format = "%H:%M:%S"),
trigger = c(rep(0,5), # Ch 1, day 1; no max
0, 2, 0, 2, 0, # Ch 1, day 2; 2 maxes
rep(0, 5), # Ch 1, day 3; no max
0, 0, 2, 0, 0, # Ch 1, day 4
0, 0, 10, 0, 0, # Ch 2, day 1
10, rep(0, 4), # Ch 2, day 2; max at head
rep(0, 4), 10, # Ch 2, day 3; max at tail
4, 10, 4, 10, 0) # Ch 2, day 4; 2 maxes
)
# A tibble: 40 x 5
interval channel date time trigger
<int> <int> <date> <time> <dbl>
1 1 1 2020-07-01 01:12 0
2 2 1 2020-07-01 06:36 0
3 3 1 2020-07-01 12:00 0
4 4 1 2020-07-01 17:24 0
5 5 1 2020-07-01 22:48 0
6 6 1 2020-07-02 01:12 0
7 7 1 2020-07-02 06:36 2
8 8 1 2020-07-02 12:00 0
9 9 1 2020-07-02 17:24 2
10 10 1 2020-07-02 22:48 0
# ... with 30 more rows
My data has 10,000+ rows from a sensor recording daily how many times it's triggered in a time interval. I want to use slice() to filter a 2-hour interval around the time of peak triggers for each day. I have code that works, but it produces warnings for specific situations that I'll explain shortly. Although the warnings do not compromise the results, I would feel more at ease if I did not have them. The conditions I need to consider are:
A sensor not being triggered for > 1 day (trigger = 0)
Triggers peaking at the head or tail end of a day
Triggers peak at more than once a day (the same max at different times)
I mainly code using tidyverse and lubridate functions. My best working code so far is as follows:
df %>%
group_by(date, channel) %>%
slice(abs(which.max(trigger) + (-1:1))) %>% # Simplifying my interval with 1 row around the peak
ungroup() %>%
arrange(channel) %>%
print()
# A tibble: 20 x 5
interval channel date time trigger
<int> <int> <date> <time> <dbl>
1 1 1 2020-07-01 01:12 0
2 2 1 2020-07-01 06:36 0
3 6 1 2020-07-02 01:12 0
4 7 1 2020-07-02 06:36 2
5 8 1 2020-07-02 12:00 0
6 1 1 2020-07-03 01:12 0
7 2 1 2020-07-03 06:36 0
8 7 1 2020-07-04 06:36 0
9 8 1 2020-07-04 12:00 2
10 9 1 2020-07-04 17:24 0
11 2 2 2020-07-01 06:36 0
12 3 2 2020-07-01 12:00 10
13 4 2 2020-07-01 17:24 0
14 6 2 2020-07-02 01:12 10
15 7 2 2020-07-02 06:36 0
16 4 2 2020-07-03 17:24 0
17 5 2 2020-07-03 22:48 10
18 6 2 2020-07-04 01:12 4
19 7 2 2020-07-04 06:36 10
20 8 2 2020-07-04 12:00 4
I have thought to slice by interval rather than the peak, but the intervals are not always sequential; it depends on when I reset my programs. If there are 2 or more peaks, I wouldn't mind filtering for the first peak. If I could identify where there are multiple peaks, that's a plus! Lastly, if there are no triggers for a day, I wouldn't want that day included. I think I could post-filter the inactivity out, but I would still get the warnings.
Quick recap:
My goal is to filter a 2-hour interval around the time of peak triggers. If you can recommend tidyverse/lubridate (or any really!) solutions, I would appreciate the help. Thanks!
You can write a custom function to test various conditions so that no warning is generated.
custom_fun <- function(trigger) {
#trigger value greater than 0
inds <- trigger > 0
#If any value greater than 0
if(any(inds)) {
#return the 2-hour interval
vals <- which.max(trigger) + -1:1
#remove values during head and tail of the day
return(vals[vals > 0 & vals <= length(trigger)])
}
#Don't select anything if no trigger > 0
else return(0)
}
and then apply it for each date and channel.
library(dplyr)
df %>%
group_by(date, channel) %>%
#If multiple peaks present.
mutate(mulitple_peak = sum(trigger == max(trigger)) > 1) %>%
slice(custom_fun(trigger)) %>%
ungroup()
# A tibble: 16 x 6
# interval channel date time trigger mulitple_peak
# <int> <int> <date> <time> <dbl> <lgl>
# 1 2 2 2020-07-01 06:36 0 FALSE
# 2 3 2 2020-07-01 12:00 10 FALSE
# 3 4 2 2020-07-01 17:24 0 FALSE
# 4 6 1 2020-07-02 01:12 0 TRUE
# 5 7 1 2020-07-02 06:36 2 TRUE
# 6 8 1 2020-07-02 12:00 0 TRUE
# 7 6 2 2020-07-02 01:12 10 FALSE
# 8 7 2 2020-07-02 06:36 0 FALSE
# 9 4 2 2020-07-03 17:24 0 FALSE
#10 5 2 2020-07-03 22:48 10 FALSE
#11 7 1 2020-07-04 06:36 0 FALSE
#12 8 1 2020-07-04 12:00 2 FALSE
#13 9 1 2020-07-04 17:24 0 FALSE
#14 6 2 2020-07-04 01:12 4 TRUE
#15 7 2 2020-07-04 06:36 10 TRUE
#16 8 2 2020-07-04 12:00 4 TRUE

Resources