Determining at most 1 hour time difference between car and non-car mode - r

I have
household person time mode
1 1 07:45:00 non-car
1 1 09:05:00 car
1 2 08:10:00 non-car
1 3 22:45:00 non-car
1 4 08:30:00 car
1 5 22:00:00 car
2 1 07:45:00 non-car
2 2 16:45:00 car
I want to find a column to find if non-car mode is at most 1 hour before a car mode in each family.
I need that column to be index of a person or persons who has this time intersection with another one.
In the above example first family, the time of first person is 1 hour before person 4, so in new column 4 infant of first person and 1 infant of 4th person.
household person time mode overlap
1 1 07:45:00 non-car 4
1 1 09:05:00 car 2
1 2 08:10:00 non-car 4,1
1 3 22:45:00 non-car 0
1 4 08:30:00 car 1,2
1 5 22:00:00 car 0
2 1 07:45:00 non-car 0
2 2 16:45:00 car 0
no intersection with other family member is 0 or whatever like NA

Here's a dplyr approach that produces those matches.
library(dplyr); library(hms)
df %>%
# Connect the table to itself, linking by household.
# So every row gets linked to every row (including itself)
# with the same household. The original data with end .x and
# the joined data will end .y, so we can compare then below.
left_join(df, by = c("household")) %>%
# Find the difference in time, in seconds
mutate(time_dif = abs(time.y - time.x)) %>%
filter(time_dif < 3600, # Keep if <1hr difference
person.x != person.y, # Keep if different person
mode.x != mode.y) %>% # Keep if different mode
# We have the answers now, everything below is for formatting
# Rename and hide some variables we don't need any more
select(household, person = person.x, time = time.x,
mode = mode.x, other = person.y) %>%
# Combine each person's overlaps into one row
group_by(household, person, time) %>%
summarise(overlaps = paste(other, collapse =","), times = length(other)) %>%
# Add back all original rows, even if no overlaps
right_join(df) %>%
## A tibble: 7 x 6
# household person time overlaps times mode
# <int> <int> <time> <chr> <int> <chr>
#1 1 1 07:45 4 1 non-car
#2 1 1 09:05 2 1 car
#3 1 2 08:10 1,4 2 non-car
#4 1 3 22:45 NA NA non-car
#5 1 4 08:30 1,2 2 car
#6 2 1 07:45 NA NA non-car
#7 2 2 16:45 NA NA car


How can I create a day number variable in R based on dates?

I want to create a variable with the number of the day a participant took a survey (first day, second day, thirds day, etc.)
The issue is that there are participants that took the survey after midnight.
For example, this is what it looks like:
08/03/2020 08:17
08/03/2020 12:01
08/04/2020 15:08
08/04/2020 22:16
07/03/2020 08:10
07/03/2020 12:03
07/04/2020 15:07
07/05/2020 00:16
08/22/2020 09:17
08/23/2020 11:04
08/24/2020 00:01
10/03/2020 08:37
10/03/2020 11:13
10/04/2020 15:20
10/04/2020 23:05
This is what I want:
08/03/2020 08:17
08/03/2020 12:01
08/04/2020 15:08
08/04/2020 22:16
07/03/2020 08:10
07/03/2020 12:03
07/04/2020 15:07
07/05/2020 00:16
08/22/2020 09:17
08/23/2020 11:04
08/24/2020 00:01
10/03/2020 08:37
10/03/2020 11:13
10/04/2020 15:20
10/04/2020 23:05
How can I create the day variable taking into consideration participants that who took the survey after midnight still belong to the previous day?
I tried the codes here. But I have issues with participants taking surveys after midnight.
Please check the below code
data2 <- data %>%
mutate(date2 = as.Date(date, format = "%m/%d/%Y %H:%M")) %>%
group_by(id) %>%
mutate(row = row_number(),
date3 = as.Date(ifelse(row == 1, date2, NA), origin = "1970-01-01")) %>%
fill(date3) %>%
ungroup() %>%
mutate(diff = as.numeric(date2 - date3 + 1)) %>%
select(-date2, -date3, -row)
#> id date diff
#> 1 1 08/03/2020 08:17 1
#> 2 1 08/03/2020 12:01 1
#> 3 1 08/04/2020 15:08 2
#> 4 1 08/04/2020 22:16 2
#> 5 2 07/03/2020 08:10 1
#> 6 2 07/03/2020 12:03 1
#> 7 2 07/04/2020 15:07 2
#> 8 2 07/05/2020 00:16 3
Here is one approach that explicitly will show dates considered. First, would make sure your date is in POSIXct format as suggested in comments (if not done already). Then, if the hour is less than 2 (midnight to 2 AM) subtract 1 from the date so the survey_date reflects the day before. If the hour is not less than 2, just keep the date. The timezone tz argument is set to "" to avoid confusion or uncertainty. Finally, after grouping by Id, subtract each survey_date from the first survey_date to get number of days since first survey. You can use as.numeric to make this column numeric if desired.
Note: if you want to just note consecutive days taken the survey (and ignore gaps in days between surveys) you can substitute for the last line:
mutate(day = cumsum(survey_date != lag(survey_date, default = first(survey_date))) + 1)
This will increase day by 1 every new survey_date found for a given Id.
df %>%
mutate(date = as.POSIXct(date, format = "%m/%d/%Y %H:%M", tz = "")) %>%
mutate(survey_date = if_else(hour(date) < 2,
as.Date(date, format = "%Y-%m-%d", tz = "") - 1,
as.Date(date, format = "%Y-%m-%d", tz = ""))) %>%
group_by(Id) %>%
mutate(day = survey_date - first(survey_date) + 1)
Id date survey_date day
<int> <dttm> <date> <drtn>
1 1 2020-08-03 08:17:00 2020-08-03 1 days
2 1 2020-08-03 12:01:00 2020-08-03 1 days
3 1 2020-08-04 15:08:00 2020-08-04 2 days
4 1 2020-08-04 22:16:00 2020-08-04 2 days
5 2 2020-07-03 08:10:00 2020-07-03 1 days
6 2 2020-07-03 12:03:00 2020-07-03 1 days
7 2 2020-07-04 15:07:00 2020-07-04 2 days
8 2 2020-07-05 00:16:00 2020-07-04 2 days
9 3 2020-08-22 09:17:00 2020-08-22 1 days
10 3 2020-08-23 11:04:00 2020-08-23 2 days
11 3 2020-08-24 00:01:00 2020-08-23 2 days
12 4 2020-10-03 08:37:00 2020-10-03 1 days
13 4 2020-10-03 11:13:00 2020-10-03 1 days
14 4 2020-10-04 15:20:00 2020-10-04 2 days
15 4 2020-10-04 23:05:00 2020-10-04 2 days

Referring to the row above when using mutate() in R

I want to create a new variable in a dataframe that refers to the value of the same new variable in the row above. Here's an example of what I want to do:
A horse is in a field divided into four zones. The horse is wearing a beacon that signals every minute, and the signal is picked up by one of four sensors, one for each zone. The field has a fence that runs most of the way down the middle, such that the horse can pass easily between zones 2 and 3, but to get between zones 1 and 4 it has to go via 2 and 3. The horse cannot jump over the fence.
| |
sensor 2 | X | | sensor 3
| | |
| | |
| | |
sensor 1 | Y| | sensor 4
| | |
In the schematic above, if the horse is at position X, it will be picked up by sensor 2. If the horse is near the middle fence at position Y, however, it may be picked up by either sensor 1 or sensor 4, the ranges of which overlap slightly.
In the toy example below, I have a dataframe where I have location data each minute for 20 minutes. In most cases, the horse moves one zone at a time, but in several instances, it switches back and forth between zone 1 and 4. This should be impossible: the horse cannot jump the fence, and neither can it run around in the space of a minute.
I therefore want to calculate a new variable in the dataset that provides the "true" location of the animal, accounting for the impossibility of travelling between 1 and 4.
Here's the data:
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,1,4,1,4))
Create two new variables: "prevloc" is where the animal was in the previous minute, and "diffloc" is the number differences between the animal's current and previous location.
example <- example %>% mutate(prevloc = lag(location),
diffloc = abs(location - prevloc))
Next, just change the first value of "diffloc" from NA to zero:
example <- example %>% mutate(diffloc = ifelse(, 0, diffloc))
Now we have a dataframe where diffloc is either 0 (animal didn't move), 1 (animal moved one zone), or 3 (animal apparently moved from zone 1 to zone 4 or vice versa). Where diffloc = 3, I want to create a "true" location taking account of the fact that such a change in location is impossible.
In my example, the animal went from zone 1 -> 4 -> 1 -> 4 -> 1 -> 4. Based on the fact that the animal started in zone 1, my assumption is that the animal just stayed in zone 1 the whole time.
My attempt to solve this below, which doesn't work:
example <- example %>%
mutate(returnloc = ifelse(diffloc < 3, location, lag(returnloc)))
I wonder whether anyone can help me to solve this? I've been trying for a couple of days and haven't even got close...
Best wishes,
One possible solution is to, when diffloc == 3, look at the previous value that is not 1 nor 4. If it is 2, then the horse is certainly in 1 afterwards, if it is 3, then the horse is certainly in 4.
example %>%
mutate(trueloc = case_when(diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 2) ~ 1,
diffloc == 3 & sapply(seq(row_number()), \(i) tail(location[1:i][!location %in% c(1, 4)], 1) == 3) ~ 4,
T ~ location))
time location prevloc diffloc trueloc
1 2022-01-01 09:00:00 1 NA 0 1
2 2022-01-01 09:01:00 1 1 0 1
3 2022-01-01 09:02:00 1 1 0 1
4 2022-01-01 09:03:00 1 1 0 1
5 2022-01-01 09:04:00 2 1 1 2
6 2022-01-01 09:05:00 3 2 1 3
7 2022-01-01 09:06:00 3 3 0 3
8 2022-01-01 09:07:00 3 3 0 3
9 2022-01-01 09:08:00 4 3 1 4
10 2022-01-01 09:09:00 4 4 0 4
11 2022-01-01 09:10:00 4 4 0 4
12 2022-01-01 09:11:00 3 4 1 3
13 2022-01-01 09:12:00 3 3 0 3
14 2022-01-01 09:13:00 2 3 1 2
15 2022-01-01 09:14:00 1 2 1 1
16 2022-01-01 09:15:00 1 1 0 1
17 2022-01-01 09:16:00 4 1 3 1
18 2022-01-01 09:17:00 1 4 3 1
19 2022-01-01 09:18:00 4 1 3 1
20 2022-01-01 09:19:00 1 4 3 1
21 2022-01-01 09:20:00 4 1 3 1
Here is an approach using a funciton containing a for-loop.
You cannot rely on diff, because this will not pick up sequences of (wrong) zone 4's.
c(1,1,4,4,4,1,1,1) should be converted to c(1,1,1,1,1,1,1,1) if I understand your question correctly.
So, you need to iterate (I think).
# custom sample data set
example <- data.frame(time = seq(as.POSIXct("2022-01-01 09:00:00"),
as.POSIXct("2022-01-01 09:20:00"),
by="1 mins"),
location = c(1,1,1,1,2,3,3,3,4,4,4,3,3,2,1,1,4,4,4,1,4))
# Make it a data.table, make sure the time is ordered
setDT(example, key = "time")
# function
fixLocations <- function(x) {
for(i in 2:length(x)) {
if (abs(x[i] - x[i-1]) > 1) x[i] <- x[i-1]
NB that this function only works if the location in the first row is correct. If it start with (wrong) zone 4's, it will go awry.
example[, locationNew := fixLocations(location)][]
# time location locationNew
# 1: 2022-01-01 09:00:00 1 1
# 2: 2022-01-01 09:01:00 1 1
# 3: 2022-01-01 09:02:00 1 1
# 4: 2022-01-01 09:03:00 1 1
# 5: 2022-01-01 09:04:00 2 2
# 6: 2022-01-01 09:05:00 3 3
# 7: 2022-01-01 09:06:00 3 3
# 8: 2022-01-01 09:07:00 3 3
# 9: 2022-01-01 09:08:00 4 4
#10: 2022-01-01 09:09:00 4 4
#11: 2022-01-01 09:10:00 4 4
#12: 2022-01-01 09:11:00 3 3
#13: 2022-01-01 09:12:00 3 3
#14: 2022-01-01 09:13:00 2 2
#15: 2022-01-01 09:14:00 1 1
#16: 2022-01-01 09:15:00 1 1
#17: 2022-01-01 09:16:00 4 1
#18: 2022-01-01 09:17:00 4 1
#19: 2022-01-01 09:18:00 4 1
#20: 2022-01-01 09:19:00 1 1
#21: 2022-01-01 09:20:00 4 1
# time location locationNew

Given a series of dates and a birth day, is there a way to obtain the age at every date entry along with a final age using the lubridate package?

I have a database of information pertaining to individuals observed over time. I would like to find a way to obtain the age of these individuals whenever a record was taken. Assuming the BIRTH assigns a value of 0, I would like to obtain the age either in days or months for the visits after. It would also be helpful to obtain a final age (either day or month) for each individual (*not included in the code). For example, for ID (A), the final age would be 10 months. I would like to use the lubridate function as it's in-built date feature makes it easier to work with dates. Any help with this is much appreciated.
date ID status
1 2000-01-01 A BIRTH
2 2000-01-14 A ETC
3 2000-01-25 A ETC
4 2000-02-12 A ETC
5 2000-02-27 A ETC
6 2000-06-05 A ETC
7 2000-10-30 A ETC
8 2001-02-04 B BIRTH
9 2001-06-15 B ETC
10 2001-12-26 B ETC
11 2002-05-22 B ETC
12 2002-06-04 B ETC
13 2000-01-08 C BIRTH
14 2000-07-11 C ETC
15 2000-08-18 C ETC
16 2000-11-27 C ETC<-c("2000-01-01","2000-01-14","2000-01-25","2000-02-12","2000-02-27","2000-06-05","2000-10-30",
print(df2) age
1 2000-01-01 A BIRTH 0
2 2000-01-14 A ETC 1
3 2000-01-25 A ETC 1
4 2000-02-12 A ETC 2
5 2000-02-27 A ETC 2
6 2000-06-05 A ETC 6
7 2000-10-30 A ETC 10
8 2001-02-04 B BIRTH 0
9 2001-06-15 B ETC 4
10 2001-12-26 B ETC 10
11 2002-05-22 B ETC 15
12 2001-02-04 B ETC 16
13 2000-01-08 C BIRTH 0
14 2000-07-11 C ETC 6
15 2000-08-18 C ETC 7
16 2000-11-27 C ETC 10
For calculations related to age in years or months, I'd like to encourage you to try the clock package rather than lubridate. lubridate is a great package, but produces some unexpected results with these kinds of calculations if you aren't 100% sure of what you are doing. In clock, the function to do this is date_count_between(). Notice that one of the results is different between clock and lubridate here:
library(lubridate, warn.conflicts = FALSE)
library(dplyr, warn.conflicts = FALSE)
df <- tibble(
date = c("2000-01-01","2000-01-14",
ID = c("A","A","A","A","A","A",
status = c("BIRTH","ETC","ETC","ETC",
df %>%
mutate(date = date_parse(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"]) %>%
ungroup() %>%
age_clock = date_count_between(birth_date, date, "month"),
age_lubridate = as.period(date - birth_date) %/% months(1))
#> # A tibble: 16 × 6
#> date ID status birth_date age_clock age_lubridate
#> <date> <chr> <chr> <date> <int> <dbl>
#> 1 2000-01-01 A BIRTH 2000-01-01 0 0
#> 2 2000-01-14 A ETC 2000-01-01 0 0
#> 3 2000-01-25 A ETC 2000-01-01 0 0
#> 4 2000-02-12 A ETC 2000-01-01 1 1
#> 5 2000-02-27 A ETC 2000-01-01 1 1
#> 6 2000-06-05 A ETC 2000-01-01 5 5
#> 7 2000-10-30 A ETC 2000-01-01 9 9
#> 8 2001-02-04 B BIRTH 2001-02-04 0 0
#> 9 2001-06-15 B ETC 2001-02-04 4 4
#> 10 2001-12-26 B ETC 2001-02-04 10 10
#> 11 2002-05-22 B ETC 2001-02-04 15 15
#> 12 2002-06-04 B ETC 2001-02-04 16 15
#> 13 2000-01-08 C BIRTH 2000-01-08 0 0
#> 14 2000-07-11 C ETC 2000-01-08 6 6
#> 15 2000-08-18 C ETC 2000-01-08 7 7
#> 16 2000-11-27 C ETC 2000-01-08 10 10
clock says that 2001-02-04 to 2002-06-04 is 16 months, while the lubridate method here only says it is 15 months. This has to do with the fact that the lubridate calculation uses the length of an average month, which doesn't always accurately reflect how we think about months.
Consider this simple example, I think most people would agree that a child born on this date in February is considered "1 month and 1 day" old. But lubridate shows 0 months!
library(lubridate, warn.conflicts = FALSE)
# "1 month and 1 day apart"
feb <- as.Date("2020-02-28")
mar <- as.Date("2020-03-29")
# As expected when thinking about age in months
date_count_between(feb, mar, "month")
#> [1] 1
# Not expected
as.period(mar - feb) %/% months(1)
#> [1] 0
secs_in_day <- 86400
secs_in_month <- as.numeric(months(1))
secs_in_month / secs_in_day
#> [1] 30.4375
# Less than 30.4375 days, so not 1 month
mar - feb
#> Time difference of 30 days
The issue is that lubridate uses the length of an average month in the computation, which is 30.4375 days. But there are only 30 days between these two dates, so it isn't considered a full month.
clock, on the other hand, uses the day component of the starting date to determine if a "full month" has passed or not. In other words, because we have passed the 28th of March, clock decides that 1 month has passed, which is consistent with how we generally think about age.
Using dplyr and lubridate, we can do the following. We first turn the date column into a date. Then we group by ID, find the birth date and calculate the number of months since that date via some lubridate magic (see How do I use the lubridate package to calculate the number of months between two date vectors where one of the vectors has NA values?).
df1 %>%
mutate(date = as_date(date)) %>%
group_by(ID) %>%
mutate(birth_date = date[status == "BIRTH"],
age = as.period(date - birth_date) %/% months(1)) %>%
Which gives:
date ID status birth_date age
<date> <fct> <fct> <date> <dbl>
1 2000-01-01 A BIRTH 2000-01-01 0
2 2000-01-14 A ETC 2000-01-01 0
3 2000-01-25 A ETC 2000-01-01 0
4 2000-02-12 A ETC 2000-01-01 1
5 2000-02-27 A ETC 2000-01-01 1
6 2000-06-05 A ETC 2000-01-01 5
7 2000-10-30 A ETC 2000-01-01 9
8 2001-02-04 B BIRTH 2001-02-04 0
9 2001-06-15 B ETC 2001-02-04 4
10 2001-12-26 B ETC 2001-02-04 10
11 2002-05-22 B ETC 2001-02-04 15
12 2002-06-04 B ETC 2001-02-04 15
13 2000-01-08 C BIRTH 2000-01-08 0
14 2000-07-11 C ETC 2000-01-08 6
15 2000-08-18 C ETC 2000-01-08 7
16 2000-11-27 C ETC 2000-01-08 10
Which is your expected output except for some rounding differences. See my comment on your question.

R conditional count of unique value over date range/window

In R, how can you count the number of observations fulfilling a condition over a time range?
Specifically, I want to count the number of different id by country over the last 8 months, but only if id occurs at least twice during these 8 months. Hence, for the count, it does not matter whether an id occurs 2x or 100x (doing this in 2 steps is maybe easier). NA exists both in id and country. Since this could otherwise be taken care off, accounting for this is not necessary but still helpful.
My current best try is, but does not account for the restriction (ID must appear at least twice in the previous 8 months) and also I find its counting odd when looking at the dates="2017-12-12", where desired_unrestricted should be equal to 4 according to my counting but the code gives 2.
dt[, date := as.Date(date)][
, totalids := sapply(date,
function(x) length(unique(id[between(date, x - lubridate::month(8), x)]))),
by = country]
ID <- c("1","1","1","1","1","1","2","2","2","3","3",NA,"4")
Date <- c("2017-01-01","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2018-05-02","2017-01-01", "2017-01-05", "2017-05-01", "2017-05-01","2017-05-01","2017-12-12","2017-12-12" )
Value <- c(2,4,3,5,2,5,8,17,17,3,7,5,3)
Country <- c("UK","UK","US","US",NA,"US","UK","UK","US","US","US","US","US")
Desired <- c(1,1,0,2,NA,0,1,2,2,2,2,1,1)
Desired_unrestricted <- c(2,2,1,3,NA,1,2,2,3,3,3,4,4)
dt <- data.frame(id=ID, date=Date, value=Value, country=Country, desired_output=Desired, desired_unrestricted=Desired_unrestricted)
Thanks in advance.
This data.table-only answer is motivated by a comment,
dt[, date := as.Date(date)] # if not already `Date`-class
dt[, date8 :=, lapply(dt$date, function(z) seq(z, length=2, by="-8 months")[2]))
][, results := dt[dt, on = .(country, date > date8, date <= date),
length(Filter(function(z) z > 1, table(id))), by = .EACHI]$V1
][, date8 := NULL ]
# id date value country desired_output desired_unrestricted results
# <char> <Date> <num> <char> <num> <num> <int>
# 1: 1 2017-01-01 2 UK 1 2 1
# 2: 1 2017-01-01 4 UK 1 2 1
# 3: 1 2017-01-05 3 US 0 1 0
# 4: 1 2017-05-01 5 US 1 3 2
# 5: 1 2017-05-01 2 <NA> NA NA 0
# 6: 1 2018-05-02 5 US 0 1 0
# 7: 2 2017-01-01 8 UK 1 2 1
# 8: 2 2017-01-05 17 UK 2 2 2
# 9: 2 2017-05-01 17 US 1 3 2
# 10: 3 2017-05-01 3 US 2 3 2
# 11: 3 2017-05-01 7 US 2 3 2
# 12: <NA> 2017-12-12 5 US 2 4 1
# 13: 4 2017-12-12 3 US 2 4 1
That's a lot to absorb.
Quick walk-through:
"8 months ago":
seq(z, length=2, by="-8 months")[2]
seq.Date (inferred by calling seq with a Date-class first argument) starts at z (current date for each row) and produces a sequence of length 2 with 8 months between them. seq always starts at the first argument, so length=1 won't work (it'll only return z); length=2 guarantees that the second value in the returned vector will be the "8 months before date" that we need.
Date subtraction:
[, date8 :=, lapply(dt$date, function(z) seq(...)[2])) ]
A simple base-R method for subtracting 8 months is seq(date, length=2, by="-8 months")[2]. seq.Date requires its first argument to be length-1, so we need to sapply or lapply it; unfortunately, sapply drops the class, so we lapply it and then programmatically combine them with, ...) (since c(..) creates a list-column, and unlist will de-class it). (Perhaps this part can be improved.)
We need that in dt first since we do a non-equi (range-based) join based on this value.
Counting id with 2 or more visits:
length(Filter(function(z) z > 1, table(id)))
We produce a table(id), which gives us the count of each id within the join-period. Filter(fun, ...) allows us to reduce those that have a count below 2, and we're left with a named-vector of ids that had 2 or more visits. Retrieving the length is what we need.
Self non-equi join:
dt[dt, on = .(country, date > date8, date <= date), ... ]
Relatively straight-forward. This is an open/closed ranging, it can be changed to both-closed if you prefer.
Self non-equi join but count ids by-row: by=.EACHI.
Retrieve the results of that and assign into the original dt:
[, results := dt[...]$V1 ]
Since the non-equi join included a value (length(Filter(...))) without a name, it's named V1, and all we want is that. (To be honest, I don't know exactly why assigning it more directly doesn't work ... but the counts are all wrong. Perhaps it's backwards by-row tallying.)
[, date8 := NULL ]
(Nothing fancy here, just proper data-stewardship :-)
There are some discrepancies in my counts versus your desired_output, I wonder if those are just typos in the OP; I think the math is right ...
Here is another option:
setkey(dt, country, date, id)
dt[, date := as.IDate(date)][,
eightmthsago := as.IDate(sapply(as.IDate(date), function(x) seq(x, by="-8 months", length.out=2L)[2L]))]
dt[, c("out", "out_unres") :=
dt[dt, on=.(country, date>=eightmthsago, date<=date),
by=.EACHI, {
v <- id[!]
.(uniqueN(v[duplicated(v)]), uniqueN(v))
}][,1L:3L := NULL]
output (like r2evans, I am also getting different output from desired as there seems to be a miscount in the desired output):
id date value country desired_output desired_unrestricted eightmthsago out out_unres
1: 1 2017-05-01 2 <NA> NA NA 2016-09-01 0 1
2: 1 2017-01-01 2 UK 1 2 2016-05-01 1 2
3: 1 2017-01-01 4 UK 1 2 2016-05-01 1 2
4: 2 2017-01-01 8 UK 1 2 2016-05-01 1 2
5: 2 2017-01-05 17 UK 2 2 2016-05-05 2 2
6: 1 2017-01-05 3 US 0 1 2016-05-05 0 1
7: 1 2017-05-01 5 US 1 3 2016-09-01 2 3
8: 2 2017-05-01 17 US 1 3 2016-09-01 2 3
9: 3 2017-05-01 3 US 2 3 2016-09-01 2 3
10: 3 2017-05-01 7 US 2 3 2016-09-01 2 3
11: <NA> 2017-12-12 5 US 2 4 2017-04-12 1 4
12: 4 2017-12-12 3 US 2 4 2017-04-12 1 4
13: 1 2018-05-02 5 US 0 1 2017-09-02 0 2
Although this question is tagged with data.table, here is a dplyr::rowwise solution to the problem. Is this what you had in mind? The output looks valid to me: The number of ìds in the last 8 months which have a count of at least greater than 2.
dt <- dt %>% mutate(date = as.Date(date))
dt %>%
group_by(country) %>%
group_modify(~ .x %>%
rowwise() %>%
mutate(totalids = .x %>%
filter(date <= .env$date, date >= .env$date %m-% months(8)) %>%
pull(id) %>%
table() %>%
`[`(. >1) %>%
#> # A tibble: 13 x 7
#> # Groups: country [3]
#> country id date value desired_output desired_unrestricted totalids
#> <chr> <chr> <date> <dbl> <dbl> <dbl> <int>
#> 1 UK 1 2017-01-01 2 1 2 1
#> 2 UK 1 2017-01-01 4 1 2 1
#> 3 UK 2 2017-01-01 8 1 2 1
#> 4 UK 2 2017-01-05 17 2 2 2
#> 5 US 1 2017-01-05 3 0 1 0
#> 6 US 1 2017-05-01 5 1 3 2
#> 7 US 1 2018-05-02 5 0 1 0
#> 8 US 2 2017-05-01 17 1 3 2
#> 9 US 3 2017-05-01 3 2 3 2
#> 10 US 3 2017-05-01 7 2 3 2
#> 11 US <NA> 2017-12-12 5 2 4 1
#> 12 US 4 2017-12-12 3 2 4 1
#> 13 <NA> 1 2017-05-01 2 NA NA 0
Created on 2021-09-02 by the reprex package (v2.0.1)

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2
