Grouping temporally arranged events across sites dplyr - r

I'm working with an ecological dataset that has multiple individuals moving across a landscape where they can be detected at multiple sites. The data has a beginning and ending timestamp when an individual was detected at a given site; heron we'll call this time window for an individual at a site an "event". These events are the rows in this data. I sorted this data by time, and noticed I can have multiple events while an individual remains at a given site (which can be due to an individual moving away from the receiver and coming back to it while not being detected at an adjacent receiver).
Here's example data for a single individual, x:
input <- data.frame(individual = c("x","x","x","x","x","x","x"),
site = c("a","a","a","b","b","a", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-01-14 11:13:10", "2020-01-14 11:16:20",
"2020-02-14 11:11:11", "2020-02-14 11:13:10",
"2020-03-14 11:12:11", "2020-03-15 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:11:41", "2020-01-14 11:13:27", "2020-01-14 11:16:50",
"2020-02-14 11:13:11", "2020-02-14 11:15:10",
"2020-03-14 11:20:11", "2020-03-15 11:20:11")))
I want to aggregate these smaller events (e.g. the first 3 events at site a) into one larger event where I summarize the start/end times for the whole event:
output <- data.frame(individual = c("x","x","x"), site = c("a", "b", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-02-14 11:11:11", "2020-03-14 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:16:50", "2020-02-14 11:15:10", "2020-03-15 11:20:11")))
Note that time intervals for events vary.
Using group_by(individual, site) would mean losing this temporal info, since individuals can travel among sites multiple times. I thought about using some sort of helper dataframe that summarizes events for individuals at sites but I am not sure how to retain the temporal info. I suppose there is a way to do this by indexing row numbers/looping in base but I am hoping there is a nifty dplyr trick that can help with this problem.

One approach would be to take the cumulative sum of times that site has changed, and use that count to summarize each individual's contiguous times at one site.
library(dplyr)
input %>%
arrange(individual, start_time) %>%
mutate(indiv_new_site = cumsum(site != lag(site, default = ""))) %>%
group_by(individual, site, indiv_new_site) %>%
summarize(start_time = min(start_time),
end_time = max(end_time))
# A tibble: 3 x 5
# Groups: individual, site [2]
individual site indiv_new_site start_time end_time
<chr> <chr> <int> <dttm> <dttm>
1 x a 1 2020-01-14 11:11:11 2020-01-14 11:16:50
2 x a 3 2020-03-14 11:12:11 2020-03-15 11:20:11
3 x b 2 2020-02-14 11:11:11 2020-02-14 11:15:10

We could use rle from base R
library(dplyr)
input %>%
arrange(individual, start_time) %>%
group_by(individual, site, grp = with(rle(site),
rep(seq_along(values), lengths))) %>%
summarize(start_time = min(start_time),
end_time = max(end_time), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 3 x 4
# individual site start_time end_time
# <chr> <chr> <dttm> <dttm>
#1 x a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2 x a 2020-03-14 11:12:11 2020-03-15 11:20:11
#3 x b 2020-02-14 11:11:11 2020-02-14 11:15:10

In data.table we can use rleid.
library(data.table)
setDT(input)
input[, .(site = first(site),
start_time = min(start_time),
end_time = max(end_time)), .(individual, rleid(site))]
# individual rleid site start_time end_time
#1: x 1 a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2: x 2 b 2020-02-14 11:11:11 2020-02-14 11:15:10
#3: x 3 a 2020-03-14 11:12:11 2020-03-15 11:20:11

Related

R function for selecting an x amount of observations per hour\day that are evenly separated

I am trying to create a function that selects a defined amount of observations per a defined time frame.
I have managed to create a function that subsets for one observation per hour:
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
#function
OnePerHour <- function(df){
dataOnePerHour <- df %>%
group_by(Animal_ID, hour(timestamp), as.Date(timestamp))%>%
filter(row_number(Animal_ID) == 1)
return(dataOnePerHour)
}
However, I am not able to work my head around expanding so I would be able to select more obs/hour that are evenly distributed.
In this example, there is an observation every minute but in the "real dataset" there might be only three or four observations/hr that is 15 mins apart for one animal and an observation every second for another. So, lets say that I am looking for 3 obs\hr, and observation freq is 1/min so 1, 21, 41 is exactly what I am looking for. If there are only three (15 mins apart) I would like to include all of the observations.
Any help will be much appreciated.
Idan
Here's a solution that creates times_per_hour equally spaced intervals within every hour for every Animal_ID, and then chooses the first observation within that interval. If there aren't any observations within that interval, however, no observation will be chosen. So if you want 3 per hour and you have observations at 12:01, 12:02, and 12:03, you're only going to get the first one, because there were no observations between 12:20-12:40 or 12:40-1:00.
library(dplyr)
library(tidyr)
library(lubridate)
#create example df
timestamp <- seq(ISOdate(2022,05,20), ISOdate(2022,05,22), "min")
Animal_ID <- c(rep("Avi",length(timestamp)), rep("David",length(timestamp)))
timestamp <-as.character(c(timestamp,timestamp))
df <- as.data.frame(cbind(Animal_ID,timestamp))
get_observations <- function(df, times_per_hour, min_date_time, max_date_time) {
# madke dataframe with all possible minutes between min and max times
timespan <- expand_grid(Animal_ID = unique(df$Animal_ID),
# replace with min and max datetimes of the data
timestamp = seq(min_date_time, max_date_time, "min"))
ideal_times <- timespan %>%
group_by(Animal_ID, hour = hour(timestamp), date = as.Date(timestamp)) %>%
# select the beginning of the interval from which you want an observation
slice(seq(1, n(), by = 60/times_per_hour)) %>%
mutate(time_interval = interval(timestamp,
lead(timestamp, default = max_date_time))) %>%
select(-timestamp)
df %>%
mutate(hour = hour(timestamp), date = as.Date(timestamp)) %>%
# join so every time interval is matched with all the obs in that hour
right_join(ideal_times, by = c("Animal_ID", "hour", "date")) %>%
# then remove all the obs that aren't in the exact interval
filter(as_datetime(timestamp) %within% time_interval) %>%
group_by(Animal_ID, time_interval) %>%
# then take the first observation
slice(1) %>%
ungroup() %>%
select(-time_interval)
}
# choose 10% so that observations are not equally spaced
sample_df <- slice_sample(df, prop = .1)
get_observations(sample_df, times_per_hour = 3,
min_date_time = ISOdate(2022,05,20), max_date_time = ISOdate(2022,05,22))
#> # A tibble: 259 × 4
#> Animal_ID timestamp hour date
#> <chr> <chr> <int> <date>
#> 1 Avi 2022-05-20 12:00:00 12 2022-05-20
#> 2 Avi 2022-05-20 12:32:00 12 2022-05-20
#> 3 Avi 2022-05-20 12:48:00 12 2022-05-20
#> 4 Avi 2022-05-20 13:15:00 13 2022-05-20
#> 5 Avi 2022-05-20 13:35:00 13 2022-05-20
#> 6 Avi 2022-05-20 13:52:00 13 2022-05-20
#> 7 Avi 2022-05-20 14:17:00 14 2022-05-20
#> 8 Avi 2022-05-20 14:28:00 14 2022-05-20
#> 9 Avi 2022-05-20 14:48:00 14 2022-05-20
#> 10 Avi 2022-05-20 15:16:00 15 2022-05-20
#> # … with 249 more rows
Created on 2022-05-23 by the reprex package (v2.0.1)
If I understand correctly, I might do something like this. Maybe a bit long, I will add in the date, hour, minute, and calculate the time difference from previous time point per animal ID first.
Then calculate the number of observation in an hour, and create a filter logical column based on your description.
df <- df %>%
mutate(dt = as.Date(timestamp),
hr = hour(timestamp),
m = minute(timestamp)) %>%
group_by(Animal_ID) %>%
mutate(time_diff = m-lag(m))
df <- df %>%
group_by(Animal_ID,
dt,
hr) %>%
mutate(num_in_hour = n(),
filterlogic = (num_in_hour == 60 & m %in% c(1,21,41))|(num_in_hour %in% c(3,4)&time_diff==15)
) %>%
filter(filterlogic == TRUE)

R - Condensing rows based on continuity of start and end timestamps

I have some IDs with timestamps that are exhibiting behavior that is out of my control. This is causing new rows for the same ID, but the end_time for the first row of a given ID matches the start_time of the next row for that same ID. An example looks like this:
df <- data.frame(id = c("1", "1", "1", "2", "3", "3"),
start_time = c("7/4/2020 10:06:27", "7/16/2020 07:16:44", "7/16/2020 07:20:32", "7/9/2020 03:27:37", "7/4/2020 02:01:49", "7/9/2020 00:00:00"),
end_time = c("7/16/2020 07:16:44", "7/16/2020 07:20:32", "7/25/2020 18:17:46", "7/21/2020 20:13:16", "7/5/2020 09:17:54", "7/11/2020 15:43:22"))
> df
id start_time end_time
1 1 7/4/2020 10:06:27 7/16/2020 07:16:44
2 1 7/16/2020 07:16:44 7/16/2020 07:20:32
3 1 7/16/2020 07:20:32 7/25/2020 18:17:46
4 2 7/9/2020 03:27:37 7/21/2020 20:13:16
5 3 7/4/2020 02:01:49 7/5/2020 09:17:54
6 3 7/9/2020 00:00:00 7/11/2020 15:43:22
But I want to 'stitch' together the timestamps for id = 1 (and other possible rows exhibiting this behavior) to for one distinct row like below. Note how the end_time and start_time entries for this id match up identically.
> df
id start_time end_time
1 1 7/4/2020 10:06:27 7/25/2020 18:17:46
2 2 7/9/2020 03:27:37 7/21/2020 20:13:16
3 3 7/4/2020 02:01:49 7/5/2020 09:17:54
4 3 7/9/2020 00:00:00 7/11/2020 15:43:22
The entries for id = 3 are appropriate because the end_time for the first instance is instance from the start time of the second instance, so the desired behavior be for those rows to be unchanged
Here's a dplyr solution:
library(dplyr)
df %>%
mutate_at(vars(start_time, end_time), ~ as.POSIXct(., format = "%m/%d/%Y %T")) %>%
arrange(id, start_time) %>%
group_by(id) %>%
mutate(grp = cumsum(start_time != dplyr::lag(end_time, default=end_time[1]))) %>%
group_by(id, grp) %>%
summarize(start_time = start_time[1], end_time = end_time[n()]) %>%
ungroup() %>%
select(-grp)
# # A tibble: 4 x 3
# id start_time end_time
# <chr> <dttm> <dttm>
# 1 1 2020-07-04 10:06:27 2020-07-25 18:17:46
# 2 2 2020-07-09 03:27:37 2020-07-21 20:13:16
# 3 3 2020-07-04 02:01:49 2020-07-05 09:17:54
# 4 3 2020-07-09 00:00:00 2020-07-11 15:43:22
Edited to add as.POSIXct. One comment was to make sure to sort by start_time to make sure that comparisons were good, but realize that since those were originally strings, that there was no guarantee that they would sort correctly. For instance, sort(c("7/31/2020", "12/01/2020")) sorts incorrectly. Unlikely given this data, but still a good safeguard.

R Summarise Data based on logical index for multiple criteria including date difference

I have a dataset where I have a date in which a medication was ordered RequestedDtm and I want to get a specific datapoint pc_before_rx which is recorded at an ObsDtm
RxUnique %>%
inner_join(PatientVS) %>%
group_by(UniqueID, RequestedDtm) %>%
arrange(UniqueID, ObsDtm)%>%
summarise(
pc_before_rx = last(ObsValue[ObsCatalogName == 'pc_recording' & difftime(RequestedDtm, ObsDtm, units = 'hours') < 24]),
dt_settings_before_rx = last(ObsDtm[ObsCatalogName == 'pc_recording' & difftime(RequestedDtm, ObsDtm, units = 'hours') < 24])) %>%
mutate(time = difftime(dt_settings_before_rx, RequestedDtm, units = 'hours'))
Produces:
UniqueID RequestedDtm pc_before_rx dt_before_rx time
<chr> <dttm> <dbl> <dttm> <drtn>
1 5936655 2020-05-02 11:03:38 NA NA NA hours
2 5925423 2020-04-23 06:01:43 14 2020-05-01 23:26:00 209.404508 hours
3 5917885 2020-04-12 16:35:53 12 2020-05-08 23:55:00 631.318448 hours
4 5930494 2020-05-01 10:36:54 15 2020-05-05 00:00:00 85.384895 hours
Clearly, the logical index for getting the last result of that day prior to the RequestedDtm is not working. Is there any way to get this filter/logical index to work?

Assigning total to correct month from date range

I've got a data set with reservation data that has the below format :
property <- c('casa1', 'casa2', 'casa3')
check_in <- as.Date(c('2018-01-01', '2018-01-30','2018-02-28'))
check_out <- as.Date(c('2018-01-02', '2018-02-03', '2018-03-02'))
total_paid <- c(100,110,120)
df <- data.frame(property,check_in,check_out, total_paid)
My goal is to have the monthly total_paid amount divided by days and assigned to each month correctly for budget reasons.
While there's no issue for casa1, casa2 and casa3 have days reserved in both months and the totals get skewed because of this issue.
Any help much appreciated!
Here you go:
library(dplyr)
library(tidyr)
df %>%
mutate(id = seq_along(property), # make few variable to help
day_paid = total_paid / as.numeric(check_out - check_in),
date = check_in) %>%
group_by(id) %>%
complete(date = seq.Date(check_in, (check_out - 1), by = "day")) %>% # get date for each day of stay (except last)
ungroup() %>% # make one row per day of stay
mutate(month = cut(date, breaks = "month")) %>% # determine month of date
fill(property, check_in, check_out, total_paid, day_paid) %>%
group_by(id, month) %>%
summarise(property = unique(property),
check_in = unique(check_in),
check_out = unique(check_out),
total_paid = unique(total_paid),
paid_month = sum(day_paid)) # summarise per month
result:
# A tibble: 5 x 7
# Groups: id [3]
id month property check_in check_out total_paid paid_month
<int> <fct> <fct> <date> <date> <dbl> <dbl>
1 1 2018-01-01 casa1 2018-01-01 2018-01-02 100 100
2 2 2018-01-01 casa2 2018-01-30 2018-02-03 110 55
3 2 2018-02-01 casa2 2018-01-30 2018-02-03 110 55
4 3 2018-02-01 casa3 2018-02-28 2018-03-02 120 60
5 3 2018-03-01 casa3 2018-02-28 2018-03-02 120 60
I hope it's somewhat readable but please ask if there is something I should explain. Convention is that people don't pay the last day of a stay, so I took that into account.

Melting Data by Date Range

I'm running into an RStudio data issue regarding properly melting data. It currently is in the following form:
Campaign, ID, Start Date, End Date, Total Number of Days, Total Spend, Total Impressions, Total Conversions
I would like my data to look like the following:
Campaign, ID, Date, Spend, Impressions, Conversions
Each 'date' should contain a specific day the campaign was run while spend, impressions, and conversions should equal Total Spend / Total # of Days, Total Impressions / Total # of Days, and Total Conversions / Total # of Days, respectively.
I'm working in RStudio so a solution in R is needed. Does anyone have experience manipulating data like this?
This works, but it's not particularly efficient. If your data is millions of rows or more, I've had better luck using SQL and inequality joins.
library(tidyverse)
#create some bogus data
data <- data.frame(ID = 1:10,
StartDate = sample(seq.Date(as.Date("2018-01-01"), as.Date("2018-12-31"), "day"), 10),
Total = runif(10)) %>%
mutate(EndDate = StartDate + floor(runif(10) * 14))
#generate all dates between the min and max in the dataset
AllDates = data.frame(Date = seq.Date(min(data$StartDate), max(data$EndDate), "day"),
Dummy = TRUE)
#join via a dummy variable to add rows for all dates to every ID
data %>%
mutate(Dummy = TRUE) %>%
inner_join(AllDates, by = c("Dummy" = "Dummy")) %>%
#filter to just the dates between the start and end
filter(Date >= StartDate, Date <= EndDate) %>%
#divide the total by the number of days
group_by(ID) %>%
mutate(TotalPerDay = Total / n()) %>%
select(ID, Date, TotalPerDay)
# A tibble: 91 x 3
# Groups: ID [10]
ID Date TotalPerDay
<int> <date> <dbl>
1 1 2018-06-21 0.00863
2 1 2018-06-22 0.00863
3 1 2018-06-23 0.00863
4 1 2018-06-24 0.00863
5 1 2018-06-25 0.00863
6 1 2018-06-26 0.00863
7 1 2018-06-27 0.00863
8 1 2018-06-28 0.00863
9 1 2018-06-29 0.00863
10 1 2018-06-30 0.00863
# ... with 81 more rows

Resources