check whether event occurred in 30-second intervals - r

I have the data set with event ID and timestamp when this event happened. For example at 9/2/2019 17:06. I want to build Markov chain model with two states noevent and event. To avoid building continuous time Markov chain, I want to split the period by 30 second and checking if in those 30 seconds event happened or not. Maybe someone could help me how to do it in R? Thank you!
I only prepared the date format and calculated the time between two events as well how many no events happened between two events.
data$timestamp <- as.POSIXct(data$timestamp,format="%m/%d/%Y %H:%M:%S")
nrow <- nrow(data)
for (i in 2:nrow) {
data$diff[i] <- difftime(data$timestamp[i], data$timestamp[i-1], units="secs")}
data$NUm <-round(data$diff/30)

tidyverse solution
Use lubridate::floor_date() to round to 30-second intervals and tidyr::complete() to fill in intervals with no events:
library(dplyr)
library(tidyr)
library(lubridate)
data %>%
mutate(timestamp = floor_date(timestamp, "30 seconds")) %>%
complete(timestamp = full_seq(timestamp, 30)) %>%
mutate(
event = ifelse(!is.na(id), "yes", "no"),
.keep = "unused"
)
# A tibble: 8 × 2
timestamp event
<dttm> <chr>
1 2023-02-19 10:01:00 yes
2 2023-02-19 10:01:30 no
3 2023-02-19 10:02:00 yes
4 2023-02-19 10:02:30 no
5 2023-02-19 10:03:00 no
6 2023-02-19 10:03:30 no
7 2023-02-19 10:04:00 no
8 2023-02-19 10:04:30 yes
Base R solution
Similar logic as above, using base functions:
times <- as.POSIXlt(data$timestamp)
times$sec <- ifelse(times$sec < 30, 0, 30)
intervals <- seq(min(times), max(times), by = 30)
data.frame(
intervals,
event = ifelse(intervals %in% as.POSIXct(times), "yes", "no")
)
intervals event
1 2023-02-19 10:01:00 yes
2 2023-02-19 10:01:30 no
3 2023-02-19 10:02:00 yes
4 2023-02-19 10:02:30 no
5 2023-02-19 10:03:00 no
6 2023-02-19 10:03:30 no
7 2023-02-19 10:04:00 no
8 2023-02-19 10:04:30 yes
Example data
In the future, it’s best if you include example data in your question. See How to make a great R reproducible example. For these solutions, I used:
data <- data.frame(
id = 1:3,
timestamp = as.POSIXct(c(
"2023-02-19 10:01:23",
"2023-02-19 10:02:01",
"2023-02-19 10:04:45"
))
)

Related

Sum time across different continuous time events across date and time combinations in R

I am having trouble figuring out how to account for and sum continuous time observations across multiple dates and time events in my dataset. A similar question is found here, but it only accounts for one instance of a continuous time event. I have a dataset with multiple date and time combinations. Here is an example from that dataset, which I am manipulating in R:
date.1 <- c("2021-07-21", "2021-07-21", "2021-07-21", "2021-07-29", "2021-07-29", "2021-07-30", "2021-08-01","2021-08-01","2021-08-01")
time.1 <- c("15:57:59", "15:58:00", "15:58:01", "15:46:10", "15:46:13", "18:12:10", "18:12:10","18:12:11","18:12:13")
df <- data.frame(date.1, time.1)
df
date.1 time.1
1 2021-07-21 15:57:59
2 2021-07-21 15:58:00
3 2021-07-21 15:58:01
4 2021-07-29 15:46:10
5 2021-07-29 15:46:13
6 2021-07-30 18:12:10
7 2021-08-01 18:12:10
8 2021-08-01 18:12:11
9 2021-08-01 18:12:13
I tried following the following script from the link I present:
df$missingflag <- c(1, diff(as.POSIXct(df$time.1, format="%H:%M:%S", tz="UTC"))) > 1
df
date.1 time.1 missingflag
1 2021-07-21 15:57:59 FALSE
2 2021-07-21 15:58:00 TRUE
3 2021-07-21 15:58:01 FALSE
4 2021-07-29 15:46:10 FALSE
5 2021-07-29 15:46:13 TRUE
6 2021-07-30 18:12:10 TRUE
7 2021-08-01 18:12:10 FALSE
8 2021-08-01 18:12:11 FALSE
9 2021-08-01 18:12:13 TRUE
But it did not working as anticipated and did not get closer to my answer. It would have been an intermediate goal and probably wouldn't answer my questions.
The GOAL of my dilemma would be account to for all the continuous time observations and put into a new table like this:
date.1 time.1 secs
1 2021-07-21 15:57:59 3
4 2021-07-29 15:46:10 1
5 2021-07-29 15:46:13 1
6 2021-07-30 18:12:10 1
7 2021-08-01 18:12:10 2
9 2021-08-01 18:12:13 1
You will see that the start time of each of the continuous time observations are recorded and the total number of seconds (secs) observed since the start of the continuous observation are being recorded. The script would account for date.1 as there are multiple dates in the dataset.
Thank you in advance.
You can create a datetime object combining date and time columns, get the difference of consecutive values and create groups where all the time 1s apart are part of the same group. For each group count the number of rows and their first datetime value.
library(dplyr)
library(tidyr)
df %>%
unite(datetime, date.1, time.1, sep = ' ') %>%
mutate(datetime = lubridate::ymd_hms(datetime)) %>%
group_by(grp = cumsum(difftime(datetime,
lag(datetime, default = first(datetime)), units = 'secs') > 1)) %>%
summarise(datetime = first(datetime),
secs = n(), .groups = 'drop') %>%
select(-grp)
# datetime secs
# <dttm> <int>
#1 2021-07-21 15:57:59 3
#2 2021-07-29 15:46:10 1
#3 2021-07-29 15:46:13 1
#4 2021-07-30 18:12:10 1
#5 2021-08-01 18:12:10 2
#6 2021-08-01 18:12:13 1
I have kept datetime as single combined column here but if needed you can separate them again as two different columns using
%>% separate(datetime, c('date', 'time'), sep = ' ')

Grouping temporally arranged events across sites dplyr

I'm working with an ecological dataset that has multiple individuals moving across a landscape where they can be detected at multiple sites. The data has a beginning and ending timestamp when an individual was detected at a given site; heron we'll call this time window for an individual at a site an "event". These events are the rows in this data. I sorted this data by time, and noticed I can have multiple events while an individual remains at a given site (which can be due to an individual moving away from the receiver and coming back to it while not being detected at an adjacent receiver).
Here's example data for a single individual, x:
input <- data.frame(individual = c("x","x","x","x","x","x","x"),
site = c("a","a","a","b","b","a", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-01-14 11:13:10", "2020-01-14 11:16:20",
"2020-02-14 11:11:11", "2020-02-14 11:13:10",
"2020-03-14 11:12:11", "2020-03-15 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:11:41", "2020-01-14 11:13:27", "2020-01-14 11:16:50",
"2020-02-14 11:13:11", "2020-02-14 11:15:10",
"2020-03-14 11:20:11", "2020-03-15 11:20:11")))
I want to aggregate these smaller events (e.g. the first 3 events at site a) into one larger event where I summarize the start/end times for the whole event:
output <- data.frame(individual = c("x","x","x"), site = c("a", "b", "a"),
start_time = as.POSIXct(c("2020-01-14 11:11:11", "2020-02-14 11:11:11", "2020-03-14 11:12:11")),
end_time = as.POSIXct(c("2020-01-14 11:16:50", "2020-02-14 11:15:10", "2020-03-15 11:20:11")))
Note that time intervals for events vary.
Using group_by(individual, site) would mean losing this temporal info, since individuals can travel among sites multiple times. I thought about using some sort of helper dataframe that summarizes events for individuals at sites but I am not sure how to retain the temporal info. I suppose there is a way to do this by indexing row numbers/looping in base but I am hoping there is a nifty dplyr trick that can help with this problem.
One approach would be to take the cumulative sum of times that site has changed, and use that count to summarize each individual's contiguous times at one site.
library(dplyr)
input %>%
arrange(individual, start_time) %>%
mutate(indiv_new_site = cumsum(site != lag(site, default = ""))) %>%
group_by(individual, site, indiv_new_site) %>%
summarize(start_time = min(start_time),
end_time = max(end_time))
# A tibble: 3 x 5
# Groups: individual, site [2]
individual site indiv_new_site start_time end_time
<chr> <chr> <int> <dttm> <dttm>
1 x a 1 2020-01-14 11:11:11 2020-01-14 11:16:50
2 x a 3 2020-03-14 11:12:11 2020-03-15 11:20:11
3 x b 2 2020-02-14 11:11:11 2020-02-14 11:15:10
We could use rle from base R
library(dplyr)
input %>%
arrange(individual, start_time) %>%
group_by(individual, site, grp = with(rle(site),
rep(seq_along(values), lengths))) %>%
summarize(start_time = min(start_time),
end_time = max(end_time), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 3 x 4
# individual site start_time end_time
# <chr> <chr> <dttm> <dttm>
#1 x a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2 x a 2020-03-14 11:12:11 2020-03-15 11:20:11
#3 x b 2020-02-14 11:11:11 2020-02-14 11:15:10
In data.table we can use rleid.
library(data.table)
setDT(input)
input[, .(site = first(site),
start_time = min(start_time),
end_time = max(end_time)), .(individual, rleid(site))]
# individual rleid site start_time end_time
#1: x 1 a 2020-01-14 11:11:11 2020-01-14 11:16:50
#2: x 2 b 2020-02-14 11:11:11 2020-02-14 11:15:10
#3: x 3 a 2020-03-14 11:12:11 2020-03-15 11:20:11

Split a rows into two when a date range spans a change in calendar year

I am trying to figure out how to add a row when a date range spans a calendar year. Below is a minimal reprex:
I have a date frame like this:
have <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-12-20'), as.Date('2019-05-13')),
to = c(as.Date('2019-06-20'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
have
#> from to
#> 1 2018-12-15 2019-06-20
#> 2 2019-12-20 2020-01-25
#> 3 2019-05-13 2019-09-10
I want a data.frame that splits into two rows when to and from span a calendar year.
want <- data.frame(
from = c(as.Date('2018-12-15'), as.Date('2019-01-01'), as.Date('2019-12-20'), as.Date('2020-01-01'), as.Date('2019-05-13')),
to = c(as.Date('2018-12-31'), as.Date('2019-06-20'), as.Date('2019-12-31'), as.Date('2020-01-25'), as.Date('2019-09-10'))
)
want
#> from to
#> 1 2018-12-15 2018-12-31
#> 2 2019-01-01 2019-06-20
#> 3 2019-12-20 2019-12-31
#> 4 2020-01-01 2020-01-25
#> 5 2019-05-13 2019-09-10
I am wanting to do this because for a particular row, I want to know how many days are in each year.
want$time_diff_by_year <- difftime(want$to, want$from)
Created on 2020-05-15 by the reprex package (v0.3.0)
Any base R, tidyverse solutions would be much appreciated.
You can determine the additional years needed for your date intervals with map2, then unnest to create additional rows for each year.
Then, you can identify date intervals of intersections between partial years and a full calendar year. This will keep the partial years starting Jan 1 or ending Dec 31 for a given year.
library(tidyverse)
library(lubridate)
have %>%
mutate(date_int = interval(from, to),
year = map2(year(from), year(to), seq)) %>%
unnest(year) %>%
mutate(year_int = interval(as.Date(paste0(year, '-01-01')), as.Date(paste0(year, '-12-31'))),
year_sect = intersect(date_int, year_int),
from_new = as.Date(int_start(year_sect)),
to_new = as.Date(int_end(year_sect))) %>%
select(from_new, to_new)
Output
# A tibble: 5 x 2
from_new to_new
<date> <date>
1 2018-12-15 2018-12-31
2 2019-01-01 2019-06-20
3 2019-12-20 2019-12-31
4 2020-01-01 2020-01-25
5 2019-05-13 2019-09-10

extracting subset of data based on whether a transaction contains at least a part of the time range in R

I have a data frame df that contains different transactions. Each transaction has a start date and an end date. The two variables for this are start_time and end_time. They are of the class POSIXct.
An example of how they look are as follows "2018-05-23 23:40:00" "2018-06-24 00:10:00".
There are about 13000 transactions in df and I want to extract all transactions that contain at least a bit of the specified time interval if not all. The time interval or range is 20:00:00 - 8:00:00 so basically 8 P.M =< interval < 8 A.M.
I am trying to use dplyr and the function filter() to do this however my problem is I am not sure how to write the boolean expression. What I have written in code so far is this:
df %>% filter(hour(start_time) >= 20 | hour(start_time) < 8 |hour(end_time) >= 20 | hour(end_time) < 8 )
I thought maybe this would get all transactions that contain at least a part of that interval but then I thought about transactions that maybe start and end outside of that interval but their duration is so long that it contains those hours from the interval. I thought maybe of adding | duration > 12 because any start time that has a duration longer than 12 hours will contain a part of that time interval. However, I feel like this code is unnecessarily long and there must be a simpler way but I don't know how.
I'll start with a sample data frame, since a sample df isn't given in the question:
library(lubridate)
library(dplyr)
set.seed(69)
dates <- as.POSIXct("2020-04-01") + days(sample(30, 10, TRUE))
start_time <- dates + seconds(sample(86400, 10, TRUE))
end_time <- start_time + seconds(sample(50000, 10, TRUE))
df <- data.frame(Transaction = LETTERS[1:10], start_time, end_time)
df
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 E 2020-04-09 11:36:19 2020-04-09 19:01:14
#> 6 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 7 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 8 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 9 I 2020-04-08 23:11:51 2020-04-09 09:04:26
#> 10 J 2020-04-07 12:28:08 2020-04-07 12:55:42
We can enumerate the possibilities for a match as follows:
Any start time before 08:00 or after 20:00
Any stop time before 08:00 or after 20:00
The stop and start times are on different dates.
Using a little modular math, we can write this as:
df %>% filter((hour(start_time) + 12) %% 20 > 11 |
(hour(end_time) + 12) %% 20 > 11 |
date(start_time) != date(end_time))
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 6 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 7 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 8 I 2020-04-08 23:11:51 2020-04-09 09:04:26
You can check that all the times are at least partly within the given range, and that the two removed rows are not.

Is there an R function for finding a list of all dates between two values. then inserting them as rows?

I have a dataframe in the following format:
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 2020-03-31 6
If the Contract_End - Contract_Begin is less than 1 month, I want to insert the additional months as rows below. Here is the desired output.
Contract_Begin Contract_End FP
2020-01-01 2020-01-31 5
2020-01-01 6
2020-02-01 6
2020-03-01 6
Trying to accomplish in R as a part of pre data processing. Any help is greatly appreciated.
We can use map2 to get the sequence of dates from 'Contract_Begin','Contract_End' and then unnest the listcolumn created by map2 and expand the rows
library(dplyr)
library(tidyr)
library(purrr)
df1 %>%
mutate_at(1:2, as.Date) %>%
mutate(Contract_Begin = map2(Contract_Begin, Contract_End, seq,
by = "1 month")) %>%
unnest(c(Contract_Begin))

Resources