Extract data values at a higher frequency than time stamps - r

I have continuous behavior data with the timestamp when the subject changed behaviors and what each behavior was, and I need to extract the instantaneous behavior at each minute, starting at the second that the first behavior began: if the behavior started at 17:34:06, I'd define the next minute as 17:35:06. I also have the durations of each behavior calculated. This is what my data looks like:
df <- data.frame(Behavior = c("GRAZ", "MLTC", "GRAZ", "MLTC", "VIGL"),
Behavior_Start = c("2022-05-10 17:34:06","2022-05-10 17:38:04","2022-05-10 17:38:26","2022-05-10 17:41:49","2022-05-10 17:42:27"),
Behavior_Duration_Minutes = c(0.000000,3.961683,4.325933,7.722067,8.350017))
I've used cut() to bin each row into the minute it falls into, but I can't figure out how to get the behavior values for the minutes in which a new behavior doesn't occur (i.e. minutes 2:4 here), and this bases it off the minute but doesn't account for the second that the first behavior began.
time <- data.frame(as.POSIXct(df$Behavior_Start, tz = "America/Denver"))
colnames(time) <- "time"
df <- cbind(df,time)
df.cut <- data.frame(df, cuts = cut(df$time, breaks= "1 min", labels = FALSE))
So the dataframe I'd like to end up with would look like this:
new.df <- data.frame(Minute = c(1:10),
Timestamp = c("2022-05-10 17:34:06","2022-05-10 17:35:06","2022-05-10 17:36:06","2022-05-10 17:37:06","2022-05-10 17:38:06","2022-05-10 17:39:06","2022-05-10 17:40:06","2022-05-10 17:41:06","2022-05-10 17:42:06","2022-05-10 17:43:06"),
Behavior = c("GRAZ","GRAZ","GRAZ","MLTC","GRAZ","GRAZ","GRAZ","MLTC","VIGL","VIGL"))

Your data:
your_df <- data.frame(
Behavior = c("Grazing","Vigilant","Grazing","Other","Grazing"),
Behavior_Start = c("2022-05-10 17:34:06","2022-05-10 17:38:04","2022-05-10 17:38:26","2022-05-10 17:41:49","2022-05-10 17:42:27"),
Behavior_Duration_Minutes = c(0.000000,3.961683,4.325933,7.722067,8.350017)
Using lead() on the duration column gives you the start and end of each
"period" of the activity, and then you need to fill in with a minute for each of that duration.
# Make a list column that generates a sequence of minutes "included" in
# the `Behavior_Duration_Minutes` column. You'll need to play with this
# logic in terms of whether or not you want `floor()` or `round()` etc.
# Also update the endpoint, here hardcoded at 10 minutes.
high_res_df <-
your_df %>%
minutes_covered = purrr::map2(
lead(Behavior_Duration_Minutes, default = 10),
~seq(.x, .y)
#> Behavior Behavior_Start Behavior_Duration_Minutes minutes_covered
#> 1 Grazing 2022-05-10 17:34:06 0.000000 0, 1, 2, 3
#> 2 Vigilant 2022-05-10 17:38:04 3.961683 4
#> 3 Grazing 2022-05-10 17:38:26 4.325933 5, 6, 7
#> 4 Other 2022-05-10 17:41:49 7.722067 8
#> 5 Grazing 2022-05-10 17:42:27 8.350017 9, 10
Now that you've generated the list of minutes included, you can use unnest() to get closer to your desired output.
# And here expand out that list-column into a regular sequence
high_res_long <-
#> # A tibble: 11 × 4
#> Behavior Behavior_Start Behavior_Duration_Minutes minutes_covered
#> <chr> <chr> <dbl> <int>
#> 1 Grazing 2022-05-10 17:34:06 0 0
#> 2 Grazing 2022-05-10 17:34:06 0 1
#> 3 Grazing 2022-05-10 17:34:06 0 2
#> 4 Grazing 2022-05-10 17:34:06 0 3
#> 5 Vigilant 2022-05-10 17:38:04 3.96 4
#> 6 Grazing 2022-05-10 17:38:26 4.33 5
#> 7 Grazing 2022-05-10 17:38:26 4.33 6
#> 8 Grazing 2022-05-10 17:38:26 4.33 7
#> 9 Other 2022-05-10 17:41:49 7.72 8
#> 10 Grazing 2022-05-10 17:42:27 8.35 9
#> 11 Grazing 2022-05-10 17:42:27 8.35 10
Created on 2023-01-13 with reprex v2.0.2
You'll need to play around with this a bit to match exactly what you want.


Grouping to add a new column produces one less row than dataset. How do I make the rows match?

I have a dataframe (df) with 3 columns - a stage number, time data, and pressure data. Here is a portion of it:
Stage = c(1, 1, 1, 1, 1, 2, 2, 2),
Pressure = c(3.24, 12.218, 9.634, 9.027, 9.027, 0, 14.28, 1.737),
DateTime = structure(c(1624720853, 1624720854, 1624720855, 1624720856, 1624720857, 1624905025, 1624905026, 1624905027),
tzone = "", class = c("POSIXct", "POSIXt")),
class = "data.frame",
row.names = c(NA, -8L))
I want to calculate the slope/derivative (change in pressure over change in time) for pressure point for each stage. I have figured out how to calculate the slope, but there are sometimes large gaps in stages and I don't need slope for changes in pressure between stages.
I have code that I believe would work, but because it is looking at the difference in rows, the output will always going to be one less row than than the total number rows within a stage.
df<- df%>%
group_by(Stage) %>%
mutate(dp.dt = diff(Pressure)/as.numeric(diff(DateTime)) )
This is the error, and like I mentioned, I believe it is happening because the code is looking at the difference in rows, which should result in one less row than the true number of rows in a stage:
Error: Problem with `mutate()` column `dp.dt`.
i `dp.dt = diff(Pressure)/as.numeric(diff(DateTime))`.
i `dp.dt` must be size 5 or 1, not 4.
The error occurred in group 1: JobStage = 1.
In the end, I am looking for something like the table below. Is there a way to induce an NA, add a row, or fill the missing row with something so that I get my desired table?
Please let me know if I need to clarify anything.
Any help would be appreciated. Thank you.
2021-06-26 10:20:53
2021-06-26 10:20:54
2021-06-26 10:20:55
2021-06-26 10:20:56
2021-06-26 10:20:57
2021-06-28 13:30:25
2021-06-28 13:30:26
2021-06-28 13:30:27
The diff returns output with length one less than the original data length. Just append NA to the diff and it should work
group_by(Stage) %>%
mutate(dp.dt = c(diff(Pressure),
NA_real_)/as.numeric(c(diff(DateTime), NA_real_)) ) %>%
# A tibble: 8 × 6
Stage Pressure DateTime class row.names dp.dt
<dbl> <dbl> <dttm> <chr> <int> <dbl>
1 1 3.24 2021-06-26 10:20:53 data.frame NA 8.98
2 1 12.2 2021-06-26 10:20:54 data.frame -8 -2.58
3 1 9.63 2021-06-26 10:20:55 data.frame NA -0.607
4 1 9.03 2021-06-26 10:20:56 data.frame -8 0
5 1 9.03 2021-06-26 10:20:57 data.frame NA NA
6 2 0 2021-06-28 13:30:25 data.frame -8 14.3
7 2 14.3 2021-06-28 13:30:26 data.frame NA -12.5
8 2 1.74 2021-06-28 13:30:27 data.frame -8 NA
You may use lead/lag to get next and previous value respectively.
df %>%
mutate(DateTime = ymd_hms(DateTime)) %>%
group_by(Stage) %>%
mutate(dp.dt = (lead(Pressure) - Pressure)/as.numeric((lead(DateTime) - DateTime))) %>%
# Stage Pressure DateTime dp.dt
# <dbl> <dbl> <dttm> <dbl>
#1 1 3.24 2021-06-26 23:20:53 8.98
#2 1 12.2 2021-06-26 23:20:54 -2.58
#3 1 9.63 2021-06-26 23:20:55 -0.607
#4 1 9.03 2021-06-26 23:20:56 0
#5 1 9.03 2021-06-26 23:20:57 NA
#6 2 0 2021-06-29 02:30:25 14.3
#7 2 14.3 2021-06-29 02:30:26 -12.5
#8 2 1.74 2021-06-29 02:30:27 NA

extracting subset of data based on whether a transaction contains at least a part of the time range in R

I have a data frame df that contains different transactions. Each transaction has a start date and an end date. The two variables for this are start_time and end_time. They are of the class POSIXct.
An example of how they look are as follows "2018-05-23 23:40:00" "2018-06-24 00:10:00".
There are about 13000 transactions in df and I want to extract all transactions that contain at least a bit of the specified time interval if not all. The time interval or range is 20:00:00 - 8:00:00 so basically 8 P.M =< interval < 8 A.M.
I am trying to use dplyr and the function filter() to do this however my problem is I am not sure how to write the boolean expression. What I have written in code so far is this:
df %>% filter(hour(start_time) >= 20 | hour(start_time) < 8 |hour(end_time) >= 20 | hour(end_time) < 8 )
I thought maybe this would get all transactions that contain at least a part of that interval but then I thought about transactions that maybe start and end outside of that interval but their duration is so long that it contains those hours from the interval. I thought maybe of adding | duration > 12 because any start time that has a duration longer than 12 hours will contain a part of that time interval. However, I feel like this code is unnecessarily long and there must be a simpler way but I don't know how.
I'll start with a sample data frame, since a sample df isn't given in the question:
dates <- as.POSIXct("2020-04-01") + days(sample(30, 10, TRUE))
start_time <- dates + seconds(sample(86400, 10, TRUE))
end_time <- start_time + seconds(sample(50000, 10, TRUE))
df <- data.frame(Transaction = LETTERS[1:10], start_time, end_time)
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 E 2020-04-09 11:36:19 2020-04-09 19:01:14
#> 6 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 7 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 8 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 9 I 2020-04-08 23:11:51 2020-04-09 09:04:26
#> 10 J 2020-04-07 12:28:08 2020-04-07 12:55:42
We can enumerate the possibilities for a match as follows:
Any start time before 08:00 or after 20:00
Any stop time before 08:00 or after 20:00
The stop and start times are on different dates.
Using a little modular math, we can write this as:
df %>% filter((hour(start_time) + 12) %% 20 > 11 |
(hour(end_time) + 12) %% 20 > 11 |
date(start_time) != date(end_time))
#> Transaction start_time end_time
#> 1 A 2020-04-18 16:51:03 2020-04-19 00:05:54
#> 2 B 2020-04-28 21:32:10 2020-04-29 06:18:06
#> 3 C 2020-04-03 02:12:52 2020-04-03 06:11:20
#> 4 D 2020-04-17 19:15:43 2020-04-17 21:01:52
#> 5 F 2020-04-14 20:51:25 2020-04-15 06:08:10
#> 6 G 2020-04-08 12:01:55 2020-04-09 01:45:53
#> 7 H 2020-04-16 01:43:38 2020-04-16 04:22:39
#> 8 I 2020-04-08 23:11:51 2020-04-09 09:04:26
You can check that all the times are at least partly within the given range, and that the two removed rows are not.

How to calculate a time period until a condition is matched

I need to calculate a time of consecutive dates, until the difference of time between two consecutive dates is greater than 13 seconds.
For example, in the data frame create with the code shown below, the column test has the time difference between the dates. What I need is events of time between lines with test > 13 seconds.
# Create a vector of dates with a random time difference in seconds between records
dates <- seq(as.POSIXct("2020-01-01 00:00:02"), as.POSIXct("2020-01-02 00:00:02"), by = "2 sec")
dates <- dates + sample(15, length(dates), replace = T)
# Create a data.frame
data <- data.frame(id = 1:length(dates), dates = dates)
# Create a test field with the time difference between each date and the next
data$test <- c(diff(data$dates, lag = 1), 0)
# Delete the zero and negative time
data <- data[data$test > 0, ]
What I want is something like this:
To get to your desired result we need to define 'blocks' of observation. Each block is splitted where test is greater than 13.
We start identifying the split_point, and then using the rle function we can assign an ID to each block.
Then we can filter out the split_point, and summarize the remaining blocks. Once with the sum of seconds, then with the min of the event dates.
split_point <- data$test <=13
# Find continuous blocks
block_str <- rle(split_point)
# Create block IDs
data$block <- rep(seq_along(block_str$lengths), block_str$lengths)
data <- data[split_point, ] # Remove split points
# Summarize
final_df <- aggregate(test ~ block, data = data, FUN = sum)
dtevent <- aggregate(dates ~ block, data= data, FUN=min)
# Join the two summaries
final_df$DatetimeEvent <- dtevent$dates
#> block test DatetimeEvent
#> 1 1 101 2020-01-01 00:00:09
#> 2 3 105 2020-01-01 00:01:11
#> 3 5 277 2020-01-01 00:02:26
#> 4 7 46 2020-01-01 00:04:58
#> 5 9 27 2020-01-01 00:05:30
#> 6 11 194 2020-01-01 00:05:44
Created on 2020-04-02 by the reprex package (v0.3.0)
Using dplyrfor convenience sake:
final_df <- data %>%
mutate(split_point = test <= 13,
block = with(rle(split_point), rep(seq_along(lengths), lengths))) %>%
group_by(block) %>%
filter(split_point) %>%
summarise(DateTimeEvent = min(dates), TotalTime = sum(test))
#> # A tibble: 1,110 x 3
#> block DateTimeEvent TotalTime
#> <int> <dttm> <drtn>
#> 1 1 2020-01-01 00:00:06 260 secs
#> 2 3 2020-01-01 00:02:28 170 secs
#> 3 5 2020-01-01 00:04:11 528 secs
#> 4 7 2020-01-01 00:09:07 89 secs
#> 5 9 2020-01-01 00:10:07 37 secs
#> 6 11 2020-01-01 00:10:39 135 secs
#> 7 13 2020-01-01 00:11:56 50 secs
#> 8 15 2020-01-01 00:12:32 124 secs
#> 9 17 2020-01-01 00:13:52 98 secs
#> 10 19 2020-01-01 00:14:47 83 secs
#> # … with 1,100 more rows
Created on 2020-04-02 by the reprex package (v0.3.0)
(results are different because reprex recreates the data each time)

Trouble using object in dataframe after a pipe (decomposition of a msts object)

I do time series decomposition and I want to save the resulting objects in a dataframe. It works if I store the results in a object and use it to make the dataframe afterwards:
# needed packages
# some "time series"
vec <- 1:1000 + rnorm(1000)
# store pipe results
pipe_out <-
# do decomposition
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
# relevant data
# make a dataframe with the stored seasonal data
data.frame(ts= pipe_out)
But doing the same as a one-liner fails:
decompose(msts(vec, start= c(2001, 1, 1), seasonal.periods= c(7, 365.25))) %>%
data.frame(ts= .$seasonal)
I get the error
Error in as.data.frame.default(x[[i]], optional = TRUE, stringsAsFactors = stringsAsFactors) :
cannot coerce class ‘"decomposed.ts"’ to a data.frame
I thought that the pipe simply moves forward the things that came up in the last step which saves us storing those things in objects. If so, shouldn't both codes result in the very same output?
EDIT (from comments)
The first code works but it is a bad solution because if one wants to extract all the vectors of the decomposed time series one would need to do it in multiple steps. Something like the following would be better:
decompose(msts(vec, start= c(2001, 1, 1),
seasonal.periods= c(7, 365.25))) %>%
data.frame(seasonal= .$seasonal, x=.$x, trend=.$trend, random=.$random)
It's unclear from your example whether you want to extract $x or $seasonal. Either way, you can extract part of a list either with the `[[`() function in base or the alias extract2() in magrittr, as you prefer. You should then use the . when you create a data.frame in the last step.
Cleaning up the code a bit to be consistent with the piping, the following works:
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start = c(2001, 1, 1), seasonal.periods= c(7, 365.25)) %>%
decompose %>%
`[[`("seasonal") %>%
# extract2("seasonal") %>% # Another option, uncomment if preferred
data.frame(ts = .) %>%
head # Just for the reprex, remove as required
#> ts
#> 1 -1.17332998
#> 2 0.07393265
#> 3 0.37631946
#> 4 0.30640395
#> 5 1.04279779
#> 6 0.20470768
Created on 2019-11-28 by the reprex package (v0.3.0)
Edit based on comment:
To do what you mention in the comments, you need to use curly brackets (see e.g. here for an explanation why). Hence, the following works:
vec <- 1:1000 + rnorm(1000)
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
{data.frame(seasonal = .$seasonal,
trend = .$trend)} %>%
#> seasonal trend
#> 1 -0.4332034 NA
#> 2 -0.6185832 NA
#> 3 -0.5899566 NA
#> 4 0.7640938 NA
#> 5 -0.4374417 NA
#> 6 -0.8739449 NA
However, for your specific use case, it may be clearer and easier to use magrittr::extract and then simply bind_cols:
vec %>%
msts(start= c(2001, 1, 1), seasonal.periods = c(7, 365.25)) %>%
decompose %>%
magrittr::extract(c("seasonal", "trend")) %>%
bind_cols %>%
#> # A tibble: 6 x 2
#> seasonal trend
#> <dbl> <dbl>
#> 1 -0.433 NA
#> 2 -0.619 NA
#> 3 -0.590 NA
#> 4 0.764 NA
#> 5 -0.437 NA
#> 6 -0.874 NA
Created on 2019-11-29 by the reprex package (v0.3.0)
With daily data, decompose() does not work well because it will only handle the annual seasonality and will give relatively poor estimates of it. If the data involve human behaviour, it will probably have both weekly and annual seasonal patterns.
Also, msts objects are not great for daily data either because they don't store the dates explicitly.
I suggest you use tsibble objects with an STL decomposition instead. Here is an example using your data.
mydata <- tsibble(
day = as.Date(seq(as.Date("2001-01-01"), length=1000, by=1)),
vec = 1:1000 + rnorm(1000)
#> Using `day` as index variable.
#> # A tsibble: 1,000 x 2 [1D]
#> day vec
#> <date> <dbl>
#> 1 2001-01-01 0.161
#> 2 2001-01-02 2.61
#> 3 2001-01-03 1.37
#> 4 2001-01-04 3.15
#> 5 2001-01-05 4.43
#> 6 2001-01-06 7.35
#> 7 2001-01-07 7.10
#> 8 2001-01-08 10.0
#> 9 2001-01-09 9.16
#> 10 2001-01-10 10.2
#> # … with 990 more rows
# Compute a decomposition
mydata %>% STL(vec)
#> # A dable: 1,000 x 7 [1D]
#> # STL Decomposition: vec = trend + season_year + season_week + remainder
#> day vec trend season_year season_week remainder season_adjust
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2001-01-01 0.161 14.7 -14.6 0.295 -0.193 14.5
#> 2 2001-01-02 2.61 15.6 -14.2 0.0865 1.04 16.7
#> 3 2001-01-03 1.37 16.6 -15.5 0.0365 0.240 16.9
#> 4 2001-01-04 3.15 17.6 -13.0 -0.0680 -1.34 16.3
#> 5 2001-01-05 4.43 18.6 -13.4 -0.0361 -0.700 17.9
#> 6 2001-01-06 7.35 19.5 -12.4 -0.122 0.358 19.9
#> 7 2001-01-07 7.10 20.5 -13.4 -0.181 0.170 20.7
#> 8 2001-01-08 10.0 21.4 -12.7 0.282 1.10 22.5
#> 9 2001-01-09 9.16 22.2 -13.8 0.0773 0.642 22.9
#> 10 2001-01-10 10.2 22.9 -12.7 0.0323 -0.0492 22.9
#> # … with 990 more rows
Created on 2019-11-30 by the reprex package (v0.3.0)
The output is a dable (decomposition table) which behaves like a dataframe most of the time. So you can extract the trend column, or either of the seasonal component columns in the usual way.

How to rewrite an R loop taking averages of every 15 observations to same code but without a loop

I am dealing with a huge dataset (years of 1-minute-interval observations of energy usage). I want to convert it from 1-min-interval to 15-min-interval.
I have written a for loop which does this successfully (tested on a small subset of the data); however, when I tried running it on the main data, it was executing very slowly - and it would have taken me over 175 hours to run the full loop (I stopped it mid-execution).
The data to be converted to the 15-th minute interval is the kWh usage; thusly converting it simply requires taking the average of the first 15th observations, then the second 15th, etc. This is the loop that's working:
# Opening the file
data <- read.csv("1.csv",colClasses="character",na.strings="?")
# Adding an index to each row
total <- nrow(data)
data$obsnum <- seq.int(nrow(data))
# Calculating 15 min kwH usage
data$use_15_min <- data$use
for (i in 1:total) {
int_used <- floor((i-1)/15)
obsNum <- 15*int_used
sum <- 0
for (j in 1:15) {
usedIndex <- as.numeric(obsNum+j)
sum <- as.numeric(data$use[usedIndex]) + sum
data$use_15_min[i] <- sum/15
I have been searching for a function that can do the same, but without using loops, as I imagine this should save much time. Yet, I haven't been able to find one. How is it possible to achieve the same functionality without using a loop?
Try data.table:
DT <- data.table(data)
n <- nrow(DT)
DT[, use_15_min := mean(use), by = gl(n, 15, n)]
The question is missing the input data so we used this:
data <- data.frame(use = 1:100)
A potential solution is to calculate the running mean (e.g. using TTR::runMean) and then select every 15th observations. Here is an example:
df = data.frame(x = 1:100, y = runif(100))
df['runmean'] = TTR::runMean(df['y'], n=15)
df_15 = df[seq(1,nrow(df), 15), ]
I cannot test it, as I do not have Your data, but perhaps:
total <- nrow(data)
data$use_15_min = TTR::runMean(data$use, n=15)
data_15_min = data[seq(1, nrow(df), 15)]
I would use lubridate::floor_date to create the 15-minute groupings.
df <- tibble(
date = seq(ymd_hm("2019-01-01 00:00"), by = "min", length.out = 60 * 24 * 7),
value = rnorm(n = 60 * 24 * 7)
#> # A tibble: 10,080 x 2
#> date value
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 0.182
#> 2 2019-01-01 00:01:00 0.616
#> 3 2019-01-01 00:02:00 -0.252
#> 4 2019-01-01 00:03:00 0.0726
#> 5 2019-01-01 00:04:00 -0.917
#> 6 2019-01-01 00:05:00 -1.78
#> 7 2019-01-01 00:06:00 -1.49
#> 8 2019-01-01 00:07:00 -0.818
#> 9 2019-01-01 00:08:00 0.275
#> 10 2019-01-01 00:09:00 1.26
#> # ... with 10,070 more rows
df %>%
nearest_15_mins = floor_date(date, "15 mins")
) %>%
group_by(nearest_15_mins) %>%
avg_value_at_15_mins_int = mean(value)
#> # A tibble: 672 x 2
#> nearest_15_mins avg_value_at_15_mins_int
#> <dttm> <dbl>
#> 1 2019-01-01 00:00:00 -0.272
#> 2 2019-01-01 00:15:00 -0.129
#> 3 2019-01-01 00:30:00 0.173
#> 4 2019-01-01 00:45:00 -0.186
#> 5 2019-01-01 01:00:00 -0.188
#> 6 2019-01-01 01:15:00 0.104
#> 7 2019-01-01 01:30:00 -0.310
#> 8 2019-01-01 01:45:00 -0.173
#> 9 2019-01-01 02:00:00 0.0137
#> 10 2019-01-01 02:15:00 0.419
#> # ... with 662 more rows
