I'm trying to find a way of merging overlapping time intervals that can deal with milliseconds.
Three potential options have been posted here:
How to flatten / merge overlapping time periods
However, I don't need to group by ID, and so am finding the dplyr and data.table methods confusing (I'm not sure whether they can deal with milliseconds, as I can't get them to work).
I have managed to get the IRanges solution working, but it converts POSIXct objects to as.numeric integers to calculate the overlaps. So, I'm assuming this is why milliseconds are absent from the output?
The lack of milliseconds doesn't seem to be a display issue, as when I subtract the resulting start and end times, I get integer results in seconds.
Here's a sample of my data:
start <- c("2019-07-15 21:32:43.565",
"2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301",
"2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957")
end <- c("2019-07-15 21:32:48.445",
"2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801",
"2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236")
df <- data.frame(start, end)
The output I get from the IRanges solution:
start end
1 2019-07-15 21:32:43 2019-07-15 21:32:49
2 2019-07-15 21:32:54 2019-07-15 21:32:54
3 2019-07-15 21:34:08 2019-07-15 21:34:10
And the desired result:
start end
1 2019-07-15 21:32:43.565 2019-07-15 21:32:49.045
2 2019-07-15 21:32:54.301 2019-07-15 21:32:54.801
3 2019-07-15 21:34:08.506 2019-07-15 21:34:10.236
Suggestions would be very much appreciated!
I've found it is quite easy to preserve milliseconds if you use POSIXlt format. Although there are faster ways to calculate the overlap, it's fast enough for most purposes to just loop through the data frame.
Here's a reproducible example.
start <- c("2019-07-15 21:32:43.565",
"2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301",
"2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957")
end <- c("2019-07-15 21:32:48.445",
"2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801",
"2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236")
df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))
i <- 1
df <- data.frame(start = as.POSIXlt(start), end = as.POSIXlt(end))
while(i < nrow(df))
{
overlaps <- which(df$start < df$end[i] & df$end > df$start[i])
if(length(overlaps) > 1)
{
df$end[i] <- max(df$end[overlaps])
df <- df[-overlaps[-which(overlaps == i)], ]
i <- i - 1
}
i <- i + 1
}
So now our data frame doesn't have overlaps:
df
#> start end
#> 1 2019-07-15 21:32:43 2019-07-15 21:32:49
#> 3 2019-07-15 21:32:54 2019-07-15 21:32:54
#> 4 2019-07-15 21:34:08 2019-07-15 21:34:10
Although it appears we have lost the milliseconds, this is just a display issue, as we can show by doing this:
df$end - df$start
#> Time differences in secs
#> [1] 5.48 0.50 1.73
as.numeric(df$end - df$start)
#> [1] 5.48 0.50 1.73
Created on 2020-02-20 by the reprex package (v0.3.0)
I think the best thing to do here is to use the clock package (for a true sub-second precision date-time type) along with the ivs package (for merging overlapping intervals).
Using POSIXct for sub-second date-times can be a bit challenging for various reasons, which I've talked about here.
The key here is iv_groups(), which merges all overlapping intervals and returns the intervals that remain after all of the overlaps have been merged. It is also backed by a C implementation that is very fast.
library(clock)
library(ivs)
library(dplyr)
df <- tibble(
start = c(
"2019-07-15 21:32:43.565", "2019-07-15 21:32:43.634",
"2019-07-15 21:32:54.301", "2019-07-15 21:34:08.506",
"2019-07-15 21:34:09.957"
),
end = c(
"2019-07-15 21:32:48.445", "2019-07-15 21:32:49.045",
"2019-07-15 21:32:54.801", "2019-07-15 21:34:10.111",
"2019-07-15 21:34:10.236"
)
)
# Parse into "naive time" (i.e. with a yet-to-be-defined time zone)
# using a millisecond precision
df <- df %>%
mutate(
start = naive_time_parse(start, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
end = naive_time_parse(end, format = "%Y-%m-%d %H:%M:%S", precision = "millisecond"),
)
df
#> # A tibble: 5 × 2
#> start end
#> <tp<naive><milli>> <tp<naive><milli>>
#> 1 2019-07-15T21:32:43.565 2019-07-15T21:32:48.445
#> 2 2019-07-15T21:32:43.634 2019-07-15T21:32:49.045
#> 3 2019-07-15T21:32:54.301 2019-07-15T21:32:54.801
#> 4 2019-07-15T21:34:08.506 2019-07-15T21:34:10.111
#> 5 2019-07-15T21:34:09.957 2019-07-15T21:34:10.236
# Now combine these start/end boundaries into a single interval vector
df <- df %>%
mutate(interval = iv(start, end), .keep = "unused")
df
#> # A tibble: 5 × 1
#> interval
#> <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:48.445)
#> 2 [2019-07-15T21:32:43.634, 2019-07-15T21:32:49.045)
#> 3 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 4 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.111)
#> 5 [2019-07-15T21:34:09.957, 2019-07-15T21:34:10.236)
# And use `iv_groups()` to merge all overlapping intervals.
# It returns the remaining intervals after all overlaps have been removed.
df %>%
summarise(interval = iv_groups(interval))
#> # A tibble: 3 × 1
#> interval
#> <iv<tp<naive><milli>>>
#> 1 [2019-07-15T21:32:43.565, 2019-07-15T21:32:49.045)
#> 2 [2019-07-15T21:32:54.301, 2019-07-15T21:32:54.801)
#> 3 [2019-07-15T21:34:08.506, 2019-07-15T21:34:10.236)
Created on 2022-04-05 by the reprex package (v2.0.1)
Related
I'm looking for a simple and correct way to change the date/time (POSIXct) format into a time that starts at 00:00:00.
I couldn't find an answer to this in R language, but if I overlooked one, please tell me :)
So I have this :
date/time
v1
2022-02-16 15:07:15
38937
2022-02-16 15:07:17
39350
And I would like this :
time
v1
00:00:00
38937
00:00:02
39350
Can somebody help me with this?
Thanks :)
You can calculate the difference between the two datetimes in seconds, and add i to a random date starting at "00:00:00", before formatting it to only including the time. See the time column in the reprex underneath:
library(dplyr)
ibrary(lubridate)
df %>%
mutate(
date = lubridate::ymd_hms(date),
seconds = as.numeric(date - first(date)),
time = format(
lubridate::ymd_hms("2022-01-01 00:00:00") + seconds,
format = "%H:%M:%S"
)
)
#> # A tibble: 2 × 4
#> date v1 seconds time
#> <dttm> <dbl> <dbl> <chr>
#> 1 2022-02-16 15:07:15 38937 0 00:00:00
#> 2 2022-02-16 15:07:17 39350 2 00:00:02
Created on 2022-03-30 by the reprex package (v2.0.1)
Note that this will be misleading if you ever have over 24 hours between two datetimes. In these cases you should probably include the date.
Data
df <- tibble::tribble(
~date, ~v1,
"2022-02-16 15:07:15", 38937,
"2022-02-16 15:07:17", 39350
)
You can deduct all date/time with the first record of date/time, and change the result to type of time by the hms() function in the hms package.
library(dplyr)
library(hms)
df %>%
mutate(`date/time` = hms::hms(as.numeric(as.POSIXct(`date/time`) - as.POSIXct(first(`date/time`)))))
date/time v1
1 00:00:00 38937
2 00:00:02 39350
Note that in this method, even if the time difference is greater than 1 day, it'll be reflected in the result, for example:
df <- read.table(header = T, check.names = F, sep = "\t", text = "
date/time v1
2022-02-16 15:07:15 38937
2022-02-18 15:07:17 39350")
df %>%
mutate(`date/time` = hms::hms(as.numeric(as.POSIXct(`date/time`) - as.POSIXct(first(`date/time`)))))
date/time v1
1 00:00:00 38937
2 48:00:02 39350
Problem description
I work with trice monthly data a lot. Trice monthly (or roughly every 10 days, also referred to as a dekad) it is the typical reporting interval for water related data in the former Soviet Union and for many more climate/water related data sets around the world. Below is an examplary data set with 2 variables:
> date = unique(floor_date(seq.Date(as.Date("2019-01-01"), as.Date("2019-12-31"),
by="day"), "10days"))
> example_data <- tibble(
date = date[day(date)!=31],
value = seq(1,36,1),
var = "A") %>%
add_row(tibble(
date = date[day(date)!=31],
value = seq(10,360,10),
var = "B"))
> example_data
# A tibble: 72 x 3
# Groups: var [2]
date value var
<ord> <dbl> <chr>
1 2019-01-01 1 A
2 2019-01-01 10 B
3 2019-01-11 2 A
4 2019-01-11 20 B
5 2019-01-21 3 A
6 2019-01-21 30 B
7 2019-02-01 4 A
8 2019-02-01 40 B
9 2019-02-11 5 A
10 2019-02-11 50 B
# … with 62 more rows
In the example I chose the 1., 11., and 21. to date the decades but it would actually be more appropriate to index them in dekad 1 to 3 per month (analogue to months 1 to 12 per year) or in dekad 1 to 36 per year (analogue to day of the year). The most elegant solution would be to have a proper date format for dekadal data like yearmonth in lubridate. However, lubridate may not plan to do support dekadal data in the near future (github conversation).
I have workflows using tsibble and timetk which work well with monthly data but it would really be more appropriate to work with the original dekadal time steps and I'm looking for a way to be able to use the tidyverse functions with dekadal data with as few cumbersome workarounds as possible.
The problem with using daily dates for dekadal data in tsibble is that is identifies the time interval as daily and you get a lot of data gaps between your 3 values per month:
> example_data_tsbl <- as_tsibble(example_data, index = date, key = var)
> count_gaps(example_data_tsbl, .full = FALSE)
# A tibble: 70 x 4
var .from .to .n
<chr> <date> <date> <int>
1 A 2019-01-02 2019-01-10 9
2 A 2019-01-12 2019-01-20 9
3 A 2019-01-22 2019-01-31 10
# …
Here's what I did so far:
I saw here the possibility to define ordered factors as indices in tsibble but timetk does not recognise factors as indices. timetk suggests to define custom indices (see 2.).
There is the possibility to add custom indices to tsibble but I haven't found examples on this and I don't understand how I have to use these functions (a vignette is still planned). I have started reading the code to try to understand how to use the functions to get support for dekadal data but I'm a bit overwhelmed.
Questions
Will dekadal custom indices in tsibble behave similarly as the yearmonth or weekyear?
Would anyone here have an example to share on how to add custom indices to tsibble?
Or does anyone know of another way to elegantly handle dekadal data in the tidyverse?
This doesn't discuss tsibbles but it was too long for a comment and does provide an alternative.
zoo can do this either by (1) the code below which does not require the creation of a new class or (2) by creating a new class and methods. For that alternative following the methods that the yearmon class has would be sufficient. See here. zoo itself does not have to be modified.
As we see below, for the first approach dates will be shown as year(cycle) where cycle is 1, 2, ..., 36. Internally the dates are stored as year + (cycle-1)/36 .
It would also be possible to use ts class if the dates were consecutive month thirds (or if not if you don't mind having NAs inserted to make them so). For that use as.ts(z).
Start a fresh session with no packages loaded and then copy and paste the input DF shown in the Note at the end and then this code. Date2dek will convert a Date vector or a character vector representing dates in standard yyyy-mm-dd format to a dek format which is described above. dek2Date performs the inverse transformation. It is not actually used below but might be useful.
library(zoo)
# convert Date or yyyy-mm-dd char vector
Date2dek <- function(x, ...) with(as.POSIXlt(x, tz="GMT"),
1900 + year + (mon + ((mday >= 11) + (mday >= 21)) / 3) / 12)
dek2Date <- function(x, ...) { # not used below but shows inverse
cyc <- round(36 * (as.numeric(x) %% 1)) + 1
if(all(is.na(x))) return(as.Date(x))
month <- (cyc - 1) %/% 3 + 1
day <- 10 * ((cyc - 1) %% 3) + 1
year <- floor(x + .001)
ix <- !is.na(year)
as.Date(paste(year[ix], month[ix], day[ix], sep = "-"))
}
# DF given in Note below
z <- read.zoo(DF, split = "var", FUN = Date2dek, regular = TRUE, freq = 36)
z
The result is the following zooreg object:
A B
2019(1) 1 10
2019(2) 2 20
2019(3) 3 30
2019(4) 4 40
2019(5) 5 50
Note
DF <- data.frame(
date = as.Date(ISOdate(2019, rep(1:2, 3:2), c(1, 11, 21))),
value = c(1:5, 10*(1:5)),
var = rep(c("A", "B"), each = 5))
Extending tsibble to support a new index requires defining methods for these generics:
index_valid() - This method should return TRUE if the class is acceptable as an index
interval_pull() - This method accepts your index values and computes the interval of the data. The interval can be created using tsibble:::new_interval(). You may find tsibble::gcd_interval() useful for computing the smallest interval.
seq() and + - These methods are used to produce future time values using the new_data() function.
A minimal example of a new tsibble index class for 'year' is as follows:
library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, union
library(vctrs)
# Object creation function
my_year <- function(x = integer()) {
x <- vec_cast(x, integer())
vctrs::new_vctr(x, class = "year")
}
# Declare this class as a valid index
index_valid.year <- function(x) TRUE
# Compute the interval of a year input
interval_pull.year <- function(x) {
tsibble::new_interval(
year = tsibble::gcd_interval(vec_data(x))
)
}
# Specify how sequences are generated from years
seq.year <- function(from, to, by, length.out = NULL, along.with = NULL, ...) {
from <- vec_data(from)
if (!rlang::is_missing(to)) {
vec_assert(to, my_year())
to <- vec_data(to)
}
my_year(NextMethod())
}
# Define `+` operation as needed for `new_data()`
vec_arith.year <- function(op, x, y, ...) {
my_year(vec_arith(op, vec_data(x), vec_data(y), ...))
}
# Use the new index class
x <- tsibble::tsibble(
year = my_year(c(2018, 2020, 2024)),
y = rnorm(3),
index = "year"
)
x
#> # A tsibble: 3 x 2 [2Y]
#> year y
#> <year> <dbl>
#> 1 2018 0.211
#> 2 2020 -0.410
#> 3 2024 0.333
interval(x)
#> <interval[1]>
#> [1] 2Y
new_data(x, 3)
#> # A tsibble: 3 x 1 [2Y]
#> year
#> <year>
#> 1 2026
#> 2 2028
#> 3 2030
Created on 2021-02-08 by the reprex package (v0.3.0)
So I clean revenue data every quarter and I need to do the two quarter moving average to predict the next five year quarterly revenue for each individual product (I know this will just end up being the same average for now). Attached here is the data frame: Revenue Df
Right now I have the data in wide format, and you'll see I created the empty forecasting columns by have the user enter a start and end date for the forecast, then it creates the columns for every quarter between. How can I fill these forecast using a moving average? I also converted it to long, and still could not figure out how to fill the forecast. Also I know the 9-30-2020 shows in the forecast, we want to replace that with the actuals even if the user inputs that date for the forecast.
for(i in ncol(Revenue_df)){
if(i<3)
{Revenue_df[,i]<- Revenue_df[,i]}
else{
Revenue_df[,i]<-(Revenue_df[,i-1]+Revenue_df[,i-2])/2
}
}
Product<- c("a","b","c","d","e")
Revenue.3_30_2020<- c(50,40,30,20,10)
Revenue.6_30_2020<- c(50,45,28,19,17)
Revenue.9_30_2020<- c(25,20,22,17,24)
revenue<- data.frame(Product,Revenue.3_30_2020,Revenue.6_30_2020,Revenue.9_30_2020)
forecast.sequence<- c("2020-09-30","2020-12-31","2021-03-31","2021-06-30","2021-09-30","2021-12-31","2022-03-31"
"2022-06-30","2022-09-30","2022-12-31","2023-03-31","2023-06-30","2023-09-30","2023-12-31","2024-03-31"
"2024-06-30","2024-09-30","2024-12-31")
forecast.sequence.amount<- paste("FC.Amount.",forecast.sequence)
revenue[,forecast.sequence.amount]<-NA
I tried this code and it did not work, any suggestions? Also attached is the code for the sample data frame shown in the picture, sorry for the bad format this is my second time asking a question on here.
This seems to be a bit simple for a product forecast. You might want to look at the forecast and fable packages for forecast functions that can account for trends and seasonality in forecasts. These would, however, require for than two data points of data. Anyway, taking your problem as given, the following code seems to do what you describe.
EDIT
I've made the forecast calculation a function to make it more straightforward to use.
library(tidyverse)
product<- c("a","b","c","d","e")
Revenue.3_30_2020<- c(50,40,30,20,10)
Revenue.6_30_2020<- c(50,45,28,19,17)
Revenue.9_30_2020<- c(25,20,22,17,24)
revenue<- data.frame( Product = product, Revenue.3_30_2020,Revenue.6_30_2020,Revenue.9_30_2020)
rev_frcst <- function(revenue, frcst_end, frcst_prefix) {
#
# Arguments:
# revenue = data frame with
# Product containing product name
# columns with the format "prefix.m_day_year" containing product quantities for past quarters
# frcst_end = end date for quarterly forecast
# frcst_prefix = string containing prefix for forecast
#
# convert revenue to long format
#
rev_long <- revenue %>% pivot_longer(cols = -Product, names_to = "Quarter", values_to = "Revenue") %>%
mutate(quarter_end = as.Date(str_remove(Quarter,"Revenue."), "%m_%d_%Y"))
num_revenue <- nrow(rev_long)/length(product)
#
# generate forecast dates
#
forecast.sequence <- seq( max(rev_long$quarter_end),
as.Date(frcst_end),
by = "quarter")[-1]
#
# Add forecast rows to data
#
rev_long <- rev_long %>%
bind_rows(expand_grid(Product=unique(revenue$Product), quarter_end = forecast.sequence) %>%
mutate(Quarter = paste(frcst_prefix, quarter_end)) ) )
#
# Define moving average function
#
mov_avg <- function(num_frcst, x) {
y <- c(x, numeric(num_frcst))
for(i in 1:num_frcst + 2) {
y[i] <- .5*(y[i-1] + y[i-2]) }
y[1:num_frcst + 2]
}
#
# Calculate forecast
#
rev_long_2 <- rev_long %>% group_by(Product) %>%
mutate(forecast = c(Revenue[1:num_revenue],
mov_avg(num_frcst =length(forecast.sequence),
x = Revenue[1:2 + num_revenue - 2]))) %>%
arrange(Product, quarter_end)
}
#
# call rev_frcst to calcuate forecast
#
rev_forecast <- rev_frcst(revenue=revenue,
frcst_end = "2024-12-31",
frcst_prefix = "FC.Amount.")
which gives
Product Quarter Revenue quarter_end forecast
<chr> <chr> <dbl> <date> <dbl>
1 a Revenue.3_30_2020 50 2020-03-30 50
2 a Revenue.6_30_2020 50 2020-06-30 50
3 a Revenue.9_30_2020 25 2020-09-30 25
4 a FC.Amount. 2020-12-30 NA 2020-12-30 37.5
5 a FC.Amount. 2021-03-30 NA 2021-03-30 31.2
6 a FC.Amount. 2021-06-30 NA 2021-06-30 34.4
7 a FC.Amount. 2021-09-30 NA 2021-09-30 32.8
8 a FC.Amount. 2021-12-30 NA 2021-12-30 33.6
9 a FC.Amount. 2022-03-30 NA 2022-03-30 33.2
10 a FC.Amount. 2022-06-30 NA 2022-06-30 33.4
I have a time column in R as:
22:34:47
06:23:15
7:35:15
5:45
How to make all the time values in a column into hh:mm:ss format. I have used
as_date(a$time, tz=NULL) but I am not able to get the format which I wanted.
Here is an option with parse_date_time which can take multiple formats
library(lubridate)
format(parse_date_time(time, c("HMS", "HM"), tz = "GMT"), "%H:%M:%S")
#[1] "22:34:47" "06:23:15" "07:35:15" "05:45:00"
data
time <- c("22:34:47", "06:23:15", "7:35:15", "5:45")
Nothing a bit of formatting can't take care of:
x <- c("22:34:47","06:23:15","7:35:15","5:45")
format(
pmax(
as.POSIXct(x, format="%T", tz="UTC"),
as.POSIXct(x, format="%R", tz="UTC"), na.rm=TRUE
),
"%T"
)
#[1] "22:34:47" "06:23:15" "07:35:15" "05:45:00"
The pmax means any additional seconds will be taken in preference to just hh:mm.
You could get functional if you wanted to get a similar result with less typing, and more opportunity for turning it into a repeatable function.
do.call(pmax, c(lapply(c("%T","%R"), as.POSIXct, x=x, tz="UTC"), na.rm=TRUE))
Using a tidyverse approach with dplyr and hms verbs.
library(dplyr)
library(hms)
a <- tibble(time = c("22:34:47", "06:23:15", "7:35:15", "5:45"))
a %>%
mutate(
time = case_when(
is.na(parse_hms(time)) ~ parse_hm(time),
TRUE ~ parse_hms(time)
)
)
# # A tibble: 4 x 1
# time
# <time>
# 1 22:34
# 2 06:23
# 3 07:35
# 4 05:45
Note that the use of case_when could be replaced with an ifelse. The reason for this conditional is that parse_hms will return NA for values without seconds.
You may also want the output to be a POSIX compliant value, you may adapt the previous solution to do so.
a %>%
mutate(
time = case_when(
is.na(parse_hms(time)) ~ as.POSIXct(parse_hm(time)),
TRUE ~ as.POSIXct(parse_hms(time))
)
)
# # A tibble: 4 x 1
# time
# <dttm>
# 1 1970-01-01 22:34:47
# 2 1970-01-01 06:23:15
# 3 1970-01-01 07:35:15
# 4 1970-01-01 05:45:00
Note this will set the date to origin, which is 1970-01-01 by default.
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))