Converting Monthly Data to Daily in R - r

I have a data.frame df that has monthly data:
Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1
I want there to be data on every day in the month (and I will assume Value does not change during each month) since I will be merging this into a different table that has monthly data.
I want the output to look like this:
Date Value
2008-01-02 3.5
2008-01-03 3.5
2008-01-04 3.5
2008-01-05 3.5
2008-01-06 3.5
2008-01-07 3.5
2008-01-08 3.5
2008-01-09 3.5
2008-01-10 3.5
2008-01-11 3.5
2008-01-12 3.5
2008-01-13 3.5
2008-01-14 3.5
2008-01-15 3.5
2008-01-16 3.5
2008-01-17 3.5
2008-01-18 3.5
2008-01-19 3.5
2008-01-20 3.5
2008-01-21 3.5
2008-01-22 3.5
2008-01-23 3.5
2008-01-24 3.5
2008-01-25 3.5
2008-01-26 3.5
2008-01-27 3.5
2008-01-28 3.5
2008-01-29 3.5
2008-01-30 3.5
2008-01-31 3.5
2008-02-01 9.5
I have tried to.daily but my call:
df <- to.daily(df$Date)
returns
Error in to.period(x, "days", name = name, ...) : ‘x’ contains no data

Not sure if i understood perfectly but i think something like this may work.
First, i define the monthly data table
library(data.table)
DT_month=data.table(Date=as.Date(c("2008-01-01","2008-02-01","2008-03-01","2008-05-01","2008-07-01"))
,Value=c(3.5,9.5,0.1,5,8))
Then, you have to do the following
DT_month[,Month:=month(Date)]
DT_month[,Year:=year(Date)]
start_date=min(DT_month$Date)
end_date=max(DT_month$Date)
DT_daily=data.table(Date=seq.Date(start_date,end_date,by="day"))
DT_daily[,Month:=month(Date)]
DT_daily[,Year:=year(Date)]
DT_daily[,Value:=-100]
for( i in unique(DT_daily$Year)){
for( j in unique(DT_daily$Month)){
if(length(DT_month[Year==i & Month== j,Value])!=0){
DT_daily[Year==i & Month== j,Value:=DT_month[Year==i & Month== j,Value]]
}
}
}
Basically, the code will define the month and year of each monthly value in separate columns.
Then, it will create a vector of daily data using the minimum and maximum dates in your monthly data, and will create two separate columns for year and month for the daily data as well.
Finally, it goes through every combination of year and months of data filling the daily values with the monthly ones. In case there is no data for certain combination of month and year, it will show a -100.
Please let me know if it works.

An option using tidyr::expand expand a row between 1st day of month to last day of month. The lubridate::floor_date can provide 1st day of month and lubridate::ceiling_date() - days(1) will provide last day of month.
library(tidyverse)
library(lubridate)
df %>% mutate(Date = ymd(Date)) %>%
group_by(Date) %>%
expand(Date = seq(floor_date(Date, unit = "month"),
ceiling_date(Date, unit="month")-days(1), by="day"), Value) %>%
as.data.frame()
# Date Value
# 1 2008-01-01 3.5
# 2 2008-01-02 3.5
# 3 2008-01-03 3.5
# 4 2008-01-04 3.5
# 5 2008-01-05 3.5
#.....so on
# 32 2008-02-01 9.5
# 33 2008-02-02 9.5
# 34 2008-02-03 9.5
# 35 2008-02-04 9.5
# 36 2008-02-05 9.5
#.....so on
# 85 2008-03-25 0.1
# 86 2008-03-26 0.1
# 87 2008-03-27 0.1
# 88 2008-03-28 0.1
# 89 2008-03-29 0.1
# 90 2008-03-30 0.1
# 91 2008-03-31 0.1
Data:
df <- read.table(text =
"Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1",
header = TRUE, stringsAsFactors = FALSE)

to.daily can only be applied to xts/zooobjects and can only convert to a LOWER frequency. i.e. from daily to monthly, but not the other way round.
One easy way to accomplish what you want is converting df to an xts object:
df.xts <- xts(df$Value,order.by = df$Date)
And merge, like so:
na.locf(merge(df.xts, foo=zoo(NA, order.by=seq(start(df.xts), end(df.xts),
"day",drop=F)))[, 1])
df.xts
2018-01-01 3.5
2018-01-02 3.5
2018-01-03 3.5
2018-01-04 3.5
2018-01-05 3.5
2018-01-06 3.5
2018-01-07 3.5
….
2018-01-27 3.5
2018-01-28 3.5
2018-01-29 3.5
2018-01-30 3.5
2018-01-31 3.5
2018-02-01 9.5
2018-02-02 9.5
2018-02-03 9.5
2018-02-04 9.5
2018-02-05 9.5
2018-02-06 9.5
2018-02-07 9.5
2018-02-08 9.5
….
2018-02-27 9.5
2018-02-28 9.5
2018-03-01 0.1
If you want to adjust the price continuously over the course of a month use na.spline in place of na.locf.

Maybe not an efficient one but with base R we can do
do.call("rbind", lapply(1:nrow(df), function(i)
data.frame(Date = seq(df$Date[i],
(seq(df$Date[i],length=2,by="months") - 1)[2], by = "1 days"),
value = df$Value[i])))
We basically generate a sequence of dates from start_date to the last day of that month which is calculated by
seq(df$Date[i],length=2,by="months") - 1)[2]
and repeat the same value for all the dates and put them in the data frame.
We get a list of dataframe and then we can rbind them using do.call.

Another way:
library(lubridate)
d <- read.table(text = "Date Value
2008-01-01 3.5
2008-02-01 9.5
2008-03-01 0.1",
stringsAsFactors = FALSE, header = TRUE)
Dates <- seq(from = min(as.Date(d$Date)),
to = ceiling_date(max(as.Date(d$Date)), "month") - days(1),
by = "1 days")
data.frame(Date = Dates,
Value = setNames(d$Value, d$Date)[format(Dates, format = "%Y-%m-01")])

Related

Is there a function in R that will sum values based on Date of Year?

I have a data table (Precip15) consisting of columns of precipitation, date of year (DOY), and Date_Time in POSIXct format. I need to be able to see the total precipitation (Rain_cm) for every day recorded. Any suggestions?
An example of the data table format looks like this:
DOY Rain Rain_cm Date_Time
179 6 0.6 2019-06-28 15:00:00
179 0 NA 2019-06-28 15:15:00
179 2 0.2 2019-06-28 16:45:00
180 0 NA 2019-06-29 10:00:00
180 10.2 1.2 2019-06-29 10:15:00
180 2 0.2 2019-06-29 13:00:00
I need it to look like this:
DOY Rain_cm
179 0.8
180 1.4
or possibly:
Date Rain_cm
2019-06-28 0.8
2019-06-29 1.4
Thanks in advance for any help!
Here are some base R solutions using the data frame DF defined reproducibly in the Note at the end. Solutions based on dplyr, data.table or zoo packages would be possible as well.
1) aggregate aggregate on DOY or on Date (defined in the transform statement below) depending on what you want. Note that aggregate automatically removes rows with NAs.
aggregate(Rain_cm ~ DOY, DF, sum)
## DOY Rain_cm
## 1 179 0.8
## 2 180 1.4
DF2 <- transform(DF, Date = as.Date(Date_Time))
aggregate(Rain_cm ~ Date, DF2, sum)
## Date Rain_cm
## 1 2019-06-28 0.8
## 2 2019-06-29 1.4
2) rowsum Another base R solution is rowsum returning a one column matrix with the row names being the value of the grouping variable. DF2 is from (1).
with(na.omit(DF), rowsum(Rain_cm, DOY))
## [,1]
## 179 0.8
## 180 1.4
with(na.omit(DF2), rowsum(Rain_cm, Date))
## [,1]
## 2019-06-28 0.8
## 2019-06-29 1.4
3) tapply Another base R approach is tapply. This produces a named numeric vector. DF2 is from (1).
with(DF, tapply(Rain_cm, DOY, sum, na.rm = TRUE))
## 179 180
## 0.8 1.4
with(DF2, tapply(Rain_cm, Date, sum, na.rm = TRUE))
## 2019-06-28 2019-06-29
## 0.8 1.4
4) xtabs xtabs can be used to form an xtabs table object. DF2 is from (1).
xtabs(Rain_cm ~ DOY, DF)
## DOY
## 179 180
## 0.8 1.4
xtabs(Rain_cm ~ Date, DF2)
## Date
## 2019-06-28 2019-06-29
## 0.8 1.4
Note
The data in reproducible form is assumed to be:
Lines <- "DOY Rain Rain_cm Date_Time
179 6 0.6 2019-06-28 15:00:00
179 0 NA 2019-06-28 15:15:00
179 2 0.2 2019-06-28 16:45:00
180 0 NA 2019-06-29 10:00:00
180 10.2 1.2 2019-06-29 10:15:00
180 2 0.2 2019-06-29 13:00:00"
L <- readLines(textConnection(Lines))
DF <- read.csv(text = gsub(" +", ",", Lines))
df <- tribble(
~DOY, ~Rain, ~Rain_cm, ~Date_Time
, 179 , 6 , 0.6 , "2019-06-28 15:00:00"
, 179 , 0 , NA , "2019-06-28 15:15:00"
, 179 , 2 , 0.2 , "2019-06-28 16:45:00"
, 180 , 0 , NA , "2019-06-29 10:00:00"
, 180 , 10.2 , 1.2 , "2019-06-29 10:15:00"
, 180 , 2 , 0.2 , "2019-06-29 13:00:00"
)
df %>%
mutate(Date_Time = ymd_hms(Date_Time)) %>%
mutate(Date = as.Date(Date_Time)) %>%
group_by(Date) %>%
summarise(perDate = sum(Rain_cm, na.rm = TRUE))
Date perDate
<date> <dbl>
1 2019-06-28 0.8
2 2019-06-29 1.4
You can use the aggregate and cut functions to calculate your total daily precip values. The following code will provide you with the desired results:
precipTotals <- aggreate(df$Rain_cm ~ cut(df$Date_Time, breaks = "day"), x = df,
FUN = sum, na.rm = TRUE)
Make sure your precip columns are as.numeric() and your Date_Time is in as.POSIXct() format and this will work for you.

Complete data frame with missing date ranges for multiple parameters

I have the following data frame:
Date_from <- c("2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13","2013-02-01","2013-05-10","2013-08-13")
Date_to <- c("2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18","2013-05-07","2013-08-12","2013-11-18")
y <- data.frame(Date_from,Date_to)
y$concentration <- c("1.5","2.5","1.5","3.5","1.5","2.5","1.5","3.5","3")
y$Parameter<-c("A","A","A","B","B","B","C","C","C")
y$Date_from <- as.Date(y$Date_from)
y$Date_to <- as.Date(y$Date_to)
y$concentration <- as.numeric(y$concentration)
I will need to check the data frame if for EACH Parameter the date range begins at the first day of the year (2013-01-01) and ends at the last day of the year (2013-12-31). If not I will need to add an extra row at the beginning and at the end for each of the parameters to complete the date range to a full year for each parameter. The result should look like this:
Date_from Date_to concentration Parameter
2013-01-01 2013-01-31 NA NA
2013-02-01 2013-05-07 1.5 A
2013-05-10 2013-08-12 2.5 A
2013-08-13 2013-11-18 1.5 A
2013-11-19 2013-12-31 NA NA
2013-01-01 2013-01-31 NA NA
2013-02-01 2013-05-07 3.5 B
2013-05-10 2013-08-12 1.5 B
2013-08-13 2013-11-18 2.5 B
2013-11-19 2013-12-31 NA NA
2013-01-01 2013-01-31 NA NA
2013-02-01 2013-05-07 1.5 C
2013-05-10 2013-08-12 3.5 C
2013-08-13 2013-11-18 3.0 C
2013-11-19 2013-12-31 NA NA
Please note: The date ranges are only equal in this example for simplification.
UPDATE: This is my original data snippet and code:
sm<-read.csv("https://www.dropbox.com/s/tft6inwcrjqujgt/Test_data.csv?dl=1",sep=";",header=TRUE)
cleaned_sm<-sm[,c(4,5,11,14)] ##Delete obsolete columns
colnames(cleaned_sm)<-c("Parameter","Concentration","Date_from","Date_to")
cleaned_sm$Date_from<-as.Date(cleaned_sm$Date_from, format ="%d.%m.%Y")
cleaned_sm$Date_to<-as.Date(cleaned_sm$Date_to, format ="%d.%m.%Y")
#detect comma decimal separator and replace with dot decimal separater as comma is not recognised as a number
cleaned_sm=lapply(cleaned_sm, function(x) gsub(",", ".", x))
cleaned_sm<-data.frame(cleaned_sm)
cleaned_sm$Concentration <- as.numeric(cleaned_sm$Concentration)
cleaned_sm$Date_from <- as.Date(cleaned_sm$Date_from)
cleaned_sm$Date_to <- as.Date(cleaned_sm$Date_to)
Added code based on #jasbner:
cleaned_sm %>%
group_by(Parameter) %>%
do(add_row(.,
Date_from = ymd(max(Date_to))+1 ,
Date_to = ymd(paste(year(max(Date_to)),"1231")),
Parameter = .$Parameter[1])) %>%
do(add_row(.,
Date_to = ymd(min(Date_from))-1,
Date_from = ymd(paste(year(min(Date_from)),"0101")) ,
Parameter = .$Parameter[1],
.before = 0)) %>%
filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))
My attempt with dplyr and lubridate. Hacked together but I think it should work. Note this does not look for any gaps in the middle of the date ranges. Basically, for each group, you add a row before and after that particular group. Then if there are any cases where the date range starts at the beginning of the year or ends at the end of the year the added rows are filtered out.
library(dplyr)
library(lubridate)
cleaned_sm %>%
group_by(Parameter) %>%
do(add_row(.,
Date_from = ymd(max(.$Date_to))+1 ,
Date_to = ymd(paste(year(max(.$Date_to)),"1231")),
Parameter = .$Parameter[1])) %>%
do(add_row(.,
Date_to = ymd(min(.$Date_from))-1,
Date_from = ymd(paste(year(min(.$Date_from)),"0101")) ,
Parameter = .$Parameter[1],
.before = 0)) %>%
filter(!duplicated(Date_from,fromLast = T),!duplicated(Date_to))
# A tibble: 15 x 4
# Groups: Parameter [3]
# Date_from Date_to concentration Parameter
# <date> <date> <dbl> <chr>
# 1 2013-01-01 2013-01-31 NA A
# 2 2013-02-01 2013-05-07 1.50 A
# 3 2013-05-10 2013-08-12 2.50 A
# 4 2013-08-13 2013-11-18 1.50 A
# 5 2013-11-19 2013-12-31 NA A
# 6 2013-01-01 2013-01-31 NA B
# 7 2013-02-01 2013-05-07 3.50 B
# 8 2013-05-10 2013-08-12 1.50 B
# 9 2013-08-13 2013-11-18 2.50 B
# 10 2013-11-19 2013-12-31 NA B
# 11 2013-01-01 2013-01-31 NA C
# 12 2013-02-01 2013-05-07 1.50 C
# 13 2013-05-10 2013-08-12 3.50 C
# 14 2013-08-13 2013-11-18 3.00 C
# 15 2013-11-19 2013-12-31 NA C
This seems like it requires a combination of different packages to attack it. I am using tidyr, data.table, and I used lubridate.
date.start <- seq.Date(as.Date("2013-01-01"), as.Date("2013-12-31"), by = "day")
Date.Int <- data.frame(Date_from = date.start, Date_to = date.start)
y_wide <- y %>% spread(Parameter, concentration)
y_wide <- as.data.table(setkey(as.data.table(y_wide), Date_from, Date_to))
Date.Int <- as.data.table(setkey(as.data.table(Date.Int), Date_from, Date_to))
dats <- foverlaps(Date.Int, y_wide, nomatch = NA)
fin.dat <- dats %>%
mutate(A = ifelse(is.na(A), -5, A),
seqs = cumsum(!is.na(A) & A != lag(A, default = -5))) %>%
group_by(seqs) %>%
summarise(Date_from = first(i.Date_from),
Date_to = last(i.Date_to) ,
A = first(A),
B = first(B),
C = first(C)) %>%
mutate(A = ifelse(A == -5, NA, A)) %>%
ungroup()%>%
gather(Concentration, Parameter, A:C) %>%
mutate(Concentration = ifelse(is.na(Parameter), NA, Concentration))
Okay, so I created a vector of dates from a start point to an end point (date.start); then I turned into a data.frame with the same interval names and interval dates for Date.Int. This is because foverlaps needs to compare two intervals (same date start and end dates in Date.Int are now officially intervals). I then took your data you provided and spread, turning it from long format data to wide format data and turned that into a data.table. keying a data.table sets up how it should be arranged, and when using foverlaps you have to key the start dates and end dates (in that order). foverlaps determines if an interval falls within another interval of dates. If you print out dats, you will see a bunch of lines with NA for everything because they did not fall within an interval. So now we have to group these in some manner. I picked grouping by values of "A" in dats. The grouping variable is called seqs. But then I summarised the data, and then switched it back from wide format to long format and replaced the appropriate NA values.

Convert Speech Start and End Time into Time Series

I am looking to convert the following R data frame into one that is indexed by seconds and have no idea how to do it. Maybe dcast but then in confused on how to expand out the word that's being spoken.
startTime endTime word
1 1.900s 2.300s hey
2 2.300s 2.800s I'm
3 2.800s 3s John
4 3s 3.400s right
5 3.400s 3.500s now
6 3.500s 3.800s I
7 3.800s 4.300s help
Time word
1.900s hey
2.000s hey
2.100s hey
2.200s hey
2.300s I'm
2.400s I'm
2.500s I'm
2.600s I'm
2.700s I'm
2.800s John
2.900s John
3.000s right
3.100s right
3.200s right
3.300s right
One solution can be achieved using tidyr::expand.
EDITED: Based on feedback from OP, as his data got duplicate startTime
library(tidyverse)
step = 0.1
df %>% group_by(rnum = row_number()) %>%
expand(Time = seq(startTime, max(startTime, (endTime-step)), by=step), word = word) %>%
arrange(Time) %>%
ungroup() %>%
select(-rnum)
# # A tibble: 24 x 2
# # Groups: word [7]
# Time word
# <dbl> <chr>
# 1 1.90 hey
# 2 2.00 hey
# 3 2.10 hey
# 4 2.20 hey
# 5 2.30 I'm
# 6 2.40 I'm
# 7 2.50 I'm
# 8 2.60 I'm
# 9 2.70 I'm
# 10 2.80 John
# ... with 14 more rows
Data
df <- read.table(text =
"startTime endTime word
1.900 2.300 hey
2.300 2.800 I'm
2.800 3 John
3 3.400 right
3.400 3.500 now
3.500 3.800 I
3.800 4.300 help",
header = TRUE, stringsAsFactors = FALSE)
dcast() is used for reshaping data from long to wide format (thereby aggregating) while the OP wants to reshape from wide to long format thereby filling the missing timestamps.
There is an alternative approach which uses a non-equi join.
Prepare data
However, startTime and endTime need to be turned into numeric variables after removing the trailing "s" before we can proceed.
library(data.table)
cols <- stringr::str_subset(names(DF), "Time$")
setDT(DF)[, (cols) := lapply(.SD, function(x) as.numeric(stringr::str_replace(x, "s", ""))),
.SDcols = cols]
Non-equi join
A sequence of timestamps covering the whole period is created and right joined to the dataset but only those timestamps are retained which fall within the given intervall. From the accepted answer, it seems that endTime must not be included in the result. So, the join condition has to be adjusted accordingly.
DF[DF[, CJ(time = seq(min(startTime), max(endTime), 0.1))],
on = .(startTime <= time, endTime > time), nomatch = 0L][
, endTime := NULL][] # a bit of clean-up
startTime word
1: 1.9 hey
2: 2.0 hey
3: 2.1 hey
4: 2.2 hey
5: 2.3 I'm
6: 2.4 I'm
7: 2.5 I'm
8: 2.6 I'm
9: 2.7 I'm
10: 2.8 John
11: 2.9 John
12: 3.0 right
13: 3.1 right
14: 3.2 right
15: 3.3 right
16: 3.4 now
17: 3.5 I
18: 3.6 I
19: 3.7 I
20: 3.8 help
21: 3.9 help
22: 4.0 help
23: 4.1 help
24: 4.2 help
startTime word
Note that this approach does not require to introduce row numbers.
nomatch = 0L avoids NA rows in case of gaps in the dialogue.
Data
library(data.table)
DF <- fread("
rn startTime endTime word
1 1.900s 2.300s hey
2 2.300s 2.800s I'm
3 2.800s 3s John
4 3s 3.400s right
5 3.400s 3.500s now
6 3.500s 3.800s I
7 3.800s 4.300s help
", drop = 1L)

How can I connect a dataset with the average of the values between two dates of another dataset in R?

I want to connect two datasets with each other by adding a new column called Average. This column is the average of the durations between Date and Date - diff. I got two datasets, the first one is called data and looks like this:
Date Weight diff Loc.nr
2013-01-24 1040 7 2
2013-01-31 1000 7 2
2013-01-19 500 4 9
2013-01-23 1040 4 9
2013-01-28 415 5 9
2013-01-31 650 3 9
The other one is called Rain.duration, in the column Duration are the hours of rain on that day. This dataset looks like this:
Date Duration
2013-01-14 4.5
2013-01-15 0.0
2013-01-16 6.9
2013-01-17 0.0
2013-01-18 1.8
2013-01-19 2.1
2013-01-20 0.0
2013-01-21 0.0
2013-01-22 4.3
2013-01-23 0.0
2013-01-24 7.5
2013-01-25 4.7
2013-01-26 0.0
2013-01-27 0.7
2013-01-28 5.0
2013-01-29 0.0
2013-01-30 3.1
2013-01-31 2.8
I made a code to do this:
for(i in 1:nrow(data)) {
for(j in 1:nrow(Rain.duration)) {
if(data$Date[i] == Rain.duration$Date[j]) {
average <- as.array(Rain.duration$Duration[(j-(data$diff[i])):j])
j <- nrow(Rain.duration)
}
}
data$Average[i] <- mean(average)
}
The problem of this code is that, because of the size of my datasets, it takes like 3 days to run. Is there a faster way to do this?
My expected outcome is:
Date Weight diff Loc.nr Average
2013-01-24 1040 7 2 1.96
2013-01-31 1000 7 2 2.98
2013-01-19 500 4 9 2.16
2013-01-23 1040 4 9 1.28
2013-01-28 415 5 9 2.98
2013-01-31 650 3 9 2.73
Here's a dplyr solution:
library(dplyr)
# add row number as a new column just to make it easier to read
weather_with_rows <- Weather %>%
mutate(Rownum = row_number())
# write function to filter by row number, then return the average duration
getavgduration <- function(mydate, mydiff) {
myrow = weather_with_rows %>%
filter(Date == mydate) %>%
pluck("Rownum")
mystartrow = myrow -mydiff
myduration = weather_with_rows %>%
filter(
Rownum <= myrow
, Rownum >= mystartrow
)
mean(myduration$Duration)
}
# get the average duration for each Date/diff pair
averages <- data %>%
group_by(Date, Diff) %>%
summarize(Average = getavgduration(Date, Diff)) %>%
ungroup()
# join this back into the original data frame
# this step might not be necessary
# and might be a big drag on performance,
# depending on the size of your real data
data_with_avg_duration <- data %>%
left_join(averages, by = c('Date','Diff')
This old question does not have an accepted answer yet, so I feel obliged to post an alternative solution which aggregates in a non-equi join.
The OP has requested to compute the average duration of rain from a table Rain.duration of daily hours of rain fall for each date interval given in data.
library(data.table)
# make sure Date columns are of class Date
setDT(data)[, Date := as.Date(Date)]
setDT(Rain.duration)[, Date := as.Date(Date)]
# aggregate in a non-equi join and assign the result to a new column
data[, Average := Rain.duration[data[, .(upper = Date, lower = Date - diff)],
on = .(Date <= upper, Date >= lower),
mean(Duration), by = .EACHI]$V1][]
Date Weight diff Loc.nr Average
1: 2013-01-24 1040 7 2 1.962500
2: 2013-01-31 1000 7 2 2.975000
3: 2013-01-19 500 4 9 2.160000
4: 2013-01-23 1040 4 9 1.280000
5: 2013-01-28 415 5 9 2.983333
6: 2013-01-31 650 3 9 2.725000
The key part is
Rain.duration[data[, .(upper = Date, lower = Date - diff)],
on = .(Date <= upper, Date >= lower),
mean(Duration), by = .EACHI]
Date Date V1
1: 2013-01-24 2013-01-17 1.962500
2: 2013-01-31 2013-01-24 2.975000
3: 2013-01-19 2013-01-15 2.160000
4: 2013-01-23 2013-01-19 1.280000
5: 2013-01-28 2013-01-23 2.983333
6: 2013-01-28 2013-01-23 2.983333
7: 2013-01-31 2013-01-28 2.725000
which does a non-equi join with the date ranges derived from data:
data[, .(upper = Date, lower = Date - diff)]
upper lower
1: 2013-01-24 2013-01-17
2: 2013-01-31 2013-01-24
3: 2013-01-19 2013-01-15
4: 2013-01-23 2013-01-19
5: 2013-01-28 2013-01-23
6: 2013-01-28 2013-01-23
7: 2013-01-31 2013-01-28
by = .EACHI requests to compute the aggregate mean(Duration) for each date interval on-the-fly which avoids to create and copy temporay subsets.
Note that this solution will give correct answers even if Rain.duration has gaps or is unordered as it relies only on Date as opposed to the other solutions which use row numbers.

Data.Table: Aggregate by every two weeks

So let's take the following data.table. It has dates and a column of numbers. I'd like to get the week of each date and then aggregate (sum) of each two weeks.
Date <- as.Date(c("1980-01-01", "1980-01-02", "1981-01-05", "1981-01-05", "1982-01-08", "1982-01-15", "1980-01-16", "1980-01-17",
"1981-01-18", "1981-01-22", "1982-01-24", "1982-01-26"))
Runoff <- c(2, 1, 0.1, 3, 2, 5, 1.5, 0.5, 0.3, 2, 1.5, 4)
DT <- data.table(Date, Runoff)
DT
So from the date, I can easily get the year and week.
DT[,c("Date_YrWeek") := paste(substr(Date,1,4), week(Date), sep="-")][]
What I'm struggling with is aggregating with every two week.
I thought that I'd get the first date for each week and filter using those values. Unfortunately, that would be pretty foolish.
DT[,.(min(Date)),by=.(Date_YrWeek)][order(Date)]
The final result would end up being the sum of every two weeks.
weeks sum_value
1 and 2 ...
3 and 4 ...
5 and 6 ...
Anyone have an efficient way to do this with data.table?
1) Define the two week periods as starting from the minimum Date. Then we can get the total Runoff for each such period like this.
DT[, .(sum_value = sum(Runoff)),
keyby = .(Date = 14 * (as.numeric(Date - min(Date)) %/% 14) + min(Date))]
giving the following where the Date column is the date of the first day of the two week period.
Date sum_value
1: 1980-01-01 3.0
2: 1980-01-15 2.0
3: 1980-12-30 3.1
4: 1981-01-13 2.3
5: 1981-12-29 2.0
6: 1982-01-12 6.5
7: 1982-01-26 4.0
2) If you prefer the text shown in the question for the first column then:
DT[, .(sum_value = sum(Runoff)),
keyby = .(two_week = as.numeric(Date - min(Date)) %/% 14)][
, .(weeks = paste(2*two_week + 1, "and", 2*two_week + 2), sum_value)]
giving:
weeks sum_value
1: 1 and 2 3.0
2: 3 and 4 2.0
3: 53 and 54 3.1
4: 55 and 56 2.3
5: 105 and 106 2.0
6: 107 and 108 6.5
7: 109 and 110 4.0
Update: Revised and added (2).
With tidyverse and lubridate:
library(tidyverse)
library(lubridate)
summary <- DT %>%
mutate(TwoWeeks = round_date(Date, "2 weeks")) %>%
group_by(TwoWeeks) %>%
summarise(sum_value = sum(Runoff))
summary
# A tibble: 9 × 2
TwoWeeks sum_value
<date> <dbl>
1 1979-12-30 3.0
2 1980-01-13 1.5
3 1980-01-20 0.5
4 1981-01-04 3.1
5 1981-01-18 0.3
6 1981-01-25 2.0
7 1982-01-10 2.0
8 1982-01-17 5.0
9 1982-01-24 5.5
Lubridate's round_date() will aggregate dates within ranges you can specify through size and unit, in this case, "2 weeks". round_date()'s output is the first calendar day of that period.

Resources