Split-apply aggregation of time series data in R - r

I have some weather forecast data, which records the forecast amount of rainfall for every hour. I would like to compare this to observation data, which has the observed amount of rainfall for every 6 hours. So, I need to aggregate the forecast data to 6-hourly data.
Here is an overview of my data:
DateUtc StationID FcstDay PrecipQuantity_hSum
1 2014-01-01 12:00:00 54745 0 0
2 2014-01-01 13:00:00 54745 0 0
3 2014-01-01 14:00:00 54745 0 0
4 2014-01-01 15:00:00 54745 0 0
5 2014-01-01 16:00:00 54745 0 0
6 2014-01-01 17:00:00 54745 0 0
7 2014-01-01 18:00:00 54745 0 0
8 2014-01-01 19:00:00 54745 0 0
9 2014-01-01 20:00:00 54745 0 0
10 2014-01-01 21:00:00 54745 0 0
11 2014-01-01 22:00:00 54745 0 0
12 2014-01-01 23:00:00 54745 0 0
13 2014-01-02 00:00:00 54745 1 0
14 2014-01-02 01:00:00 54745 1 0
15 2014-01-02 02:00:00 54745 1 0
16 2014-01-02 03:00:00 54745 1 0
17 2014-01-02 04:00:00 54745 1 0
18 2014-01-02 05:00:00 54745 1 0
19 2014-01-02 06:00:00 54745 1 0
20 2014-01-02 07:00:00 54745 1 0
... <NA> <NA> ... ...
13802582 2014-11-20 08:00:00 55005 7 0
13802583 2014-11-20 09:00:00 55005 7 0
13802584 2014-11-20 10:00:00 55005 7 0
13802585 2014-11-20 11:00:00 55005 7 0
13802586 2014-11-20 12:00:00 55005 7 0
To aggregate correctly, it is important to split by StationID (the weather station) and FcstDay (number of days between date of calculating prediction and the date being forecast) before aggregating.
I have used the xts package to do the aggregating and it works as expected if I manually subset the data first e.g.
z <- fcst[which(fcst$StationID=="54745" & fcst$FcstDay==1),]
z.xts <- xts(z$PrecipQuantity_hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.appl(z.xts, ends, sum))
I need to automate the subsetting, but I have tried to wrap the xts functions in various split-apply functions and always get the same error:
Error in xts(z$PrecipQuantity_hSum, z$DateUtc) :
NROW(x) must match length(order.by)
This is my latest version of my code:
df <- data.frame()
d_ply(
.data = fcst,
.variables = c("FcstDay", "StationID"),
.fun = function(z){
z.xts <- xts(z$PrecipQuantity_hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.apply(z.xts, ends, sum))
precip6$DateUtc <- rownames(precip6)
rownames(precip6) <- NULL
df <- rbind.fill(df, precip6)
})
I've also tried nested for loops. Can anybody give any guidance on what's wrong? I've included the code for a reproducible example set below. Thanks in advance.
DateUtc <- rep(seq(from=ISOdatetime(2014,1,1,0,0,0), to=ISOdatetime(2014,12,30,0,0,0), by=(60*60)), times=9)
StationID <- rep(c("50060","50061","50062"), each=3*8713)
FcstDay <- rep(c(1,2,3), each=8713, times=3)
PrecipQuantity_hSum <- rgamma(78417, shape=1, rate=20)
fcst <- data.frame(DateUtc, StationID, FcstDay, PrecipQuantity_hSum)

I think the error David Robinson is getting is because your example code uses PrecipQuantity_6hSum and not PrecipQuantity_hSum. Once this is changed your ddply code is working for me.
Does this work for you?
df<-ddply(
.data = fcst,
.variables = c("FcstDay", "StationID"),
.fun = function(z){
z.xts <- xts(z$PrecipQuantity_6hSum, z$DateUtc)
ends <- endpoints(z.xts, "hours", 6)
precip6 <- as.data.frame(period.apply(z.xts, ends, sum))
precip6$DateUtc <- rownames(precip6)
rownames(precip6) <- NULL
return(precip6)
})

Related

How to calculate number of hours from a fixed start point that varies among levels of a variable

The dataframe df1 summarizes detections of different individuals (ID) through time (Datetime). As a short example:
library(lubridate)
df1<- data.frame(ID= c(1,2,1,2,1,2,1,2,1,2),
Datetime= ymd_hms(c("2016-08-21 00:00:00","2016-08-24 08:00:00","2016-08-23 12:00:00","2016-08-29 03:00:00","2016-08-27 23:00:00","2016-09-02 02:00:00","2016-09-01 12:00:00","2016-09-09 04:00:00","2016-09-01 12:00:00","2016-09-10 12:00:00")))
> df1
ID Datetime
1 1 2016-08-21 00:00:00
2 2 2016-08-24 08:00:00
3 1 2016-08-23 12:00:00
4 2 2016-08-29 03:00:00
5 1 2016-08-27 23:00:00
6 2 2016-09-02 02:00:00
7 1 2016-09-01 12:00:00
8 2 2016-09-09 04:00:00
9 1 2016-09-01 12:00:00
10 2 2016-09-10 12:00:00
I want to calculate for each row, the number of hours (Hours_since_begining) since the first time that the individual was detected.
I would expect something like that (It can contain some mistakes since I did the calculations by hand):
> df1
ID Datetime Hours_since_begining
1 1 2016-08-21 00:00:00 0
2 2 2016-08-24 08:00:00 0
3 1 2016-08-23 12:00:00 60 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-23 12:00:00"
4 2 2016-08-29 03:00:00 115
5 1 2016-08-27 23:00:00 167 # Number of hours between "2016-08-21 00:00:00" (first time detected the Ind 1) and "2016-08-27 23:00:00"
6 2 2016-09-02 02:00:00 210
7 1 2016-09-01 12:00:00 276
8 2 2016-09-09 04:00:00 380
9 1 2016-09-01 12:00:00 276
10 2 2016-09-10 12:00:00 412
Does anyone know how to do it?
Thanks in advance!
You can do this :
library(tidyverse)
# first get min datetime by ID
min_datetime_id <- df1 %>% group_by(ID) %>% summarise(min_datetime=min(Datetime))
# join with df1 and compute time difference
df1 <- df1 %>% left_join(min_datetime_id) %>% mutate(Hours_since_beginning= as.numeric(difftime(Datetime, min_datetime,units="hours")))

R: data.table aggregate using external grouping vector

I have data
dt <- data.table(time=as.POSIXct(c("2018-01-01 01:01:00","2018-01-01 01:05:00","2018-01-01 01:01:00")), y=c(1,10,9))
> dt
time y
1: 2018-01-01 01:01:00 1
2: 2018-01-01 01:05:00 10
3: 2018-01-01 01:01:00 9
and I would like to aggregate by time. Usually, I would do
dt[,list(sum=sum(y),count=.N), by="time"]
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:05:00 10 1
but this time, I would also like to get zero values for the minutes in between, i.e.,
time sum count
1: 2018-01-01 01:01:00 10 2
2: 2018-01-01 01:02:00 0 0
3: 2018-01-01 01:03:00 0 0
4: 2018-01-01 01:04:00 0 0
5: 2018-01-01 01:05:00 10 1
Could this be done, for example, using an external vector
times <- seq(from=min(dt$time),to=max(dt$time),by="mins")
that can be fed to the data.table function as a grouping variable?
You would typically do with with a join (either before or after the aggregation). For example:
dt <- dt[J(times), on = "time"]
dt[,list(sum=sum(y, na.rm = TRUE), count= sum(!is.na(y))), by=time]
# time sum count
#1: 2018-01-01 01:01:00 10 2
#2: 2018-01-01 01:02:00 0 0
#3: 2018-01-01 01:03:00 0 0
#4: 2018-01-01 01:04:00 0 0
#5: 2018-01-01 01:05:00 10 1
Or in a "piped" version:
dt[J(times), on = "time"][
, .(sum = sum(y, na.rm = TRUE), count= sum(!is.na(y))),
by = time]

Summations by conditions on another row dealing with time

I am looking to run a cumulative sum at every row for values that occur in two columns before and after that point. So in this case I have volume of 2 incident types at every given minute over two days. I want to create a column which adds all the incidents that occured before and after for each row by the type. Sumif from excel comes to mind but I'm not sure how to port that over to R:
EDIT: ADDED set.seed and easier numbers
I have the following data set:
set.seed(42)
master_min =
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-2 23:00", tz="America/New_York"),
by="min"
))
)
incident1= round(runif(2821, min=0, max=10))
incident2= round(runif(2821, min=0, max=10))
master_min = head(cbind(master_min, incident1, incident2), 5)
How do I essentially compute the following logic:
for each row, sum all the incident1s that occured before that row's timestamp and all the incident2s that occured after that row's timestamp? It would be great to get a data table solution, if not a dplyr as I am working with a large dataset. Below is a before and after for the data`:
BEFORE:
master_min incident1 incident2
1: 2016-01-01 00:00:00 9 6
2: 2016-01-01 00:01:00 9 5
3: 2016-01-01 00:02:00 3 5
4: 2016-01-01 00:03:00 8 6
5: 2016-01-01 00:04:00 6 9
AFTER THE CALCULATION:
master_min incident1 incident2 new_column
1: 2016-01-01 00:00:00 9 6 25
2: 2016-01-01 00:01:00 9 5 29
3: 2016-01-01 00:02:00 3 5 33
4: 2016-01-01 00:03:00 8 6 30
5: 2016-01-01 00:04:00 6 9 29
If I understand correctly:
# Cumsum of incident1, without current row:
master_min$sum1 <- cumsum(master_min$incident1) - master_min$incident1
# Reverse cumsum of incident2, without current row:
master_min$sum2 <- rev(cumsum(rev(master_min$incident2))) - master_min$incident2
# Your new column:
master_min$new_column <- master_min$sum1 + master_min$sum2
*update
The following two lines can do the job
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
I rewrote the question a bit to show a bit more comprehensive structure
library(data.table)
master_min <-
setDT(
data.frame(master_min = seq(
from=as.POSIXct("2016-1-1 0:00", tz="America/New_York"),
to=as.POSIXct("2016-1-1 0:09", tz="America/New_York"),
by="min"
))
)
set.seed(2)
incident1= as.integer(runif(10, min=0, max=10))
incident2= as.integer(runif(10, min=0, max=10))
master_min = cbind(master_min, incident1, incident2)
Now master_min looks like this
> master_min
master_min incident1 incident2
1: 2016-01-01 00:00:00 1 5
2: 2016-01-01 00:01:00 7 2
3: 2016-01-01 00:02:00 5 7
4: 2016-01-01 00:03:00 1 1
5: 2016-01-01 00:04:00 9 4
6: 2016-01-01 00:05:00 9 8
7: 2016-01-01 00:06:00 1 9
8: 2016-01-01 00:07:00 8 2
9: 2016-01-01 00:08:00 4 4
10: 2016-01-01 00:09:00 5 0
Apply transformations
master_min$sum1 <- cumsum(master_min$incident1)
master_min$sum2 <- sum(master_min$incident2) - cumsum(master_min$incident2)
Results
> master_min
master_min incident1 incident2 sum1 sum2
1: 2016-01-01 00:00:00 1 5 1 37
2: 2016-01-01 00:01:00 7 2 8 35
3: 2016-01-01 00:02:00 5 7 13 28
4: 2016-01-01 00:03:00 1 1 14 27
5: 2016-01-01 00:04:00 9 4 23 23
6: 2016-01-01 00:05:00 9 8 32 15
7: 2016-01-01 00:06:00 1 9 33 6
8: 2016-01-01 00:07:00 8 2 41 4
9: 2016-01-01 00:08:00 4 4 45 0
10: 2016-01-01 00:09:00 5 0 50 0

Filling higher resolution zoo obj with data from lower resolution zoo obj

I have one zoo object with hourly observations, and one with daily observations.
My goal is to merge the two series by the index into one object, where I match daily values with all hourly values of the same date.
To be specific, the first object zX contains hourly observations with no missing values. The second object zY contains a list of certain special dates. These should be added to zX as a dummy on every observation on that day.
library(zoo)
# 3 days of data with hourly resoulution
x <- runif(24*3)
indexHour <- as.POSIXct(as.Date("2015-01-01") + seq(0, (24*3-1)/24, 1/24))
zX <- zoo(x, indexHour)
# Only 2 days of data with daily resolution - one date is missing
y <- c(0, 2)
indexDay <- as.POSIXct(c(as.Date("2015-01-01"), as.Date("2015-01-3")))
zY <- zoo(y, indexDay)
Expected output
2015-01-01 00:00:00 0.78671677 0
2015-01-01 01:00:00 0.40625297 0
...
2015-01-01 23:00:00 0.75371677 0
2015-01-02 00:00:00 0.34571677 NA
2015-01-02 01:00:00 0.40625297 NA
...
2015-01-02 23:00:00 0.12671677 NA
2015-01-03 00:00:00 0.54671677 2
2015-01-03 01:00:00 0.40625297 2
...
2015-01-03 23:00:00 0.23671677 2
Try this:
z <- cbind(zX, zY = coredata(zY)[match(as.Date(time(zX)), as.Date(time(zY)))])
giving:
> head(z, 30)
zX zY
2014-12-31 19:00:00 0.20050507 0
2014-12-31 20:00:00 0.98745944 0
2014-12-31 21:00:00 0.02685118 0
2014-12-31 22:00:00 0.82922065 0
2014-12-31 23:00:00 0.77466073 0
2015-01-01 00:00:00 0.87494486 0
2015-01-01 01:00:00 0.39466493 0
2015-01-01 02:00:00 0.49233047 0
2015-01-01 03:00:00 0.19231866 0
2015-01-01 04:00:00 0.91684281 0
2015-01-01 05:00:00 0.48264758 0
2015-01-01 06:00:00 0.08900482 0
2015-01-01 07:00:00 0.48236308 0
2015-01-01 08:00:00 0.30624266 0
2015-01-01 09:00:00 0.48860905 0
2015-01-01 10:00:00 0.18761759 0
2015-01-01 11:00:00 0.37730202 0
2015-01-01 12:00:00 0.51766405 0
2015-01-01 13:00:00 0.30146257 0
2015-01-01 14:00:00 0.66511275 0
2015-01-01 15:00:00 0.66457355 0
2015-01-01 16:00:00 0.92248105 0
2015-01-01 17:00:00 0.17868851 0
2015-01-01 18:00:00 0.71363131 0
2015-01-01 19:00:00 0.82189523 NA
2015-01-01 20:00:00 0.73392131 NA
2015-01-01 21:00:00 0.95409518 NA
2015-01-01 22:00:00 0.49774272 NA
2015-01-01 23:00:00 0.27700155 NA
2015-01-02 00:00:00 0.85833340 NA
Inspired by the join statements in How to join (merge) data frames (inner, outer, left, right)? the following code produce desired output:
x <- cbind(x = coredata(zX), date = format(as.Date(index(zX))))
y <- cbind(y = coredata(zY), date = format(as.Date(index(zY))))
z <- zoo(merge(x, y, by = 'date', all.x=TRUE), index(zX))
z <- z[,!colnames(z) %in% c('date')]
View(z)

How to create multiple rules for "lapply" function

Currently, I have an array of list(groupA) like the following example:
$AAAAAA
time timegap
1 06:00:00 0
2 07:00:00 60
3 08:00:00 40
4 09:00:00 0
5 10:00:00 30
$BBBBBBB
time timegap
1 06:00:00 0
2 07:00:00 60
3 08:00:00 40
4 09:00:00 0
5 10:00:00 30
I am trying to create a function that generate a dummy variable if timegap is greater than a certain number. But the challenge is the number for generating dummy variable would be different from others if the time in in the range from 07:00:00 to 09:00:00.
What I did was as following:
dummytime<-function(x){
if(x$time>times("07:00:00") & x$time<times("09:00:00")){
d<-c(1200)
}
else{
d<-c(600)
}
dummytime<- as.numeric(x$timegap>=d)
as.data.frame(dummytime)
}
dumtime<-lapply(groupm2,dummytime)
However, I got an error like this:
Error in if (as.logical(x$time > times("07:00:00") & x$time < times("09:00:00")))
{ : missing value where TRUE/FALSE needed
Any suggestion? Thanks for help in advance.
Here is one way. Since you used the chron package to convert character to time. I have done that. Then, I created a list. Then, lpply.
library(chron)
# time to Class 'times'
df1$time <- chron(times = df1$time)
df2$time <- chron(times = df2$time)
# Create a list
ana <- list(df1 = df1, df2 = df2)
#$df1
# time timegap
#1 06:00:00 0
#2 07:00:00 60
#3 08:00:00 40
#4 09:00:00 0
#5 10:00:00 30
lapply(ana, function(x){
x$test <- ifelse(x$time >= "07:00:00" & x$time <= "09:00:00",
1200, 600)
x
})
#$df1
# time timegap test
#1 06:00:00 0 600
#2 07:00:00 60 1200
#3 08:00:00 40 1200
#4 09:00:00 0 1200
#5 10:00:00 30 600
#$df2
# time timegap test
#1 06:00:00 0 600
#2 07:00:00 60 1200
#3 08:00:00 40 1200
#4 09:00:00 0 1200
#5 10:00:00 30 600
Or
lapply(ana, transform,
test = ifelse(time >= "07:00:00" & time <= "09:00:00", 1200, 600))

Resources