period.apply function with large endpoints - r

I have a time series and want to use period.apply() function xts library to estimate the mean for 377 days
The reproducible example is as following
zoo.data <- zoo(rnorm(5031)+10,as.Date(13514:17744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days', k =377)
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
The output generated is
2007-05-28 2007-12-31 2008-10-05 2008-12-31 2009-02-02 2009-12-31
9.905663 9.800760 10.006344 10.052163 10.152453 10.032073
2010-06-13 2010-12-31 2011-10-22 2011-12-31 2012-02-18 2012-12-31
9.879439 10.038644 9.957582 9.977026 9.959094 10.004348
2013-06-29 2013-12-31 2014-11-07 2014-12-31 2015-03-06 2015-12-31
10.004620 10.086071 9.902875 9.843695 9.851306 10.072610
2016-07-14 2016-12-31 2017-11-23 2017-12-31 2018-03-22 2018-08-01
9.966911 10.199251 10.001628 10.263590 10.181235 10.059080
The output is unexpected as the difference in each date is not 377. The output shows that its stops at year end 20xx-12-31 before moving on to next endpoints

I am not sure that you could solve this using endpoints function directly.
Here is one way to solve it using built-in functions. It is a slightly
general solution.
In the code below, you can uncomment the commented lines to print the number of observations in the last interval.
library(xts)
apply.fun <- function(data, variable=1, fun=mean, k=377) { # variable: variable name or column index
data <- as.xts(data)
variable <- data[, variable, drop=TRUE]
idx <- index(data)
byindex <- as.integer(idx - first(idx)) %/% k # intervals idendifiers
endates <- idx[!duplicated(byindex, fromLast=TRUE)]
ans <- setNames(tapply(variable, byindex, fun), endates)
#inter.end <- sum(byindex==last(byindex))
#if(inter.end < k) cat(sprintf("Last internal has fewer observations: %d<k=%d\n\n", inter.end, k))
return(as.xts(as.matrix(ans)))
}
set.seed(147)
zoo.data <- zoo(rnorm(5031)+10,as.Date(13514:17744,origin="1970-01-01"))
apply.fun(zoo.data, 1, mean)
# [,1]
# 2008-01-12 10.043735
# 2009-01-23 10.042741
# 2010-02-04 9.957842
# 2011-02-16 10.016998
# 2012-02-28 9.932871
# 2013-03-11 9.932731
# 2014-03-23 10.045344
# 2015-04-04 10.015821
# 2016-04-15 10.015023
# 2017-04-27 10.038887
# 2018-05-09 9.978744
# 2018-08-01 10.004074

Related

Using xts with timespans crossing calendar dates: How to use period.apply (xts) or POSIXct datetime arguments in these cases in R?

I have a problem applying a function (min) to a specific repeating time-period. Basically my data looks like in that sample:
library(xts)
start <- as.POSIXct("2018-05-18 00:00")
tseq <- seq(from = start, length.out = 1440, by = "10 mins")
Measurings <- data.frame(
Time = tseq,
Temp = sample(10:37,1440, replace = TRUE, set.seed(seed = 10)))
)
Measurings_xts <- xts(Measurings[,-1], Measurings$Time)
with much appreciated help (here), I managed to find out that min and max functions (contrary to mean, which works right away in period.apply) must be defined by a helper function and can then be calculated for logical datetime arguments(hours, days, years...) by using this solution:
colMin <- function(x, na.rm = FALSE) {
apply(x, 2, min, na.rm = na.rm)
}
epHours <- endpoints(Measurings_xts, "hours")
Measurings_min <- period.apply(Measurings_xts, epHours, colMin)
For meteorological analyses I need to calculate further minima for a less intuitive timespan, crossing the calendar day, that I fail to define in code:
I need to output the minimum nighttime temperature from e.g. 2018-05-18 19:00 to 2018-05-19 7:00 in the morning for each night in my dataset.
I have tried to move the timespan by manipulating(moving) the time column up or down, to include the nighttime in one calendar day. Since this solution is error-prone and doesn´t work for my real data, where some observations are missing. How do I use the POSIXct datetime and/or xts functionalities to calculate minima in this case?
You could solve this by creating your own "end points" when you use period.apply
# Choose the appropriate time ranges
z <- Measurings_xts["T19:00/T07:00"]
# Creating your own "endpoints":
epNights <- which(diff.xts(index(z), units = "mins") > 10) - 1
Subtract one off each index because the jumps are recorded at the start of the next "night interval" in the output from which().
Then add the last data point in the data set to your end points vector, and you can then use this in period.apply
epNights <- c(epNights, nrow(z))
Measurings_min <- period.apply(z, epNights, colMin)
Measurings_min
# [,1]
# 2018-05-18 07:00:00 10
# 2018-05-19 07:00:00 10
# 2018-05-20 07:00:00 10
# 2018-05-21 07:00:00 10
# 2018-05-22 07:00:00 10
# 2018-05-23 07:00:00 10
# 2018-05-24 07:00:00 11
# 2018-05-25 07:00:00 10
# 2018-05-26 07:00:00 10
# 2018-05-27 07:00:00 10
# 2018-05-27 23:50:00 12
here is one approach that works by defining a new group for each night interval
# define the time interval, e.g. from 19:00 to 7:00
from <- 19
to <- 7
hours <- as.numeric(strftime(index(Measurings_xts), format="%H"))
y <- rle(as.numeric(findInterval(hours, c(to,from)) != 1))
y$values[c(TRUE, FALSE)] <- cumsum(y$values[c(TRUE, FALSE)])
grp <- inverse.rle(y)
# grp is a grouping variable that is 0 for everything outside the
# defined interval , 1 for the first night, 2 for the second...
s <- split(Measurings_xts, grp); s$`0` <- NULL
# min_value will contain the minimum value for each night interval
min_value <- sapply(s, min)
# to see the date interval for each value
start <- sapply(s, function(x) as.character(index(x)[1]))
end <- sapply(s, function(x) as.character(index(x)[length(x)]))
data.frame(start, end, min_value)
# start end min_value
#1 2018-05-18 2018-05-18 06:50:00 10
#2 2018-05-18 19:00:00 2018-05-19 06:50:00 10
#3 2018-05-19 19:00:00 2018-05-20 06:50:00 10
#4 2018-05-20 19:00:00 2018-05-21 06:50:00 10
#5 2018-05-21 19:00:00 2018-05-22 06:50:00 10
#6 2018-05-22 19:00:00 2018-05-23 06:50:00 10
#7 2018-05-23 19:00:00 2018-05-24 06:50:00 11
#8 2018-05-24 19:00:00 2018-05-25 06:50:00 10
#9 2018-05-25 19:00:00 2018-05-26 06:50:00 10
#10 2018-05-26 19:00:00 2018-05-27 06:50:00 10
#11 2018-05-27 19:00:00 2018-05-27 23:50:00 12

R and Data.table - applying rollapply over multiple columns

I would really appreciate if you can help me do the rollapply for each column of the data.table
time AUD NZD EUR GBP USD AUD
1 2013-01-01 20:00 0.213 -0.30467 -0.127515
2 2013-01-01 20:05 0.21191 -0.30467 -0.127975
3 2013-01-01 20:10 0.212185 -0.304965 -0.127935
4 2013-01-01 20:15 0.212055 -0.30511 -0.1288
5 2013-01-01 20:20 0.211225 -0.30536 -0.12938
6 2013-01-01 20:25 0.211185 -0.30527 -0.129195
7 2013-01-01 20:30 0.21159 -0.3059 -0.13043
8 2013-01-01 20:35 0.21142 -0.304955 -0.13155
9 2013-01-01 20:40 0.21093 -0.30419 -0.132715
10 2013-01-01 20:45 0.2078 -0.30339 -0.13544
11 2013-01-01 20:50 0.208445 -0.30304 -0.135645
12 2013-01-01 20:55 0.208735 -0.30185 -0.1357
13 2013-01-01 21:00 0.20891 -0.303265 -0.13722
14 2013-01-01 21:05 0.20903 -0.30428 -0.137495
15 2013-01-01 21:10 0.209615 -0.305495 -0.13734
16 2013-01-01 21:15 0.20981 -0.30588 -0.13772
17 2013-01-01 21:20 0.209855 -0.306935 -0.13801
18 2013-01-01 21:25 0.209585 -0.30604 -0.138045
19 2013-01-01 21:30 0.210105 -0.3061 -0.137765
20 2013-01-01 21:35 0.210335 -0.30734 -0.138525
Code that works:
library("zoo")
library("data.table")
calculateAverage <- function (x,N) {
tempDataStorage <- rollapply(out[,1], N, mean)
}
col1 <- out[,2]
col2 <- out[,3]
col3 <- out[,4]
average1 <- calculateAverage(col1, 2)
average2 <- calculateAverage(col2, 2)
average3 <- calculateAverage(col3, 2)
combine <- cbind(average1, average2, average3)
tempMatrix <- matrix(, nrow = nrow(out), ncol = ncol(out))
tempMatrix[2:nrow(out), 1:3] <- combine
Suggestion from SO:
test <- lapply(out[,with=F], function(x) rollapply(x,width=2, FUN=mean))
Challenges:
1. The code I created works, but it feels inefficient and not generic. It needs to be modified whenever the number of cols changes
2. Suggestion from SO output is list which is not useful to me
If an alternate method is suggested, I would be really appreciate it!
Thanks in advance
Edit:
Data table added
data <- cbind(mtcars,as.Date(c("2007-06-22", "2004-02-13")))
merge(rollapply(Filter(is.numeric, data), 2, mean),
Filter(Negate(is.numeric), data))
The first line creates data, so that there are not only numeric values in it. This is only to mimic your data, which is not available right now.
The second line filters only numeric columns and applies mean function to each of filtered columns.
Suggestion from David Arenburg worked perfectly!
MaPrice <- function(x, N) {
Mavg <- rollapply(x, N, mean)
Mavg
}
SpreadMA <- out[, lapply(.SD, MaPrice, N = 20)]

Count time stamps in different time intervals - issue with interval which spans midnight

I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2

Summarise a vector and then append the summary statistics to the original dataframe in R

Intro:
I would like to compute the mean, standard deviation, and standard error of a numeric vector in a given dataframe and then create three new vectors using these summary statistics. I then need to combine them with the original dataframe.
Example Code:
## Creating our dataframe:
datetime <- c("5/12/2017 16:15:00","5/16/2017 16:45:00","5/19/2017 17:00:00")
datetime <- as.POSIXct(datetime, format = "%m/%d/%Y %H:%M:%S")
values <- c(1,2,3)
df <- data.frame(datetime, values)
## Here's the current output:
head(df)
datetime values
1 2017-05-12 16:15:00 1
2 2017-05-16 16:45:00 2
3 2017-05-19 17:00:00 3
## And here's the desired output:
head(df1)
datetime values mean sd se
1 2017-05-12 16:15:00 1 2 0.816 0.471
2 2017-05-16 16:45:00 2 2 0.816 0.471
3 2017-05-19 17:00:00 3 2 0.816 0.471
Thanks in advance!
For those who are curious as to why I am trying to do this, I am following this tutorial. I need to make one of those line graph plots with errorbars for some calibrations between a low-cost sensor and an expensive reference instrument.
You can do the assignment simultaneously. Suppose you already have the helper function for you choice of sd and se:
sd0 <- function(x){sd(x) / sqrt(length(x)) * sqrt(length(x) - 1)}
se0 <- function(x){ sd0(x) / sqrt(length(x))}
Then you can try:
df[c('mean', 'sd', 'se')] <- lapply(list(mean, sd0, se0), function(f) f(df$values))
# > df
# datetime values mean sd se
# 1 2017-05-12 16:15:00 1 2 0.8164966 0.4714045
# 2 2017-05-16 16:45:00 2 2 0.8164966 0.4714045
# 3 2017-05-19 17:00:00 3 2 0.8164966 0.4714045
Here is the dplyr solution, with sd0 and se0 given in mt1022's answer:
df %>% mutate("mean"=mean(values),"sd"=sd0(values),"se"=se0(values))

R Search for a particular time from index

I use an xts object. The index of the object is as below. There is one for every hour of the day for a year.
"2011-01-02 18:59:00 EST"
"2011-01-02 19:58:00 EST"
"2011-01-02 20:59:00 EST"
In columns are values associated with each index entry. What I want to do is calculate the standard deviation of the value for all Mondays at 18:59 for the complete year. There should be 52 values for the year.
I'm able to search for the day of the week using the weekdays() function, but my problem is searching for the time, such as 18:59:00 or any other time.
You can do this by using interaction to create a factor from the combination of weekdays and .indexhour, then use split to select the relevant observations from your xts object.
set.seed(21)
x <- .xts(rnorm(1e4), seq(1, by=60*60, length.out=1e4))
groups <- interaction(weekdays(index(x)), .indexhour(x))
output <- lapply(split(x, groups), function(x) c(count=length(x), sd=sd(x)))
output <- do.call(rbind, output)
head(output)
# count sd
# Friday.0 60 1.0301030
# Monday.0 59 0.9204670
# Saturday.0 60 0.9842125
# Sunday.0 60 0.9500347
# Thursday.0 60 0.9506620
# Tuesday.0 59 0.8972697
You can use the .index* family of functions (don't forget the '.' in front of 'index'!):
fxts[.indexmon(fxts)==0] # its zero-based (!) and gives you all the January values
fxts[.indexmday(fxts)==1] # beginning of month
fxts[.indexwday(SPY)==1] # Mondays
require(quantmod)
> fxts
value
2011-01-02 19:58:00 1
2011-01-02 20:59:00 2
2011-01-03 18:59:00 3
2011-01-09 19:58:00 4
2011-01-09 20:59:00 5
2011-01-10 18:59:00 6
2011-01-16 18:59:00 7
2011-01-16 19:58:00 8
2011-01-16 20:59:00 9`
fxts[.indexwday(fxts)==1] #this gives you all the Mondays
for subsetting the time you use
fxts["T19:30/T20:00"] # this will give you the time period you are looking for
and here you combine weekday and time period
fxts["T18:30/T20:00"] & fxts[.indexwday(fxts)==1] # to get a logical vector or
fxts["T18:30/T21:00"][.indexwday(fxts["T18:30/T21:00"])==1] # to get the values
> value
2011-01-03 18:58:00 3
2011-01-10 18:59:00 6

Resources