I would really appreciate if you can help me do the rollapply for each column of the data.table
time AUD NZD EUR GBP USD AUD
1 2013-01-01 20:00 0.213 -0.30467 -0.127515
2 2013-01-01 20:05 0.21191 -0.30467 -0.127975
3 2013-01-01 20:10 0.212185 -0.304965 -0.127935
4 2013-01-01 20:15 0.212055 -0.30511 -0.1288
5 2013-01-01 20:20 0.211225 -0.30536 -0.12938
6 2013-01-01 20:25 0.211185 -0.30527 -0.129195
7 2013-01-01 20:30 0.21159 -0.3059 -0.13043
8 2013-01-01 20:35 0.21142 -0.304955 -0.13155
9 2013-01-01 20:40 0.21093 -0.30419 -0.132715
10 2013-01-01 20:45 0.2078 -0.30339 -0.13544
11 2013-01-01 20:50 0.208445 -0.30304 -0.135645
12 2013-01-01 20:55 0.208735 -0.30185 -0.1357
13 2013-01-01 21:00 0.20891 -0.303265 -0.13722
14 2013-01-01 21:05 0.20903 -0.30428 -0.137495
15 2013-01-01 21:10 0.209615 -0.305495 -0.13734
16 2013-01-01 21:15 0.20981 -0.30588 -0.13772
17 2013-01-01 21:20 0.209855 -0.306935 -0.13801
18 2013-01-01 21:25 0.209585 -0.30604 -0.138045
19 2013-01-01 21:30 0.210105 -0.3061 -0.137765
20 2013-01-01 21:35 0.210335 -0.30734 -0.138525
Code that works:
library("zoo")
library("data.table")
calculateAverage <- function (x,N) {
tempDataStorage <- rollapply(out[,1], N, mean)
}
col1 <- out[,2]
col2 <- out[,3]
col3 <- out[,4]
average1 <- calculateAverage(col1, 2)
average2 <- calculateAverage(col2, 2)
average3 <- calculateAverage(col3, 2)
combine <- cbind(average1, average2, average3)
tempMatrix <- matrix(, nrow = nrow(out), ncol = ncol(out))
tempMatrix[2:nrow(out), 1:3] <- combine
Suggestion from SO:
test <- lapply(out[,with=F], function(x) rollapply(x,width=2, FUN=mean))
Challenges:
1. The code I created works, but it feels inefficient and not generic. It needs to be modified whenever the number of cols changes
2. Suggestion from SO output is list which is not useful to me
If an alternate method is suggested, I would be really appreciate it!
Thanks in advance
Edit:
Data table added
data <- cbind(mtcars,as.Date(c("2007-06-22", "2004-02-13")))
merge(rollapply(Filter(is.numeric, data), 2, mean),
Filter(Negate(is.numeric), data))
The first line creates data, so that there are not only numeric values in it. This is only to mimic your data, which is not available right now.
The second line filters only numeric columns and applies mean function to each of filtered columns.
Suggestion from David Arenburg worked perfectly!
MaPrice <- function(x, N) {
Mavg <- rollapply(x, N, mean)
Mavg
}
SpreadMA <- out[, lapply(.SD, MaPrice, N = 20)]
Related
I have a time series and want to use period.apply() function xts library to estimate the mean for 377 days
The reproducible example is as following
zoo.data <- zoo(rnorm(5031)+10,as.Date(13514:17744,origin="1970-01-01"))
ep <- endpoints(zoo.data,'days', k =377)
period.apply(zoo.data, INDEX=ep, FUN=function(x) mean(x))
The output generated is
2007-05-28 2007-12-31 2008-10-05 2008-12-31 2009-02-02 2009-12-31
9.905663 9.800760 10.006344 10.052163 10.152453 10.032073
2010-06-13 2010-12-31 2011-10-22 2011-12-31 2012-02-18 2012-12-31
9.879439 10.038644 9.957582 9.977026 9.959094 10.004348
2013-06-29 2013-12-31 2014-11-07 2014-12-31 2015-03-06 2015-12-31
10.004620 10.086071 9.902875 9.843695 9.851306 10.072610
2016-07-14 2016-12-31 2017-11-23 2017-12-31 2018-03-22 2018-08-01
9.966911 10.199251 10.001628 10.263590 10.181235 10.059080
The output is unexpected as the difference in each date is not 377. The output shows that its stops at year end 20xx-12-31 before moving on to next endpoints
I am not sure that you could solve this using endpoints function directly.
Here is one way to solve it using built-in functions. It is a slightly
general solution.
In the code below, you can uncomment the commented lines to print the number of observations in the last interval.
library(xts)
apply.fun <- function(data, variable=1, fun=mean, k=377) { # variable: variable name or column index
data <- as.xts(data)
variable <- data[, variable, drop=TRUE]
idx <- index(data)
byindex <- as.integer(idx - first(idx)) %/% k # intervals idendifiers
endates <- idx[!duplicated(byindex, fromLast=TRUE)]
ans <- setNames(tapply(variable, byindex, fun), endates)
#inter.end <- sum(byindex==last(byindex))
#if(inter.end < k) cat(sprintf("Last internal has fewer observations: %d<k=%d\n\n", inter.end, k))
return(as.xts(as.matrix(ans)))
}
set.seed(147)
zoo.data <- zoo(rnorm(5031)+10,as.Date(13514:17744,origin="1970-01-01"))
apply.fun(zoo.data, 1, mean)
# [,1]
# 2008-01-12 10.043735
# 2009-01-23 10.042741
# 2010-02-04 9.957842
# 2011-02-16 10.016998
# 2012-02-28 9.932871
# 2013-03-11 9.932731
# 2014-03-23 10.045344
# 2015-04-04 10.015821
# 2016-04-15 10.015023
# 2017-04-27 10.038887
# 2018-05-09 9.978744
# 2018-08-01 10.004074
I have a vector of dates:
dates <- seq(as.Date('2017-01-01'), as.Date('2017-12-31'), by = 'days')
I want to create a data frame where this vector is repeated for n rows. Can anyone tell me how I might be able to accomplish this? Any help is greatly appreciated.
Thanks for the suggestions so far. Unfortunately, I think my intention was unclear in my original question. I would like each of n rows in the data frame to contain the vector of dates so that the final data frame would look something like this:
1 2017-01-01 2017-01-02.....2017-12-31
2 2017-01-01 2017-01-02.....2017-12-31
3 2017-01-01 2017-01-02.....2017-12-31
.
.
.
n 2017-01-01 2017-01-02.....2017-12-31
You can use rep to repeat the vector and then coerce to a dataframe. For example, repeating 10 times
num_repeat <- 10
dates <- data.frame(rep(
seq(as.Date('2017-01-01'), as.Date('2017-12-31'), by = 'days'),
times = num_repeat))
As the question asker is hoping to fill n rows, wouldn't it make more sense to specify length.out rather than times?
set.seed(1)
dtf <- data.frame(A=letters[sample(1:27, 1000, TRUE)])
dtf$B <- rep(dates, length.out=nrow(dtf))
tail(dtf)
# A B
# 995 d 2017-09-22
# 996 u 2017-09-23
# 997 r 2017-09-24
# 998 h 2017-09-25
# 999 f 2017-09-26
# 1000 h 2017-09-27
We use replicate to do this
n <- 5
out <- do.call(rbind, replicate(n, as.data.frame(as.list(dates)),
simplify = FALSE))
names(out) <- paste0('V', seq_along(out))
dim(out)
#[1] 5 365
out[1:3, 1:3]
# V1 V2 V3
#1 2017-01-01 2017-01-02 2017-01-03
#2 2017-01-01 2017-01-02 2017-01-03
#3 2017-01-01 2017-01-02 2017-01-03
out[1:3, 362:365]
# V362 V363 V364 V365
#1 2017-12-28 2017-12-29 2017-12-30 2017-12-31
#2 2017-12-28 2017-12-29 2017-12-30 2017-12-31
#3 2017-12-28 2017-12-29 2017-12-30 2017-12-31
I have a dataframe ("observations") with time stamps in H:M format ("Time"). In a second dataframe ("intervals"), I have time ranges defined by "From" and "Till" variables, also in H:M format.
I want to count number of observations which falls within each interval. I have been using between from data.table, which has been working without any problem when dates are included.
However, now I only have time stamps, without date. This causes some problems for the times which occurs in the interval which spans midnight (20:00 - 05:59). These times are not counted in the code I have tried.
Example below
interval.data <- data.frame(From = c("14:00", "20:00", "06:00"), Till = c("19:59", "05:59", "13:59"), stringsAsFactors = F)
observations <- data.frame(Time = c("14:32", "15:59", "16:32", "21:34", "03:32", "02:00", "00:00", "05:57", "19:32", "01:32", "02:22", "06:00", "07:50"), stringsAsFactors = F)
interval.data
# From Till
# 1: 14:00:00 19:59:00
# 2: 20:00:00 05:59:00 # <- interval including midnight
# 3: 06:00:00 13:59:00
observations
# Time
# 1: 14:32:00
# 2: 15:59:00
# 3: 16:32:00
# 4: 21:34:00 # Row 4-8 & 10-11 falls in 'midnight interval', but are not counted
# 5: 03:32:00 #
# 6: 02:00:00 #
# 7: 00:00:00 #
# 8: 05:57:00 #
# 9: 19:32:00
# 10: 01:32:00 #
# 11: 02:22:00 #
# 12: 06:00:00
# 13: 07:50:00
library(data.table)
library(plyr)
adply(interval.data, 1, function(x, y) sum(y[, 1] %between% c(x[1], x[2])), y = observations)
# From Till V1
# 1 14:00 19:59 4
# 2 20:00 05:59 0 # <- zero counts - wrong!
# 3 06:00 13:59 2
One approach is to use a non-equi join in data.table, and their helper function as.ITime for working with time strings.
You'll have an issue with the interval that spans midnight, but, there should only ever be one of those. And as you're interested in the number of observations per 'group' of intervals, you can treat this group as the equivalent of the 'Not' of the others.
For example, first convert your data.frame to data.table
library(data.table)
## set your data.frames as `data.table`
setDT(interval.data)
setDT(observations)
Then use as.ITime to convert to an integer representation of time
## convert time stamps
interval.data[, `:=`(FromMins = as.ITime(From),
TillMins = as.ITime(Till))]
observations[, TimeMins := as.ITime(Time)]
## you could combine this step with the non-equi join directly, but I'm separating it for clarity
You can now use a non-equi join to find the interval that each time falls within. Noting that those times that reutrn 'NA' are actually those that fall inside the midnight-spanning interval
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
]
# From Till FromMins TillMins Time
# 1: 14:00 19:59 872 872 14:32
# 2: 14:00 19:59 959 959 15.59
# 3: 14:00 19:59 992 992 16:32
# 4: NA NA 1294 1294 21:34
# 5: NA NA 212 212 03:32
# 6: NA NA 120 120 02:00
# 7: NA NA 0 0 00:00
# 8: NA NA 357 357 05:57
# 9: 14:00 19:59 1172 1172 19:32
# 10: NA NA 92 92 01:32
# 11: NA NA 142 142 02:22
# 12: 06:00 13:59 360 360 06:00
# 13: 06:00 13:59 470 470 07:50
Then to get the number of observatins for the groups of intervals, you just .N grouped by each time point, which can just be chained onto the end of the above statement
interval.data[
observations
, on = .(FromMins <= TimeMins, TillMins > TimeMins)
][
, .N
, by = .(From, Till)
]
# From Till N
# 1: 14:00 19:59 4
# 2: NA NA 7
# 3: 06:00 13:59 2
Where the NA group corresponds to the one that spans midnight
I just tweaked your code to get the desired result. Hope this helps!
adply(interval.data, 1, function(x, y)
if(x[1] > x[2]) return(sum(y[, 1] %between% c(x[1], 23:59), y[, 1] %between% c(00:00, x[2]))) else return(sum(y[, 1] %between% c(x[1], x[2]))), y = observations)
Output is:
From Till V1
1 14:00 19:59 4
2 20:00 05:59 7
3 06:00 13:59 2
Consider this
time <- seq(ymd_hms("2014-02-24 23:00:00"), ymd_hms("2014-06-25 08:32:00"), by="hour")
group <- rep(LETTERS[1:20], each = length(time))
value <- sample(-10^3:10^3,length(time), replace=TRUE)
df2 <- data.frame(time,group,value)
str(df2)
> head(df2)
time group value
1 2014-02-24 23:00:00 A 246
2 2014-02-25 00:00:00 A -261
3 2014-02-25 01:00:00 A 628
4 2014-02-25 02:00:00 A 429
5 2014-02-25 03:00:00 A -49
6 2014-02-25 04:00:00 A -749
I would like to create a variable that contains, for each group, the rolling mean of value
over the last 5 days (not including the current observation)
only considering observations that fall at the exact same hour as the current observation.
In other words:
At time 2014-02-24 23:00:00, df2['rolling_mean_same_hour'] contains the mean of the values of value observed at 23:00:00 during the last 5 days in the data (not including 2014-02-24 of course).
I would like to do that in either dplyr or data.table. I confess having no ideas how to do that.
Any ideas?
Many thanks!
You can calculate the rollmean() with your data grouped by the group variable and hour of the time variable, normally the rollmean() will include the current observation, but you can use shift() function to exclude the current observation from the rollmean:
library(data.table); library(zoo)
setDT(df2)
df2[, .(rolling_mean_same_hour = shift(
rollmean(value, 5, na.pad = TRUE, align = 'right'),
n = 1,
type = 'lag'),
time), .(hour(time), group)]
# hour group rolling_mean_same_hour time
# 1: 23 A NA 2014-02-24 23:00:00
# 2: 23 A NA 2014-02-25 23:00:00
# 3: 23 A NA 2014-02-26 23:00:00
# 4: 23 A NA 2014-02-27 23:00:00
# 5: 23 A NA 2014-02-28 23:00:00
# ---
#57796: 22 T -267.0 2014-06-20 22:00:00
#57797: 22 T -389.6 2014-06-21 22:00:00
#57798: 22 T -311.6 2014-06-22 22:00:00
#57799: 22 T -260.0 2014-06-23 22:00:00
#57800: 22 T -26.8 2014-06-24 22:00:00
I have a dataframe with a POSIXct datetime column and a column with a value.
The value may contain periods of NA, sometimes even lags between some hours (no data at all, eg.)
t v
2014-01-01 20:00:00 1000
2014-01-01 20:15:00 2300
2014-01-01 20:30:00 1330
2014-01-01 20:45:00 NA
2014-01-01 21:00:00 NA
2014-01-01 22:15:00 NA
2014-01-01 22:30:00 1330
2014-01-01 22:45:00 3333
One can easily see that there is a period with simply no data written (21:00 to 22:15)
When I now apply
aggregate(data, list(t=cut($t, "1hour"), FUN=sum)
it interprets anything missing as zero. When plotting it with ggplot2 and geom_line, the curve in that region will break down from 1000s to 10s.
I want that aggregate returns NA values for every hour that is not represented by the data (missing or NA itself), such that the values are not bent down to 0 and such that the line plot shows a gap in that period (disconnected data points).
Thanks to #JulienNavarre and #user20650 who both contributed parts of the solution, I put here my final solution which is additionally capable of handling data at non-regular times and demands at least x values per hour for aggregation.
data$t <- as.POSIXct(strptime(data$t,"%Y-%m-%d %H:%M:%S"))
x <- 4 # data available x times per hour
h <- 1 # aggregate to every h hours
# aggregation puts NA if data has not x valid values per hour
dataagg <- aggregate(data$v, list(t=cut(data$t, paste(h,"hours"))),
function(z) ifelse(length(z)<x*h||any(is.na(z)),NA,sum(z,na.rm=T)))
dataagg$t <- as.POSIXct(strptime(dataagg$t, '%Y-%m-%d %H:%M:%S'))
# Now fill up missing datetimes with NA
a <- seq(min(dataagg$t), max(dataagg$t), by=paste(h,"hours"))
t <- a[seq(1, length(a), by=1)]
tdf <- as.data.frame(t)
tdf$t <- as.POSIXct(strptime(tdf$t, '%Y-%m-%d %H:%M:%S'))
dataaggfinal <- merge(dataagg, tdf, by="t", all.y=T)
What you want is not clear tho, but maybe you are looking for a right join, which you can do with merge and all.Y = TRUE.
And after you can do your sum grouped by, with aggregate.
> data$t <- as.POSIXct(data$t)
>
> time.seq <- seq(min(as.POSIXct(data$t)), max(as.POSIXct(data$t)), by = "min")[seq(1, 166, by = 15)]
>
> merge(data, as.data.frame(time.seq), by.x = "t", by.y = "time.seq", all.y = T)
t v
1 2014-01-01 20:00:00 1000
2 2014-01-01 20:15:00 2300
3 2014-01-01 20:30:00 1330
4 2014-01-01 20:45:00 NA
5 2014-01-01 21:00:00 NA
6 2014-01-01 21:15:00 NA
7 2014-01-01 21:30:00 NA
8 2014-01-01 21:45:00 NA
9 2014-01-01 22:00:00 NA
10 2014-01-01 22:15:00 NA
11 2014-01-01 22:30:00 1330
12 2014-01-01 22:45:00 3333
And the x argument in aggregate should be, in this case, the variable you want to "sum", then its "data$v" not "data".