I want to split my large xts object in a list of regular one second periods containing all the observations of the original objects. The goal is to send each list element to nodes on my cluster for processing.
I came up with this solution, which is quite elaborate. I'm wondering if this code can be simplified:
library(xts)
set.seed(123)
myts = xts(1:10000, as.POSIXlt(1366039619, ts="EST", origin="1970-01-01") + rnorm(10000, 1, 100))
# insure we have at least one observation per second
secs = seq(trunc(index(head(myts, 1))), trunc(index(tail(myts, 1))), by="s")
# generate second periods endpoints
myts = merge(myts, secs, fill=na.locf)
myts.aligned = align.time(myts, 1)
myts.ep = endpoints(myts.aligned, "seconds", 1)
# split large xts object in list of second periods
myts.list = lapply(1:(length(myts.ep)-1), function(x, myts, ep) { myts[ep[x]:ep[x+1],] }, myts, myts.ep)
# call to parLapply here...
I think this does what you want:
split(myts, "secs")
It will create a list where each component is 1 second of non-overlapping data.
See ?split.xts
Related
I'm calculate the difference in seconds of two consecutive row with the following code
set.seed(79)
library(outbreaks)
library(lubridate)
# Import data
disease_df <- measles_hagelloch_1861[, 3, drop = FALSE]
# Generate a random time for each day
disease_df$time <- sample(1:86400, nrow(disease_df), replace = TRUE)
disease_df$time <- hms::as.hms(disease_df$time)
# Combine date and time
disease_df$time1 <- with(disease_df, ymd(date_of_prodrome) + hms(time))
# Sort data
disease_df <- disease_df[order(disease_df$time1), ]
# Difference in days of two consecutive row
disease_df$diff <- as.numeric(difftime(disease_df$date_of_prodrome,
dplyr::lag(disease_df$date_of_prodrome, 1), units = 'days'))
# Difference in seconds of two consecutive row
disease_df$diff1 <- as.numeric(difftime(disease_df$time1,
dplyr::lag(disease_df$time1, 1), units = 'secs'))
Here is the resulted dataframe
and error message longer object length is not a multiple of shorter object length.
Could you please explain why difftime works fine for days but results in error for seconds? Thank you so much!
time1 column is of type "POSIXlt". I am not really sure why difftime with units = 'secs' doesn't work but if you convert it to POSIXct, it works without any error.
disease_df$time1 <- as.POSIXct(disease_df$time1)
disease_df$diff1 <- as.numeric(difftime(disease_df$time1,
dplyr::lag(disease_df$time1, 1), units = 'secs'))
Apparently dplyr was not happy wth the line: dplyr::lag(disease_df$time1, 1) because of the format of disease_df$time1.
Converting it to POSIXct works, so just update this part of your code:
# Combine date and time and convert to POSIXct
disease_df$time1 <- as.POSIXct(with(disease_df, ymd(date_of_prodrome) + hms(time)))
I frequently use to.daily to convert 1 min OHLC data to a daily format but am trying to find a way to do the same with overnight data. I was hoping to see the option to specify what time a "day" starts and ends but didn't see that.
Overnight session being 18:00 to 09:30.
Does anyone have a simple way to do this?
You could use time-of-day subsetting with which.i = TRUE to find all of the observations you don't want. Then subset the original data with the negative of the result, so all the non-overnight observations will be dropped.
# assume data are in a xts object named 'x'
DayObs <- x["T09:30/T18:30", which.i = TRUE]
Overnight <- x[-DayObs,]
You might need to change the start and end times in the time-of-day subset call.
If you already have your data subset so that it only includes the overnight session, you can aggregate to "daily" using period.apply() and custom endpoints. Assuming your data are in an object named x:
ep <- c(0, which(diff(.indexhour(x) > 9 & .indexmin(x) > 30) == 1))
makeOHLC <- function(x) {
op <- as.numeric(first(x))
cl <- as.numeric(last(x))
c(Open = op, High = max(x), Low = min(x), Close = cl)
}
period.apply(x, ep, makeOHLC)
I want to create a dataframe of 15 minute intervals over 24 hours starting with a certain inverval on several dates. I use a loop for that but instant of the actual intervals it stores the number of seconds which is not useful in my case. Is there any way to avoid this? I need the intervals to look how often timed events happen in these intervals. I found one similar question, but the answer concentrated on using lapply instead of apply, which isn't applicable here.
So here is a basic example:
begin<-as.POSIXct(rbind("2016-03-31 09:00:00","2016-04-12 09:00:00"))
end<-as.POSIXct(rbind("2016-03-31 09:15:00","2016-04-12 09:15:00"))
int<-as.interval(begin,end)
aufl<-duration(15, "mins")
Intervall=data.frame()
for (j in 1:length(int)){for (i in 1:96){Intervall[j,i]<-int_shift(int[j],aufl*(i-1))}}
Intervall
I created an answer, I hope this is what you are looking for. If not, please comment:
library(lubridate)
begin <- as.POSIXct(rbind("2016-03-31 09:00:00","2016-04-12 09:00:00"))
# copy begin time for loop
begin_new <- begin
# create durateion object
aufl <- duration(15, "mins")
# count times for loop
times <- 24*60/15
# create dataframe with begin time
Intervall <- data.frame(begin,stringsAsFactors = FALSE)
for (i in 1:times){
cat("test",i,"\n")
# save old time for interval calculation
begin_start <- begin_new
# add 15 Minutes to original time
begin_new <- begin_new + aufl
cat(begin_new,"\n")
# create an interval object between
new_dur <- interval(begin_start,begin_new)
# bind to original dataframe
Intervall <- cbind(Intervall,new_dur)
}
# Add column names
vec_names <- paste0("v",c(1:(times+1)))
colnames(Intervall) <- vec_names
Let's say our dataset looks as follows;
demand <- ts(BJsales, start = c(2000, 1), frequency = 12)
plot(demand)
Now I pass the timeseries object to HoltWinter and plot the fitted data.
hw <- HoltWinters(demand)
plot(hw)
I want to difference Demand and fitted data to find Mean Absolute Deviation(MAD).
I took the demand by hw$x
I took the fit by hw$fit
accu_Holt_data <- as.data.frame(hw$x)
fore_holt <- as.data.frame(hw$fit)
differnce <- accu_Holt_data - fore_holt
cant difference as row length is different
Following up on my comment above, you can do something like this:
dta <- cbind(hw$fit[, 1], hw$x)
mean(abs(dta[, 2] - dta[, 1]), na.rm = TRUE)
There are two main issues with your approach: First, hw$fit is a multi-column dataframe where the first column, xhat, represents the filtered series. Second, the two times series have different indices. Hence the need for someting like cbind to merge the time series.
I have an xts numeric matrix that includes multiple days of minute interval series. I need to calculate statistics for each day on the minute periods, add new columns, and then put all the day series back together.
I have tried apply.daily(), which calls my stats function with an xts matrix, for each day, but I can't figure how to return the modified day series back to the invoking function and reassemble the full set of modified data.
One solution that could work is to use endpoints(x, on = "day") in a loop, then call rbind to reassemble the processed day frames. Is there a better solution?
process = function(myxts) {
day.indexes = endpoints(myxts, on="days")
days = length(day.indexes) - 1
l = list()
list.index = 1
for( i in 1:days ) {
day.begin = day.indexes[i] + 1
day.end = day.indexes[i+1]
l[[list.index]] = ets.sym.process.daily(myxts[day.begin:day.end])
list.index = list.index + 1
}
return(do.call("rbind", l))
}
You should be able to use some combination of do.call(rbind, lapply(split(myxts,"days"), myfun)). It's hard to be more specific without a reproducible example.