How to calculate daily standard deviation from data collected hourly in R? - r

I calculated daily means from hourly data for all four variables in the dataset without any issues using the xts function daily.apply(df.xts, FUN=mean). By doing that I get daily averages of each of my variables. However, I am not being able to do the same for the standard deviation. When using daily.apply(df.xts, FUN=sd) I just get the index (time stamps) and only one column with values as an output. What am I missing? Similar issue for var.
Thank you

See issue's 124 and 128 on github to see the reasoning why.
Solution (for now?), use an extra package called matrixStats. The code below will return the standard deviation per column of an xts object.
apply.daily(df.xts, matrixStats::colSds)

Related

Analyzing disparate time series in R

Are there tools in R that simplify analysis of lagged and disparate time series. For example:
Daily values that only occur on weekdays (no entry on weekends or holidays)
vs
Bi-annual values
What I'm seeking is ways to:
Complete the missing daily values (with interpolated, or last value rolled forward, etc.)
Look for correlation between daily values and the bi-annual value (only the values that came before the bi-annual event)
As an example:
10-year treasury note interest rate (daily on non-holiday weekdays) as "X" and i-bond fixed rate as "Y" (set May 1/Nov 1)
Any suggestions appreciated.
I've built a test dataset manually for "x" and used functions in zoo to populate the missing values (interpolated), but I'm hoping for a less "brute-force" method for looking at analyzing the disparate time series. I've used lag functions in the past, but those were on matching interval time series.
What Jon commented is what I had in mind:
expand a weekday time series to full week using missing value function(s) in zoo
Sample the daily value - say April 15 for the May 1, Oct 15 for Nov 1
Ideally be able to automate - say loop through April 1-30, Oct 1-30 to look for highest RSqr for the model of choice (linear, polynomial, etc.)
Not have to build discrete datasets for each of the above - but if that is what is required I can do it programmatically - I've done that with stock data in the past. I was looking for a more efficient means of selecting the datasets ad hoc during the analysis.
I don't have code to post, because I'm clueless as to the feature/function that would make the date selection I'm after possible (at least in R).
Thanks for the input so far. It has already been useful in helping me look at alternative methods to achieve what I'm after.

dealing with NA in seasonal cycle analysis R

I have a timeseries of monthly data with lots of missing datapoints, set to NA. I want to simply subtract the annual cycle from the data, ignoring the missing entries. It seems that the decompose function can't handle missing data points, but I have seen elsewhere that the seasonal package is suggested instead. However I am also running into problems there too with the NA.
Here is a minimum reproducible example of the problem using a built in dataset...
library(seasonal)
# set range to missing NA in Co2 dataset
c2<-co2
c2[c2>330 & c2<350]=NA
seas(c2,na.action=na.omit)
Error in na.omit.ts(x) : time series contains internal NAs
Yes, I know! that's why I asked you to omit them! Let's try this:
seas(c2,na.action=na.x13)
Error: X-13 run failed
Errors:
- Adding MV1981.Apr exceeds the number of regression effects
allowed in the model (80).
Hmmm, interesting, no idea what that means, okay, please just exclude the NA:
seas(c2,na.action=na.exclude)
Error in na.omit.ts(x) : time series contains internal NAs
that didn't help much! and for good measure
decompose(c2)
Error in na.omit.ts(x) : time series contains internal NAs
I'm on the following:
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
Why is leaving out NA such a problem? I'm obviously being completely stupid, but I can't see what I'm doing wrong with the seas function. Happy to consider an alternative solution using xts.
My first solution, simply manually calculating the seasonal cycle, converting to a dataframe to subtract the vector and then transforming back.
# seasonal cycle
scycle=tapply(c2,cycle(c2),mean,na.rm=T)
# converting to df
df=tapply(c2, list(year=floor(time(c2)), month = cycle(c2)), c)
# subtract seasonal cycle
for (i in 1:nrow(df)){df[i,]=df[i,]-scycle}
# convert back to timeseries
anomco2=ts(c(t(df)),start=start(c2),freq=12)
Not very pretty, and not very efficient either.
The comment of missuse lead me to another Seasonal decompose of monthly data including NA in r I missed with a near duplicate question and this suggested the package zoo, which seems to work really well for additive series
library(zoo)
c2=co2
c2[c2>330&c2<350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")
shows that the series is very well reconstructed through the missing period.
The output of deconstruct has the trend and seasonal cycle available. I wish I could transfer my bounty to user https://stackoverflow.com/users/516548/g-grothendieck for this helpful response. Thanks to user missuse too.
However, if the missing portion is at the end of the series, the software has to extrapolate the trend and has more difficulties. The original series (in black) maintains the trend, while the trend is smaller in the reconstructed series (red):
c2=co2
c2[c2>350]=NA
d=decompose(na.StructTS(c2))
plot(co2)
lines(d$x,col="red")
Lastly, if instead the missing portion is at the start of the series, the software is unable to extrapolate backwards in time and throws an error... I feel another SO question coming on...
c2=co2
c2[c2<330]=NA
d=decompose(na.StructTS(c2))
Error in StructTS(y) :
the first value of the time series must not be missing
You could just use some algorithm that fills the missing data before.
(e.g. from package imputeTS or zoo)
imputeTS for example has extra imputation algorithms for seasonal time series e.g.:
x <- na_seadec(co2)
Another good option for seasonal data:
x <- na_kalman(co2)
And now just go on without the missing data.
An important hint from Adrian Tompkins (see also comment below):
This will work best, when the missing data is somewhere in the middle. For a lot of leading NAs the method is no good choice. In this case it fills the NAs, but it is not able to extrapolate the trend backwards:
c2<-co2
c2[c2<330]<-NA
c3<-na_kalman(c2)
c4<-na_seadec(c2)
plot(co2)
lines(c3,col="blue")
lines(c4,col="red")

NA in time series handling?

I am dealing with a forecast of time series in R. I have several questions:
I would like to ask how we can handle missing values in time series?
I guess we can somehow interpolate them?
Can you suggest some solution in R for this?
One of the solutions imputeTS library.
library(imputeTS)
# amount of NA
table(is.na(tsAirgap))
# Kalman smoothing imputation (one of the best)
imp_tsAirgap <- na_kalman(tsAirgap)
# Imputed time-series, no NAs
table(is.na(imp_tsAirgap))
If you would like to delete the missing values and their corresponding time-stamps, you can also use the na.remove function within the tseries package.

Check Seasonality in time series

I have 2 years of hourly data.I want to check seasonality .
1.Decomposing the series shows seasonality.But since Decomposition is not enough
what else can i use to check seasonality in R?
2.I tried hourly seasonality , I am not sure on the period of seasonality.How to determine the frequency in R?
Frequency is the number of observations per unit of time. But, in my short experience, the unit of time depends on the event you are studying. For example, if you have monthly data of a yearly seasonal event (like the flowering of some plants) and you sampled 5 times each month, frequency will be 5*12. I suggest you decompose your time series and and check for seasonality there. You can use ts, stl and plot.stl. Try to adjust the parameters as best as you can but also try to check what happens when you change them.
Please read through below link, if you feel to keep multiple seasonal periods in data, you can also paste sample of your data here for further suggestions
https://robjhyndman.com/hyndsight/seasonal-periods/

Transforming to time series object before ARIMA

Is it always mandatory to transform the csv file into time series object before performing auto.arima()?
x<-read.csv(c://text.csv)
text<-ts(x,frequency=12, start=c(1946,1))
test<-auto.arima(text)
Does the transformation is mandatory for arima?
Also, is there any minimum number of past lagged terms required for performing effective forecasting through ARIMA?
In the sorce code on the line 16 of auto.arima you can read x <- as.ts(x), so the function first converts the data to a time series object and so it is not mandatory to supply a time series object.
The number of lags is determined using a performance criteria (mostly AIC) and the procedure starts at lag 0 and goes up until the maximal allowed lag order.

Resources