correctly lagging with irregularly spaced data - r

I have some irregularly spaced data, say table A. The frequency is every 2-5 days. I have another data set, table B, which has entries for every weekday. I want to run the following regression:
A_{t} = alpha + beta1 * B_{t-2 months} + error
where, when I lag B, if there isn't something that isn't exactly 60 days ago, e.g. if 60 days ago was a Sunday, then just pick the next Monday. I can of course construct this w/ a for loop, but what is the R way. Currently, the data are store in MySQL tables and I am using RMySQL to access.
Thanks for the help.

You want the zoo package and its documentation --- which has numerous examples about how to aggregate, align, transform, ... data along the time dimension.
It is a hard problem. You'll have to think about how you do it --- but at least appropriate and powerful tools exist. There are also plenty of usage examples here and on the R lists.
At a minimum, you could use na.locf() to carry your last irregular observation forward to the next regular one (after having merged the data based on daily dates). You can then use lag() operators on the regular data. Also, packages dynlm and dyn facilitate modeling with lm() on data help in zoo objects by adding lags etc to the formula interface.

Related

Time series benchmarking/reconciliation and revisions - are there methods that minimise revisions?

I am using the tempdisagg R package for benchmarking quarterly time series to annual time series from different (more trusted) sources (by temporally disaggragating the annual data using the quarterly data as indicator series).
The time series are sub series and sum series, and these identities should hold after benchmarking, too. I.e. if
S = A + B - C,
then
predict(td(S,...)) = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...)).
I have tried the Denton-Cholette and the Chow-Lin-maxlog methods.
This is to be carried out regularly, so ideally I would like a disaggregation method that minimises revisions. I have tried removing up to ten years' worth a data from various time series to see if any method outperforms the other in terms of minimising revisions, but it seems that it depends on a combination of time series volatility and method and I can't reach a conclusion.
It would be possible to use a combination of different methods on the sub series, I guess.
Is there any comprehensive knowledge on benchhmarking and revisions?
I have attached some graphs in an attempt to illustrate the problem. Ideally, we would like to see one line that just changes colour according to the various years of data, as in the first two graphs until about 2015. The black lines in the graphs are the raw data.
Your question seems to consist of two independent parts.
You mention that the identity S = A + B - C can be achieved with our tempdisagg R-library by predict(td(S,...)) = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...)).
This is usually not the case. You will have to apply td() to three of the four series and compute the fourth series implicitly (e.g. S = predict(td(A, ...)) + predict(td(B, ...)) - predict(td(C,...))).
To answer your question about the revisions, a reproducible example would be handy. You could create such an example with the example time series in our tempdisagg library, which are accessible by data(tempdisagg).
Since the Chow-Lin method is based on a regression (in your case of the involved annual time series), the regression parameters are gonna change with every new or revised annual value. As a consequence, all values of the resulting quarterly series are gonna be revised. When applying the Denton method, no parameters have to be estimated. Thus only the most recent years of the resulting quarterly series are prone to revision. If your focus is on the whole resulting quarterly time series, a lot less quarters are prone to revisions when using the Denton method compared to the Chow-Lin method. If your focus is on the revisions of the most recent quarters/years, it's a different story and I doubt that there is a clear cut answer.
I hope this helps. Our paper Temporal Disaggregation of Time Series contains an overview of the different temporal disaggregation methods implemented in the tempdisagg library.

Interpolate a high frequency time series

I have a physical time series in a range of 2 year sample data with a frequency of 30 minutes, but there are multiple and wide lost data intervals as you can see there:
I tried with the function na.interp from forecast package with a bad result (shown above):
sapply(dataframeTS[2:10], na.interp)
Im looking for a more useful method.
UPDATE:
Here is more info about the pattern I want to capture, concretely the row data. This subsample belongs to May.
You might want to try the **imputeTS** package. It's an R package dedicated to time series missing value imputation.
The na_seadec(), na_seasplit(), na_kalman() methods might be interesting here
There are many more algorithm options - you can find a list in this Paper about the package.
In this specific case I would try:
na_seasplit(yourData)
or
na_kalman(yourData)
or
na_seadec(yourData)
Be aware, that it might be you need to give the seasonality information correctly with the time series. (you have to create a time series (ts object) and set the frequency parameter)
Still might be that it won't work out at all, you will have to try.
(if you can provide the data I'll also give it a try)

R plot data.frame to get more effective overview of data

At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)

R gstat spatio-temporal variogram kriging

I am trying to use the function variogramST from the R package gstat to calculate a spatio-temporal variogram.
There are 12 years of data with 20'000 data points at irregular points in space and time (no full grid or partial grid). I have to use the STIDF from the spacetime package for an irregular data set. I would like a temporal semivariogram with reference points at 0, 90, 180, 270 days, up to some years etc. Unfortunately both computational and memory problems occur. When the command
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1)
is run without further arguments, the semiovariogram is taking into account only very short time periods in terms of reference points for the semivariogram, which does not seem to capture the inherent data structure appropriately.
There are more arguments for this function at the user's disposal, but I am not sure how to parametrize them correctly: tlag, tunit, twindow. Specifically, I am wondering how they interact and how I achieve my goal as described above. So I tried the following code
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag= ...., tunit=... , twindow= ...)
The following code results ist not working due to memory issues in my 32Gbyte RAM computer:
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag=90*(0:20), tunit="days")
but might be perhaps flawed, otherwise. Furthermore, the latter line of code also seems infeasible in terms of computation time.
Does someone know how to specify the variogramST-function from the gstat packaging correctly, aiming at the desired time intervals?
Thanks
If I understand correctly, the twindow argument should be the number of observations to include when calculating the space-time variogram. Assuming your 20k point are distributed more or less evenly over the 12 years, then you have about 1600 points per year. Again, assuming I understand things correctly, if you wanted to include about two years of data in temporal autocorrelation calculations, you would do:
samplevariogram<-variogramST(formula=formula_gstat,data=STIDF1,tlag=90*(0:20), tunit="days",twindow=2*1600)

Fast Fourier Transform and Clustering of Time Series

I'm making a project connected with identifying dynamic of sales. That's how the piece of my database looks like http://imagizer.imageshack.us/a/img854/1958/zlco.jpg. There are three columns:
Product - present the group of product
Week - time since launch the product (week), first 26 weeks
Sales_gain - how the sales of product change by week
In the database there is 3302 observations = 127 time series
My aim is to cluster time series in groups which are going to show me different dynamic of sales. Before clustering I want to use Fast Fourier Transform to change time series on vectors and take into consideration amplitude etc and then use a distance algorithm and group products.
It's my first time I deal with FFT and clustering, so I would be grateful if anybody would point steps, which I have to do before/after using FFT to group dynamics of sales. I want to do all steps in R, so it would be wonderful if somebody type which procedures should I use to do all steps.
That's how my time series look like now http://imageshack.com/a/img703/6726/sru7.jpg
Please note that I am relatively new to time series analysis (that's why I cannot put here my code) so any clarity you could provide in R or any package you could recommend that would accomplish this task efficiently would be appreciated.
P.S. Instead of FFT I found the code for DWT here -> www.rdatamining.com/examples/time-series-clustering-classification but cannot use it on my data base and time series (suggest R to analyze new time series after 26 weeks). Can sb explain it to me?
You may have too little data for FFT/DWT to make sense. DTW may be better, but I also don't think it makes sense for sales data - why would there be a x-week temporal offset from one location to another? It's not as if the data were captured at unknown starting weeks.
FFT and DWT are good when your data will have interesting repetitive patterns, and you have A) a good temporal resolution (for audio data, e.g. 16000 Hz - I am talking about thousands of data points!) and B) you have no idea of what frequencies to expect. If you know e.g. you will have weekly patterns (e.g. no sales on sundays) then you should filter them with other algorithms instead.
DTW (dynamic time-warping) is good when you don't know when the event starts and how they align. Say you are capturing heart measurements. You cannot expect to have the hearts of two subjects to beat in synchronization. DTW will try to align this data, and may (or may not) succeed in matching e.g. an anomaly in the heart beat of two subjects. In theory...
Maybe you don't need specialized time methods here at all.
A) your data has too low temporal resolution
B) your data is already perfectly aligned
Maybe all you need is spend more time in preprocessing your data, in particular normalization, to be able to capture similarity.

Resources