[R+zoo]: Operations on time series with different temporal resolutions - r

I have two time series (sensor data) with different temporal resolutions. A time series from the class "xts / zoo" (TS1) includes hourly values and the other time series (TS2) has a better temporal resolution (one observation every 10 minutes). I.e. for TS1 I have 24 data points (observations) per day and for TS2 I have 144 data points per day.
When I calculate TS1-TS2 for one day I get a result with 24 data points (low temporal resolution). What I would like to achieve is to obtain a result with 144 data points (as TS2, better temporal resolution).
Is it possible to achieve this in R?
P.S.:
That's no a trivial problem because in an hourly interval I just have one observation from TS1 and 6 observations from TS2, so I could imagine this problem can be solved if one draws a fit line between every two points of TS1 and calculate the difference between the line and the data points from TS2. But I know no R Function to do this.

You can approximate missing values using na.approx for linear/constant approx or na.spline for polynomial one.
## new index to be used
new.index <-
seq(min(index(TS1)),max(index(TS1)), by=as.difftime(10,units='mins'))
## linear approx
TS1.new <- na.approx(merge(TS1 ,xts(NULL,new.index)))
Now you can susbtract your ts, (even if maybe you should check that they have same start dates)
TS2-TS1.new

Related

time series with multiple observations per unit of time

I have a dataset of the daily spreads of 500 stocks. My eventual goal is to make a model using extreme value theory. However as one of the first steps, I want to check my data for volatility clustering and leptokurticity. So I first want R to see my data as a time series and I want to plot my data. However, I only find examples of time series with only one observation per unit of time. Is there a possibility for R to treat my type of dataset as a time series? And what's the best way to plot it?

How to create and analyze a time series with variable test frequency in R

Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.
Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.

setting the parameter "frequency" for the multi-year average of hourly ozone data

I am trying to decompose a time series which is the monthly multi-year average of hourly ozone data. There are 288 data points (24 hours * 12 months). STL needs ts object to extract the components of time series. And ts has the parameter "frequency". As far as I know, it is the number of observations in one period. For example, it is 12 for monthly averaged temperature data.
What is the frequency for my case since If I use 288
data_ts=stl(ts(data,frequency = 288),s.window = "per"))
As expected, it throws the error "series is not periodic or has less than two periods".
BTW, I am aware of other methods to extract seasonality, but I also need to check the results with STL.
Best
Assuming you have hourly data, there are 24 periods per day, and 24*365.25 periods per year on average. Months would appear to be irrelevant for a natural phenomenon such as ozone. Similarly, weeks are irrelevant. So you just need seasonal periods of 24 and 24*265.35.
The mstl() function from the forecast package can handle multiple seasonal periods.
library(forecast)
data_ts <- mstl(msts(data, seasonal.periods = c(24, 24*365.25)))
However, if you actually have monthly data, then the frequency is 12.
data_ts <- mstl(ts(data, frequency = 12))
As you can see in the picture ACF, the ACF of your data clearly shows an annual seasonal trend.
it peaks at yearly lag at about 12, 24, etc.
If I am behalf on you, I will use freq=12 to decompose my time series data.

How are the intermediate values between observations in time series calculated in ts() in R?

I need help regarding how frequency affects my time series. I fit a daily time series data with frequency = 7 When I view the time series, I get intermediate values between days. I have data for 60 days. I created a time series for the same
ts.v1<- ts(V1, start = as.Date("2017-08-01"), end = as.Date("2017-09-30"), frequency = 7)
which gives me 421 values. I kind of understood that it has to do with the frequency as the value is a product of 7 and 60. What I need to know is- how are these calculated? And why? Isn't frequency used only to tell your time series whether the data is daily/weekly/annual etc.? (I referred to this)
Similarly in my ACF and PACF plots, the lag values are < 1 meaning there are seven values to make 1 'lag'. In that scenario, when I estimate arima(p,d,q) using these plots would the values be taken as lag x frequency?
Normally one does not use Date class with ts. With ts, the frequency is the number of points in a unit interval. Just use:
ts(V1, frequency = 7)
The times will be 1, 1 + 1/7, 1 + 2/7, ... You can later match them to the proper dates if need be.

Suggestions for clustering methods

I have two time series of meteorological measurements (i.e., X and Y). Both X and Y time series were constructed using daily measurements over a period of one year. By plotting X time series versus Y times series as a scatterplot and connecting all the points by date in ascending order, a closed loop is obtained representing the annual cycle. I have measurements at N locations and thus I have N loops (i.e., annual cycles) which I want to cluster to find those that have similar shapes.
With so many clustering methods, I am not sure which one will be more appropriate to use for this analysis (initially I was
thinking to use self-organizing maps).
Thank you very much for any suggestions.
Unless you have too many time series, I suggest to start with hierarchical clustering. It's easy to interpret because of the dendrogram.
For similarity, a cyclic version of DTW may be good, assuming that there is some delay between different locations.

Resources