Analyzing disparate time series in R

Analyzing disparate time series in R - r

Are there tools in R that simplify analysis of lagged and disparate time series. For example:
Daily values that only occur on weekdays (no entry on weekends or holidays)
vs
Bi-annual values
What I'm seeking is ways to:
Complete the missing daily values (with interpolated, or last value rolled forward, etc.)
Look for correlation between daily values and the bi-annual value (only the values that came before the bi-annual event)
As an example:
10-year treasury note interest rate (daily on non-holiday weekdays) as "X" and i-bond fixed rate as "Y" (set May 1/Nov 1)
Any suggestions appreciated.
I've built a test dataset manually for "x" and used functions in zoo to populate the missing values (interpolated), but I'm hoping for a less "brute-force" method for looking at analyzing the disparate time series. I've used lag functions in the past, but those were on matching interval time series.
What Jon commented is what I had in mind:
expand a weekday time series to full week using missing value function(s) in zoo
Sample the daily value - say April 15 for the May 1, Oct 15 for Nov 1
Ideally be able to automate - say loop through April 1-30, Oct 1-30 to look for highest RSqr for the model of choice (linear, polynomial, etc.)
Not have to build discrete datasets for each of the above - but if that is what is required I can do it programmatically - I've done that with stock data in the past. I was looking for a more efficient means of selecting the datasets ad hoc during the analysis.
I don't have code to post, because I'm clueless as to the feature/function that would make the date selection I'm after possible (at least in R).
Thanks for the input so far. It has already been useful in helping me look at alternative methods to achieve what I'm after.

Related

How to make prediction if trend/lag values are used as predictors?

I have a 3-year hourly sales dataset(9-6 pm, Monday to Saturday only) and would like to make predictions either for a day or a week ahead with linear regression. The dataset has excluded all the national holidays since the store is not open. This time-series data presents strong intra-day and intra-week seasonalities and shows a high peak around national holidays. So I extracted the following variables for feature engineering:
(1) time-related features: timestamp("2021-02-01 09:00:00"), hour, weekday, month, year
(2) one-week-lag variable
(3) trend variable by decomposing "trend" from historical data and add as a new predictor
decomp_ts <- decompose(ts)
data$trend <- decomp_ts$trend
(4) holiday dummy variables indicating the day before holidays
The model works fine, but I encounter two questions when deploying the model with the real-time data.
(1) I wanted to use predict() with a future dataset of a week ahead as input for the "newdata" argument. But my mind is twisted as I am not sure how to do with the trend variable. Should I run an additional prediction on trend for the next week and add this info back to the future dataset to predict the sales?
(2) How would you suggest generating the one-week lag data considering the missing values caused by holidays? In my case, a store may open only two days in Christmas week, then the one-week-lag variable for the next week will contain missing values for the days when the store is not open.
I look forward to any suggestions.

Calculate rolling yearly differences in R with xts

I would like to calculate rolling yearly differences based on a daily time series in R with a xts object. However, I see currently two issues:
The number of trading days per year is not constant.
There could be holes in the time series, e.g. one year could be missing in-between.
Are there functions available in the library to take such rolling differences without constant lags (e.g. a lag of 260 days could be off by 10 days sometimes)? Or would the correct approach here to search for each date the same date one year before (minus one or two days to account for weekends)?

Create time series in R with data for week days only

I have a data set containing the energy usage by day (date) from 01 Jan 2016 through to 07 Nov 2017 on a daily basis. One of the fields therein is a flag for non working day (nwd) with values of 0 and 1 indicating whether or not this is a working day.
The structure of the data looks like this :-
Date,usage,avgtemp,nwd
2016-01-01,28.5,105986,1
2016-01-02,29.2,105548,1
.
.
.
2017-11-07,98457,23.5,0
I created a data frame with these values - no problems. I then created 2 other data frames, one with nwd = 1 and other with nwd = 1 for the data set for non working and working days respectively.
I am trying to generate a time series (using zoo or xts package - I am open to either) for each of these 2 data frames so that I can then do the non stationarity tests (adf/pp) on them and then do the arima modelling to build a forecast model of the usage.
Can I use a time series for such data sets where the data is not quite regular because each of these series will have gaps - the work day series may have less than 5 continuous days in a week if there are holidays in between. The same would apply to the non working day series.
I cannot summarize this at a weekly level as I need to forecast them at a daily level and possibly at the half hourly level subsequently. I might even want to do 'ardl' modelling later using 'avgtemp' as one of the regressors.
P.S.
Found a post which to some extent is similar to mine but I can't seem to get it going based on the responses there :-
how to convert data frame into time series in R

Designating dynamic start in time-series vectors in R

I have some problems with time-series designation of vectors in R.
I work with time-series and when I want to set a vector to a certain period, I feel quite confident about how to do it. I have simply done as follow name<- ts(name, frequency=12, start=c(2007,1)). As you can see I have monthly data
I am making an R template for colleagues to use, and I want them to be able to carry out a recursive ARIMA regression from any given starting point. That is, I have a range of in-sample predicted valued and I want to designate a start-value that is n monthly observation after 2007 (or whatever start data is used), where n is the start-value of the recursive regression.

first and last from the xts time-series package do exactly what you want.i.e. to get the first 2 months of an object x:
first(x, '2 months’)
or the last 6 weeks:
last(x, '6 weeks’)
Valid period.types are: secs, seconds, mins, minutes, hours, days, weeks, months, quarters, and years. As always you can find much more detailed information using ?xts::first.

Role of frequency parameter in ts

How does the ts() function use its frequency parameter? What is the effect of assigning wrong values as frequency?
I am trying to use 1.5 years of website usage data to build a time series model so that I can forecast the usage for coming periods. I am using data at daily level. What should be the frequency here - 7 or 365 or 365.25?

The frequency is "the" period at which seasonal cycles repeat. I use "the" in scare quotes since, of course, there are often multiple cycles in time series data. For instance, daily data often exhibit weekly patterns (a frequency of 7) and yearly patterns (a frequency of 365 or 365.25 - the difference often does not matter).
In your case, I would assume that weekly patterns dominate, so I would assign frequency=7. If your data exhibits additional patterns, e.g., holiday effects, you can use specialized methods accounting for multiple seasonalities, or work with dummy coding and a regression-based framework.

Here, the frequency parameter is not a frequency that you can observe in the data of your time series. Instead, you have to specify the frequency at which samples of the time series were taken. In your case, this is simply 1 day, or 1.
The value you give here will influence the results you get later when running analysis operations (examples are average requests per time unit or fourier transformation to get the (real) frequencies in the data). E.g. if you wanted to get all your results in the unit of hours instead of in days, you would pass 24 instead of 1 as frequency, because your data samples were taken in a frequency of 24 hours.