Generate artifical sales data series in R with arima model & add noise - r

I'm extremely new to R and pretty new to time series analysis. I have been looking for the answer online but I can't seem to find it.
I need to generate an artificial time series that would represent sales data of a product X. I need to make a few variations. an Independent series (I can do that with Excel), and a series with AR(1) with rho= 0.8, and with added noise level.
For the AR(1), I know I can use
ar.sim<-arima.sim(model=list(ar=c(.8)), n=52)
ar.sim
this generates 52 data points along the lines of
[1] -2.080871400 -1.327156528 -0.672868162 0.521280151 -0.200243250
[6] 0.808095642 2.208014641 4.266957623 1.682261358 0.796715498
Question 1: How do I make it start from a certain level, e.g. from 300 products sold? I thought n.start but that seems to be something else.. Do I just add 300 everytime? that seems wrong..
Question 2: How can I add noise to this series?

Related

How to create and analyze a time series with variable test frequency in R

Here is a short description of the problem I am trying to solve: I have test data for multiple variables (weight, thickness, absorption, etc.) that are taken at varying intervals over time - no set schedule, sometimes a test a day, sometimes days might go between tests. I want to detect trends in each of these and alert stake holders when any parameter is trending up/down more than a certain amount. I first did a linear model between each variable's raw data and test time (I converted the test time to days or weeks since a fixed date) and create a table with slopes for each variable - so the stake holders can view one table for all variables and quickly see if any of them is raising concern. The issue was that the data for most variables is very noisy. Someone suggested using time series functions, separating noise and seasonality from the trends, and studying the trend component for a cleaner analysis. I started to look into this and see a couple concerns/questions already:
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable? Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
Finally, is there a better way of accomplishing what I explained above? I am only looking for general trends over time - say over 3 months of data, not day to day trends.
Time series are generally used to see if previous observations of a variable have influence on future observations. You would model under the assumption that the previous observations are able to predict the future observations. That is the reason for that most (not all) time series models require evenly spaced instances of training data. If your data is not only very noisy, but also not collected on a regular basis, then you should seriously consider if time series is the appropriate choice of modelling.
Time series analysis seems to require specifying a frequency - how do you handle this if your test data is not taken at regular intervals.
What you can do, is creating an aggregate by increasing the time bucket (shift from daily data to a weekly average for instance) such that every unit of time has an instance of training data. Following your final comment, you could create the average of the observations of the last 3 months of data instead from the observations.
If one gets over the issue in #1 above, decomposes the data, and gets the trend separated out (ie. take out particularly the random variation/noise), how would you then get a slope metric from that? Namely, if I wanted to then fit a linear model to the trend component of the raw data (after decomposing), what would be the x (independent) variable?
In the simplest case of a linear model, the independent variable is the unit of time corresponding to the prediction you are trying to make. However this is not always regarded a time series model.
In the case of an autoregressive model, this would be the previous observation of what you are trying to predict, something similar to y(t) = x(t-1), for instance multiplied by a smoothing factor. I encourage you to read Forecasting: principles and practice which is an excellent book on the matter.
Is there a way to connect the trend component of the ts-decompose function with the original data's x-axis data (in this case the actual test date/times, say converted to weeks or days from a fixed date)?
The function decompose.ts returns a list which includes trend. Trend is a vector of the estimated trend components corresponding to it's respective time value.
Let's create an example time series with linear trend
df <- data.frame(
date = seq(from = as.Date("2021-01-01"), to = as.Date("2021-01-10"), by=1)
)
df$value <- jitter(seq(from = 1, to = nrow(df), by=1))
time_series <- ts(df$value, frequency = 5)
df$trend <- decompose(time_series)$trend
> df
date value trend
1 2021-01-01 0.9170296 NA
2 2021-01-02 1.8899565 NA
3 2021-01-03 3.0816892 2.992256
4 2021-01-04 4.0075589 4.042486
5 2021-01-05 5.0650478 5.046874
6 2021-01-06 6.1681775 6.051641
7 2021-01-07 6.9118942 7.074260
8 2021-01-08 8.1055282 8.041628
9 2021-01-09 9.1206522 NA
10 2021-01-10 9.9018900 NA
As you see, the trend component already is an estimate of the dependent variable at the corresponding time. In decompose the estimate of trend is based on a moving average.

Predicting multivariate time series with RNN

I have been experimenting with a R package called RNN.
The following is the code site:
https://github.com/bquast/rnn
It has a very nice example for financial time series prediction.
I have read the code and I understand it uses the sequence of the time series to predict in advance the value of next day instrument.
The following is an example of run with 10 hidden nodes and 200 epochs
RNN financial time series prediction
What I would expect as result is that the algorithm succeed, at least in part, to predict in advance the value of the instrument.
From what I can see, apparently is only approximating the value of the time series at the current day, not giving any prediction on the next day.
Is my expectation wrong?
This code is very simple, how would you improve it?
y <- X[,1:input$training_amount+input$prediction_gap,as.numeric(input$target)]
matrix(y, ncol=input$training_amount)
y.train moves all the data forward by a day so that is what is being trained on - next day data for the currency pair you care about. With ncol = training_amount when there are too many columns (with them now equal to training_amount + prediction_gap), the first data points fall off; hence all the data gets moved forward by the prediction_gap.

Can we easily avoid the "maximum supported lag is 350 error" when using the ARIMA function in R?

I am currently fitting a SARIMAX model to big data sets. Information is retrieved every 10 minutes, so I have a vector of size 52560 for a year of data. Considering it is representing electricity load in a device throughout said year, we can observe a daily pattern, a weekly one and a yearly one.
There is also a trend, so I need to differentiate my series 4 times. Since the set is for 1 year, I can let aside the yearly seasonality and focus on the other 3. Let's say I get something like this:
dauch = diff(diff(diff(auch2), 144), 1008)
With 144 being the daily seasonality (6×24 10-minute points per day) and 1008 the weekly one.
I would like to fit a SARIMAX model on which I worked with my predecessor. He found that SARIMAX(2,1,5)(1,2,8)144 was the best one for this series. However, I get an error whenever I try this:
themodel = arima(auch[1000:4024,1], order = c(2,1,5), seasonal = list(order = c(1,2,8),
period = 144), xreg=tmpf[988:4012])
tmpf being the temperature used as an exogenous variable. The error is the following:
Error in makeARIMA(trarma[[1L]], trarma[[2L]], Delta, kappa, SSinit) :
maximum supported lag is 350
I don't really understand what it means in my case, because the period I chose is 144 which is inferior to 350. I feel like I need to keep the D = 2 in the model because of the dual differencing for daily and weekly pattern, but I don't know what to do to solve this. Thanks.

Is it possibile to arrange a time series in the way that a specific autocorrleation is created?

I have a file containing 2,500 random numbers. Is it possible to rearrange these saved numbers in the way that a specific autocorrelation is created? Lets say, autocorrelation to the lag 1 of 0.2, autocorrelation to the lag 2 of 0.4, etc.etc.
Any help is greatly appreciated!
To be more specific:
The time series of a daily return in percent of an asset has the following characteristics that I am trying to recreate:
Leptokurtic, symmetric distribution, let's say centered at a daily return of zero
No significant autocorrelations (because the sign of a daily return is not predictable)
Significant autocorrleations if the time series is squared
The aim is to produce a random time series which satisfies all these three characteristics. The only two inputs should be the leptokurtic distribution (this I have already created) and the specific autocorrelation of the squared resulting time series (e.g. the final squared time series should have an autocorrelation at lag 1 of 0.2).
I only know how to produce random numbers out of my own mixed-distribution. Naturally if I would square this resulting time series, there would be no autocorrelation. I would like to find a way which takes this into account.
Generally the most straightforward way to create autocorrelated data is to generate the data so that it's autocorrelated. For example, you could create an auto correlated path by always using the value at p-1 as the mean for the random draw at time period p.
Rearranging is not only hard, but sort of odd conceptually. What are you really trying to do in the end? Giving some context might allow better answers.
There are functions for simulating correlated data. arima.sim() from stats package and simulate.Arima() from the forecast package.
simulate.Arima() has the advantages that (1.) it can simulate seasonal ARIMA models (maybe sometimes called "SARIMA") and (2.) It can simulate a continuation of an existing timeseries to which you have already fit an ARIMA model. To use simulate.Arima(), you do need to already have an Arima object.
UPDATE:
type ?arima.sim then scroll down to "examples".
Alternatively:
install.packages("forecast")
library(forecast)
fit <- auto.arima(USAccDeaths)
plot(USAccDeaths,xlim=c(1973,1982))
lines(simulate(fit, 36),col="red")

Evolution of linear regression coefficients over time

I would like to observe the evolution of the linear regression coefficients over time. To be more precise, let's have a time frame of 2 years where the linear regression will always use the data set with a range of 1 year. After the first regression, we move one week further (i.e. we add a new week, but one is also subtracted from the beginning) and do the regression again as long as we reach the final date: altogether, there will be 52 regressions.
My problem is that there are some holidays in the data set and we can not simply add 7 days as one would easily suggest. I would like to have some wrapper function that would do aforementioned for many other functions from different packages, for example forecast.lm() from the forecast package or any function that one can think of: the objective in every case would be to find the evolution of the linear regression parameters week-by-week.
I think you might get more answers if you edit/subdivide your question in a clear way. (1) how do I find holidays (it's not clear what your definition of holidays is)? (2) how do I slice up a data set accordingly? (3) how do I run a linear regression in each chunk?
(1) find holidays: can't really help here, as I don't know how they're defined/coded in your data set. library(sos); findFn("holiday") finds some options
(2) partition the data set according to inter-holiday/weekend intervals. The example below supposes holidays are coded as 1 and non-holidays are coded as zero.
(3) run the linear regression for each chunk and extract the coefficients.
d <- data.frame(holiday=c(0,0,0,1,1,0,0,0,0,1,0,0,0,0),
x=runif(14),y=runif(14))
per <- cumsum(c(1,diff(d$holiday)==-1)) ## maybe use rle() instead
dd <- with(d,split(subset(d,!holiday),per[!holiday]))
t(sapply(lapply(dd,lm,formula=y~x),coef))

Resources