The following is a (not perfectly)seasonal time series sequence that I am trying to fit an ARIMA model to:
data plot
I made it stationary(confirmed by adf test) using first order regular diff and first order seasonal diff, after which it looked like this:
acf and pacf of the stationary data
The seasonal components are at 12 and 24, both significant in the acf plot, while in the pacf plot, only the 12th one is significant.
I tried a lot of combinations for p,d,a,P,D,Q but no matter what the combination, the residuals always seem to have the first seasonal lag as significant in both the acf and pacf plots.
I decided to go with ARIMA(4,1,4)(1,1,1)[96] (even though 12 is the frequency of data(monthly), it happens to show seasonality at roughly 96 intervals, i.e. in 8 years) because it was giving the best log-likelihood score, but still it doesn't fits the seasonal components.
residual acf and pacf
Can anyone please suggest me what should be improved/tried in order for the model to fit all the lags?
Sharing the r file and dataset here: https://drive.google.com/drive/folders/1okMUkBj2W2nF9NkoX4igq2-7QP-cgnSO?usp=sharing
Related
I've got this prediction problem for daily data across several years. My data has both yearly and weekly seasonality.
I tried using the following recurrence:(which I just came up with, from nowhere if you like) xn = 1/4(xn-738 + xn-364 + xn-7 + 1/6(xn-1+xn-2+xn-3+xn-4+xn-5+xn-6)
Basically, I am taking into consideration some of the previous days in the week before the day I am trying to predict and also the corresponding day a year and two years earlier. I am doing an average over them.
My question is: can one try to improve the prediction by replacing the coefficients 1/4,1/6 etc with coefficients that would make the mean squared residual smaller?
Personally I see your problem as a regression.
If you have enough data I would run for timeseries prediction.
You said that the data has yearly and weekly seasonality. In order to cope with that you can have two models one with weekly window and one dealing with the yearly pattern and then somehow combine them (linear combination or even another model).
However if you don't have you could try passing the above xi as features to a regression model such as linear regression,svm,feed forward neural network and on theory it will find those coefficients that produce small enough loss (error).
I have seasonal time series data for weekly retail sales and am trying to use the tslm function of Hyndman's forecast package to fit a model with regressors in addition to trend and season.
The issue I'm running into is that when I build the tslm, before adding any regressors (only trend + season), I get a perfect fit (R^2 =1) on the training data!
A prefect fit is problematic because any additional covariate I add to the model (# of items being sold, distribution, etc.) have no impact on predictions (insignificant). Just looking at the data, I know these other regressors matter so I'm not exactly sure where I'm going wrong. Hoping somebody in the community can help me out.
Some information about the data I am using:
Dataset contains weekly data from 2014 - 2017
Training data contains 156 weekly observations (2014 - 2016)
Test data contains 48 observations in 2017
I am using weekly seasonality to build the time series:
ts.train <- ts(df.train$sales, freq=365.25/7)
m.lm <- tslm(ts.train ~ trend + season + items, data=df.train)
p.lm <- forecast(m.lm,
h=48,
newdata=data.frame(items=df.test$items))
If I leave "items" out of the formula, the predictions do not change at all.
I appreciate any input and guidance!
Items probably has too many variables (if they are dummy variables), since you get a perfect fit. See: https://www.otexts.org/fpp2/useful-predictors.html
For example, you need only 6 dummy variables to account for 7 week days.
I am working on building a time series model.
However, I am having trouble understanding what the difference is between the simulate function and the forecast function in the forecast package.
Suppose I built an arima model and want to use it to simulate future values as long as 10 years. The data is hourly and we have a year worth of data.
When using forecast to predict the next 1000-step-ahead estimation, I got the following plot.
Using forecast method
Then I used the simulate function to simulate the next 1000 simulated values and got the following plot.
Using simulate method
Data points after the red line are simulated data points.
In the latter example, I used the following codes to simulate the future values.
simulate(arima1, nsim=1000, future=TRUE, bootstrap=TRUE))
where arima1 is my trained arima model, bootstrap residuals are used because the model residuals are not very normal.
Per definition in the forecast package, future=TRUE means that we are simulating future values based on the historical data.
Can anyone tell me what the difference is between these two method? Why does simulate() give me a much more realistic results but forecasted values from forecast() just converge to a constant after several iterations (no much fluctuation to the results from simulate())?
A simulation is a possible future sample path of the series.
A point forecast is the mean of all possible future sample paths. So the point forecasts are usually much less variable than the data.
The forecast function produces point forecasts (the mean) and interval forecasts containing the estimated variation in the future sample paths.
As a side point, an ARIMA model is not appropriate for this time series because of the skewness. You might need to use a transformation first.
I have a panel data set of lets say 1000 observations, so i=1,2,...,1000 . The data set runs in daily basis for a month, so t=1,2,...,31.
I want to estimate individual specific in R:
y_i10=αi+βi∗yi9+γi∗yi8+...+δi∗yi1+ϵit
and then produce density forecasts for the next 21 days, that is produce density forecasts for yi11,yi12 etc
My questions are:
Can I do this with plm package ? I am aware how to estimate with plm package but I do not know how to produce the forecasts.
Would it be easier (and correct) if I consider each observation as a separate time series, and use arima(9,0,0) for each one of them, and then get the density forecasts ? If so, how can I get the density forecasts ?
In (2.) , how can I include individual specific effects that are constant over time ?
Thanks a lot
I use "vars" R package to do a multivariate time series analysis. The thing is when I conduct a bivariate VAR, the result of serial.test() give always a really low p-value, so we reject H0 and the residuals are correlated. The right thing to do is to increase the order of the VAR but even with a very high order (p=20 or even more) my residuals are still correlated.
How is it possible ?
I can't really give you a reproducible code because i don't know how to reproduce a VAR with residuals always correlated. For me it's a really unusual situation, but if someone know how it's possible it would be great.
This is probably a better question for Cross Validated as it doesn't contain any R code or a reproducible example, but you're probably going to need to do more digging than "I have a low p-value". Have you tested your data for normallity? Also, to say
The right thing to do is to increase the order of the VAR
is very inaccurate. What type of data are you working with that you would set a lag order as high as 20? A Typical value for yearly data is 1, for quarterly is 4, and for monthly is 12. You can't just keep throwing higher and higher orders at your problem and expect it to fix issues in the underlying data.
Assuming you have an optimal lag value and your data is normally distributed and you still have a low p-value there are several ways to go.
Minor cases of positive serial correlation (say, lag-1 residual autocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between 1.2 and 1.6) indicate that there is some room for fine-tuning in the model. Consider adding lags of the dependent variable and/or lags of some of the independent variables. Or, if you have an ARIMA+regressor procedure available in your statistical software, try adding an AR(1) or MA(1) term to the regression model. An AR(1) term adds a lag of the dependent variable to the forecasting equation, whereas an MA(1) term adds a lag of the forecast error. If there is significant correlation at lag 2, then a 2nd-order lag may be appropriate.
If there is significant negative correlation in the residuals (lag-1 autocorrelation more negative than -0.3 or DW stat greater than 2.6), watch out for the possibility that you may have overdifferenced some of your variables. Differencing tends to drive autocorrelations in the negative direction, and too much differencing may lead to artificial patterns of negative correlation that lagged variables cannot correct for.
If there is significant correlation at the seasonal period (e.g. at lag 4 for quarterly data or lag 12 for monthly data), this indicates that seasonality has not been properly accounted for in the model. Seasonality can be handled in a regression model in one of the following ways: (i) seasonally adjust the variables (if they are not already seasonally adjusted), or (ii) use seasonal lags and/or seasonally differenced variables (caution: be careful not to overdifference!), or (iii) add seasonal dummy variables to the model (i.e., indicator variables for different seasons of the year, such as MONTH=1 or QUARTER=2, etc.) The dummy-variable approach enables additive seasonal adjustment to be performed as part of the regression model: a different additive constant can be estimated for each season of the year. If the dependent variable has been logged, the seasonal adjustment is multiplicative. (Something else to watch out for: it is possible that although your dependent variable is already seasonally adjusted, some of your independent variables may not be, causing their seasonal patterns to leak into the forecasts.)
Major cases of serial correlation (a Durbin-Watson statistic well below 1.0, autocorrelations well above 0.5) usually indicate a fundamental structural problem in the model. You may wish to reconsider the transformations (if any) that have been applied to the dependent and independent variables. It may help to stationarize all variables through appropriate combinations of differencing, logging, and/or deflating.