Multivariate time series model using MARSS package (or maybe dlm) - r

I have two temporal processes. I would like to see if one temporal process (X_{t,2}) can be used to perform better forecast of the other process (X_{t,1}). I have multiple sources providing temporal data on X_{t,2}, (e.g. 3 time series measuring X_{t,2}). All time series require a seasonal component.
I found MARSS' notation to be pretty natural to fit this type of model and the code looks like this:
Z=factor(c("R","S","S","S")) # observation matrix
B=matrix(list(1,0,"beta",1),2,2) #evolution matrix
A="zero" #demeaned
R=matrix(list(0),4,4); diag(R)=c("r","s","s","s")
Q="diagonal and unequal"
U="zero"
period = 12
per.1st = 1 # Now create factors for seasons
c.in = diag(period)
for(i in 2:(ceiling(TT/period))) {c.in = cbind(c.in,diag(period))}
c.in = c.in[,(1:TT)+(per.1st-1)]
rownames(c.in) = month.abb
C = "unconstrained" #2 x 12 matrix
dlmfit = MARSS(data, model=list(Z=Z,B=B,Q=Q,C=C, c=c.in,R=R,A=A,U=U))
I got a beta estimate implying that the second temporal process is useful in forecasting the first process but to my dismay, MARSS gives me an error when I use MARSSsimulate to forecast because one of the matrices (related to seasonality) is time-varying.
Anyone, knows a way around this issue of the MARSS package? And if not, any tips on fitting an analogous model using, say the dlm package?

I was able to represent my state-space model in a form adequate to use with the dlm package. But I encountered some problems using dlm too. First, the ML estimates are VERY unstable. I bypassed this issue by constructing the dlm model based on marss estimates. However, dlmFilter is not working properly. I think the issue is that dlmFilter is not designed to deal with models with multiple sources for one time series, and additional seasonal components. dlmForecast gives me forecasts that I need!!!
In summary for my multivariate time series model (with multiple sources providing data for one of the temporal processes), the MARSS library gave me reasonable estimates of the parameters and allowed me to obtain filtered and smoothed values of the states. Forecast values were not possible. On the other hand, dlm gave fishy estimates for my model and the dlmFilter didn't work, but I was able to use dlmForecast to forecast values using the model I fitted in MARSS and reexpressed in dlm appropriate form.

Related

tsCV() function versus train/test split for time-series in R

So, frequency of data is monthly and it is stationary I have an ARIMA model using auto.arima. Couple tests are applied to the data before creating model like ACF,ADF etc.
y is my monthly time-series object using ts() function:
myarima=auto.arima(y, stepwise = F,approximation = F,trace=T)
Then I use forecast function:
forecast = forecast(myarima,h=10)
autoplot(forecast)
Since for this case, I did not create any train and test sets because my data has fluctutation at the end so if I create a train/test split since my test set should equal to the forecast horizon (last 10 months) then the model will not be able to understand fluctutations at the end since it will be the test. Would be great to be enlighten regarding the K-fold cross validation to avoid these kind of scenarios.
Without train and test split, after creating the model and visualizing the forecast, I went for tsCV():
myforecast_arima<-function(x,h){
forecast(auto.arima(x),stepwise=F,approximation=F,h=h)
}
error_myarima<-tsCV(y,myforecast_arima,h=10)
mean(arimaerror^2,na.rm=TRUE) #To get MSE
Then I get a kind of low MSE value which is around 0.30
So, my question is, is it trustworthy method to evaluate ARIMA models and then deploy using this pathway? or Should I use train/test split method? What would you guys prefer in general? Should I use any other method? and how can I determine the window parameter of tsCV() function? If my pathway is correct then how can I improve it? What are the biggest differences between K-Fold CV and tsCV() function?
Thank you!

How to build a model for temperature-outcome using dlm?

I have a dataset containing information about weather, air pollution and healthoutcomes. I want to regress temperature (T) and temperature lag (T1) against cardiac deaths (CVD). I have previously used the glm model in R using the following script:
#for mean daily temperature and temperature lags separately.
modelT<-glm(cvd~T, data=datapoisson, family=poisson(link="log"), na=na.omit)
I get the effect estimates and standard error values which i used to convert to risk ratio.
Now i want to use dynamic linear model or distributed linear model for check the predictor-outcome and lagged predictor outcome association. However, i can't find the script for running the model in R.
I installed the DLM package in R, but still can't figure out how to build a model using DLM package in R.
I would appreciate if someone can help with it.
Could you try least squares multiple regression to predict the outcome? I used that method when I tried to 'predict' which factors influenced power in a floating offshore wind turbine. It is good for correlating multiple parameters.
They fit a plane to a set of points, but it seems like a similar idea.
https://math.stackexchange.com/questions/99299/best-fitting-plane-given-a-set-of-points

How to check and control for autocorrelation in a mixed effect model of longitudinal data?

I have behavioral data for many groups of birds over 10 days of observation. I wanted to investigate whether there is a temporal pattern in some behaviors (e.g. does mate competition increase over time?) And I was told that I had to account for the autocorrelation of the data, since behavior is unlikely to be independent in each day.
However I was wondering about two things:
Since I'm not interested in the differences in y among days but the trend of y over days, do I still need to correct for autocorrelation?
If yes, how do I control for the autocorrelation so that I'm left out only with the signal (and noise of course)?
For the second question, keep in mind I will be analyzing the effect of time on behavior using mixed models in R (since there are random effects such as pseudo-replication), but I have not found any straightforward method of correcting for autocorrelation in the data when modeling the responses.
(1) Yes, you should check for/account for autocorrelation.
The first example here shows an example of estimating trends in a mixed model while accounting for autocorrelation.
You can fit these models with lme from the nlme package. Here's a mixed model without autocorrelation included:
cmod_lme <- lme(GS.NEE ~ cYear,
data=mc2, method="REML",
random = ~ 1 + cYear | Site)
and you can explore the autocorrelation by using plot(ACF(cmod_lme)).
(2) Add correlation to the model something like this:
cmod_lme_acor <- update(cmod_lme,
correlation=corAR1(form=~cYear|Site)
#JeffreyGirard notes that
to check the ACF after updating the model to include the correlation argument, you will need to use plot(ACF(cmod_lme_acor, resType = "normalized"))

Difference between simulate() and forecast() in "forecast" package

I am working on building a time series model.
However, I am having trouble understanding what the difference is between the simulate function and the forecast function in the forecast package.
Suppose I built an arima model and want to use it to simulate future values as long as 10 years. The data is hourly and we have a year worth of data.
When using forecast to predict the next 1000-step-ahead estimation, I got the following plot.
Using forecast method
Then I used the simulate function to simulate the next 1000 simulated values and got the following plot.
Using simulate method
Data points after the red line are simulated data points.
In the latter example, I used the following codes to simulate the future values.
simulate(arima1, nsim=1000, future=TRUE, bootstrap=TRUE))
where arima1 is my trained arima model, bootstrap residuals are used because the model residuals are not very normal.
Per definition in the forecast package, future=TRUE means that we are simulating future values based on the historical data.
Can anyone tell me what the difference is between these two method? Why does simulate() give me a much more realistic results but forecasted values from forecast() just converge to a constant after several iterations (no much fluctuation to the results from simulate())?
A simulation is a possible future sample path of the series.
A point forecast is the mean of all possible future sample paths. So the point forecasts are usually much less variable than the data.
The forecast function produces point forecasts (the mean) and interval forecasts containing the estimated variation in the future sample paths.
As a side point, an ARIMA model is not appropriate for this time series because of the skewness. You might need to use a transformation first.

From Auto.arima to forecast in R

I don't quite understand the syntax of how forecast() applies external regressors in the library(forecast) in R.
My fit looks like this:
fit <- auto.arima(Y,xreg=factors)
where Y is a timeSeries object 100 x 1 and factors is a timeSeries object 100 x 5.
When I go to forecast, I apply...
forecast(fit, h=horizon)
And I get an error:
Error in forecast.Arima(fit, h = horizon) : No regressors provided
Does it want me to add back the xregressors from the fit? I thought these were included in the fit object as fit$xreg. Does that mean it's asking for future values of the xregressors, or that I should repeat the same values I used in the fit set? The documentation doesn't cover the meaning of xreg in the forecast step.
I believe all this means I should use
forecast(fit, h=horizon,xreg=factors)
or
forecast(fit, h=horizon,xreg=fit$xreg)
Which gives the same results. But I'm not sure whether the forecast step is interpreting the factors as future values, or appropriately as previous ones. So,
Is this doing a forecast out of purely past values, as I expect?
Why do I have to specify the xreg values twice? It doesn't run if I exclude them, so it doesn't behave like an option.
Correct me if I am wrong, but I think you may not completely understand how the ARIMA model with regressors works.
When you forecast with a simple ARIMA model (without regressors), it simply uses past values of your time series to predict future values. In such a model, you could simply specify your horizon, and it would give you a forecast until that horizon.
When you use regressors to build an ARIMA model, you need to include future values of the regressors to forecast. For example, if you used temperature as a regressor, and you were predicting disease incidence, then you would need future values of temperature to predict disease incidence.
In fact, the documentation does talk about xreg specifically. look up ?forecast.Arima and look at both the arguments h and xreg. You will see that If xreg is used, then h is ignored. Why? Because if your function uses xreg, then it needs them for forecasting.
So, in your code, h was simply ignored when you included xreg. Since you just used the values that you used to fit the model, it just gave you all the predictions for the same set of regressors as if they were in the future.
related
https://stats.stackexchange.com/questions/110589/arima-with-xreg-rebuilding-the-fitted-values-by-hand
I read that arima in R is borked
See Issue 3 and 4
https://www.stat.pitt.edu/stoffer/tsa4/Rissues.htm
the xreg was suggested to derive the proper intercept.
I'm using real statistics for excel to figure out what is the actual constant. I had a professor tell me you need to have a constant
These derive the same forecasts. So it appears you can use xreg to get some descriptive information, but you would have to use the statsexchange link to manually derive from them.
f = auto.arima(lacondos[,1])
f$coef
g = Arima(lacondos[,1],c(t(matrix(arimaorder(f)))),include.constant=FALSE,xreg=1:length(lacondos[,1]))
g$coef

Resources