I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance


Interpreting ACF and PACF plots for SARIMA model

I'm new to time series and used the monthly ozone concentration data from Rob Hyndman's website to do some forecasting.
After doing a log transformation and differencing by lags 1 and 12 to get rid of the trend and seasonality respectively, I plotted the ACF and PACF shown [in this image][2]. Am I on the right track and how would I interpret this as a SARIMA?
There seems to be a pattern every 11 lags in the PACF plot, which makes me think I should do more differencing (at 11 lags), but doing so gives me a worse plot.
I'd really appreciate any of your help!
I got rid of the differencing at lag 1 and just used lag 12 instead, and this is what I got for the ACF and PACF.
From there, I deduced that: SARIMA(1,0,1)x(1,1,1) (AIC: 520.098)
or SARIMA(1,0,1)x(2,1,1) (AIC: 521.250)
would be a good fit, but auto.arima gave me (3,1,1)x(2,0,0) (AIC: 560.7) normally and (1,1,1)x(2,0,0) (AIC: 558.09) without stepwise and approximation.
I am confused on which model to use, but based on the lowest AIC, SAR(1,0,1)x(1,1,1) would be the best? Also, the thing that concerns me is that none of the models pass the Ljung-Box test. Is there any way I can fix this?
It is quite difficult to manually select a model order that will perform well at forecasting a dataset. This is why Rob has built the 'auto.arima' function in his R forecast package, to figure out the model that may perform best based on certain metrics.
When you see a pacf plot with significantly negative lags that usually means you have over differenced your data. Try removing the 1st order difference and keeping the 12 order difference. Then carry on making your best guess.
I'd recommend trying his auto.arima function and passing it a time series object with frequency = 12. He has a good writeup of seasonal arima models here:
If you would like more insight into manually selecting a SARIMA model order, this is a good read:
In response to your Edit:
I think it would be beneficial to this post if you clarify your objective. Which of the following are you trying to achieve?
Find a model where residuals satisfy Ljung Box Test
Produce the most accurate out of sample forecast
Manually select lag orders such that ACF and PACF plots show no significant lags remaining.
In my opinion, #2 is the most sought after objective so I'll assume that is your goal. From my experience, #3 produces poor results out of sample. In regards to #1, I am usually not concerned about correlations remaining in the residuals. We know we do not have the true model for this time-series, so I do not feel there's any reason to expect an approximate model that performs well out of sample to not have left something behind in the residuals that is more complex perhaps, or nonlinear etc.
To provide you another SARIMA result, I ran this data through some code I've developed and found the following equation produced the minimal error on a cross-validation period.
Final model is:
SARIMA [0,1,1] [1,1,1]12 with a constant using the log normal of the time-series.
The errors in the cross validation period are:
MAPE = 16%
MAE = 0.46
RSQR = 74%
Here is the Partial Autocorrelation plot of the residuals for your information.
This is roughly similar in methodology to selecting an equation based on AICc to my understanding, but is ultimately a different approach. Regardless, if your objective is out of sample accuracy, I'd recommend evaluating equations in terms of their out of sample accuracy versus in-sample fit, tests, or plots.

poLCA not stable estimates

I am trying to run a latent class analysis with covariates using polca package. However, every time I run the model, the multinomial logit coefficients result different. I have considered the changes in the order of the classes and I set up a very high number of replications (nrep=1500). However, rerunning the model I obtain different results. For example, I have 3 classes (high, low, medium). No matter the order in which the classes are considered in the estimation, the multinomial model will give me different coefficient for the same combinations after different estimations (such as low vs high and medium vs high). Should I increase further the number of repetitions in order to have stable results? Any idea of why is this happening? I know with the function set.seed() I can replicate the results but I would like to obtain stable estimates to be able to claim the validity of the results. Thank you very much!
From the manual (?poLCA):
As long as probs.start=NULL, each function call will use different
(random) initial starting parameters
you need to use set.seed() or set probs.start in order to get consistent results across function calls.
Actually, if with different starting points you are not converging, you have a data problem.
LCA uses a kind of maximum likelihood estimation. If there is no convergence, you have an under-identification problem: you have too little information to estimate the number of classes that you have. Lower class numbers might run, or you will have to make some a-priori restrictions.
You might wish to read Latent Class and Latent Transition Analysis by Collins. It was a great help for me.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Mixed Logit fitted probabilities in RSGHB

My question has to do with using the RSGHB package for predicting choice probabilities per alternative by applying mixed logit models (variation across respondents) with correlated coefficients.
I understand that the choice probabilities are simulated on an individual level and in order to get preference share an average of the individual shares would do. All the sources I have found treat each prediction as a separate simulation which makes the whole process cumbersome if many predictions are needed.
Since one can save the respondent specific coefficient draws wouldn't it be faster to simply apply the logit transform to each each (vector of) coefficient draw? Once this is done new or existing alternatives could be calculated faster than rerunning a whole simulation process for each required alternative. For the time being using a fitted() approach will not help me understand how prediction actually works.

How to test time series model?

I am wondering what the good approach for testing time series model would be. Suppose I have a time series in a time domain t1,t2,...tN. I have inputs, say, zt1, zt2,...ztN and output x1,x2...xN.
Now, if that were a classical data mining problem, I could go with known approaches like cross-validation, leave-one-out, 70-30 or something else.
But how should I approach the problem of testing my model with time series? Should I build the model on the first t1,t2,...t(N-k) inputs and test it on the last k inputs? But what if we want to maximise the prediction for p steps ahead and not k (where p < k). I am looking for a robust solution which I can apply to my specific case.
With timeseries fitting, you need to be careful about not using your Out-of-sample data until after you've developed your model. The main problem with modelling is that it's simply easy to overfit.
Typically what we do is to use 70% for in-sample modelling, 30% of out-of-sample testing/validation. And when we use the model in production, the data we collect day-to-day becomes true-out-of-sample data : the data you have never seen or used.
Now, if you have enough data points, I'd suggest trying rolling window fitting approach. For each time step in your in-sample, you look back N time steps to fit your model and see how the parameters in your model varies over time. For example, let's say your model is linear regression with Y = B0 + B1*X1 + B2*X2. You'd do regression N - window_size time over the sample. This way, you understand how sensitive your Betas are in relation to time, among other things.
It sounds like you have a choice between
Using the first few years of data to create the model, then seeing how well it predicts the remaining years.
Using all the years of data for some subset of input conditions, then seeing how well it predicts using the remaining input conditions.
