Lag length in ADF tests using different packages in R - r

In a time series analysis, I tested 30 time series with 62 observations for a unit root with the ur.df test from the R package urca (Bernard Pfaff), with lag length decided by the AIC criterion. With no exception, a lag length of 1 was chosen. This seems highly implausible. Testing with a CADF test from the R package CADFtest (which performs an ordinary ADF test if x~1 is chosen), and the AIC criterion for lag length selection, the number of lags varies between 0 and 7. Is there someone who can explain the tendency to a uniform and short lag length in urca?
Furthermore, if the lag lengths in ur.df and CADFtest are the same, the test statistics are not. For instance, for the time series lcon (natural logarithm of consumption per head) 1950-2010 in the Netherlands, the test statistics (constant and trend) are -1.5378 (1) with ur.df and -2.4331 (1) with CADFtest. Adf.test from the R package tseries computes a test statistic equal to ur.df (-1.5378, 1 lag). So rejection of a unit root is dependent on the package, which is not an optimal situation.

There seems to be a severe problem due to sensitivity of results with regard to sample length. Some observations might change the result dramatically (i.e. comparing lag length p=3 and 4 the series starts for the former at y_t=3 and for the latter at y_t=4). Therefore the time series should start at a common date (as is also recommended for IC based selection of lag length for VAR models in gerneral). So if max.lag.y=6 the provided time series needs to be truncated accordingly (i.e. y[-c(1:5)]). Unfortunately this is not the default. Hope this helps. Not sure if this is the only issue with CADFtest though....
(see also https://stat.ethz.ch/pipermail/r-help/2011-November/295519.html )
Bests
Hannes

I had the same problem. You need to specify the maximum number of lags, otherwise the default will be 1.
for example
ur.df(variable, type = "drift", lags=30, selectlags = "AIC")

Related

How to estimate less conservative standard errors when using post-stratified weights without full information in the survey package?

I'm encountering (very) huge standard errors in my analysis of proportions with post-stratified data when using the survey package.
I'm working with a data set including (normalized) weights calculated via raking by another party. I don't know exactly how the strata have been defined (e.g. "ageXgender" has been used, but it's unclear which categorization has been used). Let's assume a simple random sample with a considerable amount of non-response.
Is there any way to estimate reduced standard errors due to post-stratification without the exact information about the procedure in survey? I could recallibrate the weights with rake() if I can exactly define the strata but I don't have enough information for this.
I have tried to infer the strata by grouping all equal weights together and thought that I would at least get an upper bound of the reduction in standard errors this way but using them did only lead to marginally reduced standard errors and sometimes even increased standard errors:
# An example with the api datasets, pretending that pw are post-stratification weights of unknown origin
library(survey)
data(api)
apistrat$pw <-apistrat$pw/mean(apistrat$pw) #normalized weights
# Include some more extreme weights to simulate my data
mins <- which(apistrat$pw == min(apistrat$pw))
maxs <- which(apistrat$pw == max(apistrat$pw))
apistrat[mins[1:5], "pw"] <- 0.1
apistrat[maxs[1:5], "pw"] <- 10
apistrat[mins[6:10], "pw"] <- 0.2
apistrat[maxs[6:10], "pw"] <- 5
dclus1<-svydesign(id=~1, weights=~pw, data=apistrat)
# "Estimate" stratas from the weights
apistrat$ps_est <- as.factor(apistrat$pw)
dclus_ps_est <-svydesign(id=~1, strata=~ps_est, weights=~pw, data=apistrat)
svymean(~api00, dclus1)
svymean(~api00, dclus_ps_est)
#this actually increases the se instead of reducing it
My real weights are also much more complex with 700 unique values in 1000 cases.
Is it possible to somehow approximate the reduction of standard errors due to post-stratification without knowing the real variables and categories and -especially- population values for rake? Could I use rake with some assumptions about the variables and categories used in the strata definitions but without the population totals in some way?
If your data are already raked, then you know the population totals exactly: raking makes the estimated population totals equal the true population totals for the raking variables. So, if you know the raking variables you can estimate the population totals then rake. The raking won't change the weights (because ex hypothesi these were already raked) but it will change the standard error estimates
(The next version of the survey package will have an option in svydesign to do exactly this.)

How to test efficiently for spatial/temporal autocorrelatian in a time series

I am looking at count data over a period of 31 years and 274 connected plots. I suspect this data to have spatial/temporal autocorrelation and I am looking for a well arranged method for testing this.
So far I used DHARMa to check my model residuals which returned the following message:
testing for spatial autocorrelation requires unique x,y values - if
you have several observations per location, either use the
recalculateResiduals function to aggregate residuals per location, or
extract the residuals from the fitted object, and plot / test each of
them independently for spatially repeated subgroups (a typical
scenario would repeated spatial observation, in which case one could
plot / test each time step separately for temporal autocorrelation).
Note that the latter must be done by hand, outside
testSpatialAutocorrelation
As aggregating is not an option I’ve decided to test each of them independently. The following output is the result of testing a single year for spatial autocorrelation:
> Moran.I(spat$Residuals, dists.inv)
$observed
[1] -0.007104585
$expected
[1] -0.003663004
$sd
[1] 0.004742504
$p.value
[1] 0.4680297
How can I interpret this output? And what would be a good method of testing every single year in my dataset? I thought about writing a loop which would make things very hard to read.
The same thing applies to testing for temporal autocorrelation. This is one of the 274 plots:
> lmtest::dwtest(temp$Residuals ~ 1, order.by = temp$year)
Durbin-Watson test
data: temp$Residuals ~ 1
DW = 2.1637, p-value = 0.6775
alternative hypothesis: true autocorrelation is greater than 0
Is there a smart method of running multiple tests to quickly identify the affected years/plots?
Also how would I include the strength of the spatial/temporal autocorrelation separately per year or plot in the final model?

Test for Stationarity in time series

I need to check second order stationarity of a time series of length 7320 (I have 1800 such time series). These time series are displacement recorded at 1800 sites on a mountain.
I tried using Priestley-Subba Rao in R : stationarity(). For 1 time series out of 1800, I got these values:
p-value for T : 2.109424e-15
p-value for I+R : 9.447661e-06
p-value for T+I+R : 1.4099e-10
Could you please tell me how to interpret it. All I know is if the p-value for T is 0, the null hypothesis of time series being stationary is rejected. Also, for 2nd time series out of 1800, I got these values;
p-value for T : 0
p-value for I+R : 1.458063e-09
p-value for T+I+R : 0
Could you tell me how to differentiate between the two. Both the time series are from the same dataset. Also, is it possible that one time series is stationary and another is not, given the fact they are from the same site and recorded at the exact same time.
I also tried Wavelet Spectrum Test in R: hwtos2() function. But this function takes the time-series length that are power of 2. Is there any other better test for looking at stationarity that does not limit with the length of time series?
The book "Nonstationarities in Hydrologic and Environmental Time Series" (Springer Ed.), at pag. 119, provides a good explanation for interpreting those p-values within the Priestley-Subba Rao test.
In general, you may also take a look at:
https://www.stat.tamu.edu/~suhasini/test_papers/priestley_subbarao70.pdf
About other stationarity tests, you may have a look at "weakly.stationary()"
function within "analytics" package and to the "costat" package whose info at:
https://www.jstatsoft.org/article/view/v055i01
where there is a suggestion to handle non dyadic length (i.e., 2^J
for some natural number J) time series. At pag. 5:
"It should be made clear that this is not a limitation of wavelets per se, but of the computationally efficient algorithms used to compute the intended quantities. Data sets of other lengths can be handled by zero-padding or truncation"
Some interesting info at:
https://arxiv.org/pdf/1603.06415.pdf

Is it possibile to arrange a time series in the way that a specific autocorrleation is created?

I have a file containing 2,500 random numbers. Is it possible to rearrange these saved numbers in the way that a specific autocorrelation is created? Lets say, autocorrelation to the lag 1 of 0.2, autocorrelation to the lag 2 of 0.4, etc.etc.
Any help is greatly appreciated!
To be more specific:
The time series of a daily return in percent of an asset has the following characteristics that I am trying to recreate:
Leptokurtic, symmetric distribution, let's say centered at a daily return of zero
No significant autocorrelations (because the sign of a daily return is not predictable)
Significant autocorrleations if the time series is squared
The aim is to produce a random time series which satisfies all these three characteristics. The only two inputs should be the leptokurtic distribution (this I have already created) and the specific autocorrelation of the squared resulting time series (e.g. the final squared time series should have an autocorrelation at lag 1 of 0.2).
I only know how to produce random numbers out of my own mixed-distribution. Naturally if I would square this resulting time series, there would be no autocorrelation. I would like to find a way which takes this into account.
Generally the most straightforward way to create autocorrelated data is to generate the data so that it's autocorrelated. For example, you could create an auto correlated path by always using the value at p-1 as the mean for the random draw at time period p.
Rearranging is not only hard, but sort of odd conceptually. What are you really trying to do in the end? Giving some context might allow better answers.
There are functions for simulating correlated data. arima.sim() from stats package and simulate.Arima() from the forecast package.
simulate.Arima() has the advantages that (1.) it can simulate seasonal ARIMA models (maybe sometimes called "SARIMA") and (2.) It can simulate a continuation of an existing timeseries to which you have already fit an ARIMA model. To use simulate.Arima(), you do need to already have an Arima object.
UPDATE:
type ?arima.sim then scroll down to "examples".
Alternatively:
install.packages("forecast")
library(forecast)
fit <- auto.arima(USAccDeaths)
plot(USAccDeaths,xlim=c(1973,1982))
lines(simulate(fit, 36),col="red")

R, cointegration, multivariate, co.ja(), johansen

I am new to R and cointegration so please have patience with me as I try to explain what it is that I am trying to do. I am trying to find cointegrated variables among 1500-2000 voltage variables in the west power system in Canada/US. THe frequency is hourly (common in power) and cointegrated combinations can be as few as N variables and a maximum of M variables.
I tried to use ca.jo but here are issues that I ran into:
1) ca.jo (Johansen) has a limit to the number of variables it can work with
2) ca.jo appears to force the first variable in the y(t) vector to be the dependent variable (see below).
Eigenvectors, normalised to first column: (These are the cointegration relations)
V1.l2 V2.l2 V3.l2
V1.l2 1.0000000 1.0000000 1.0000000
V2.l2 -0.2597057 -2.3888060 -0.4181294
V3.l2 -0.6443270 -0.6901678 0.5429844
As you can see ca.jo tries to find linear combinations of the 3 variables but by forcing the coefficient on the first variable (in this case V1) to be 1 (i.e. the dependent variable). My understanding was that ca.jo would try to find all combinations such that every variable is selected as a dependent variable. You can see the same treatment in the examples given in the documentation for ca.jo.
3) ca.jo does not appear to find linear combinations of fewer than the number of variables in the y(t) vector. So if there were 5 variables and 3 of them are cointegrated (i.e. V1 ~ V2 + V3) then ca.jo fails to find this combination. Perhaps I am not using ca.jo correctly but my expectation was that a cointegrated combination where V1 ~ V2 + V3 is the same as V1 ~ V2 + V3 + 0 x V4 + 0 x V5. In other words the coefficient of the variable that are NOT cointegrated should be zero and ca.jo should find this type of combination.
I would greatly appreciate some further insight as I am fairly new to R and cointegration and have spent the past 2 months teaching myself.
Thank you.
I have also posted on nabble:
http://r.789695.n4.nabble.com/ca-jo-cointegration-multivariate-case-tc3469210.html
I'm not an expert, but since no one is responding, I'm going to try to take a stab at this one.. EDIT: I noticed that I just answered to a 4 year old question. Hopefully it might still be useful to others in the future.
Your general understanding is correct. I'm not going to go in great detail about the whole procedure but will try to give some general insight. The first thing that the Johansen procedure does is create a VECM out of the VAR model that best corresponds to the data (This is why you need the lag length for the VAR as input to the procedure as well). The procedure will then investigate the non-lagged component matrix of the VECM by looking at its rank: If the variables are not cointegrated then the rank of the matrix will not be significantly different from 0. A more intuitive way of understanding the johansen VECM equations is to notice the comparibility with the ADF procedure for each distinct row of the model.
Furthermore, The rank of the matrix is equal to the number of its eigenvalues (characteristic roots) that are different from zero. Each eigenvalue is associated with a different cointegrating vector, which
is equal to its corresponding eigenvector. Hence, An eigenvalue significantly different
from zero indicates a significant cointegrating vector. Significance of the vectors can be tested with two distinct statistics: The max statistic or the trace statistic. The trace test tests the null hypothesis of less than or equal to r cointegrating vectors against the alternative of more than r cointegrating vectors. In contrast, The maximum eigenvalue test tests the null hypothesis of r cointegrating vectors against the alternative of r + 1 cointegrating vectors.
Now for an example,
# We fit data to a VAR to obtain the optimal VAR length. Use SC information criterion to find optimal model.
varest <- VAR(yourData,p=1,type="const",lag.max=24, ic="SC")
# obtain lag length of VAR that best fits the data
lagLength <- max(2,varest$p)
# Perform Johansen procedure for cointegration
# Allow intercepts in the cointegrating vector: data without zero mean
# Use trace statistic (null hypothesis: number of cointegrating vectors <= r)
res <- ca.jo(yourData,type="trace",ecdet="const",K=lagLength,spec="longrun")
testStatistics <- res#teststat
criticalValues <- res#criticalValues
# chi^2. If testStatic for r<= 0 is greater than the corresponding criticalValue, then r<=0 is rejected and we have at least one cointegrating vector
# We use 90% confidence level to make our decision
if(testStatistics[length(testStatistics)] >= criticalValues[dim(criticalValues)[1],1])
{
# Return eigenvector that has maximum eigenvalue. Note: we throw away the constant!!
return(res#V[1:ncol(yourData),which.max(res#lambda)])
}
This piece of code checks if there is at least one cointegrating vector (r<=0) and then returns the vector with the highest cointegrating properties or in other words, the vector with the highest eigenvalue (lamda).
Regarding your question: the procedure does not "force" anything. It checks all combinations, that is why you have your 3 different vectors. It is my understanding that the method just scales/normalizes the vector to the first variable.
Regarding your other question: The procedure will calculate the vectors for which the residual has the strongest mean reverting / stationarity properties. If one or more of your variables does not contribute further to these properties then the component for this variable in the vector will indeed be 0. However, if the component value is not 0 then it means that "stronger" cointegration was found by including the extra variable in the model.
Furthermore, you can test test significance of your components. Johansen allows a researcher to test a hypothesis about one or more
coefficients in the cointegrating relationship by viewing the hypothesis as
a restriction on the non-lagged component matrix in the VECM. If there exist r cointegrating vectors, only these linear combinations or linear transformations of them, or combinations of the cointegrating vectors, will be stationary. However, I'm not aware on how to perform these extra checks in R.
Probably, the best way for you to proceed is to first test the combinations that contain a smaller number of variables. You then have the option to not add extra variables to these cointegrating subsets if you don't want to. But as already mentioned, adding other variables can potentially increase the cointegrating properties / stationarity of your residuals. It will depend on your requirements whether or not this is the behaviour you want.
I've been searching for an answer to this and I think I found one so I'm sharing with you hoping it's the right solution.
By using the johansen test you test for the ranks (number of cointegration vectors), and it also returns the eigenvectors, and the alphas and betas do build said vectors.
In theory if you reject r=0 and accept r=1 (value of r=0 > critical value and r=1 < critical value) you would search for the highest eigenvalue and from that build your vector. On this case, if the highest eigenvalue was the first, it would be V1*1+V2*(-0.26)+V3*(-0.64).
This would generate the cointegration residuals for these variables.
Again, I'm not 100%, but preety sure the above is how it works.
Nonetheless, you can always use the cajools function from the urca package to create a VECM automatically. You only need to feed it a cajo object and define the number of ranks (https://cran.r-project.org/web/packages/urca/urca.pdf).
If someone could confirm / correct this, it would be appreciated.

Resources