I am working with bivariate time series data. I used VAR model to fit and forecast.
But the "p" value from seria.test (Portmanteau Test) gives values p<< 0.05. Is that okay?
> var1 = VAR(datax.ts, p= 8)
> serial.test(var1, lags.pt=10, type = "PT.asymptotic")
Portmanteau Test (asymptotic)
data: Residuals of VAR object var1
Chi-squared = 23.724, df = 8, p-value = 0.002549
or Is this wrong? Also the forecast is a flat one. Any idea how to change this?
I have attached Raw Data for your reference.
If I understood you correctly, you have estimated a VAR model using the package vars. You went on and tested the model for autocorrelation in the errors using a portmanteau test.
The null hypothesis of no autocorrelation is rejected since the p-value of 0.002549 is lower than the significance level alpha of 0.05.
Since autocorrelation is an undesirable feature you want to move on and search for a model with no autocorrelation.
Rephrased, because you still have autocorrelation in the errors, there is variance that remains which is not explained by the model.
Related
I am trying to use the car::Anova function to carry out joint Wald chi-squared tests for interaction terms involving categorical variables.
I would like to compare results when using bootstrapped variance-covariance matrix for the model coefficients. I have some concerns about the normality of residuals and am doing this as a first step before considering permutation tests as an alternative to joint Wald chi-squared tests.
I have found the variance covariance from the model fitted on 1000 bootstrap resamples of the data. The problem is that the car::Anova.merMod function does not seem to use the user-specified variance covariance matrix. I get the same results whether I specify vcov. or not.
I have made a very simple example below where I try to use the identity matrix in Anova(). I have tried this with the more realistic bootstrapped var-cov as well.
I looked at the code on github and it looks like there is a line where vcov. is overwritten using vcov(mod), so that might be an error. However I thought I'd see if anyone here had come across this issue or could see if I had made a mistake.
Any help would be great!
df1 = data.frame( y = rbeta(180,2,5), x = rnorm(180), group = letters[1:30] )
mod1 = lmer(y ~ x + (1|group), data = df1)
# Default, uses variance-covariance from the model
Anova(mod1)
# Should use user-specified varcov matrix but does not - same results as above
Anova(mod1, vcov. = diag(2))
# I'm not bootstrapping the var-cov matrix here to save space/time
p.s. Using car::linearHypothesis works for user-specified vcov, but this does not give results using type 3 sums of squares. It is also more laborious to use for more than one interaction term. Therefore I'd prefer to use car::Anova if possible.
I am using "glmnet" package (in R) mostly to perform regularized linear regression.
However I am wondering if it can perform LASSO-type regressions with non-negative (integer) continuous (dependent) outcome variable.
I can use family = poisson, but the outcome variable is not specifically "count" variable. It is just a continuous variable with lower limit 0.
I aware of "lower.limits" function, but I guess it is for covariates (independent variables). (Please correct me if my understanding of this function not right.)
I look forward to hearing from you all! Thanks :-)
You are right that setting lower limit in glmnet is meant for covariates. Poisson will set a lower limit to zero because you exponentiate to get back the "counts".
Going along those lines, most likely it will work if you transform your response variable. One quick way is to take the log of your response variable, do the fit and transform it back, this will ensure that it's always positive. you have to deal with zeros
An alternative is a power transformation. There's a lot to think about and I can only try a two parameter box-cox with a dataset since you did not provide yours:
library(glmnet)
library(mlbench)
library(geoR)
data(BostonHousing)
data = BostonHousing
data$chas=as.numeric(data$chas)
# change it to min 0 and max 1
data$medv = (data$medv-min(data$medv))/diff(range(data$medv))
Then here I use a quick approximation via pca (without fitting all the variables) to get the suitable lambda1 and lambda2 :
bcfit = boxcoxfit(object = data[,14],
xmat = prcomp(data[,-14],scale=TRUE,center=TRUE)$x[,1:2],
lambda2=TRUE)
bcfit
Fitted parameters:
lambda lambda2 beta0 beta1 beta2 sigmasq
0.42696313 0.00001000 -0.83074178 -0.09876102 0.08970137 0.05655903
Convergence code returned by optim: 0
Check the lambda2, it is the one thats critical for deciding whether you get a negative value.. It should be rather small.
Create the functions to power transform:
bct = function(y,l1,l2){((y+l2)^l1 -1)/l1}
bctinverse = function(y,l1,l2){(y*l1+1)^(1/l1) -l2}
Now we transform the response:
data$medv_trans = bct(data$medv,bcfit$lambda[1],bcfit$lambda[2])
And fit glmnet:
fit = glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans,nlambda=500)
Get predictions over all lambdas, and you can see there's no negative predictions once you transform back:
pred = predict(fit,as.matrix(data[,1:13]))
range(bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]))
[1] 0.006690685 0.918473356
And let's say we do a fit with cv:
fit = cv.glmnet(x=as.matrix(data[,1:13]),y=data$medv_trans)
pred = predict(fit,as.matrix(data[,1:13]))
pred_transformed = bctinverse(pred,bcfit$lambda[1],bcfit$lambda[2]
plot(data$medv,pred_transformed,xlab="orig response",ylab="predictions")
I would like to test the suitability of the dynamic linear model which I have fitted to a problem set of data. I have done this using the SS() function in the dse package in R. Are there any ways of testing the fit of the model in R using likelihoods and information tests?
For illustrative purposes, assume that my model is a random walk. The theoretical form of the random walk being X(t) = X(t-1) + e(t)~N(0,1) for state evolution Y(t) = X(t) + w(t)~N(0,1). The code in R being defined by:
kalman.filter=dse::SS(F = matrix(1,1,1),
Q = matrix(1,1,1),
H = matrix(1,1,1),
R = matrix(1,1,1),
z0 = matrix(0,1,1),
P0 = matrix(0,1,1)
)
Assume that the actual observations were then:
simulate.kalman.filter=simulate(kalman.filter, start = 1, freq = 1, sampleT = 100)
Then assume we fit a model called "test":
test=l(kalman.filter, simulate.kalman.filter)
How can I test the fit of the data (simulate.kalman.filter) to the model theoretical model in R? I am looking for function such as the likelihood and the Bayesian Information Criterion.
I have figured out the answer to the question.
The function for doing this is called informationTests() in the same package of dse. It will return the AIC, BIC, and negative log-likelihood of the fitted model to the data. In the example above, this is done by:
informationTests(test)
Remember that a model with a lower BIC is considered better. You can also compare two models (assume that you had a second model fitted to the data called test2) by adding the second model as a parameter:
informationTests(test, test2)
This tabulates the AIC, BIC and likelihoods against one another.
I created a linear regression model of two continuous variables Income and Expense. The former is the independent variable and the latter is the dependent. I initially found that there was heteroskedasticity in the model after looking at the spread of the data and then calculating a post-estimation function (Breusch-Pagan test) which calculated that p-value < 2.2e-16. Since this was less than the significance level of 0.05 I rejected the null hypothesis that there was homoskedasticity and concluded that heteroskedasticity does exist.
In trying to correct the heteroskedasticity I used the box-cox transformation on the dependent variable using the following code:
lmodI = lm(LCF2010$expense ~ LCF2010$income, data=newexcel) #my original Original model
boxcox(lmodI, lambda = seq(0,0.5,0.1)) #Found the ideal lambda value to be 0.35
newexcel <- cbind(newexcel, newexcel$expense^0.35) #Added the new variable to the original dataframe
names(newexcel)[14] <- "Yprime" #Changed the column name to "Yprime"
lmodINew <- lm(Yprime ~ income, data=newexcel) #Created the new linear model
I then decided to compare the old model to the new to see if I had corrected the heteroskedasticity - creating the following diagnostic plots:
original model:
new model:
I also ran the Breusch-Pagan test for the new model and found that the p-value stayed the same at p-value < 2.2e-16. This and the fact that I couldnt see much of a difference between the two diagnostic plots has confused me as I expected the method I used to fix the heteroskedasticity.
I expected the p-value for the new model to be higher than 0.05 so I couldn't reject the null hypothesis and thus have homoskedasticity. Have I done something wrong during the box-cox transformation?
From your plots it seems you have a couple of hundred of observations. Remember that the Breusch-Pagan test is essentially the number of observations times R-squared, where the R-squared comes from the auxiliary regression of the residuals on the regressors (see eqn. [8.16] in Wooldridge 2015). If n is large, this statistic will always reject the null hypothesis.
I am trying to explain to myself the forecasting result from applying an ARIMA model to a time-series dataset. The data is from the M1-Competition, the series is MNB65. I am trying to fit the data to an ARIMA(1,0,0) model and get the forecasts. I am using R. Here are some output snippets:
> arima(x, order = c(1,0,0))
Series: x
ARIMA(1,0,0) with non-zero mean
Call: arima(x = x, order = c(1, 0, 0))
Coefficients:
ar1 intercept
0.9421 12260.298
s.e. 0.0474 202.717
> predict(arima(x, order = c(1,0,0)), n.ahead=12)
$pred
Time Series:
Start = 53
End = 64
Frequency = 1
[1] 11757.39 11786.50 11813.92 11839.75 11864.09 11887.02 11908.62 11928.97 11948.15 11966.21 11983.23 11999.27
I have a few questions:
(1) How do I explain that although the dataset shows a clear downward trend, the forecast from this model trends upward? This also happens for ARIMA(2,0,0), which is the best ARIMA fit for the data using auto.arima (forecast package) and for an ARIMA(1,0,1) model.
(2) The intercept value for the ARIMA(1,0,0) model is 12260.298. Shouldn't the intercept satisfy the equation: C = mean * (1 - sum(AR coeffs)), in which case, the value should be 715.52. I must be missing something basic here.
(3) This is clearly a series with non-stationary mean. Why is an AR(2) model still selected as the best model by auto.arima? Could there be an intuitive explanation?
Thanks.
No ARIMA(p,0,q) model will allow for a trend because the model is stationary. If you really want to include a trend, use ARIMA(p,1,q) with a drift term, or ARIMA(p,2,q). The fact that auto.arima() is suggesting 0 differences would usually indicate there is no clear trend.
The help file for arima() shows that the intercept is actually the mean. That is, the AR(1) model is (Y_t-c) = ϕ(Y_{t-1} - c) + e_t rather than Y_t = c + ϕY_{t-1} + e_t as you might expect.
auto.arima() uses a unit root test to determine the number of differences required. So check the results from the unit root test to see what's going on. You can always specify the required number of differences in auto.arima() if you think the unit root tests are not leading to a sensible model.
Here are the results from two tests for your data:
R> adf.test(x)
Augmented Dickey-Fuller Test
data: x
Dickey-Fuller = -1.031, Lag order = 3, p-value = 0.9249
alternative hypothesis: stationary
R> kpss.test(x)
KPSS Test for Level Stationarity
data: x
KPSS Level = 0.3491, Truncation lag parameter = 1, p-value = 0.09909
So the ADF says strongly non-stationary (the null hypothesis in that case) while the KPSS doesn't quite reject stationarity (the null hypothesis for that test). auto.arima() uses the latter by default. You could use auto.arima(x,test="adf") if you wanted the first test. In that case, it suggests the model ARIMA(0,2,1) which does have a trend.