Simluating an ARMA Model Using R - r

My professor and I who are new to time series analysis in R are attempting the simulate an ARMA model. However, we are having trouble understanding where the parameters for the time series simulation come from. When simulating an ARMA model in R using the arima.sim() function, one argument that is required in the function is model =, which is a list with component ar and ma giving the AR and MA coefficients respectively. The issue we are running into is that we do not know where these AR and MA coefficients come from. Would anyone happen to know where the coefficients arise from?
I have tried searching the internet for information regarding this issue. However, the only answer that I have seen is that the coefficients are from
running an ACF and PACF. Though, there has been no further explanations as to what we are running the ACF and PACF over to generate these coefficients. Are we running ACF and PACF over previously simulated data or something else?
AR(1) Model Example Code
Ar.sm <- list(order = c(1,0,0), ar = 0.1, sd = 0.1)
Ar.lg <- list(order = c(1,0,0), ar = 0.1, sd = 0.1)
AR1.sm <- arima.sim(model = Ar.sm, n = 50)
AR1.lg <- arima.sim(model = Ar.lg, n = 50)
Any help would be greatly appreciated. Additionally, if anyone has found any literature or videos explaining this more in depth, that would be fantastic. Thank you and have a nice day.

The ARMA model is actually a class of models where you get different models by using different parameters. If you are using an ARMA(p,q) model then this means you have p auto-regressive (AR) terms and q moving-average (MA) terms. The AR and MA coefficients in the model set the size of these terms. If you are merely simulating a model (as opposed to making inferences from data) then it is up to you to set the coefficients to whatever values you want to simulate with. You are correct that different coefficient values give different kinds of results that are closely connected to the ACF and PACF.
Since you are simulating, may I suggest that you just try to simulate some examples using coefficients of your choice, and vary the coefficients you put into your simulation to see the differences in what you get out. It would also be a useful exercise for you to construct the sample ACF and PACF of your simulated data, and see how these vary as you change the coefficient values going into your simulation. This will give you a better idea of the connection between the coefficients and the output of the model.

Related

how to decompose a gamma distribution into two gamma distribution in R

Is there an algorithm available in R that can decompose a gamma distribution into two (or more) gamma distributions? If so, can you give me an example with it? Basically, I have a data set that looks like a gamma distribution if I plot it with respect to time (it's a time series data). Basically, this data contains the movement of the animal. And the animal can be in two different states: hungry, not hungry. My immediate reaction was to use the Hidden Markov Model and see if I can predict the two states. I was trying to use the depmix() function from depmixS4 library in R to see if I can see the two different states. However, I don't really know how to use this function in gamma distribution. The following is the code that I wrote, but it says that I need an argument for gamma, which I don't understand. Can someone tell me what parameter I should use and how to determine the parameter? Thanks!
mod <- depmix(freq ~ 1, data = mod.data, nstates = 2, family = gamma())
fit.mod <- fit(mod)
Thank you!

ACF PACF Determination ARIMA

Please help me confirm my understanding. For below graphs, I believe
AR(p) = 0 and MA(q) = 0
Is that correct ?
First, let's learn more...
As Aashiq Reza brought the description link, I think the ACF and PACF plots that you shared is like an MA(2) process.
ARIMA(p, i,q) has three elements, p is for AR, i is for difference, and q stands for MA process lag. Because the lag parameter defines the lag in the model's regression formula, if both of p and q be zero, then the model is not ARIMA anymore.
My suggestion: probabilistic model selection...
You can evaluate the correctness of a model for a time-series object using information criteria like AIC and BIC. For example, you have a preset of possible p and q, then you can test each one and get the criteria for it. The model with the least criterion is the best one. This link helps with the calculation in python.

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

Difference between glmnet() and cv.glmnet() in R?

I'm working on a project that would show the potential influence a group of events have on an outcome. I'm using the glmnet() package, specifically using the Poisson feature. Here's my code:
# de <- data imported from sql connection
x <- model.matrix(~.,data = de[,2:7])
y <- (de[,1])
reg <- cv.glmnet(x,y, family = "poisson", alpha = 1)
reg1 <- glmnet(x,y, family = "poisson", alpha = 1)
**Co <- coef(?reg or reg1?,s=???)**
summ <- summary(Co)
c <- data.frame(Name= rownames(Co)[summ$i],
Lambda= summ$x)
c2 <- c[with(c, order(-Lambda)), ]
The beginning imports a large amount of data from my database in SQL. I then put it in matrix format and separate the response from the predictors.
This is where I'm confused: I can't figure out exactly what the difference is between the glmnet() function and the cv.glmnet() function. I realize that the cv.glmnet() function is a k-fold cross-validation of glmnet(), but what exactly does that mean in practical terms? They provide the same value for lambda, but I want to make sure I'm not missing something important about the difference between the two.
I'm also unclear as to why it runs fine when I specify alpha=1 (supposedly the default), but not if I leave it out?
Thanks in advance!
glmnet() is a R package which can be used to fit Regression models,lasso model and others. Alpha argument determines what type of model is fit. When alpha=0, Ridge Model is fit and if alpha=1, a lasso model is fit.
cv.glmnet() performs cross-validation, by default 10-fold which can be adjusted using nfolds. A 10-fold CV will randomly divide your observations into 10 non-overlapping groups/folds of approx equal size. The first fold will be used for validation set and the model is fit on 9 folds. Bias Variance advantages is usually the motivation behind using such model validation methods. In the case of lasso and ridge models, CV helps choose the value of the tuning parameter lambda.
In your example, you can do plot(reg) OR reg$lambda.min to see the value of lambda which results in the smallest CV error. You can then derive the Test MSE for that value of lambda. By default, glmnet() will perform Ridge or Lasso regression for an automatically selected range of lambda which may not give the lowest test MSE. Hope this helps!
Hope this helps!
Between reg$lambda.min and reg$lambda.1se ; the lambda.min obviously will give you the lowest MSE, however, depending on how flexible you can be with the error, you may want to choose reg$lambda.1se, as this value would further shrink the number of predictors. You may also choose the mean of reg$lambda.min and reg$lambda.1se as your lambda value.

OLS estimation with AR(1) term

For reasons that I cannot explain (because I can't, not because I don't want to), a process used at my office requires running some regressions on Eviews.
The equation specification used on Eviews is:
dependent_variable c independent_variable ar(1)
Furthermore, the process used is "NLS and ARMA."
I don't use Eviews but, as I understand it, that equation means an OLS regression with a constant, one independent variable and an AR(1) term.
I tried running this in R:
result <- lm(df$dependent[2:48] ~ df$independent[1:47] + df$dependent[1:47])
Where df is a data.frame containing the dependent and independent variables (both spanning 48 observations).
Am I doing it right? Because the parameter estimations, while similar, are different in Eviews. Different enough that I cannot use them.
I've thoroughly searched the internet for what this means. I've read up on ARIMA and ARMAX models but I don't think that this is it. I'm sorry but I'm not that knowledgeable on statistics. By the way, estimating ARMAX models seems very complicated and is done by ML, not LS, so I'm really hoping that's not it.
EDIT: I had to edit the model indexes again because I messed them up, again.
You need arima function, see ?arima
Example with some data
y <- lh # lh is Luteinizing Hormone in Blood Samples in datasets package (Base)
set.seed(001)
x <- rnorm(length(y), 100, 10)
arima(y, order = c(1,0,0), xreg=x)
Call:
arima(x = y, order = c(1, 0, 0), xreg = x)
Coefficients:
ar1 intercept x
0.5810 1.8821 0.0053
s.e. 0.1153 0.6991 0.0068
sigma^2 estimated as 0.195: log likelihood = -29.08, aic = 66.16
See ?arima to find help about its arguments.

Resources