OLS estimation with AR(1) term - r

For reasons that I cannot explain (because I can't, not because I don't want to), a process used at my office requires running some regressions on Eviews.
The equation specification used on Eviews is:
dependent_variable c independent_variable ar(1)
Furthermore, the process used is "NLS and ARMA."
I don't use Eviews but, as I understand it, that equation means an OLS regression with a constant, one independent variable and an AR(1) term.
I tried running this in R:
result <- lm(df$dependent[2:48] ~ df$independent[1:47] + df$dependent[1:47])
Where df is a data.frame containing the dependent and independent variables (both spanning 48 observations).
Am I doing it right? Because the parameter estimations, while similar, are different in Eviews. Different enough that I cannot use them.
I've thoroughly searched the internet for what this means. I've read up on ARIMA and ARMAX models but I don't think that this is it. I'm sorry but I'm not that knowledgeable on statistics. By the way, estimating ARMAX models seems very complicated and is done by ML, not LS, so I'm really hoping that's not it.
EDIT: I had to edit the model indexes again because I messed them up, again.

You need arima function, see ?arima
Example with some data
y <- lh # lh is Luteinizing Hormone in Blood Samples in datasets package (Base)
set.seed(001)
x <- rnorm(length(y), 100, 10)
arima(y, order = c(1,0,0), xreg=x)
Call:
arima(x = y, order = c(1, 0, 0), xreg = x)
Coefficients:
ar1 intercept x
0.5810 1.8821 0.0053
s.e. 0.1153 0.6991 0.0068
sigma^2 estimated as 0.195: log likelihood = -29.08, aic = 66.16
See ?arima to find help about its arguments.

Related

Optimizing a GAM for Smoothness

I am currently trying to generate a general additive model in R using a response variable and three predictor variables. One of the predictors is linear, and the dataset consists of 298 observations.
I have run the following code to generate a basic GAM:
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
This produces a model with 18 degrees of freedom and seems to substantially overfit the data. I'm wondering how I might generate a GAM that maximizes smoothness and predictive error. I realize that each of these features is going to come at the expense of the other, but is there good a way to find the optimal model that doesn't overfit?
Additionally, I need to perform leave one out cross validation (LOOCV), and I am not sure how to make sure that gam() does this in the MGCV package. Any help on either of these problems uld be greatly appreciated. Thank you.
I've run this to generate a GAM, but it overfits the data.
GAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5])
I have also generated 1,000,000 GAMs with varying combinations of smoothing parameters and ranged the maximum degrees of freedom allowed from 10 (as shown in the code below) to 19. The variable "combinations2" is a list of all 1,000,000 combinations of smoothers I selected. This code is designed to try and balance degrees of freedom and AIC score. It does function, but I'm not sure that I'm actually going to be able to find the optimal model from this. I also cannot tell how to make sure that it uses LOOCV.
BestGAM <- gam(response~ linearpredictor+ predictor2+ predictor3, data = data[2:5])
for(i in 1:100000){
PotentialGAM <- gam(response~ linearpredictor+ s(predictor2) + s(predictor3), data = data[2:5], sp=c(combinations2[i,]$Var1,combinations2[i,]$Var2))
if (AIC(PotentialGAM,BestGAM)$df[1] <= 10 & AIC(PotentialGAM,BestGAM)$AIC[1] < AIC(PotentialGAM,BestGAM)$AIC[2]){
BestGAM <<- PotentialGAM
listNumber <- i
}
}
You are fitting your GAM using generalised cross validation (GCV) smoothness selection. GCV is a way to get around the invariance problem of ordinary cross validation (OCV; what you also call LOOCV) when estimating GAMs. Note that GCV is the same as OCV on a rotated version of the fitting problem (rotating y - Xβ by Q, any orthogonal matrix), and while when fitting with GCV {mgcv} doesn't actually need to do the rotation and the expected GCV score isn't affected by the rotation, GCV is just OCV (wood 2017, p. 260)
It has been shown that GCV can undersmooth (resulting in more wiggly models) as the objective function (GCV profile) can become flat around the optimum. Instead it is preferred to estimate GAMs (with penalized smooths) using REML or ML smoothness selection; add method = "REML" (or "ML") to your gam() call.
If the REML or ML fit is as wiggly as the GCV one with your data, then I'd be likely to presume gam() is not overfitting, but that there is something about your response data that hasn't been explained here (are the data ordered in time, for example?)
As to your question
how I might generate a GAM that maximizes smoothness and [minimize?] predictive error,
you are already doing that using GCV smoothness selection and for a particular definition of "smoothness" (in this case it is squared second derivatives of the estimated smooths, integrated over the range of the covariates, and summed over smooths).
If you want GCV but smoother models, you can increase the gamma argument above 1; gamma 1.4 is often used for example, which means that each EDF costs 40% more in the GCV criterion.
FWIW, you can get the LOOCV (OCV) score for your model without actually fitting 288 GAMs through the use of the influence matrix A. Here's a reproducible example using my {gratia} package:
library("gratia")
library("mgcv")
df <- data_sim("eg1", seed = 1)
m <- gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = df, method = "REML")
A <- influence(m)
r <- residuals(m, type = "response")
ocv_score <- mean(r^2 / (1 - A))

How to perform logistic regression on not binary variable?

I was searching for this answer and I'm really suprised that haven't found it. I just want to peform three level logistic regression in R.
Let's define some artificial data:
set.seed(42)
y <- sample(0:2, 100, replace = T)
x <- rnorm(100)
My variable y is containing three numbers - 0, 1 and 2. So I thought that the simplest way would be just to use:
glm(y ~ x, family = binomial("logit"))
However I got information that y should be in interval [0,1]. Do you know how I can perform this regression ?
Please notice - I know that it's not so straightforward to perform multilevel logistic regression, there are several techniques how to do so e.g. One vs all. But as I was seeking for it, I wasn't able to find any.
Logistic regression as implemented by glm only works for 2 levels of output, not 3.
The message is a little vauge because you can specify the y-variable in logistic regression as 0s and 1s, or as a proportion (between 0 and 1) with a weights argument specifying the number of subjects the proportion is of.
With 3 or more ordered levels in the response you need to use a generalization, one common generalization is proportional odds logistic regression (also goes by other names). The polr function in the MASS package and the lrm function in the rms package (and probably other functions in other packages) fit these types of models, but glm does not.
set.seed(42)
y <- sample(0:2, 100, replace = TRUE)
x <- rnorm(100)
multinomial regression
If you don't want to treat your responses as ordered (i.e., nominal or categorical values):
library(nnet) ## 'recommended' package, i.e. installed by default
multinom(y~x)
Results
# weights: 9 (4 variable)
initial value 109.861229
final value 104.977336
converged
Call:
multinom(formula = y ~ x)
Coefficients:
(Intercept) x
1 -0.001529465 0.29386524
2 -0.649236723 -0.01933747
Residual Deviance: 209.9547
AIC: 217.9547
Or, if your responses are ordered:
ordinal regression
MASS::polr() does proportional-odds logistic regression. (You may also be interested in the ordinal package, which has more features; it can also do multinomial models.)
library(MASS) ## also 'recommended'
polr(ordered(y)~x)
Results
Call:
polr(formula = ordered(y) ~ x)
Coefficients:
x
0.06411137
Intercepts:
0|1 1|2
-0.4102819 1.3218487
Residual Deviance: 212.165
AIC: 218.165
If you read the error message, it offers a hint that you might get success with:
y <- sample(seq(0,1,length=3), 100, replace = T)
And in fact, you do. Now you challenge might be to interpret that in the context of the actual situation in reality (which you have not offered.) You do get a warning, but R warnings are not errors.
You might also look up the topic of polychotomous logistic regression, which is implemented in several variants that might be useful in particular situations. Frank Harrell's book Regression Modeling Strategies has material on such techniques. You may also post further questions on CrossValidated.com if you need help choosing which route to go.

Model Syntax for Simple Moderation Model in Lavaan (with bootstrapping)

I am a social scientist currently running a simple moderation model in R, in the form of y ~ x + m + m * x. My moderator is a binary categorical variable (two separate groups).
I started out with lm(), bootstrapped estimates with boot() and obtained bca confidence intervals with boot.ci. Since there is no automated way of doing this for all parameters (at my coding level at least), this is bit tedious. Howver, I now saw that the lavaan package offer bootstrapping as part of the regular sem() function, and also bca CIs as part of parameterEstimates(). So, I was wondering (since I am using lavaan in other analyses) whether I could just replace lm() with lavaan for the sake of keeping my work more consistent.
Doing this, I was wondering about what the equivalent model for lavaan would be to test for moderation in the same way. I saw this post where Jeremy Miles proposes the code below, which I follow mostly.
mod.1 <- "
y ~ c(a, b) * x
y ~~ c(v1, v1) * y # This step needed for exact equivalence
y ~ c(int1, int2) * 1
modEff := a - b
mEff := int1 - int2"
But it would be great if you could help me figure out some final things.
1) What does the y ~~ c(v1, v1) * y part mean and and why is it needed for "exact equivalence" to the lm model? From the output it seems this constrics variances of the outcome for both groups to the same value?
2) From the post, am I right to understand that either including the interaction effect as calculated above OR constraining (only) the slope between models and looking at model fit with anova()would be the same test for moderation?
3) The lavaan page says that adding test = "bootstrap" to the sem() function allows for boostrap adjusted p-values. However, I read a lot about p-values conflicting with the bca-CIs at times, and this has happened to me. Searching around, I understand that this conflict comes from the assumptions for the distribution of the data under the H0 for p-values, but not for CIs (which just give the range of most likely values). I was therefore wondering what it exactly means that the p-values given here are "bootstrap-adjusted"? Is it technically more true to report these for my SEM models than the CIs?
Many questions, but I would be very grateful for any help you can provide.
Best,
Alex
I think I can answer at least Nr. 1 and 2 of your questions but it is probably easier to not use SEM and instead program a function that conveniently gives you CIs for all coefficients of your model.
So first, to answer your questions:
What is proposed in the code you gave is called multigroup comparison. Essentially this means that you fit the same SEM to two different groups of cases in your dataset. It is equivalent to a moderated regression with binary moderator because in both cases you get two slopes (often called „simple slopes“) for the scalar predictor, one slope per group of the moderator.
Now, in your lavaan code you only see the scalar predictor x. The binary moderator is implied by group="m" when you fit the model with fit.1 <- sem(mod.1, data = df, group = "m") (took this from the page you linked).
The two-element vectors (c( , )) in the lavaan code specify named parameters for the first and second group, respectively. By y ~~ c(v1, v1) * y , the residual variances of y are set equal in both groups because they have the same name. In contrast, the slopes c(a, b) and the intercepts c(int1, int2) are allowed to vary between groups.
Yes. If you use the SEM, you would fit the model a second time adding a == b and compare the model this to the first version where the slopes can differ. This is the same as comparing lm() models with and without a:b (or a*b) in the formula.
Here I cannot provide a direct answer to your question. I suspect if you want BCa CIs as you would get from applying boot.ci to an lm model fit, this might not be implemented. In the lavaan documentation BCa confidence intervals are only mentioned once: In the section about the parameterEstimates function, which can also perform bootstrap (see p. 89). However, it does not produce actual BCa (bias-corrected and accelerated) CIs but only bias-corrected ones.
As mentioned above, I guess the simplest solution would be to use lm() and either repeat the boot.ci procedure for each coefficient or write a wrapper function that does this for you. I suggest this also because a reviewer may be quite puzzled to see you do multigroup SEM instead of a simple moderated regression, which is much more common.
You probably did something like this already:
lm_fit <- function(dat, idx) coef( lm(y ~ x*m, data=dat[idx, ]) )
bs_out <- boot::boot(mydata, statistic=lm_fit, R=1000)
ci_out <- boot::boot.ci(bs_out, conf=.95, type="bca", index=1)
Now, either you repeat the last line for each coefficient, i.e., varying index from 1 to 4. Or you get fancy and let R do the repeating with a function like this:
all_ci <- function(bs) {
est <- bs$t0
lower <- vector("numeric", length(bs$t0))
upper <- lower
for (i in 1:length(bs$t0)) {
ci <- tail(boot::boot.ci(bs, type="bca", index=i)$bca[1,], 2)
lower[i] <- ci[1]
upper[i] <- ci[2]
}
cbind(est, lower, upper)
}
all_ci(bs_out)
I am sure this could be written more concisely but it should work fine for bootstraps of simple lm() models.

How to deal with spatially autocorrelated residuals in GLMM

I am conducting an analysis of where on the landscape a predator encounters potential prey. My response data is binary with an Encounter location = 1 and a Random location = 0 and my independent variables are continuous but have been rescaled.
I originally used a GLM structure
glm_global <- glm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
data=Data_scaled, family=binomial)
but realized that this failed to account for potential spatial-autocorrelation in the data (a spline correlogram showed high residual correlation up to ~1000m).
Correlog_glm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glm_global,
type = "pearson"), xmax = 1000)
I attempted to account for this by implementing a GLMM (in lme4) with the predator group as the random effect.
glmm_global <- glmer(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs+(1|Group),
data=Data_scaled, family=binomial)
When comparing AIC of the global GLMM (1144.7) to the global GLM (1149.2) I get a Delta AIC value >2 which suggests that the GLMM fits the data better. However I am still getting essentially the same correlation in the residuals, as shown on the spline correlogram for the GLMM model).
Correlog_glmm_global <- spline.correlog (x = Data_scaled[, "Y"],
y = Data_scaled[, "X"],
z = residuals(glmm_global,
type = "pearson"), xmax = 10000)
I also tried explicitly including the Lat*Long of all the locations as an independent variable but results are the same.
After reading up on options, I tried running Generalized Estimating Equations (GEEs) in “geepack” thinking this would allow me more flexibility with regards to explicitly defining the correlation structure (as in GLS models for normally distributed response data) instead of being limited to compound symmetry (which is what we get with GLMM). However I realized that my data still demanded the use of compound symmetry (or “exchangeable” in geepack) since I didn’t have temporal sequence in the data. When I ran the global model
gee_global <- geeglm(Encounter ~ Dist_water_cs+coverMN_cs+I(coverMN_cs^2)+
Prey_bio_stand_cs+Prey_freq_stand_cs+Dist_centre_cs,
id=Pride, corstr="exchangeable", data=Data_scaled, family=binomial)
(using scaled or unscaled data made no difference so this is with scaled data for consistency)
suddenly none of my covariates were significant. However, being a novice with GEE modelling I don’t know a) if this is a valid approach for this data or b) whether this has even accounted for the residual autocorrelation that has been evident throughout.
I would be most appreciative for some constructive feedback as to 1) which direction to go once I realized that the GLMM model (with predator group as a random effect) still showed spatially autocorrelated Pearson residuals (up to ~1000m), 2) if indeed GEE models make sense at this point and 3) if I have missed something in my GEE modelling. Many thanks.
Taking the spatial autocorrelation into account in your model can be done is many ways. I will restrain my response to R main packages that deal with random effects.
First, you could go with the package nlme, and specify a correlation structure in your residuals (many are available : corGaus, corLin, CorSpher ...). You should try many of them and keep the best model. In this case the spatial autocorrelation in considered as continous and could be approximated by a global function.
Second, you could go with the package mgcv, and add a bivariate spline (spatial coordinates) to your model. This way, you could capture a spatial pattern and even map it. In a strict sens, this method doesn't take into account the spatial autocorrelation, but it may solve the problem. If the space is discret in your case, you could go with a random markov field smooth. This website is very helpfull to find some examples : https://www.fromthebottomoftheheap.net
Third, you could go with the package brms. This allows you to specify very complex models with other correlation structure in your residuals (CAR and SAR). The package use a bayesian approach.
I hope this help. Good luck

Plot the Profile Deviance for a GLM fit in R

I would like to be able to plot the profile deviance for a parameter estimate fitted using the function glm() in R. The profile Deviance is the deviance function for different values of the parameter estimate in question, after estimating all other parameters. I need to plot the deviance for several values around the fitted parameter, to check the assumption of quadratic deviance function.
My model is predicting reconviction of offenders. The formula is of the form:
reconv ~ [other variables] + sex, where reconv is a binary yes/no factor, and sex is binary male/female factor. I'd like to plot the profile deviance of the parameter estimated for sex=female (sex=male is the reference level).
The glm() function estimated the parameter as -0.22, with std error 0.12.
[I'm asking this question because there was no answer I could find, but I worked it out, and wanted to post a solution to be of use to others. Of course, additional help is welcome. :-)]
See the profileModel package by Ioannis Kosmidis. He had a paper in the R Journal (R News it would appear) illustrating the package:
Ioannis Kosmidis. The profilemodel R package: Profiling objectives for models with linear predictors. R News, 8(2):12-18, October 2008.
The PDF is here (entire newsletter).
See ?profile.glm (and example("profile.glm")) in the MASS package -- I think it will do everything you want (this is not loaded by default, but it is mentioned in ?profile, which might have been the first place you looked ...) (Note that the profiles are generally plotted on a signed-square-root scale, so that a truly quadratic profile will appear as a straight line.)
The way I found to do this involves using the offset() function (as detailed in Pawitan, Y. (2001) 'In All Likelihood' p172).
The answers given by #BenBolker and #GavinSimpson are better than this, in that they referenced packages which will do everything this does and a lot more.
I'm posting this because its another way round it, and also, plotting things "manually" is sometimes nice for learning. It taught me a lot.
sexi <- as.numeric(data.frame$sex)-1 #recode a factor as 0/1 numeric
beta <- numeric(60) #Set up vector to Store the betas
deviance <- numeric(60) #Set up vector to Store the deviances
for (i in 1:60){
beta[i] <- 0.5 - (0.01*i)
#A vector of values either side of the fitted MLE (in this case -0.22)
mod <- update(model,
.~. - sex #Get rid of the fitted variable
+ offset( I(sexi*beta[i]) ) #Replace with offset term.
)
deviance[i] <- mod$deviance #Store i'th deviance
}
best <- which.min(deviance)
#Find the index of best deviance. Should be the fitted value from the model
deviance0 <- deviance - deviance[best]
#Scale deviance to zero by subtracting best deviance
betahat <- beta[best] #Store best beta. Should be the fitted value.
stderror <- 0.12187 #Store the std error of sex, found in summary(model)
quadratic <- ((beta-betahat)^2)*(1/(stderror^2))
#Quadratic reference function to check quadratic assumption against
x11()
plot(beta,deviance0,type="l",xlab="Beta(sex)",ylim=c(0,4))
lines(beta,quadratic,lty=2,col=3) #Add quadratic reference line
abline(3.84,0,lty=3) #Add line at Deviance = 3.84

Resources