R, AR(1) and FGLS Computations. Unable to create a lagged variable - r

Problem
Data
Hello everyone. I have been tasked with trying to solve this problem for the past two weeks with little to no help from my professor. I am not really asking for the exact code, but I can't start the problem...
We are given a set of data of the variables (M,R,Y) and we are asked to run the regression shown in the image. My issue now is that I can't seem to create a lagged variable. I have been told that 1) we don't need to use time series objects 2) I should be using time series objects (based off my research). Furthermore, when we get to the part where it asks us to run the regression for t=2 and onwards to get the p_hat, the only coefficient I can get is 1.0000 and the B1 that we should be getting ends up being something like 3.14e-11, something incredibly wrong. Here is what I have currently for my variables, based off the data given. If anyone is able to help guide me in the right direction I would really appreciate it.
#Creates Time Series Objects Which Can Be Lagged using lag()
Mt2 <- ts(data=DATA$M, start=1,end=180,frequency=1)
Mt1 <- ts(data=DATA$M, start=1,end=180,frequency=1)
Rt2 <- ts(data=DATA$R, start=1,end=180,frequency=1)
Yt2 <- ts(data=DATA$Y, start=1,end=180,frequency=1)
#Dependent Variable starts at t=2 and ends at t=181
#Lag Variable starts at t=1 and ends at t=180
Model_A <- lm( Mt2 ~ lag(Mt2,1) + Rt2 + Yt2, data=DATA)
bgtest(Model_A) #Conclude there is Autocorrelation
e <- resid(Model_A)
et <- ts(e,start=2,end=180,frequency=1)
et2 <- ts(e,start=1, end=179, frequency=1)
Model_e <- lm(et ~ et2)

Using lag() inside a plain lm() formula does not work because lag() just shifts the time index but keeps the data vector unchanged. And as lm() (or more precisely model.frame()) ignores the time index, this leads to including the response variable as one of the regressors and hence a regression coefficient of 1. As an illustration:
lm(Nile ~ lag(Nile, -1))
## Call:
## lm(formula = Nile ~ lag(Nile, -1))
##
## Coefficients:
## (Intercept) lag(Nile, -1)
## -1.819e-13 1.000e+00
There are various solutions to this problem, including:
Using dedicated time series infrastructure such as ar() or arima(). The latter can also easily incorporate additional regressors via xreg:
arima(Nile, c(1, 0, 0))
## Call:
## arima(x = Nile, order = c(1, 0, 0))
##
## Coefficients:
## ar1 intercept
## 0.5063 919.5685
## s.e. 0.0867 29.1410
##
## sigma^2 estimated as 21125: log likelihood = -639.95, aic = 1285.9
Using convenience interfaces to lm() with additional model.frame
processing such as dyn or dynlm:
library("dynlm")
dynlm(Nile ~ L(Nile, 1))
## Time series regression with "ts" data:
## Start = 1872, End = 1970
##
## Call:
## dynlm(formula = Nile ~ L(Nile, 1))
##
## Coefficients:
## (Intercept) L(Nile, 1)
## 452.7668 0.5043
Doing the lag pre-processing "by hand" before calling lm():
nile <- ts.intersect(Nile = Nile, NileLag = lag(Nile, -1))
lm(Nile ~ NileLag, data = nile)
## Call:
## lm(formula = Nile ~ NileLag, data = nile)
##
## Coefficients:
## (Intercept) NileLag
## 452.7668 0.5043
Depending on whether you need a particular estimation method or returned model class for further computations, one or the other strategy might be more handy. In your particular case I would recommend the last solution and then go through the different steps in the exercise by using plain lm().

Use a data.frame and create your lagged variable "by hand" or using something like dplyr::lag. Like this:
X <- 1:10
Xlag <- c(NA, X[1:9])
The base R lag function is not very useful for anything other than ts objects and they aren't very useful except for the specific modeling functions that support them.

Related

How to incorporate a random effect into a nonlinear mixed effect (nlme) model?

I'd like to build a nonlinear mixed effect model that describes the relationship between two variables, "x" and "y", which vary randomly by a third variable "r" using an exponential rise to a maximum as described by the equation: y = theta(1-exp(-beta*x)).
I've been able to create the nonlinear model for x and y using nls(), but I have not been successful in incorporating a random effect into nlme().
When I build the model using nlme() I end up with an error message: "Error in eval(predvars, data, env) : object 'theta' not found". This error is unexpected to me since the nls() model ran without issue using the same dataframe.
To build the dataset:
x = c(33,35,16,8,31,31,31,23,7,7,7,7,11,11,3,3,6,6,32,32,1,17,17,17,25,40,40,6,6,29,29,13,23,23,44,44,43,43,13,4,6,15,17,22,28,8,11,22,32,6,12,20,27,15,29,29,29,29,29,12,12,16,16,12,12,2,49,49,14,14,14,37,2.87,4.86,7.90,11.95,16.90,16.90,18.90,18.89,22.00,24.08,27.14,30.25,31.22,32.26,7,14,19,31,36,7,14,19,31,36,7,16,16,16,16,16,16,32,32,32,32,32,32,11,11,11,13,13,13,13,13,13,13,13,13,13,13,13,9,9)
y = c(39.61,32.66,27.06,21.74,22.18,38.19,35.02,23.13,9.70,14.20,13.40,15.30,18.80,19.00,3.80,4.90,15.00,14.20,24.90,16.56,1.76,29.29,28.49,18.64,27.10,9.47,14.14,10.27,8.44,26.15,25.43,22.00,19.00,13.00,73.19,67.76,32.34,36.86,8.00,1.57,8.33,16.20,14.69,18.95,20.52,4.92,8.28,15.27,18.37,6.60,10.98,12.56,19.04,5.49,21.00,12.90,17.30,11.40,12.20,15.63,15.22,33.80,17.78,19.33,3.86,8.57,30.40,13.39,11.93,4.55,6.18,12.70,2.71,7.23,5.61,22.74,15.71,16.95,18.31,20.78,17.64,20.00,19.52,24.86,30.06,24.92,4.17,11.02,10.08,14.94,25.98,0.00,3.67,3.67,6.69,11.90,5.06,13.21,10.33,0.00,0.00,6.47,8.38,28.57,25.26,28.67,27.92,33.69,29.61,6.11,7.13,6.93,4.81,15.34,4.90,14.94,8.88,10.24,8.80,10.46,10.48,9.19,9.67,9.40,24.98,50.79)
r = c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B","C","C","D","E","E","E","F","G","G","H","H","H","H","I","I","I","J","J","J","J","K","L","L","L","L","L","L","L","L","L","L","L","L","L","L","M","N","N","N","N","N","O","P","P","P","P","P","Q","R","R","S","S","S","T","U","U","U","U","U","U","U","U","U","U","U","U","U","U","V","V","V","V","V","V","V","V","V","V","W","X","X","X","X","Y","Y","Z","Z","Z","Z","Z","Z","AA","AA","AA","AB","AB","AB","AB","AB","AB","AB","AB","AB","AB","AB","AB","AC","AC")
df = data.frame(x,y,r)
To build the nonlinear model without "r" as a random effect.
nls_test = nls(y~theta*(1-exp(-beta*x)),
data = df,
start = list(beta = 0.2, theta = 38),
trace = TRUE)
In my model, the only fixed effect is x and the only random effect is r. I've tried building an nlme() model that reflects this, based on the nlme package documentation (https://cran.r-project.org/web/packages/nlme/nlme.pdf),more specifically these lines of code found on page 186 of the documentation linked above.
The nlme() object that I've tried to create with my data is as follows:
nlme_test = nlme(y ~ theta*(1-exp(-beta*x)),
fixed = x~1,
random = r~1,
data = df,
start = c(theta = 38,
beta = 0.2))
And results in the following error.
Error in eval(predvars, data, env) : object 'theta' not found
From what I gather, this is related to 'theta' not being included in the dataframe ("df") used to build the nlme object, but it is unclear to me why this occurs as most examples that I have found for this error are related to the use of the predict() function and missing column or disagreement between column names.
Also, since the nls() model (nls_Test) worked fine using the same start = c(theta = 38, beta = 0.2) and without a 'theta' or 'beta' data column in df, I'm a bit confused as to why I'm receiving this error about column name error.
Does anyone have suggestions or references to help me incorporate the random effect into my nlme model?
Thanks!
Expanding on my (now deleted, because incomplete) comment, I assume this is what you want to do. Please confirm carefully by reading the help page about nlme (i.e. ?nlme::nlme).
nlme_test <- nlme(y ~ theta*(1-exp(-beta*x)),
fixed = theta + beta ~ 1,
random = theta + beta ~ 1,
groups = ~ r,
data = df,
start = c(theta = 38,
beta = 0.2))
The fixed and random arguments should not name the variables in your model formula but the regression parameters. This way the function knows which parts of the model are variables (to be found in data) and which parts are parameters. Also, you missed the groups argument in order to specify how the data is clustered.
Output:
summary(nlme_test)
## Nonlinear mixed-effects model fit by maximum likelihood
## Model: y ~ theta * (1 - exp(-beta * x))
## Data: df
## AIC BIC logLik
## 887.6224 904.6401 -437.8112
##
## Random effects:
## Formula: list(theta ~ 1, beta ~ 1)
## Level: r
## Structure: General positive-definite, Log-Cholesky parametrization
## StdDev Corr
## theta 1.145839e+01 theta
## beta 1.061366e-05 0.01
## Residual 6.215030e+00
##
## Fixed effects: theta + beta ~ 1
## Value Std.Error DF t-value p-value
## theta 21.532188 2.8853414 96 7.462614 0e+00
## beta 0.104404 0.0251567 96 4.150144 1e-04
## Correlation:
## theta
## beta -0.548
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.89510795 -0.51882772 -0.09466037 0.34471808 3.66855121
##
## Number of Observations: 126
## Number of Groups: 29

Quasi-Poisson mixed-effect model on overdispersed count data from multiple imputed datasets in R

I'm dealing with problems of three parts that I can solve separately, but now I need to solve them together:
extremely skewed, over-dispersed dependent count variable (the number of incidents while doing something),
necessity to include random effects,
lots of missing values -> multiple imputation -> 10 imputed datasets.
To solve the first two parts, I chose a quasi-Poisson mixed-effect model. Since stats::glm isn't able to include random effects properly (or I haven't figured it out) and lme4::glmer doesn't support the quasi-families, I worked with glmer(family = "poisson") and then adjusted the std. errors, z statistics and p-values as recommended here and discussed here. So I basically turn Poisson mixed-effect regression into quasi-Poisson mixed-effect regression "by hand".
This is all good with one dataset. But I have 10 of them.
I roughly understand the procedure of analyzing multiple imputed datasets – 1. imputation, 2. model fitting, 3. pooling results (I'm using mice library). I can do these steps for a Poisson regression but not for a quasi-Poisson mixed-effect regression. Is it even possible to A) pool across models based on a quasi-distribution, B) get residuals from a pooled object (class "mipo")? I'm not sure. Also I'm not sure how to understand the pooled results for mixed models (I miss random effects in the pooled output; although I've found this page which I'm currently trying to go through).
Can I get some help, please? Any suggestions on how to complete the analysis (addressing all three issues above) would be highly appreciated.
Example of data is here (repre_d_v1 and repre_all_data are stored in there) and below is a crucial part of my code.
library(dplyr); library(tidyr); library(tidyverse); library(lme4); library(broom.mixed); library(mice)
# please download "qP_data.RData" from the last link above and load them
## ===========================================================================================
# quasi-Poisson mixed model from single data set (this is OK)
# first run Poisson regression on df "repre_d_v1", then turn it into quasi-Poisson
modelSingle = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson",
data = repre_d_v1)
# I know there are some warnings but it's because I share only a modified subset of data with you (:
printCoefmat(coef(summary(modelSingle))) # unadjusted coefficient table
# define quasi-likelihood adjustment function
quasi_table = function(model, ctab = coef(summary(model))) {
phi = sum(residuals(model, type = "pearson")^2) / df.residual(model)
qctab = within(as.data.frame(ctab),
{`Std. Error` = `Std. Error`*sqrt(phi)
`z value` = Estimate/`Std. Error`
`Pr(>|z|)` = 2*pnorm(abs(`z value`), lower.tail = FALSE)
})
return(qctab)
}
printCoefmat(quasi_table(modelSingle)) # done, makes sense
## ===========================================================================================
# now let's work with more than one data set
# object "repre_all_data" of class "mids" contains 10 imputed data sets
# fit model using with() function, then pool()
modelMultiple = with(data = repre_all_data,
expr = glmer(Y ~ Gender + Age + Xi + Age:Xi + (1|Country) + (1|Participant_ID),
family = "poisson"))
summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
# this has quite similar structure as coef(summary(someGLM))
# but I don't see where are the random effects?
# and more importantly, I wanted a quasi-Poisson model, not just Poisson model...
# ...but here it is not possible to use quasi_table function (defined earlier)...
# ...and that's because I can't compute "phi"
This seems reasonable, with the caveat that I'm only thinking about the computation, not whether this makes statistical sense. What I'm doing here is computing the dispersion for each of the individual fits and then applying it to the summary table, using a variant of the machinery that you posted above.
## compute dispersion values
phivec <- vapply(modelMultiple$analyses,
function(model) sum(residuals(model, type = "pearson")^2) / df.residual(model),
FUN.VALUE = numeric(1))
phi_mean <- mean(phivec)
ss <- summary(pool(modelMultiple)) # class "mipo" ("mipo.summary")
## adjust
qctab <- within(as.data.frame(ss),
{ std.error <- std.error*sqrt(phi_mean)
statistic <- estimate/std.error
p.value <- 2*pnorm(abs(statistic), lower.tail = FALSE)
})
The results look weird (dispersion < 1, all model results identical), but I'm assuming that's because you gave us a weird subset as a reproducible example ...

Getting a subset of variables in R summary

When using the summary function in R, is there an option I can pass in there to present only a subset of the variables?
In my example, I ran a panel regression I have several explanatory variables, and have many dummy variables whose coefficients I do not want to present. I suppose there is a simple way to do this, but couldn't find it in the function documentation. Thanks
It is in the documentation, but you have to look for the associacted print method for summary.plm. The argument is subset. Use it as in the following example:
library(plm)
data("Grunfeld", package = "plm")
mod <- plm(inv ~ value + capital, data = Grunfeld)
print(summary(mod), subset = c("capital"))
Assuming the regression you ran behaves similarly as the summary() of a basic lm() model:
# set up data
x <- 1:100 * runif(100, .01, .02)
y <- 1:100 * runif(100, .01, .03)
# run a very basic linear model
mylm <- lm(x ~ y)
summary(mylm)
# we can save summary of our linear model as a variable
mylm_summary <- summary(mylm)
# we can then isolate coefficients from this summary (summary is just a list)
mylm_summary$coefficients
#output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2007199 0.04352267 4.611846 1.206905e-05
y 0.5715838 0.03742379 15.273273 1.149594e-27
# note that the class of this "coefficients" object is a matrix
class(mylm_summ$coefficients)
# output
[1] "matrix"
# we can convert that matrix into a data frame so it is easier to work with and subset
mylm_df_coefficients <- data.frame(mylm_summary$coefficients)

Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.

Updating data in lm() calls

Is there is an equivalent to update for the data part of an lm call object?
For example, say i have the following model:
dd = data.frame(y=rnorm(100),x1=rnorm(100))
Model_all <- lm(formula = y ~ x1, data = dd)
Is there a way of operating on the lm object to have the equivalent effect of:
Model_1t50 <- lm(formula = y ~ x1, data = dd[1:50,])
I am trying to construct some psudo out of sample forecast tests, and it would be very convenient to have a single lm object and to simply roll the data.
I'm fairly certain that update actually does what you want!
example(lm)
dat1 <- data.frame(group,weight)
lm1 <- lm(weight ~ group, data=dat1)
dat2 <- data.frame(group,weight=2*weight)
lm2 <- update(lm1,data=dat2)
coef(lm1)
##(Intercept) groupTrt
## 5.032 -0.371
coef(lm2)
## (Intercept) groupTrt
## 10.064 -0.742
If you're hoping for an effiency gain from this, you'll be disappointed -- R just substitutes the new arguments and re-evaluates the call (see the code of update.default). But it does make the code a lot cleaner ...
biglm objects can be updated to include more data, but not less. So you could do this in the opposite order, starting with less data and adding more. See http://cran.r-project.org/web/packages/biglm/biglm.pdf
However, I suspect you're interested in parameters estimated for subpopulations (ie if rows 1:50 correspond to level "a" of factor variable factrvar. In this case, you should use interaction in your formula (~factrvar*x1) rather than subsetting to data[1:50,]. Interaction of this type will give different effect estimates for each level of factrvar. This is more efficient than estimating each parameter separately and will constrain any additional parameters (ie, x2 in ~factrvar*x1 + x2) to be the same across values of factrvar--if you estimated the same model multiple times to different subsets, x2 would receive a separate parameter estimate each time.

Resources