Logit regression : glmer vs bife - r

I am working on a panel dataset and trying to run a logit regression with fixed effects.
I found that glmer models from the lme4 package and the bife package are suited for this kind of work.
However when I run a regression with each model I do not have the same results (estimates, standard errors, etc.)
Here is the code and results for the glmer model with an intercept:
glmer_1 <- glmer(CVC_dummy~at_log + (1|year), data=own, family=binomial(link="logit"))
summary(glmer_1)
Estimate Std. Error zvalue Pr(>|z|)
(Intercept) -6.43327 0.09635 -66.77 <2e-16 ***
at_log 0.46335 0.01101 42.09 <2e-16 ***
Without an intercept:
glmer_2 <- glmer(CVC_dummy~at_log + (1|year)-1, data=own, family=binomial(link="logit"))
summary(glmer_2)
Estimate Std.Error z value Pr(>|z|)
at_log 0.46554 0.01099 42.36 <2e-16 ***
And with the bife package:
bife_1 <- bife(CVC_dummy~at_log | year, data=own, model="logit")
summary(bife_1)
Estimate Std. error t-value Pr(> t)
at_log 0.4679 0.0110 42.54 <2e-16 ***
Why are estimated coefficients of at_log different between the two packages?
Which package should I use ?

There is quite a confusion about the terms fixed effects and random effects. From your first sentence, I guess that you intend to calculate a fixed-effects model.
However, while bife calculates fixed-effects models, glmer calculates random-effects models/mixed-effects models.
Both often get confused because random-effects models differ between fixed effects (your usual coefficients, the independent variables you are interested in) and random effects (the variances/std. dev. of your random intercepts and/or random slopes).
On the other hand, fixed-effects models are called that way because they cancel out individual differences by including a dummy variable (-1) for each group, hence by including a fixed effect for each group.
However, not all fixed-effects models work by including indicator-variables: Bife works with pseudo demeaning - yet, the results are the same and it is still called a fixed-effects model.

Related

Model averaging with MuMIn: interpretation of coefficient-names in results

I am doing model averaging with MuMIn and trying to interpret the results.
Everything works fine, but I am wondering about the names of my coefficients in the results:
Model-averaged coefficients:
(full average)
Estimate Std. Error Adjusted SE
cond((Int)) 0.9552775 0.0967964 0.0969705
cond(Distanzpunkt) -0.0001217 0.0001451 0.0001453
cond(area_km2) 0.0022712 0.0030379 0.0030422
cond(prop) 0.0487036 0.1058994 0.1060808
Does someone know, what "cond()" tells me and why it appears in the model output?
Within the models, the coefficients are named "Distanzpunkt", "area_km2" and "prop".
Were you fitting a zero inflation model with glmmTMB? If so, then cond() is referring to the terms in the conditional model, rather than the zero-inflation model.

Resurrecting coefficients from simulated data in Poisson regression

I am trying to understand how to resurrect model estimates from simulated data in a poisson regressions. There are other similar posts on interpreting coefficients on StackExchange/CrossValidated (https://stats.stackexchange.com/questions/11096/how-to-interpret-coefficients-in-a-poisson-regression, https://stats.stackexchange.com/questions/128926/how-to-interpret-parameter-estimates-in-poisson-glm-results), but I think my question is different (although admittedly related). I am trying to resurrect known relationships in order to understand what is happening with the model. I am posting here instead of CrossValidated because I am thinking that it is less of statistical interpretation and more of how I would get a known / simulated relationship back via code.
Here are some simulated data y and x with known relationships to some response resp
set.seed(707)
x<-rnorm(10000,mean=5,sd=1)
y<-rnorm(10000,mean=5,sd=1)
resp<-(0.5*x+0.7*y-0.1*x*y) # where I define some relationships
With a linear regression, it is very straight forward:
summary(lm(resp~y+x+y:x))
The output shows the exact linear relationship between x, y, and the interaction.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.592e-14 1.927e-15 8.260e+00 <2e-16 ***
y 7.000e-01 3.795e-16 1.845e+15 <2e-16 ***
x 5.000e-01 3.800e-16 1.316e+15 <2e-16 ***
y:x -1.000e-01 7.489e-17 -1.335e+15 <2e-16 ***
Now, if I am interested in a poisson regression, I need integers, I just round, but keep the relationship between predictors and response:
resp<-round((0.5*x+0.7*y-0.1*x*y),0)
glm1<-glm(resp~y+x+y:x,family=poisson())
summary(glm1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.419925 0.138906 3.023 0.0025 **
y 0.163919 0.026646 6.152 7.66e-10 ***
x 0.056689 0.027375 2.071 0.0384 *
y:x -0.011020 0.005261 -2.095 0.0362 *
It is my understanding that one needs to exponentiate the results to understand them, because of the link function. But here, neither the exponentiated estimate nor intercept + estimate get me back to the original values.
> exp(0.419925+0.163919)
[1] 1.792917
> exp(0.163919)
[1] 1.178119
How do I interpret these values as related to the original 0.7*y relationship?
Now if I put that same linear equation into the exponential function, I get the values directly - no need to use exp():
resp<-round(exp(0.5*x+0.7*y-0.1*x*y),0)
summary(glm(resp~y+x+y:x,family=poisson()))
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.002970 0.045422 0.065 0.948
y 0.699539 0.008542 81.894 <2e-16 ***
x 0.499476 0.008912 56.047 <2e-16 ***
y:x -0.099922 0.001690 -59.121 <2e-16 ***
Can someone explain to me what I am misinterpreting here, and how I might find the original values of a known relationship without first using the exp() function, as above?
You're neglecting the fact that the Poisson GLM uses a log link (exponential inverse-link) by default (or rather, you're not using that information consistently). You should either generate your 'data' with an exponential inverse-link:
resp <- round(exp(0.5*x+0.7*y-0.1*x*y))
or fit the model with an identity link (family=poisson(link="identity")). (I wouldn't recommend the latter, as it is rarely a sensible model.)
For what it's worth, it's harder to simulate Poisson data that will exactly match a specified set of parameters, because (unlike the Gaussian where you can reduce the variance to arbitrarily small values) you can't generate real Poisson data with arbitrarily little noise. (Your round() statement produces integers, but not Poisson-distributed outcomes.)

Random slope for time in subject not working in lme4

I can not insert a random slope in this model with lme4(1.1-7):
> difJS<-lmer(JS~Tempo+(Tempo|id),dat,na.action=na.omit)
Error: number of observations (=274) <= number of random effects (=278) for term
(Tempo | id); the random-effects parameters and the residual variance (or scale
parameter) are probably unidentifiable
With nlme it is working:
> JSprova<-lme(JS~Tempo,random=~1+Tempo|id,data=dat,na.action=na.omit)
> summary(JSprova)
Linear mixed-effects model fit by REML Data: dat
AIC BIC logLik
769.6847 791.3196 -378.8424
Random effects:
Formula: ~1 + Tempo | id
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.1981593 (Intr)
Tempo 0.5409468 -0.692
Residual 0.5597984
Fixed effects: JS ~ Tempo
Value Std.Error DF t-value p-value
(Intercept) 4.116867 0.14789184 138 27.837013 0.0000
Tempo -0.207240 0.08227474 134 -2.518874 0.0129
Correlation:
(Intr)
Tempo -0.837
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.79269550 -0.39879115 0.09688881 0.41525770 2.32111142
Number of Observations: 274
Number of Groups: 139
I think it is a problem of missing data as I have few cases where there is a missing data in time two of the DV but with na.action=na.omit should not the two package behave in the same way?
It is "working" with lme, but I'm 99% sure that your random slopes are indeed confounded with the residual variation. The problem is that you only have two measurements per subject (or only one measurement per subject in 4 cases -- but that's not important here), so that a random slope plus a random intercept for every individual gives one random effect for every observation.
If you try intervals() on your lme fit, it will give you an error saying that the variance-covariance matrix is unidentifiable.
You can force lmer to do it by disabling some of the identifiability checks (see below).
library("lme4")
library("nlme")
library("plyr")
Restrict the data to only two points per individual:
sleepstudy0 <- ddply(sleepstudy,"Subject",
function(x) x[1:2,])
m1 <- lme(Reaction~Days,random=~Days|Subject,data=sleepstudy0)
intervals(m1)
## Error ... cannot get confidence intervals on var-cov components
lmer(Reaction~Days+(Days|Subject),data=sleepstudy0)
## error
If you want you can force lmer to fit this model:
m2B <- lmer(Reaction~Days+(Days|Subject),data=sleepstudy0,
control=lmerControl(check.nobs.vs.nRE="ignore"))
## warning messages
The estimated variances are different from those estimated by lme, but that's not surprising since some of the parameters are jointly unidentifiable.
If you're only interested in inference on the fixed effects, it might be OK to ignore these problems, but I wouldn't recommend it.
The sensible thing to do is to recognize that the variation among slopes is unidentifiable; there may be among-individual variation among slopes, but you just can't estimate it with this model. Don't try; fit a random-intercept model and let the implicit/default random error term take care of the variation among slopes.
There's a recent related question on CrossValidated; there I also refer to another example.

How to estimate a Spatial Autoregressive model in R?

I am trying to estimate some spatial models in R using the data from a paper on spatial econometric models using cross-section time series data by Franzese & Hays (2007).
I focus on their results given in table 4 (see below).
Using lm I am able to replicate their results for the OLS, S-OLS, and S-2SLS models.
However, in trying to estimate the S-ML (Spatial Maximum Likelihood) model I run into trouble.
If I use a GLM model there are some minor differences for some of the explanatory variables but there is quite a large margin with regard to the estimated coefficient for the spatial lag (output shown below).
I'm not entirely sure about why GLM is not the right estimation method in this case.
Using GLS I get results similar to GLM (possibly related).
require(MASS)
m4<-glm(lnlmtue~lnlmtue_1+SpatLag+DENSITY+DEIND+lngdp_pc+UR+TRADE+FDI+LLVOTE+LEFTC+TCDEMC+GOVCON+OLDAGE+factor(cc)+factor(year),family=gaussian,data=fh)
summary(m4)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.199091355 3.924227850 1.835 0.068684 .
lnlmtue_1 0.435487985 0.080844033 5.387 0.000000293 ***
SpatLag -0.437680018 0.101078950 -4.330 0.000028105 ***
DENSITY 0.007633016 0.010268468 0.743 0.458510
DEIND 0.040270153 0.032304496 1.247 0.214618
I tried using the splm package but this leads to even larger consistencies (output shown below).
Moreover, I'm not able to include fixed effects in the model.
require(splm)
m4a<-spml(lnlmtue~lnlmtue_1+DENSITY+DEIND+lngdp_pc+UR+TRADE+FDI+LLVOTE+LEFTC+ TCDEMC+GOVCON+OLDAGE,data=fh,index=c("cc","year"),listw=mat2listw(wmat),
model="pooling",spatial.error="none",lag=T)
summary(m4a)
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
(Intercept) 1.79439070 0.78042284 2.2993 0.02149 *
lnlmtue_1 0.75795987 0.04828145 15.6988 < 2e-16 ***
DENSITY -0.00026038 0.00203002 -0.1283 0.89794
DEIND -0.00489516 0.01414457 -0.3461 0.72928
So basically my question really is how does one properly estimate a SAR model with cross-section time-series data in R?
R-code
Replication data
Adjacency matrix
Is it critical that you use R?
I suggest that you examine the features of Geoda, a free spatial analysis package available from Arizona State University.
Though I have only used it to run basic spatial OLS (not 2SLS), I was pleased with Geoda's flexibility and visualization tools. I encourage you to skim the documentation and consider downloading the latest release.
If you must use R, I suggest exploring the GeoXp package (http://cran.r-project.org/web/packages/GeoXp/index.html).

Residual variance extracted from glm and lmer in R

I am trying to take what I have read about multilevel modelling and merge it with what I know about glm in R. I am now using the height growth data from here.
I have done some coding shown below:
library(lme4)
library(ggplot2)
setwd("~/Documents/r_code/multilevel_modelling/")
rm(list=ls())
oxford.df <- read.fwf("oxboys/OXBOYS.DAT",widths=c(2,7,6,1))
names(oxford.df) <- c("stu_code","age_central","height","occasion_id")
oxford.df <- oxford.df[!is.na(oxford.df[,"age_central"]),]
oxford.df[,"stu_code"] <- factor(as.character(oxford.df[,"stu_code"]))
oxford.df[,"dummy"] <- 1
chart <- ggplot(data=oxford.df,aes(x=occasion_id,y=height))
chart <- chart + geom_point(aes(colour=stu_code))
# see if lm and glm give the same estimate
glm.01 <- lm(height~age_central+occasion_id,data=oxford.df)
glm.02 <- glm(height~age_central+occasion_id,data=oxford.df,family="gaussian")
summary(glm.02)
vcov(glm.02)
var(glm.02$residual)
(logLik(glm.01)*-2)-(logLik(glm.02)*-2)
1-pchisq(-2.273737e-13,1)
# lm and glm give the same estimation
# so glm.02 will be used from now on
# see if lmer without level2 variable give same result as glm.02
mlm.03 <- lmer(height~age_central+occasion_id+(1|dummy),data=oxford.df,REML=FALSE)
(logLik(glm.02)*-2)-(logLik(mlm.03)*-2)
# 1-pchisq(-3.408097e-07,1)
# glm.02 and mlm.03 give the same estimation, only if REML=FALSE
mlm.03 gives me the following output:
> mlm.03
Linear mixed model fit by maximum likelihood
Formula: height ~ age_central + occasion_id + (1 | dummy)
Data: oxford.df
AIC BIC logLik deviance REMLdev
1650 1667 -819.9 1640 1633
Random effects:
Groups Name Variance Std.Dev.
dummy (Intercept) 0.000 0.0000
Residual 64.712 8.0444
Number of obs: 234, groups: dummy, 1
Fixed effects:
Estimate Std. Error t value
(Intercept) 142.994 21.132 6.767
age_central 1.340 17.183 0.078
occasion_id 1.299 4.303 0.302
Correlation of Fixed Effects:
(Intr) ag_cnt
age_central 0.999
occasion_id -1.000 -0.999
You can see that there is a variance for the residual in the random effect section, which I have read from Applied Multilevel Analysis - A Practical Guide by Jos W.R. Twisk, that this represents the amount of "unexplained variance" from the model.
I wondered if I could arrive at the same residual variance from glm.02, so I tried the following:
> var(resid(glm.01))
[1] 64.98952
> sd(resid(glm.01))
[1] 8.061608
The results are slightly different from the mlm.03 output. Does this refer to the same "residual variance" stated in mlm.03?
Your glm.02 and glm.01 estimate a simple linear regression model using least squares. On the other hand, mlm.03 is a linear mixed model estimated through maximum likelihood.
I don't know your dataset, but it looks like you use the dummy variable to create a cluster structure at level-2 with zero variance.
So your question has basically two answers, but only the second answer is important in your case. The models glm.02 and mlm.03 do not contain the same residual variance estimate, because...
The models are usually different (mixed effects vs. classical regression). In your case, however, the dummy variable seems to supress the additional variance component in the mixed model. So for me the models seem to be equal.
The method used to estimate the residual variance is different. glm uses LS, lmer uses ML in your code. ML estimates for the residual variance are slightly biased (resulting in smaller variance estimates). This can be solved by using REML instead of ML to estimate variance components.
Using classic ML (instead of REML), however, is still necessary and correct for the likelihood-ratio test. Using REML the comparison of the two likelihoods would not be correct.
Cheers!

Resources