I'm trying to estimate a logistic unit fixed effects model for panel data using R. My dependent variable is binary and measured daily over two years for 13 locations.
The goal of this model is to predict the value of y for a particular day and location based on x.
zero <- seq(from=0, to=1, by=1)
ids = dplyr::data_frame(location=seq(from=1, to=13, by=1))
dates = dplyr::data_frame(date = seq(as.Date("2015-01-01"), as.Date("2016-12-31"), by="days"))
data = merge(dates, ids)
data$y <- sample(zero, size=9503, replace=TRUE)
data$x <- sample(zero, size=9503, replace=TRUE)
While surveying the available packages to do so, I've read a number of ways to (apparently) do this, but I'm not confident I've understood the differences between packages and approaches.
From what I have read so far, glm(), survival::clogit() and pglm::pglm() can be used to do this, but I'm wondering if there are substantial differences between the packages and what those might be.
Here are the calls I've used:
fixed <- glm(y ~ x + factor(location), data=data)
fixed <- clogit(y ~ x + strata(location), data=data)
One of the reasons for this insecurity is the error I get when using pglm (also see this question) that pglm can't use the "within" model:
fixed <- pglm(y ~ x, data=data, index=c("location", "date"), model="within", family=binomial("logit")).
What distinguishes the "within" model of pglm from the approaches in glm() and clogit() and which of the three would be the correct one to take here when trying to predict y for a given date and unit?
I don't see that you have defined a proper hypothesis to test within the context of what you are calling "panel data", but as far as getting glm to give estimates for logistic coefficients within strata it can be accomplished by adding family="binomial" and stratifying by your "unit" variable:
> fixed <- glm(y ~ x + strata(unit), data=data, family="binomial")
> fixed
Call: glm(formula = y ~ x + strata(unit), family = "binomial", data = data)
Coefficients:
(Intercept) x strata(unit)unit=2 strata(unit)unit=3
0.10287 -0.05910 -0.08302 -0.03020
strata(unit)unit=4 strata(unit)unit=5 strata(unit)unit=6 strata(unit)unit=7
-0.06876 -0.05042 -0.10200 -0.09871
strata(unit)unit=8 strata(unit)unit=9 strata(unit)unit=10 strata(unit)unit=11
-0.09702 0.02742 -0.13246 -0.04816
strata(unit)unit=12 strata(unit)unit=13
-0.11449 -0.16986
Degrees of Freedom: 9502 Total (i.e. Null); 9489 Residual
Null Deviance: 13170
Residual Deviance: 13170 AIC: 13190
That will not take into account any date-ordering, which is what I would have expected to be the interest. But as I said above, there doesn't yet appear to be a hypothesis that is premised on any sequential ordering.
This would create a fixed effects model that included a spline relationship of date to probability of y-event. I chose to center the date rather than leaving it as a very large integer:
library(splines)
fixed <- glm(y ~ x + ns(scale(date),3) + factor(unit), data=data, family="binomial")
fixed
#----------------------
Call: glm(formula = y ~ x + ns(scale(date), 3) + factor(unit), family = "binomial",
data = data)
Coefficients:
(Intercept) x ns(scale(date), 3)1 ns(scale(date), 3)2
0.13389 -0.05904 0.04431 -0.10727
ns(scale(date), 3)3 factor(unit)2 factor(unit)3 factor(unit)4
-0.03224 -0.08302 -0.03020 -0.06877
factor(unit)5 factor(unit)6 factor(unit)7 factor(unit)8
-0.05042 -0.10201 -0.09872 -0.09702
factor(unit)9 factor(unit)10 factor(unit)11 factor(unit)12
0.02742 -0.13246 -0.04816 -0.11450
factor(unit)13
-0.16987
Degrees of Freedom: 9502 Total (i.e. Null); 9486 Residual
Null Deviance: 13170
Residual Deviance: 13160 AIC: 13200
Related
glmer is used to estimate effects on the logit scale of y when the data are clustered. In the following model
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "logit"))
the exp of the coefficient of treat is the odds ratio of a binary 0-1 treatment variable, x is a covariate, and cluster is a clustering indicator across which we model a random effect (intercept). A standard approach in glm's to estimate risk ratios is to use a log link instead, i.e. family=binomial(link = "log"). However using this in glmer I get error
Error in (function (fr, X, reTrms, family, nAGQ = 1L, verbose = 0L, maxit = 100L, :
(maxstephalfit) PIRLS step-halvings failed to reduce deviance in pwrssUpdate
after calling
fit1 = glmer(y ~ treat + x + ( 1 | cluster), family = binomial(link = "log"))
A web search revealed other people had similar issues with the Gamma family.
This seems to be a general problem as the reproducible example below demonstrates. My question thus is: how can I estimate risk ratios using a mixed effect model like glmer?
Reproducible Example
The following code simulates data that replicates the problem.
n = 1000 # sample size
m = 50 # number of clusters
J = sample(1:m, n, replace = T) # simulate cluster membership
x = rnorm(n) # simulate covariate
treat = rbinom(n, 1, 0.5) # simulate random treatment
u = rnorm(m) # simulate random intercepts
lt = x + treat + u[J] # compute linear term of logistic mixed effect model
p = 1/(1+exp(-lt)) # use logit link to transform to probabilities
y = rbinom(n,1,p) # draw binomial outcomes
d = data.frame(y, x, treat)
# First fit logistic model with glmer
fit1 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "logit"), data = d)
summary(fit1)
# Now try to log link
fit2 = glmer( y ~ treat + x + (1 | as.factor(J)),
family = binomial(link = "log"), data = d)
This error is returned due to your model producing values > 1:
PIRLS step-halvings failed to reduce deviance in pwrssUpdate
...
When using lme4 to fit GLMMs with link functions that do not automatically constrain the response to the allowable range of the distributional family (e.g. binomial models with a log link, where the estimated probability can be >1, or inverse-Gamma models, where the estimated mean can be negative), it is not unusual to get this error. This occurs because lme4 doesn’t do anything to constrain the predicted values, so NaN values pop up, which aren’t handled gracefully. If possible, switch to a link function to one that constrains the response (e.g. logit link for binomial or log link for Gamma).
Unfortunately, the suggested workaround is to use a different link function.
The following paper surveys a number of alternative model choices for calculation for [adjusted] relative risk:
Model choices to obtain adjusted risk difference estimates from a binomial regression model with convergence problems: An assessment of methods of adjusted risk difference estimation (2016)
Let LL = loglikelihood
Residual Deviance = 2(LL(Saturated Model) - LL(Proposed Model))
However, when I use glm function, it seems that
Residual Deviance = -2LL(Proposed Model)
For example,
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
mydata$rank <- factor(mydata$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")
summary(mylogit)
###
Residual deviance: 458.52 on 394 degrees of freedom
AIC: 470.52
#Residual deviance
-2*logLik(mylogit)
##'log Lik.' 458.5175 (df=6)
#AIC
-2*logLik(mylogit)+2*(5+1)
##470.5175
Where is LL(Saturated Model) and how can I get it's value in R?
Thank you.
I have got the answer: it only happens when the log likelihood of the saturated model is 0, which for discrete models implies that the probability of the observed data under the saturated model is 1. Binary data is pretty much the only case where this is true (because individual fitted probabilities become either zero or one).H and Here for details.
I can use gls() from the nlme package to build mod1 with no random effects.
I can then compare mod1 using AIC to mod2 built using lme() which does include a random effect.
mod1 = gls(response ~ fixed1 + fixed2, method="REML", data)
mod2 = lme(response ~ fixed1 + fixed2, random = ~1 | random1, method="REML",data)
AIC(mod1,mod2)
Is there something similar to gls() for the lme4 package which would allow me to build mod3 with no random effects and compare it to mod4 built using lmer() which does include a random effect?
mod3 = ???(response ~ fixed1 + fixed2, REML=T, data)
mod4 = lmer(response ~ fixed1 + fixed2 + (1|random1), REML=T, data)
AIC(mod3,mod4)
With modern (>1.0) versions of lme4 you can make a direct comparison between lmer fits and the corresponding lm model, but you have to use ML --- it's hard to come up with a sensible analogue of the "REML criterion" for a model without random effects (because it would involve a linear transformation of the data that set all of the fixed effects to zero ...)
You should be aware that there are theoretical issues with information-theoretic comparisons between models with and without variance components: see the GLMM FAQ for more information.
library(lme4)
fm1 <- lmer(Reaction~Days+(1|Subject),sleepstudy, REML=FALSE)
fm0 <- lm(Reaction~Days,sleepstudy)
AIC(fm1,fm0)
## df AIC
## fm1 4 1802.079
## fm0 3 1906.293
I prefer output in this format (delta-AIC rather than raw AIC values):
bbmle::AICtab(fm1,fm0)
## dAIC df
## fm1 0.0 4
## fm0 104.2 3
To test, let's simulate data with no random effect (I had to try a couple of random-number seeds to get an example where the among-subject std dev was actually estimated as zero):
rr <- simulate(~Days+(1|Subject),
newparams=list(theta=0,beta=fixef(fm1),
sigma=sigma(fm1)),
newdata=sleepstudy,
family="gaussian",
seed=103)[[1]]
ss <- transform(sleepstudy,Reaction=rr)
fm1Z <- update(fm1,data=ss)
VarCorr(fm1Z)
## Groups Name Std.Dev.
## Subject (Intercept) 0.000
## Residual 29.241
fm0Z <- update(fm0,data=ss)
all.equal(c(logLik(fm0Z)),c(logLik(fm1Z))) ## TRUE
While I agree that with Ben that the simplest solution is to set REML=FALSE, the maximum REML likelihood for a model without random effects is well defined and is fairly straightforward to compute via the well known relation
between the ordinary profile likelihood function and the restricted likelihood.
The following code simulates data for which the estimated variance of the random intercept of a LMM ends up at 0 such that the maximum restricted log likelihood of the LMM should be equal to the restricted likelihood of the model without any random effects included.
The restricted likelihood of the LM is computed via the above formula and evaluates to the same value as that of the LMM.
An even simpler alternative is to use glmmTMB:
library(lme4)
#> Loading required package: Matrix
# simulate some toy data for which the LMM ends up at the boundary
set.seed(5)
n <- 100 # the sample size
x <- rnorm(n)
y <- rnorm(n)
group <- factor(rep(1:10,10))
# fit the LMM via REML
mod1 <- lmer(y ~ x + (1|group), REML=TRUE, control=lmerControl(boundary.tol=1e-8))
#> boundary (singular) fit: see ?isSingular
logLik(mod1)
#> 'log Lik.' -147.8086 (df=4)
# fit a model without random effects and compute its maximum REML log likelihood
mod0 <- lm(y ~ x)
p <- length(coef(mod0)) # number of fixed effect parameters
X <- model.matrix(mod0) # the fixed effect design matrix
sigma.REML <- summary(mod0)$sigma # REMLE of sigma
# the maximum ordinary log likelihood evaluated at the REML estimates
logLik.lm.at.REML <- sum(dnorm(residuals(mod0), 0, sigma.REML, log=TRUE))
# the restricted log likelihood of the model without random effects (via above formula)
logLik.lm.at.REML + p/2*log(2*pi) - 1/2*(- p*log(sigma.REML^2) + determinant(crossprod(X))$modulus)
#> [1] -147.8086
#> attr(,"logarithm")
#> [1] TRUE
library(glmmTMB)
data <- data.frame(y,x,group)
logLik(glmmTMB(y~x, family = gaussian(), data=data, REML=TRUE))
#> 'log Lik.' -147.8086 (df=3)
logLik(glmmTMB(y~x+(1|group), family = gaussian(), data=data, REML=TRUE))
#> 'log Lik.' -147.8086 (df=4)
I have the following data
data.set <- data.frame("varA"=rnorm(50),"varB"=rnorm(50),
"varC"=rnorm(50), binary.outcome=sample(c(0,1),50,replace=T) )
exp.vars <- c("varA","varB","varC")
I then wish to apply a logistic model using all of the exp.vars as dependent variables without hard coding them (I want to put this into a function so that different combinations of exp.vars can be tried. My attempt:
results <- glm( binary.outcome ~ get(paste(exp.vars, collapse="+")), family=binomial,
data=data.set )
How can I get this to work?
The . in the formula tells R to use all variables in the data.frame data.set (except y) as predictors. This should do it:
glm( binary.outcome ~ ., family=binomial,
data=data.set )
Call: glm(formula = binary.outcome ~ ., family = binomial, data = data.set)
Coefficients:
(Intercept) varA varB varC
-0.4820 0.1878 -0.3974 -0.4566
Degrees of Freedom: 49 Total (i.e. Null); 46 Residual
Null Deviance: 66.41
Residual Deviance: 62.06 AIC: 70.06
and from ?formula
There are two special interpretations of . in a formula. The usual one
is in the context of a data argument of model fitting functions and
means ‘all columns not otherwise in the formula’: see terms.formula.
In the context of update.formula, only, it means ‘what was previously
in this part of the formula’.
I have built a survival cox-model, which includes a covariate * time interaction (non-proportionality detected).
I am now wondering how could I most easily get survival predictions from my model.
My model was specified:
coxph(formula = Surv(event_time_mod, event_indicator_mod) ~ Sex +
ageC + HHcat_alt + Main_Branch + Acute_seizure + TreatmentType_binary +
ICH + IVH_dummy + IVH_dummy:log(event_time_mod)
And now I was hoping to get a prediction using survfit and providing new.data for the combination of variables I am doing the predictions:
survfit(cox, new.data=new)
Now as I have event_time_mod in the right-hand side in my model I need to specify it in the new data frame passed on to survfit. This event_time would need to be set at individual times of the predictions. Is there an easy way to specify event_time_mod to be the correct time to survfit?
Or are there any other options for achieving predictions from my model?
Of course I could create as many rows in the new data frame as there are distinct times in the predictions and setting to event_time_mod to correct values but it feels really cumbersome and I thought that there must be a better way.
You have done what is refereed to as
An obvious but incorrect approach ...
as stated in Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model vignette in version 2.41-3 of the R survival package. Instead, you should use the time-transform functionality, i.e., the tt function as stated in the same vignette. The code would be something similar to the example in the vignette
> library(survival)
> vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
+ data=veteran,
+ tt = function(x, t, ...) x * log(t+20))
>
> vfit3
Call:
coxph(formula = Surv(time, status) ~ trt + prior + karno + tt(karno),
data = veteran, tt = function(x, t, ...) x * log(t + 20))
coef exp(coef) se(coef) z p
trt 0.01648 1.01661 0.19071 0.09 0.9311
prior -0.00932 0.99073 0.02030 -0.46 0.6462
karno -0.12466 0.88279 0.02879 -4.33 1.5e-05
tt(karno) 0.02131 1.02154 0.00661 3.23 0.0013
Likelihood ratio test=53.8 on 4 df, p=5.7e-11
n= 137, number of events= 128
The survfit though does not work when you have a tt term
> survfit(vfit3, veteran[1, ])
Error in survfit.coxph(vfit3, veteran[1, ]) :
The survfit function can not yet process coxph models with a tt term
However, you can easily get out the terms, linear predictor or mean response with predict. Further, you can create the term over time for the tt term using the answer here.