I was using the glmer code for a logistic regression model with 2.5 million observations. However, after I added the multi-level component (a few hundred thousand groups), the data was too large to run in a timely manner on my computer. I want to try a general additive model instead, but I am confused about how to write the code.
The glmer code is as follows:
mylogit.m1a <- glmer(outcome ~
exposure*risk+ tenure.yr + CurrentAge + percap.inc.k + employment + rentership + pop.change + pop.den.k +
(1 | geo_id / house_id),
data = temp, family = "binomial", control = glmerControl(optimizer="bobyqa", calc.derivs=FALSE))
print(Sys.time()-start)
The example I found writes the gam like this:
ga_model = gam(
Reaction ~ Days + s(Subject, bs = 're') + s(Days, Subject, bs = 're'),
data = sleepstudy,
method = 'REML'
)
But I am confused about why there are two bits in parenthesis / what I should put in parenthesis to specify the model correctly.
The details are given in ?smooth.construct.re.smooth.spec:
Exactly how the random effects are implemented is best seen by
example. Consider the model term ‘s(x,z,bs="re")’. This will
result in the model matrix component corresponding to ‘~x:z-1’
being added to the model matrix for the whole model.
So s(Days, Subject, bs = "re") is equivalent to the (0 + Days|Subject) term in the lmer model: both of them encode "random variation in slope with respect to day across subjects"
So your (1 | geo_id / house_id) would be translated to mgcv syntax as
s(geo_id, bs = "re") + s(geo_id, house_id, bs = "re")
(the nesting syntax a/b expands in general to a + a:b).
A couple of other comments:
you should probably use bam() as a drop-in replacement for gam() (much faster)
you may very well run into problems with memory usage: mgcv doesn't use sparse matrices for the random effects terms, so they can get big
Related
I appreciated any insights into staggered did (difference-in-differences) models.
I wanted to ask if I use the correct function to set-up the model for a did (data structure provided below):
did=time*treated
didreg = lm(y ~ time + treated + did + x + factor(year) + factor(firm), data = sample)
The data looks like:
I'm not familiar with difference-in-difference modelling, but from skimming the Wiki it seems that what you want is a simple interaction. To fit that, you don't even need to calculate a new variable (did), but you can specify it directly in the model. There's couple of ways to specify that with R formula syntax:
# Simple main effects models, no interactions
main_mod <- lm(y ~ time + treated + x + factor(year) + factor(firm), data = sample)
# Model with the interaction effect explicitly specified
did_mod1 <- lm(y ~ time + treated + time:treated + x + factor(year) + factor(firm), data = sample)
# Model with shortened syntax for specifying interactions
did_mod2 <- lm(y ~ time * treated + x + factor(year) + factor(firm), data = sample)
did_mod1 and did_mod2 are identical, did_mod2 is just a more compact way of writing the same model. The * indicates that you want both the main effects and the interactions of the variables to the left and the right. It's recommended to always fit main effects when you fit interactions, so the second way of writing the model saves time & space.
I'm trying to include time fixed effects (dummies for years generated with model.matrix) into a PPML regression in R.
Without time fixed effect the regression is:
require(gravity)
my_model <- PPML(y="v", dist="dist",
x=c("land","contig","comlang_ethno",
"smctry","tech","exrate"),
vce_robust=T, data=database)
I've tried to add command fe=c("year") within the PPML function but it doesn't work.
I'd appreciate any help on this.
I would comment on the previous answer but don't have enough reputation. The gravity model in your PPML command specifies v = dist × exp(land + contig + comlang_ethno + smctry + tech + exrate + TimeFE) = exp(log(dist) + land + contig + comlang_ethno + smctry + tech + exrate + TimeFE).
The formula inside of glm should have as its RHS the variables inside the exponential, because it represents the linear predictor produced by the link function (the Poisson default for which is natural log). So in sum, your command should be
glm(v ~ log(dist) + land + contig + comlang_ethno + smctry + tech + exrate + factor(year),
family='quasipoisson')
and in particular, you need to have distance in logs on the RHS (unlike the previous answer).
Just make sure that year is a factor, than you can just use the plain-and-simple glm-function as
glm(y ~ dist + year, family = "quasipoisson")
which gives you the results with year as dummies/fixed effects. The robust SE are then calculated with
lmtest::coeftest(EstimationResults.PPML, vcov=sandwich::vcovHC(model.PPML, "HC1"))
The PPML function does nothing more, it just isn't very flexible.
Alternatively to PPML and glm, you can also solve your problem using the function femlm (from package FENmlm) which deals with fixed-effect estimation for maximum likelihood models.
The two main advantages of function femlm are:
you can add as many fixed-effects as you want, and they are dealt with separately leading to computing times without comparison to glm (especially when fixed-effects contain many categories)
standard-errors can be clustered with intuitive commands
Here's an example regarding your problem (with just two variables and the year fixed-effects):
library(FENmlm)
# (default family is Poisson, 'pipe' separates variables from fixed-effects)
res = femlm(v ~ log(dist) + land | year, base)
summary(res, se = "cluster")
This code estimates the coefficients of variables log(dist) and land with year fixed-effects; then it displays the coefficients table with clustered standard-errors (w.r.t. year) for the two variables.
Going beyond your initial question, now assume you have a more complex case with three fixed-effects: country_i, country_j and year. You'd write:
res = femlm(v ~ log(dist) + land | country_i + country_j + year, base)
You can then easily play around with clustered standard-errors:
# Cluster w.r.t. country_i (default is first cluster encountered):
summary(res, se = "cluster")
summary(res, se = "cluster", cluster = "year") # cluster w.r.t. year cluster
# Two-way clustering:
summary(res, se = "twoway") # two-way clustering w.r.t. country_i & country_j
# two way clustering w.r.t. country_i & year:
summary(res, se = "twoway", cluster = c("country_i", "year"))
For more information on the package, the vignette can be found at https://cran.r-project.org/web/packages/FENmlm/vignettes/FENmlm.html.
I am fitting a maximal model to untransformed response times to correct trials, with two, two-level, centered categorical predictors (Stimulation, Cognate Status) and an orthogonal second-order polynomial with 5 levels (Block). Random effects include full crossed structure with correlations. 32 subjects, 60 items, balanced, within-subjects design, 12,406 observations. The model converges but the summary takes an age to process.
The model runs without any convergence issues but summary() initialises a memory-intensive process but does not compile/print the output. I don't have any issues with the summary() function for any other elements.
I have included the code for the model for reference.
Max.lmer.RT = lmer(RT ~ StimCent.r*(ot1 + ot2)*CogStatCent.r +
(1 + StimCent.r*(ot1 + ot2)*CogStatCent.r | PID) +
(1 + StimCent.r*(ot1 + ot2) | DutchName:CogStatCent.r),
data = TDL.cent.RT, REML = FALSE, control =
lmerControl(optimizer = "nloptwrap2", optCtrl =
list(maxfun = 100000)))
summary(Max.lmer.RT)
A fix or suggestions on what might be causing this would be much appreciated.
I'm attempting to "translate" a model run in HLM7 software to R lmer syntax.
This is from the now-ubiquitous "Math achievement" dataset. The outcome is math achievement score, and in the dataset there are various student-level predictors (such as minority status, SES, and whether or not the student is female) and various school level predictors (such as Catholic vs. Public).
The only predictors in the model I want to fit are student-level predictors, which have all been group-mean centered to deal with dummy variables (aside: contrast codes are better). The students are nested in schools, so we should (I think) have random effects specified for all of the components of the model.
Here is the HLM model:
Level-1 Model
(note: all predictors at level one are group mean centered)
MATHACHij = β0j + β1j*(MINORITYij) + β2j*(FEMALEij) + β3j*(SESij) + rij
Level-2 Models
β0j = γ00 + u0j
β1j = γ10 + u1j
β2j = γ20 + u2j
β3j = γ30 + u3j
Mixed Model
MATHACHij = γ00 + γ10*MINORITYij + γ20*FEMALEij + γ30*SESij + u0j + u1j*MINORITYij + u2j*FEMALEij + u3j*SESij + rij
Translating it to lmer syntax, I try:
(note: _gmc means the variable has been group mean centered, the grouping factor is "school_id")
model1<-lmer(mathach~minority_gmc+female_gmc+ses_gmc+(minority_gmc|school_id)+(female_gmc|school_id)+(ses_gmc|school_id), data=data, REML=F)
When I run this model I get results that don't mesh with the HLM results. Am I specifying the random effects incorrectly?
Thanks!
When you specify your random effect structure, you can include each random effect in one parentheses. While this may not solve your result dependencies, I believe the appropriate random effects code syntax for your model is this:
lmer(mathach~minority_gmc + female_gmc + ses_gmc + (1 + minority_gmc + female_gmc + ses_gmc |school_id), data=data, REML=F)
Suppose I compute the Hausman-Taylor estimator using the plm command with the option: model= "ht". Using the result I like to obtain a robust variance-covariance matrix to make inference fully robust. For this purpose the vcovHC() command (part of the plm package) is used. Here is a minimal example:
data("Wages", package = "plm")
ht <- plm(lwage ~ wks + south + smsa + married + exp + I(exp^2) +
bluecol + ind + union + sex + black + ed |
sex + black + bluecol + south + smsa + ind,
data = Wages, model = "ht", index = 595)
vcvHT <- vcovHC(ht,method="arellano")
Error in vcovHC.plm(ht, method = "arellano") :
Model has to be either random, within or pooling model
Technically, as the error message indicates, vcovHC() is unable to compute the VCV matrix since it does not support models of the type computed by plm(...,model="ht")
My question is this:
Why doesn't vcovHC() support the Hausman-Taylor model? Is it because standard error based on a (cluster) robust VCV matrix shouldn't be used for theoretical reasons (inconsistent etc.), or is it simply not implemented but save to use (if programmed by hand)?
It is currently not implemented (yet); but as HT is a special kind of IV, it should in principle be possible to compute an HC covariance. I will get around to doing it sometime. A production version requires a lot of interface work and consideration of all possible cases; but an ad-hoc function might be relatively easy to write, based on components from the model object.