Extend doesn't show effects in R package simr

Extend doesn't show effects in R package simr - r

I'm trying to reproduce the example of Green and MacLeod in https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12504.
library(simr)
model1 <- glmer(z ~ x + (1|g), family="poisson", data=simdata)
summary(model1)
fixef(model1)["x"] <- -0.05 # Specify desired effect size
model3 <- extend(model1, along="g", n=15) # Add more groups
summary(model3)
However, in the output of model 3, the number of groups is not extended (same result as in model 1):
Random effects:
Groups Name Variance Std.Dev.
g (Intercept) 0.08345 0.2889
Number of obs: 30, groups: g, 3
I know that by checking the rows I get different results, but why isn't that part of the regression? Am I doing something wrong? How can I extend the number of groups so that I can calculate a proper powerCurve?
> nrow(getData(model1))
[1] 30
> nrow(getData(model2))
[1] 130

First off, well done on checking that the functions are doing what you expect. I hadn't expected users to be quite this thorough.
model3 now has two datasets attached to it. The original one that lme4 knows about, and an attribute newData that simr checks instead.
So print and summary, which are part of lme4, give you values from the old dataset.
But getData and powerSim and powerCurve in simr will use the new extended dataset:
> powerSim(model3, nsim=10, progress=FALSE)
Power for predictor 'x', (95% confidence interval):
80.00% (44.39, 97.48)
Test: z-test
Effect size for x is -0.050
Based on 10 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 150
Time elapsed: 0 h 0 m 0 s
I've opened an issue on the github repository, this is probably a bug that should be fixed.

Related

lme4: How to specify random slopes while constraining all correlations to 0?

Due to an interesting turn of events, I'm trying use the lme4 package in R to fit a model in which the random slopes are not allowed to correlate with each other or the random intercept. Effectively, I want to estimate the variance parameter for each random slope, but none of the correlations/covariances. From the reading I've done so far, I think what I want is effectively a diagonal variance/covariance structure for the random effects.
An answer to a similar question here provides a workaround to specify a model where slopes are correlated with intercepts, but not with each other. I also know the || syntax in lme4 makes slopes that are correlated with each other, but not with the intercepts. Neither of these seems to fully accomplish what I'm looking to do.
Borrowing the example from the earlier post, if my model is:
m1 <- lmer (Y ~ A + B + (1+A+B|Subject), data=mydata)
is there a way to specify the model such that I estimate variance parameters for A and B while constraining all three correlations to 0? I would like to achieve a result that looks something like this:
VarCorr(m1)
## Groups Name Std.Dev. Corr
## Subject (Intercept) 1.41450
## A 1.49374 0.000
## B 2.47895 0.000 0.000
## Residual 0.96617
I'd prefer a solution that could achieve this for an arbitrary number of random slopes. For example, if I were to add a random effect for a third variable C, there would be 6 correlation parameters to fix at 0 rather than 3. However, anything that could get me started in the right direction would be extremely helpful.
Edit:
On asking this question, I misunderstood what the || syntax does in lme4. Struck through the incorrect statement above to avoid misleading anyone in the future.

This is exactly what the double-bar notation does. However, note that the || in lme4 does not work as one might expect for factor variables. It does work 'properly' in glmmTMB, and the afex::mixed() function is a wrapper for [g]lmer which does implement a fully functional version of ||. (I have meant to import this into lme4 for years but just haven't gotten around to it yet ...)
simulated example
library(lme4)
set.seed(101)
dd <- data.frame(A = runif(500), B = runif(500),
Subject = factor(rep(1:25, 20)))
dd$Y <- simulate(~ A + B + (1 + A + B|Subject),
newdata = dd,
family = gaussian,
newparams = list(beta = rep(1,3), theta = rep(1,6), sigma = 1))[[1]]
solution
summary(m <- lmer (Y ~ A + B + (1+A+B||Subject), data=dd))
The correlations aren't listed because they are structurally absent (internally, the random effects term is expanded to (1|Subject) + (0 + A|Subject) + (0+B|Subject), which is also why the groups are listed as Subject, Subject.1, Subject.2).
Random effects:
Groups Name Variance Std.Dev.
Subject (Intercept) 0.8744 0.9351
Subject.1 A 2.0016 1.4148
Subject.2 B 2.8718 1.6946
Residual 0.9456 0.9724
Number of obs: 500, groups: Subject, 25

Two-level modelling with lme in R

I am interested in estimating a mixed effect model with two random components (I am sorry for the somewhat unprecise notation. I am somewhat new to these kind of models). Finally, I also want also the standard errors of the variances of the random components. That is why I am somewhat boudn to using the package lme. The reason is that I found this description on how to calculate those standard errors and also interesting, the standard error for function of these variances link.
I believe I know how to use the package lmer. I am finally interested in model2. For the model1, both command yield the same estimates. But model2 with lme yields different results than model2 with lmer from the lme4 package. Could you help me to get around how to set up the random components for lme? This would be much appreciated. Thanks. Please find attached my MWE.
Best
Daniel
#### load all packages #####
loadpackage <- function(x){
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
}
# Load package (after installing)
library( i , character.only = TRUE )
}
}
# Then try/install packages...
loadpackage( c("nlme", "msm", "lmeInfo", "lme4"))
alcohol1 <- read.table("https://stats.idre.ucla.edu/stat/r/examples/alda/data/alcohol1_pp.txt", header=T, sep=",")
attach(alcohol1)
id <- as.factor(id)
age <- as.factor(age)
model1.lmer <-lmer(alcuse ~ 1 + peer + (1|id))
summary(model1.lmer)
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age))
summary(model2.lmer)
model1.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id, method ="REML")
summary(model1.lme)
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id + 1|age, method ="REML")
Edit (15/09/2021):
Estimating the model as follows end then returning the estimates via nlme::VarCorr gives me different results. While the estimates seem to be in the ball park, it is as they are switched across components.
model2a.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2a.lme)
nlme::VarCorr(model2a.lme)
Variance StdDev
id = pdLogChol(1)
(Intercept) 0.38390274 0.6195989
age = pdLogChol(1)
(Intercept) 0.47892113 0.6920413
Residual 0.08282585 0.2877948
EDIT (16/09/2021):
Since Bob pushed me to think more about my model, I want to give some additional information. Please know that the data I use in the MWE do not match my true data. I just used it for illustrative purposes since I can not upload myy true data. I have a household panel with income, demographic informations and parent indicators.
I am interested in intergenerational mobility. Sibling correlations of permanent income are one industry standard. At the very least, contemporanous observations are very bad proxies of permanent income. Due to transitory shocks, i.e., classical measurement error, those estimates are most certainly attenuated. For this reason, we exploit the longitudinal dimension of our data.
For sibling correlations, this amounts to hypothesising that the income process is as follows:
$$Y_{ijt} = \beta X_{ijt} + \epsilon_{ijt}.$$
With Y being income from individual i from family j in year t. X comprises age and survey year indicators to account for life-cycle effects and macroeconmic conditions in survey years. Epsilon is a compund term comprising a random individual and family component as well as a transitory component (measurement error and short lived shocks). It looks as follows:
$$\epsilon_{ijt} = \alpha_i + \gamma_j + \eta_{ijt}.$$
The variance of income is then:
$$\sigma^2_\epsilon = \sigma^2_\alpha + \sigma^2\gamma + \sigma^2\eta.$$
The quantitiy we are interested in is
$$\rho = \frac(\sigma^2\gamma}{\sigma^2_\alpha + \sigma^2\gamma},$$
which reflects the share of shared family (and other characteristics) among siblings of the variation in permanent income.
B.t.w.: The struggle is simply because I want to have a standard errors for all estimates and for \rho.

This is an example of crossed vs nested random effects. (Note that the example you refer to is fitting a different kind of model, a random-slopes model rather than a model with two different grouping variables ...)
If you try with(alcohol1, table(age,id)) you can see that every id is associated with every possible age (14, 15, 16). Or subset(alcohol1, id==1) for example:
id age coa male age_14 alcuse peer cpeer ccoa
1 1 14 1 0 0 1.732051 1.264911 0.2469111 0.549
2 1 15 1 0 1 2.000000 1.264911 0.2469111 0.549
3 1 16 1 0 2 2.000000 1.264911 0.2469111 0.549
There are three possible models you could fit for a model with random effects of age(indexed by i) and id (indexed by j)
crossed ((1|age) + (1|id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_j +epsr_{ij}; alcohol use varies among individuals and, independently, across ages (this model won't work very well because there are only three distinct ages in the data set, more levels are usually needed)
id nested within age ((1|age/id) = (1|age) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_i + eps2_{ij} + epsr_{ij}; alcohol use varies across ages, and varies across individuals within ages (see note above about number of levels).
age nested within id ((1|id/age) = (1|id) + (1|age:id)): Y_{ij} = beta0 + beta1*peer + eps1_j + eps2_{ij} + epsr_{ij}; alcohol use varies across individuals, and varies across ages within individuals
Here eps1_i, eps2_{ij}, and epsr_{ij} are normal deviates; epsr is the residual error term.
The latter two models actually don't make sense in this case; because there is only one observation per age/id combination, the nested variance (eps2) is completely confounded with the residual variance (epsr). lme doesn't notice this; if you try to fit one of the nested models in lmer it will give an error that
number of levels of each grouping factor must be < number of observations (problems: id:age)
(Although if you try to compute confidence intervals based on model1.lme you'll get an error "cannot get confidence intervals on var-cov components: Non-positive definite approximate variance-covariance", which is a hint that something is wrong.)
You could restate this problem as saying that the residual variation, and the variation among ages within individuals, are jointly unidentifiable (can't be separated from each other, statistically).
The updated answer here shows how to get the standard errors of the variance components from an lmer model, so you shouldn't be stuck with lme (but you should think carefully about which model you're really trying to fit ...)
The GLMM FAQ might also be useful.
More generally, the standard error of
rho = (V_gamma)/(V_alpha + V_gamma)
will be hard to compute accurately, because this is a nonlinear function of the model parameters. You can apply the delta method, but the most reliable approach would be to use parametric bootstrapping: if you have a fitted model m, then something like this should work:
var_ratio <- function(m) {
v <- as.data.frame(sapply(VarCorr(m), as.numeric))
return(v$family/(v$family + v$id))
}
confint(m, method="boot", FUN =var_ratio)

You should specify random effects in lme by using / not +
By lmer
model2.lmer <-lmer(alcuse ~ 1 + peer + (1|id) + (1|age), data = alcohol1)
summary(model2.lmer)
Linear mixed model fit by REML ['lmerMod']
Formula: alcuse ~ 1 + peer + (1 | id) + (1 | age)
Data: alcohol1
REML criterion at convergence: 651.3
Scaled residuals:
Min 1Q Median 3Q Max
-2.0228 -0.5310 -0.1329 0.5854 3.1545
Random effects:
Groups Name Variance Std.Dev.
id (Intercept) 0.08078 0.2842
age (Intercept) 0.30313 0.5506
Residual 0.56175 0.7495
Number of obs: 246, groups: id, 82; age, 82
Fixed effects:
Estimate Std. Error t value
(Intercept) 0.3039 0.1438 2.113
peer 0.6074 0.1151 5.276
Correlation of Fixed Effects:
(Intr)
peer -0.814
By lme
model2.lme <- lme(alcuse ~ 1+ peer, data = alcohol1, random = ~ 1 |id/age, method ="REML")
summary(model2.lme)
Linear mixed-effects model fit by REML
Data: alcohol1
AIC BIC logLik
661.3109 678.7967 -325.6554
Random effects:
Formula: ~1 | id
(Intercept)
StdDev: 0.4381206
Formula: ~1 | age %in% id
(Intercept) Residual
StdDev: 0.4381203 0.7494988
Fixed effects: alcuse ~ 1 + peer
Value Std.Error DF t-value p-value
(Intercept) 0.3038946 0.1438333 164 2.112825 0.0361
peer 0.6073948 0.1151228 80 5.276060 0.0000
Correlation:
(Intr)
peer -0.814
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.0227793 -0.5309669 -0.1329302 0.5853768 3.1544873
Number of Observations: 246
Number of Groups:
id age %in% id
82 82

Okay, finally. Just to sketch my confidential data: I have a panel of individuals. The data includes siblings, identified via mnr. income is earnings, wavey survey year, age age factors. female a factor for gender, pid is the factor identifying the individual.
m1 <- lmer(income ~ age + wavey + female + (1|pid) + (1 | mnr),
data = panel)
vv <- vcov(m1, full = TRUE)
covvar <- vv[58:60, 58:60]
covvar
3 x 3 Matrix of class "dgeMatrix"
cov_pid.(Intercept) cov_mnr.(Intercept) residual
[1,] 2.6528679 -1.4624588 -0.4077576
[2,] -1.4624588 3.1015001 -0.0597926
[3,] -0.4077576 -0.0597926 1.1634680
mean <- as.data.frame(VarCorr(m1))$vcov
mean
[1] 17.92341 16.86084 56.77185
deltamethod(~ x2/(x1+x2), mean, covvar, ses =TRUE)
[1] 0.04242089
The last scalar should be what I interprete as the shared background of the siblings of permanent income.
Thanks to #Ben Bolker who pointed me into this direction.

SIMR package - effect sizes

I'm using SIMR package to estimate power and effect sizes of my models. I don't understand how the package estimates the effect sizes, though, and what kind of an effect it reports (is it Cohen's d?).
E.g.
For my model, in which AQ and LSAS are continuous predictors and cond is a categorical (3 level) predictor, I get this output (for AQ):
> model.cnv.cue = lme4::lmer(DV ~ AQ_centr + cond + LSAS_centr + (1 | code), data = mydata, REML = FALSE)
> powerSim(model.cnv.cue,nsim = 200)
Power for predictor 'AQ_centr', (95% confidence interval):
60.50% (53.36, 67.32)
Test: Kenward Roger (package pbkrtest)
Effect size for AQ_centr is -0.048
Based on 200 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 153
Time elapsed: 0 h 0 m 23 s
nb: result might be an observed power calculation
Is it Cohen's d = -0.048? Or r? What does Kenward Roger test have to do with this?
And then, when I run it for the categorical predictor, there are no effect sizes reported:
> model.cnv.cue = lme4::lmer(CNV_500_cue ~ cond + AQ_centr + LSAS_centr + (1 | code), data = ANT, REML = FALSE)
> powerSim(model.cnv.cue,nsim = 200)
Power for predictor 'cond', (95% confidence interval):
95.50% (91.63, 97.92)
Test: Likelihood ratio
Based on 200 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 153
Time elapsed: 0 h 0 m 13 s
nb: result might be an observed power calculation
So how does the package estimate the effect sizes? And how to get effect sizes for categorical predictors?

The effect size -0.048 is the slope of your predictor AQ_centr.
Kenward Roger tests are used to calculate your p-values for the continuous predictor; for your categorical predictor Likelihood ratio tests are used. Instead of KR you could have also used Bootstrap etc. (it's just the way of computing p-values).
Your 3 level categorical predictor is probably split into 2 dummy variables when entering the model. If you are interested in the effect of one specific dummy variable (let's say cond2), you can run a z-test on it, like so:
powerSim(model.cnv.cue, fixed('cond2', 'z'), nsim=200)
To find out about the dummy variables, you can take a look at the model summary:
summary(model.cnv.cue)$coef
More info can be found here:
https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12504
https://besjournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2F2041-210X.12504&file=mee312504-sup-0001-AppendixS1.html

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.

Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

Svyglm in package survey in R not returning Std Errors

I'd really appreciate some assistance with this. I'd like to estimate coefficients and 95% CI for a glm that is applied to a household survey with 2 levels (defined by dd and hh.num1). I've only recently come across the package survey.
I've been following the examples within vignette for 1) setting up a dataset to consider the sampling methods - using svydesign 2) setting up a glm using the command svyglm. For the example datasets:
library(survey)data(api)head(apiclus1)dclus1 <- svydesign(id = ~dnum, weights = ~pw, data = apiclus1)logitmodel <-svyglm(I(sch.wide=="Yes")~awards+comp.imp+enroll+target+hsg+pct.resp+mobility+ell+meals, design=dclus1, family=quasibinomial())summary(logitmodel)
Adding lots of variables seems OK so I'm confident that the package is working with a good dataset.
When I do the same to my dataset, the std errors return with "Inf" if 3 or 4 variables are added in and I can't figure out why. It seems as though it's more common with factors. I'm sorry that I haven't been able to replicate the error with the other examples, but the dataset could be downloaded here.
So using this dataset:
load("balo2_7March17.Rdat")
dclus1 <- svydesign(id=~dd+hh.num1, weights=~chweight, data = balo2)
glm1 <- svyglm(out.penta ~ factor(MN18c) + windex5 + age.y,
design=dclus1, family=quasibinomial())
summary(glm1)
If MN18c is numeric then the std errors are produced, if it's a factor (and it should be) the stnd errors are Inf. Short of knowing what else to do I'll need to try the analysis in STATA. I saw some commentary that errors may occur if applied to a "bad" dataset, but what comprises "bad"?

The problem is that you have zero residual degrees of freedom in your model. The residual df is the design df (the number of PSUs minus the number of strata) minus the number of predictors, which can easily get negative when you have two large clusters per stratum. This definition of residual df is probably conservative, but it's not a straightforward question.
> degf(dclus1)
[1] 5
> glm1$df.resid
[1] 0
You can extract the standard errors with
> SE(glm1)
(Intercept) factor(MN18c)2 factor(MN18c)3 factor(MN18c)4 windex5
0.5461374 0.4655331 0.2805168 0.3718879 0.1376936
age.y
0.1638210
and if you are willing to use a different residual degrees of freedom, you can specify that to summary and get $p$-values. In particular, if none of your covariates are at the cluster level, there is a reasonable argument that the regression doesn't use up degrees of freedom and so for one parameter at a time you can do
> summary(glm1, df=degf(dclus1))
Call:
svyglm(formula = out.penta ~ factor(MN18c) + windex5 + age.y,
design = dclus1, family = quasibinomial())
Survey design:
svydesign(id = ~dd + hh.num1, weights = ~chweight, data = balo2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.0848 0.5461 -5.648 0.00241 **
factor(MN18c)2 -0.1183 0.4655 -0.254 0.80957
factor(MN18c)3 -0.4908 0.2805 -1.750 0.14059
factor(MN18c)4 -0.6137 0.3719 -1.650 0.15981
windex5 0.2556 0.1377 1.856 0.12256
age.y 0.9934 0.1638 6.064 0.00176 **
Combining parameters (eg to test the three coefficients making up MN18c) is more problematic, and I think you at least need df=degf(clus1)-3+1.
In the forthcoming version 4.1 the package will report standard errors in this situation (but not $p$-values unless a different df= is specified)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex