SIMR package - effect sizes - r

I'm using SIMR package to estimate power and effect sizes of my models. I don't understand how the package estimates the effect sizes, though, and what kind of an effect it reports (is it Cohen's d?).
E.g.
For my model, in which AQ and LSAS are continuous predictors and cond is a categorical (3 level) predictor, I get this output (for AQ):
> model.cnv.cue = lme4::lmer(DV ~ AQ_centr + cond + LSAS_centr + (1 | code), data = mydata, REML = FALSE)
> powerSim(model.cnv.cue,nsim = 200)
Power for predictor 'AQ_centr', (95% confidence interval):
60.50% (53.36, 67.32)
Test: Kenward Roger (package pbkrtest)
Effect size for AQ_centr is -0.048
Based on 200 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 153
Time elapsed: 0 h 0 m 23 s
nb: result might be an observed power calculation
Is it Cohen's d = -0.048? Or r? What does Kenward Roger test have to do with this?
And then, when I run it for the categorical predictor, there are no effect sizes reported:
> model.cnv.cue = lme4::lmer(CNV_500_cue ~ cond + AQ_centr + LSAS_centr + (1 | code), data = ANT, REML = FALSE)
> powerSim(model.cnv.cue,nsim = 200)
Power for predictor 'cond', (95% confidence interval):
95.50% (91.63, 97.92)
Test: Likelihood ratio
Based on 200 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 153
Time elapsed: 0 h 0 m 13 s
nb: result might be an observed power calculation
So how does the package estimate the effect sizes? And how to get effect sizes for categorical predictors?

The effect size -0.048 is the slope of your predictor AQ_centr.
Kenward Roger tests are used to calculate your p-values for the continuous predictor; for your categorical predictor Likelihood ratio tests are used. Instead of KR you could have also used Bootstrap etc. (it's just the way of computing p-values).
Your 3 level categorical predictor is probably split into 2 dummy variables when entering the model. If you are interested in the effect of one specific dummy variable (let's say cond2), you can run a z-test on it, like so:
powerSim(model.cnv.cue, fixed('cond2', 'z'), nsim=200)
To find out about the dummy variables, you can take a look at the model summary:
summary(model.cnv.cue)$coef
More info can be found here:
https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12504
https://besjournals.onlinelibrary.wiley.com/action/downloadSupplement?doi=10.1111%2F2041-210X.12504&file=mee312504-sup-0001-AppendixS1.html

Related

Weird plots when plotting logistic regression residuals vs predictor variables?

I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:
model <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.888112 0.859847 -9.174 < 2e-16 ***
age 0.028529 0.009212 3.097 0.00196 **
bmi 0.095759 0.015265 6.273 3.53e-10 ***
surgery11 0.923723 0.524588 1.761 0.07826 .
surgery21 1.607389 0.600113 2.678 0.00740 **
surgery31 1.544822 0.573972 2.691 0.00711 **
cvd1 0.624692 0.290005 2.154 0.03123 *
rt1 -0.816374 0.353953 -2.306 0.02109 *
I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...
residualPlots(model)
My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?
Thanks
Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.
Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post
Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :
library(mlbench)
library(car)
data(PimaIndiansDiabetes)
table(PimaIndiansDiabetes$diabetes)
neg pos
500 268
mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)
Let's make it more imbalanced, and you get a plot exactly like yours:
da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)
neg pos
500 100
mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)

Why the statistical power I obtained with powerSim (simr) is way too low?

I am trying to compute the statistical power of my linear mixed model, which includes 2 conditions and 16 observations from 3 subjects. Fixed effect is the condition ('Cond') and random effect is subject identity (SubID).
Here is my data
lme4 <- lmer(delta ~ Cond + (1|SubID), data = Data)
I used powerSim (simr package):
powerSim (fit = lme4, test = fixed ("Cond" ), nsim = 1000)
And this is what I obtained: a power of 7-9%, which is very low.
Power for predictor 'Cond', (95% confidence interval):
8.5% ( 9.46, 15.39)
Test: Likelihood ratio
Based on 1000 simulations, (20 warnings, 0 errors)
alpha = 0.05, nrow = 16
Given that some of the data reach significance, I am expecting a power around 60 to 80%. Could you tell me whether I did anything wrong in the computation?

Extend doesn't show effects in R package simr

I'm trying to reproduce the example of Green and MacLeod in https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/2041-210X.12504.
library(simr)
model1 <- glmer(z ~ x + (1|g), family="poisson", data=simdata)
summary(model1)
fixef(model1)["x"] <- -0.05 # Specify desired effect size
model3 <- extend(model1, along="g", n=15) # Add more groups
summary(model3)
However, in the output of model 3, the number of groups is not extended (same result as in model 1):
Random effects:
Groups Name Variance Std.Dev.
g (Intercept) 0.08345 0.2889
Number of obs: 30, groups: g, 3
I know that by checking the rows I get different results, but why isn't that part of the regression? Am I doing something wrong? How can I extend the number of groups so that I can calculate a proper powerCurve?
> nrow(getData(model1))
[1] 30
> nrow(getData(model2))
[1] 130
First off, well done on checking that the functions are doing what you expect. I hadn't expected users to be quite this thorough.
model3 now has two datasets attached to it. The original one that lme4 knows about, and an attribute newData that simr checks instead.
So print and summary, which are part of lme4, give you values from the old dataset.
But getData and powerSim and powerCurve in simr will use the new extended dataset:
> powerSim(model3, nsim=10, progress=FALSE)
Power for predictor 'x', (95% confidence interval):
80.00% (44.39, 97.48)
Test: z-test
Effect size for x is -0.050
Based on 10 simulations, (0 warnings, 0 errors)
alpha = 0.05, nrow = 150
Time elapsed: 0 h 0 m 0 s
I've opened an issue on the github repository, this is probably a bug that should be fixed.

Syntax for diagonal variance-covariance matrix for non-linear mixed effects model in nlme

I am analysing routinely collected substance use data during the first 12 months' of treatment in a large sample of outpatients attending drug and alcohol treatment services. I am interested in whether differing levels of methamphetamine use (no use, low use, and high use) at the outset of treatment predicts different levels after a year in treatment, but the data is very irregular, with different clients measured at different times and different numbers of times during their year of treatment.
The data for the high and low use group seem to suggest that drug use at outset reduces during the first 3 months of treatment and then asymptotes. Hence I thought I would try a non-linear exponential decay model.
I started with the following nonlinear generalised least squares model using the gnls() function in the nlme package:
fitExp <- gnls(outcome ~ C*exp(-k*yearsFromStart),
params = list(C ~ atsBase_fac, k ~ atsBase_fac),
data = dfNL,
start = list(C = c(nsC[1], lsC[1], hsC[1]),
k = c(nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = gnlsControl(nlsTol = 0.1))
where outcome is number of days of drug use in the 28 days previous to measurement, atsBase_fac is a three-level categorical predictor indicating level of amphetamine use at baseline (noUse, lowUse, and highUse), yearsFromStart is a continuous predictor indicating time from start of treatment in years (baseline = 0, max - 1), C is a parameter indicating initial level of drug use, and k is the rate of decay in drug use. The starting values of C and k are taken from nls models estimating these parameters for each group. These are the results of that model
Generalized nonlinear least squares fit
Model: outcome ~ C * exp(-k * yearsFromStart)
Data: dfNL
AIC BIC logLik
27672.17 27725.29 -13828.08
Variance function:
Structure: Exponential of variance covariate
Formula: ~yearsFromStart
Parameter estimates:
expon
0.7927517
Coefficients:
Value Std.Error t-value p-value
C.(Intercept) 0.130410 0.0411728 3.16738 0.0015
C.atsBase_faclow 3.409828 0.1249553 27.28839 0.0000
C.atsBase_fachigh 20.574833 0.3122500 65.89218 0.0000
k.(Intercept) -1.667870 0.5841222 -2.85534 0.0043
k.atsBase_faclow 2.481850 0.6110666 4.06150 0.0000
k.atsBase_fachigh 9.485155 0.7175471 13.21886 0.0000
So it looks as if there are differences between groups in initial rate of drug use and in rate of reduction in drug use. I would like to go a step further and fit a nonlinear mixed effects model.I tried consulting Pinhiero and Bates' book accompanying the nlme package but the only models I could find that used irregular, sparse data like mine used a self-starting function, and my model does not do that.
I tried to adapt the gnls() model to nlme like so:
fitNLME <- nlme(model = outcome ~ C*exp(-k*yearsFromStart),
data = dfNL,
fixed = list(C ~ atsBase_fac, k ~ atsBase_fac),
random = pdDiag(yearsFromStart ~ id),
groups = ~ id,
start = list(fixed = c(nsC[1], lsC[1], hsC[1], nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = nlmeControl(optim = "optimizer"))
bit I keep getting error message, I presume through errors in the syntax specifying the random effects.
Can anyone give me some tips on how the syntax for the random effects works in nlme?
The only dataset in Pinhiero and Bates that resembled mine used a diagonal variance-covariance matrix. Can anyone filled me in on the syntax of this nlme function, or suggest a better one?
p.s. I wish I could provide a reproducible example but coming up with synthetic data that re-creates the same errors is way beyond my skills.

R geepack: unreasonably large estimates using GEE

I am using geepack for R to estimate logistic marginal model by geeglm(). But I am getting garbage estimates. They about 16 orders of magnitude too large. However the p-values seems to similar to what I expected. This means that the response essentially becomes a step function. See attached plot
Here is the code that generates the plot:
require(geepack)
data = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=data, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=data)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
Here is the regression table:
Call:
geeglm(formula = moden ~ 1 + power, family = binomial, data = data,
id = defacto, corstr = "exchangeable")
Coefficients:
Estimate Std.err Wald Pr(>|W|)
(Intercept) -7.38e+15 1.47e+15 25.1 5.4e-07 ***
power 2.05e+13 1.60e+12 164.4 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Estimated Scale Parameters:
Estimate Std.err
(Intercept) 1.03e+15 1.65e+37
Correlation: Structure = exchangeable Link = identity
Estimated Correlation Parameters:
Estimate Std.err
alpha 0.196 3.15e+21
Number of clusters: 3 Maximum cluster size: 381
Hoping for some help. Thanks!
Kind regards,
Marius
I will give three procedures, each of which is a marginalized random intercept model (MRIM). These MRIMs have coefficients with marginal logistic interpretations and are of smaller magnitude than the GEE:
| Model | (Intercept) | power | LogL |
|-------|-------------|--------|--------|
| `L_N` | -1.050| 0.00267| -270.1|
| `LLB` | -0.668| 0.00343| -273.8|
| `LPN` | -1.178| 0.00569| -266.4|
compared to a glm that doesn't account for any correlation, for reference:
| Model | (Intercept) | power | LogL |
|-------|-------------|--------|--------|
| strt | -0.207| 0.00216| -317.1|
A marginalized random intercept model (MRIM) is worth exploring because you want a marginal model with exchangeable correlation structure for the clustered data, and that is the type of structure MRIMs exhibit.
The code (especially R script with comments) and PDFs for literature are in the GITHUB repo. I detail the code and literature down below.
The concept of MRIM has been around since 1999, and some background reading on this is in the GITHUB repo. I suggest reading Swihart et al 2014 first because it reviews the other papers.
In chronological order --
L_N Heagerty (1999): the approach fits a random intercept logistic model with a normally distributed random intercept. The trick is that the predictor in the random intercept model is nonlinearly parameterized with marginal coefficients so that the resulting marginal model has a marginal logistic interpretation. Its code is the lnMLE R package (not on CRAN, but on Patrick Heagerty's website here). This approach is denoted L_N in the code to indicate logit (L) on the marginal, no interepretation on conditional scale (_) and a normally (N) distributed random intercept.
LLB Wang & Louis (2003): the approach fits a random intercept logistic model with a bridge distributed random intercept. Unlike Heagerty 1999 where the trick is nonlinear-predictor for the random intercept model, the trick is a special random effects distribution (the bridge distribution) that allows both the random intercept model and the resulting marginal model to have a logistic interpretation. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. This approach is denoted LLB in the code to indicate logit (L) on the marginal, logit (L) on the conditional scale and a bridge (B) distributed intercept.
LPN Caffo and Griswold (2006): the approach fits a random intercept probit model with a normally distributed random intercept, whereas Heagerty 1999 used a logit random intercept model. This substitution makes computations easier and still yields a marginal logit model. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. This approach is denoted LPN in the code to indicate logit (L) on the marginal, probit (P) on the conditional scale and a normally (N) distributed intercept.
Griswold et al (2013): another review / practical introduction.
Swihart et al 2014: This is a review paper for Heagerty 1999 and Wang & Louis 2003 as well as others and generalizes the MRIM method. One of the most interesting generalizations is allowing the logistic CDF (equivalently, logit link) in both the marginal and conditional models to instead be a stable distribution that approximates a logistic CDF. Its code is implemented with gnlmix4MMM.R (in the repo) which uses rmutil and repeated R packages. I denote this SSS in the R script with comments to indicate stable (S) on the marginal, stable (S) on the conditional scale and a stable (S) distributed intercept. It is included in the R script but not detailed in this post on SO.
Prep
#code from OP Question: edit `data` to `d`
require(geepack)
d = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=d, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=d)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
#get some starting values from glm():
strt <- coef(glm(moden ~ power, family = binomial, data=d))
strt
#I'm so sorry but these methods use attach()
attach(d)
L_N Heagerty (1999)
# marginally specifies a logit link and has a nonlinear conditional model
# the following code will not run if lnMLE is not successfully installed.
# See https://faculty.washington.edu/heagerty/Software/LDA/MLV/
library(lnMLE)
L_N <- logit.normal.mle(meanmodel = moden ~ power,
logSigma= ~1,
id=defacto,
model="marginal",
data=d,
beta=strt,
r=10)
print.logit.normal.mle(L_N)
Prep for LLB and LPN
library("gnlm")
library("repeated")
source("gnlmix4MMM.R") ## see ?gnlmix; in GITHUB repo
y <- cbind(d$moden,(1-d$moden))
LLB Wang and Louis (2003)
LLB <- gnlmix4MMM(y = y,
distribution = "binomial",
mixture = "normal",
random = "rand",
nest = defacto,
mu = ~ 1/(1+exp(-(a0 + a1*power)*sqrt(1+3/pi/pi*exp(pmix)) - sqrt(1+3/pi/pi*exp(pmix))*log(sin(pi*pnorm(rand/sqrt(exp(pmix)))/sqrt(1+3/pi/pi*exp(pmix)))/sin(pi*(1-pnorm(rand/sqrt(exp(pmix))))/sqrt(1+3/pi/pi*exp(pmix)))))),
pmu = c(strt, log(1)),
pmix = log(1))
print("code: 1 -best 2-ok 3,4,5 - problem")
LLB$code
print("coefficients")
LLB$coeff
print("se")
LLB$se
LPN Caffo and Griswold (2006)
LPN <- gnlmix4MMM(y = y,
distribution = "binomial",
mixture = "normal",
random = "rand",
nest = defacto,
mu = ~pnorm(qnorm(1/(1+exp(-a0 - a1*power)))*sqrt(1+exp(pmix)) + rand),
pmu = c(strt, log(1)),
pmix = log(1))
print("code: 1 -best 2-ok 3,4,5 - problem")
LPN$code
print("coefficients")
LPN$coeff
print("se")
LPN$se
coefficients from 3 approaches:
rbind("L_N"=L_N$beta, "LLB" = LLB$coefficients[1:2], "LPN"=LPN$coefficients[1:2])
max log likelihood for 3 models:
rbind("L_N"=L_N$logL, "LLB" = -LLB$maxlike, "LPN"=-LPN$maxlike)

Resources