emmeans: regrid() for binomial GLMM with user-defined link function - r

I have fitted a binomial GLMM in R with a modified link function with a fixed guessing probability as suggested in this thread - except that the guessing probability is 1/2 and not 1/3. Therefore the sigmoidal activation in my case becomes:
P(correct) = 0.5 + 0.5*(exp(term)/(1 + exp(term))).
My model looks like this:
library(lme4)
m = 2
mod = glmer(correct ~ group*stim_strength + (stim_strength|subject) ,
family=binomial(link=mafc.logit(m)), data=obs_data)
where: guessing probability is 1/m; correct is a categorical variable indicating correct/incorrect response; group is a factor with two levels; stim_strength is numerical with values in [0,1]; mafc.logit is the function suggested in the thread.
I'm essentially fitting separate psychometric curves of the stimulus strength (stim_strength) for the two groups, while taking into account the inter-subject fluctuations in slope and intercept (random effect structure (stim_strength|subject))
This is what I get:
plot_model(mod, type = 'emm', terms = c('stim_strength', 'group'))
---> plot
The model describes the data nicely, and I now want to perform some post-hoc analyses on it. Specifically, I want to run for example:
mod.emm = emmeans(mod, ~group|stim_strength, at=list(stim_strength=c(.25,.75)))
confint(regrid(mod.emm))
contrast(regrid(mod.emm), 'pairwise', simple = 'group', combine = TRUE, adjust = 'holm')
i.e. compute confidence intervals for the %correct of the two groups at some specified values of stim_strength, and compare the %correct of the two groups at these values.
Note that I'm using regrid(), because I want the analyses to be done on the back-transformed values, not on the linear part of the model!
However, regrid() won't work with a user-defined link function. In fact, the regrid is just ignored here, as you can see e.g. from the output of the confint() call above (estimates are labelled as prob but they're clearly not transformed to [.5,1]):
stim_strength = 0.25:
group prob SE df asymp.LCL asymp.UCL
1 -1.329 0.173 Inf -1.716 -0.942
2 -0.553 0.161 Inf -0.913 -0.192
stim_strength = 0.75:
group prob SE df asymp.LCL asymp.UCL
1 1.853 0.372 Inf 1.018 2.687
2 3.375 0.395 Inf 2.489 4.261
Similarly, when adding type='response' in emmeans, I get the message:
Unknown transformation "mafc.logit(2)": no transformation done
Any workaround?
Thanks!

Looking at the linked suggestion, it appears that mafc.logit() is a function that returns a list with all the information needed to implement the transform. All you need to do is update the emmGrid object with that information:
mod.emm <- update(mod.emm, tran = mafc.logit(2))
confint(regrid(mod.emm), adjust = 'holm')
# etc...
See, for example, this vignette section and possibly other parts of that vignette.

Related

Question regarding LASSO confidence intervals using selectiveinference package in R

I want to get the confidence intervals for LASSO regression. For this, I used the selective inference package in R.
The fixedLassoInf function in this package provides the confidence intervals for lasso regression for a given value of lambda. Also, we can pass the coefficient vector obtained from glmnet package to this function.
The coefficients for LASSO logistic regression for a given lambda using glmnet package is as follows:
require(ISLR)
require(glmnet)
require(selectiveInference)
y1 <- Default$default
x1 <- model.matrix(default ~ student + balance + income + student*income, Default)[, -1]
lasso.mod1 <- glmnet(x1,y1, alpha = 1, lambda = 0.0003274549,family='binomial')
lasso.mod$beta
> lasso.mod1$beta
4 x 1 sparse Matrix of class "dgCMatrix"
s0
studentYes -6.131640e-01
balance 5.635401e-03
income 2.429232e-06
studentYes:income .
Then I used the fixedLassoInf function in selective inference package in R, to get the confidence intervals:
y1 <- Default$default
beta = coef(lasso.mod1, x=x1, y=y1, s=lambda/1000, exact=T)
y1= ifelse(y1=="NO",0,1)
out = fixedLassoInf(x1,(y1),beta,lambda,family="binomial",alpha=0.05)
out
However, I am getting following Warning messages:
**
Warning messages:
1: In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = "partial", :
Solution beta does not satisfy the KKT conditions (to within specified tolerances)
2: In fixedLogitLassoInf(x, y, beta, lambda, alpha = alpha, type = "partial", :
Solution beta does not satisfy the KKT conditions (to within specified tolerances). You might try rerunning glmnet with a lower setting of the 'thresh' parameter, for a more accurate convergence.
3: glm.fit: algorithm did not converge
**
Also as the output I am getting something not correct,
Call:
fixedLassoInf(x = x1, y = (y1), beta = beta, lambda = lambda,
family = "binomial", alpha = 0.05)
Testing results at lambda = 0.000, with alpha = 0.050
Var Coef Z-score P-value LowConfPt UpConfPt LowTailArea UpTailArea
1 1142.801 1884.776 1 -Inf -60.633 0 0
2 0.386 1664.734 0 0.023 Inf 0 0
3 0.029 3318.110 0 0.001 Inf 0 0
4 -0.029 -1029.985 1 -Inf -0.003 0 0
Note: coefficients shown are partial regression coefficients
Based on the warning message, there is a problem with the Karush Kuhn Tucker (KKT) condition.
Can anyone help me to figure this out?
Thank you.
One of my university teachers always said
Fitting is an art, not a technique.
What I mean: Do expect that you need manual work for parameter guessing and multiple iterations of fitting. You might even question the method of fitting itself, but let's not go that path.
Anyhow, R will not do the magic of finding the correct model (now: number of parameters for LASSO) for you. From the output you show, you seem to have 4 variables, of which 3 are close to zero, therefore I suggest to start with...
Bounding the maximal number of variables in the model, i.e. dfmax=2 seems a good start
Limiting the maximum number of variables ever to be nonzero, e.g. pmax=2
The documentation of glment further details on other options.

Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.

R - 2x2 mixed ANOVA with repeated measures simple effect analysis

I would like to ask how to perform the simple main effect analysis in R correctly, in case of presence interaction effects between Group and Stage variables ?
One of my friends do same analysis in SPSS (using Bonferroni correction) and I try to reproduce his result in R.
I have data set of following structure:
ID Group Stage Y
1 I pre 0.123
1 I post 0.453
2 II pre 0.676
2 II post 0.867
3 I pre 0.324
3 I post 0.786
4 II pre 0.986
4 II post 0.112
... ... ... ...
This is 2x2 mixed ANOVA schema (1 between subject variable 'Group', 1 within subject variable 'Stage', which constitutes repated measure of y dependent variable).
I analysed it using ezANOVA function:
ezANOVA(data = dat, dv = y, wid = ID, between = Group, within = Stage, detailed = TRUE, type = "III")
I found a significant interaction Stage*Group. So I have determine simple effects using Bonferroni correction. I tried to do that with many methods. For example, if I want to find significant interactions in group I, between levels of Stage variable, I tried to use:
dataControl <- subset(dat, Group == "control" )
ezANOVA(data = dataControl, dv = y, wid = ID, within = Stage, detailed = TRUE, type = "III" ) // method 1
aov(data = dataControl, y ~ Stage + Error(ID/Stage)) // method 2
t.test(y ~ Stage, paired=TRUE) // method 3
But every method gave me different p-value result. None of these p-values matched those calculated with SPSS. Interesingly main effects p-values and other calculation gave the same result in SPSS and R. So I conclude that I am using wrong method in simple main effect analysis.
I would be very thankful I you could help me.
If you want R to give you the same numbers as SPSS, do this:
#pairwise comparisons
library(asbio)
bonf <- pairw.anova(data$dv, data$group, method="bonf") #also try "tukey" or "lsd"
print(bonf)
#plot(bonf) #can plot the CFs
This will give you t(s), mean differences, upper and lower bounds, HLSD Diff Lower Upper Decision Adj. p-value decision, and adjusted p-value.

Cubic spline method for longitudinal series data?

I have a serial data formatted as follows:
time milk Animal_ID
30 25.6 1
31 27.2 1
32 24.4 1
33 17.4 1
34 33.6 1
35 25.4 1
33 29.4 2
34 25.4 2
35 24.7 2
36 27.4 2
37 22.4 2
80 24.6 3
81 24.5 3
82 23.5 3
83 25.5 3
84 24.4 3
85 23.4 3
. . .
Generally, 300 animals have records of milk in different time points of short period. However, if we join their data together and do not care about different animal_ID, we would have a curve between milk~time like this, the line in figure below:
Also, in the above figure, we have data for 1 example animal, they are short and highly variable. My purposed is to smooth each animal data but it would be would if the model allows learning general patter from whole data to be included. I used different smooth model (ns, bs, smooth.spline) with the following format but it just did not work:
mod <- lme(milk ~ bs(time, df=3), data=dat, random = ~1|Animal_ID)
I am hoping if somebody has already dealt with this problem would give me an advice. Thanks
The full dataset can be accessed from here:
https://www.dropbox.com/s/z9b5teh3su87uu7/dat.txt?dl=0
I would suggest you use mgcv package. This is one of the recommended R packages, performing a class of models called generalized additive mixed models. You can simply load it by library(mgcv). This is a very powerful library, which can handle from the simplest linear regression model, to generalized linear models, to additive models, to generalized additive models, as well as models with mixed effects (fixed effects + random effects). You can list all (exported) functions of mgcv via
ls("package:mgcv")
And you can see there are many of them.
For your specific data and problem, you may use a model with formula:
model <- milk ~ s(time, bs = 'cr', k = 100) + s(Animal_ID, bs = 're')
In mgcv, s() is a setup for smooth functions, represented by spline basis implied by bs. "cr" is the cubic spline basis, which is exactly what you want. k is the number of knots. It should be chosen depending on the number of unique values of variable time in your data set. If you set k to exactly this number, you end up with a smoothing spline; while any value smaller than that means a regression spline. However, both will be penalized (if you know what penalization mean). I read your data in:
dat <- na.omit(read.csv("data.txt", header = TRUE)) ## I saved you data into file "data.txt"
dat$Animal_ID <- factor(dat$Animal_ID)
nrow(dat) ## 12624 observations
length(unique(dat$time)) ## 157 unique time points
length(ID <- levels(dat$Animal_ID)) ## 355 cows
There are 157 unique values, so I reckon k = 100 is possibly appropriate.
For Animal_ID (coerced as a factor), we need a model term for random effect. "re" is a special class for i.i.d random effect. It is passed to bs for some internal matrix construction reason (so this is not a smooth function!).
Now to fit a GAM model, you can call the legacy gam or the constantly developing bam (gam for big data). I think you will use the latter. They have the same calling convention similar to lm and glm. For example, you can do:
fit <- bam(model, data = dat, family = "gaussian", discrete = TRUE, nthreads = 2)
As you can see, bam allows multi-core parallel computation via nthreads. While discrete is a newly developed feature which can speed up matrix formation.
Since you are dealing with time series data, finally you might consider some temporal autocorrelation. mgcv allows configuration of AR1 correlation, whose correlation coefficient is passed by bam argument rho. However, you need an extra index AR_start to tell mgcv how the time series breaks up into pieces. For example, when reaching a different Animal_ID, AR_start get a TRUE to indicate a new segment of time series. See ?bam for details.
mgcv also provides
summary.gam function for model summary
gam.check for basic model checking
plot.gam function for plotting individual terms
predict.gam (or predict.bam) for prediction on new data.
For example, the summary of the above suggested model is:
> summary(fit)
Family: gaussian
Link function: identity
Formula:
milk ~ s(time, bs = "cr", k = 100) + s(Animal_ID, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.1950 0.2704 96.89 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(time) 10.81 13.67 5.908 1.99e-11 ***
s(Animal_ID) 351.43 354.00 136.449 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.805 Deviance explained = 81.1%
fREML = 29643 Scale est. = 5.5681 n = 12624
The edf (effective degree of freedom) may be thought of as a measure of the degree of non-linearity. So we put in k = 100, while ending up with edf = 10.81. This suggest that the spline s(time) has been heavily penalized. You can view the what s(time) looks like by:
plot.gam(fit, page = 1)
Note that the random effect s(Animal_ID) also has a "smooth", that is an cow-specific constant. For random effects, a Gaussian QQ plot will be returned.
The diagnostic figures returned by
invisible(gam.check(fit))
looks OK, so I think the model is acceptable (I am not offering you model selection, so think up a better model if you think there is).
If you want to make prediction for Animal_ID = 26, you may do
newd <- data.frame(time = 1:150, Animal_ID = 26)
oo <- predict.gam(fit, newd, type = `link`, se.fit = TRUE)
Note that
You need to include both variables in newd (otherwise mgcv complains missing variable)
since you have only one spline smooth s(time), and the random effect term s(Animal_ID) is a constant per Animal_ID. so it is OK to use type = 'link' for individual prediction. By the way, type = 'terms' is slower than type = 'link'.
If you want to make prediction for more than one cows, try something like this:
pred.ID <- ID[1:10] ## predict first 10 cows
newd <- data.frame (time = rep (1:150, times = n), Animal_ID = factor (rep (pred.ID, each = 150)))
oo <- predict.bam (fit, newd, type = "link", se.fit = TRUE)
Note that
I have used predict.bam here, as now we have 150 * 10 = 1500 data points to predict. Plus: we require se.fit = TRUE. This is rather expensive, so use predict.bam is faster than predict.gam. Particularly, if you have fitted your model using bam(..., discrete = TRUE), you can have predict.bam(..., discrete = TRUE). Prediction process goes through the same matrix formation steps as in model fitting (see ?smoothCon used in model fitting and ?PredictMat used in prediction, if you are keen to know more internal structure of mgcv.)
I specified Animal_ID as factors, because this is a random effect.
For more on mgcv, you can refer to library manual. Check specially ?mgcv, ?gam, ?bam ?s.
Final update
Though I said that I will not help you with model section, but I think this model is better (it gives higher adj-Rsquared) and is also more reasonable in sense:
model <- milk ~ s(time, bs = 'cr', k = 20) + s(Animal_ID, bs = 're') + s(Animal_ID, time, bs = 're')
The last term is imposing a random slop. This implies that we are assuming that each individual cow has different growing/reducing pattern of milk production. This is a more sensible assumption in your problem. The earlier model with only random intercept is not sufficient. After adding this random slop, the smooth term s(time) looks smoother. This is a good sign not a bad sign, because we want some simple explanation for s(time), don't we? Compare the s(time) you get from both models, and see what you discover.
I have also reduced k = 100 to k = 20. As we saw in previous fit, the edf for this term is about 10, so k = 20 is pretty sufficient.

Anova Type 2 and Contrasts

the study design of the data I have to analyse is simple. There is 1 control group (CTRL) and
2 different treatment groups (TREAT_1 and TREAT_2). The data also includes 2 covariates COV1 and COV2. I have been asked to check if there is a linear or quadratic treatment effect in the data.
I created a dummy data set to explain my situation:
df1 <- data.frame(
Observation = c(rep("CTRL",15), rep("TREAT_1",13), rep("TREAT_2", 12)),
COV1 = c(rep("A1", 30), rep("A2", 10)),
COV2 = c(rep("B1", 5), rep("B2", 5), rep("B3", 10), rep("B1", 5), rep("B2", 5), rep("B3", 10)),
Variable = c(3944133, 3632461, 3351754, 3655975, 3487722, 3644783, 3491138, 3328894,
3654507, 3465627, 3511446, 3507249, 3373233, 3432867, 3640888,
3677593, 3585096, 3441775, 3608574, 3669114, 4000812, 3503511, 3423968,
3647391, 3584604, 3548256, 3505411, 3665138,
4049955, 3425512, 3834061, 3639699, 3522208, 3711928, 3576597, 3786781,
3591042, 3995802, 3493091, 3674475)
)
plot(Variable ~ Observation, data = df1)
As you can see from the plot there is a linear relationship between the control and the treatment groups. To check if this linear effect is statistical significant I change the contrasts using the contr.poly() function and fit a linear model like this:
contrasts(df1$Observation) <- contr.poly(levels(df1$Observation))
lm1 <- lm(log(Variable) ~ Observation, data = df1)
summary.lm(lm1)
From the summary we can see that the linear effect is statistically significant:
Observation.L 0.029141 0.012377 2.355 0.024 *
Observation.Q 0.002233 0.012482 0.179 0.859
However, this first model does not include any of the two covariates. Including them results in a non-significant p-value for the linear relationship:
lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1)
summary.lm(lm2)
Observation.L 0.04116 0.02624 1.568 0.126
Observation.Q 0.01003 0.01894 0.530 0.600
COV1A2 -0.01203 0.04202 -0.286 0.776
COV2B2 -0.02071 0.02202 -0.941 0.354
COV2B3 -0.02083 0.02066 -1.008 0.320
So far so good. However, I have been told to conduct a Type II Anova rather than Type I. To conduct a Type II Anova I used the Anova() function from the car package.
Anova(lm2, type="II")
Anova Table (Type II tests)
Response: log(Variable)
Sum Sq Df F value Pr(>F)
Observation 0.006253 2 1.4651 0.2453
COV1 0.000175 1 0.0820 0.7763
COV2 0.002768 2 0.6485 0.5292
Residuals 0.072555 34
The problem here with using Type II is that you do not get a p-value for the linear and quadratic effect. So I do not know if the effect is statistically linear and or quadratic.
I found out that the following code produces the same p-value for Observation as the Anova() function. But the result also does not include any p-values for the linear or quadratic effect:
lm2 <- lm(log(Variable) ~ Observation + COV1 + COV2, data = df1)
lm3 <- lm(log(Variable) ~ COV1 + COV2, data = df1)
anova(lm2, lm3)
Does anybody know how to conduct a Type II anova and the contrasts function to obtain the p-values for the linear and quadratic effects?
Help would be very much appreciated.
Best
Peter
I found one partial workaround for this, but it may require further correction. The documentation for the function drop1() from the stats package indicates that this function produces Type II sums of squares (although this page: http://www.statmethods.net/stats/anova.html ) declares that drop1() produces Type III sums of squares, and I didn't spend too much time poring over this (http://afni.nimh.nih.gov/sscc/gangc/SS.html) to cross-check sums of squares calculations. You could use it to calculate everything manually, but I suspect you're asking this question because it would be nice if someone had already worked through it.
Anyway, I added a second vector to the dummy data called Observation2, and set it up with just the linear contrasts (you can only specify one set of contrasts for a given vector at a given time):
df1[,"Observation2"]<-df1$Observation
contrasts(df1$Observation2, how.many=1)<-contr.poly
Then created a third linear model:
lm3<-lm(log(Variable)~Observation2+COV1+COV2, data=df1)
And conducted F tests with drop1 to compare F statistics from Type II ANOVAs between the two models:
lm2, which contains both the linear and quadratic terms:
drop1(lm2, test="F")
lm3, which contains just the linear contrasts:
drop1(lm3, test="F")
This doesn't include a direct comparison of the models against each other, although the F statistic is higher (and p value accordingly lower) for the linear model, which would lead one to rely upon it instead of the quadratic model.

Resources