Predicting probabilities in R with mixed effects model

Predicting probabilities in R with mixed effects model - r

I have defined a binary response mixed effects model using the R function glmer as follows:
fit <-glmer(binary_r ~ cat1 + (1 | SUBJECTIDf) + (1 | cat2) + (1 | cat1:cat2),
family = binomial("logit"), data = mydata))
where cat1 and cat2 are categorical variables and SUBJECTIDf denotes the factor variable tagging the individual subjects of the study. Also, SUBJECTIDf and cat2 are cross-classified factors.
I would like to do the following for the above model:
create a table which presents the probability of a positive response for each combination of categories pertaining to cat1 and cat2;
create a plot (possibly a caterpillar plot) which displays the probabilities defined under a);
and
create summary statistics for the probabilities defined under 1., including the minimum and maximum probabilities, across all combinations of categories pertaining to cat1 and cat2.
I am assuming that in order to achieve the above outcomes, it would be appropriate to assume that the individual probabilities defined under 1., above are formed as averages or medians across all subjects for a given combination of cat1 and cat2.
I would be most grateful for advice on how to proceed.

Since you didn't give a reproducible example I'm going to simulate one ... this part of the answer is only setting up an example data set.
## crossed cat2 and SUBJECTIDf
dd <- expand.grid(cat2=factor(letters[1:10]),
SUBJECTIDf=factor(1:10))
## each subject gets one value of cat1:
## for example, half get A while half get B
cat_tab <- data.frame(SUBJECTIDf=factor(1:10),cat1=rep(c("A","B"),5))
dd <- merge(dd,cat_tab)
I'm having a little trouble accommodating the design consideration that cat1 and cat2 are not cross-classified. I'll just knock out some categories:
dd <- with(dd,dd[!(cat1=="A" & cat2 %in% c("a","b","c")),])
Now we have the design set up, we simulate response values:
library(lme4)
form <- binary_r ~ cat1 + (1 | SUBJECTIDf) + (1 | cat2) + (1 | cat1:cat2)
dd$binary_r <- simulate(form[-2], ## RHS only
family=binomial,
newdata=dd,
newparams=list(beta=0:1,
theta=c(2,4,1)),
seed=101)[[1]]
At this point we come in with the model fit you've suggested above.
fit <- glmer(form, family = binomial, data=dd)
create a table which presents the probability of a positive response for each combination of categories pertaining to cat1 and cat2;
By setting re.form below to exclude subject ID, we are implicitly computing the value for the hypothetical median individual (i.e., random effect set to zero; mean and median predictions coincide on the logit scale, but not once we back-transform to the probability scale).
Get unique combinations of cat1 and cat2 found in the data:
newdd <- unique(dd[,c("cat1","cat2")])
newdd$SUBJECTIDf <- NA ## need to have SUBJECTIDf in the data frame ...
t1 <- predict(fit,newdata=newdd,type="response",
re.form=~(1|cat2)+(1|cat1:cat2))
newdd <- data.frame(newdd[,c("cat1","cat2")],pred=t1)
head(newdd)
## cat1 cat2 pred
## 4 A d 0.215336024
## 5 A e 0.944897414
## 6 A f 0.036751551
## 7 A g 0.003819873
## 8 A h 0.970115614
## 9 A i 0.003819873
We could also compute predictions for all individuals as follows:
## we happen to have a factorial design, but expand.grid() would
## e.g. fill in missing values
newdd2 <- unique(dd[,c("cat1","cat2","SUBJECTIDf")])
t2 <- predict(fit,newdata=newdd2,type="response",
re.form=NULL)
newdd2$pred <- t2
head(newdd2)
In order to create the plot (see below) we would have to summarize across individuals within category combinations.
create a plot (possibly a caterpillar plot) which displays the probabilities defined under a);
library(ggplot2); theme_set(theme_bw())
ggplot(newdd,aes(cat2,pred,colour=cat1))+
geom_point()+scale_colour_brewer(palette="Set1")
Or, aggregating the subject-specific predictions:
ggplot(newdd2,aes(cat2,pred,colour=cat1))+
stat_summary(fun.y=mean,geom="point")+
scale_colour_brewer(palette="Set1")
We could use reorder() on the cat2 categories to try to get a more sensible order, but since there is a cat1:cat2 interaction, that might not work too well. Caterpillar plots (i.e. getting uncertainties on predictions) are a little trickier, because of the difficulty of getting uncertainty on predictions that combine uncertainty in conditional modes (values of individual random effects) and fixed effects. Can be done by (1) assuming conditional modes and fixed effects are independent or (2) parametric bootstrapping (bootMer), but both are a little more trouble than I'm willing to take at the moment ...
create summary statistics for the probabilities defined under 1., including the minimum and maximum probabilities, across all combinations of categories pertaining to cat1 and cat2.
This doesn't really make sense to me unless we are going the disaggregated route. If we have predicted for each combination of cat1 and cat2, then we only have a single value for each combination (i.e., no "min/max" probabilities). Aggregating is easy in base R, e.g.
aggregate(pred~cat1:cat2,data=newdd2,
FUN=function(x) c(min=min(x),max=max(x)))
or in the tidyverse:
library(dplyr)
newdd2 %>% group_by(cat1,cat2) %>%
summarise(min=min(pred),max=max(pred))

Related

how test the difference in factor loadings of latent variable using lavaan package in R

How can I calculate the estimate of factor loading for latent variables between different data frames?
I have 3 data frames related to 1 latent variable (the same variables A B C D) but in different intervals.
dataframe1 (100-120 days). (1550 records)
dataframe2 (120-150 days). (1780 records)
dataframe3 (180-250 days). (1670 records)
package (lavaan)
model1 <- 'latent_variable1 =~ A + B + C + D
A~~B'
output1 <- cfa(model1, data=datafram1, std.lv=TRUE)
output2 <- cfa(model1, data=datafram2, std.lv=TRUE)
output3 <- cfa(model1, data=datafram3, std.lv=TRUE)
Now I would like to calculate the difference on the estimate between the 3 latent variables using SEM.
Someone could help me?
example:
model<- 'latent_variable1 ~ latent_variable2 ~ latent_variable3'
output4<-sem(model, dataframe =????, std.lv=TRUE)

If I understood correctly, your aim is mainly to test the differences on the factor loadings on these 3 datasets. And, in fact, you do not have 3 latent variables, but just one, but applied in 3 different contexts.
This is a specific type of the assessment for measurement invariance for latent variable.
So, what you need is:
merge the datasets into just one dataset, but an indication of their origin. Let's call it group
datafram1$group<-"G1"
datafram2$group<-"G2"
datafram3$group<-"G3"
df<-rbind(datafram1,datafram2)
df<-rbind(df,datafram3)
run a multi group assessment, using this group variable
output1 <- cfa(model1, data=df, std.lv=TRUE, group="group")
This will return 3 outputs, one per group, similar to assessing in the 3 different dataframes.
You can now constrain the loadings to be equal
output2 <- cfa(model1, data=df, std.lv=TRUE, group="group",group.equal="loadings")
and we can assess the significance of the differences between them:
anova(output1, output2)

Simulate data for mixed-effects model with predefined parameter

I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.

Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]

save lmlist coefficients and residuals to dataset

I have 100 groups with 40 observations each. I know there might be other appropriate models but i am currently only interested in the following.
reg<-lmList(Y ~ Intercept + a + b + c + d | grp,data=data, pool=F)
In order to save the residuals (from lmList) to my table I just do
data$residual <- residuals(reg)
Now I want to save the beta coefficients to the original table as well. Since each group only has a single set of coefficients it should be same for the group but different across groups.
coef <- coef(reg) gives me a list with the group coefficients. However R shows this as a list with only 5 columns (excluding the group names).
data$coef<-coef[,c(1) ] gives me the intercepts but I lose the group information.
I was thinking about creating a separate table with group identification so that I just merge the original table and the coefficient tables. But could not figure out how to get the group identification along with the coefficients.
If there is an easier way to do this please help.

If fm1 is your fitted lmList object then I think simply making the row names into an additional column should do what you want:
library(nlme)
fm1 <- lmList(distance ~ age | Subject, Orthodont)
res <- data.frame(Subject=rownames(coef(fm1)),coef(fm1),check.names=FALSE)
rownames(res) <- NULL ## now redundant
head(res)
Subject (Intercept) age
1 M16 16.95 0.550
2 M05 13.65 0.850
...
Then you should be able to merge() (although merging with the Orthodont object seems problematic - I think this may be because it's a weird groupedData object ...)

order random effects plot by size of parameter estimates (using nlme)

I want to make a random effects plot using the plot method for ranef objects (plot.ranef.lme).
library(nlme)
x <- Orthodont
# change factor to unordered for this example
x$Subject <- factor(x$Subject, ordered=FALSE)
m <- lme(distance ~ age, x, random = ~ 1 | Subject)
re <- ranef(m)
plot(re)
Above, the order of the factor on the y-axis follows the order of the factor levels.
Now, I want the order of the levels to correspond to the size of the random effect parameters. The best I could come up with is to reorder the factor levels by using the random effects parameters after estimating the model, reorder the factor and estimate the model again. This is clumsy to say the least, but I was unable to get this done via some arguments in the plot method (I am not very familiar with lattice).
o <- order(re[, 1])
x$Subject <- factor(x$Subject, levels=levels(x$Subject)[o])
m <- lme(distance ~ age, x, random = ~ 1 | Subject)
re <- ranef(m)
plot(re)
This is what I want but without using the clumsy approach above.
How can I do this in a more sensible way?

I don't think teher is a parameter that can be used to change order levels. You should do it by hand .
That's said you can plot your own dotplot using the re object, and use reorder to order factor.
library(lattice)
dat = data.frame(x= row.names(re),y=re[,attr(re,'effectName')])
dotplot(reorder(x,y)~y,data=dat)

R: not clear how to bootstrap eta value for repeated measures ANOVA

The goal: calculate 95% confidence intervals for eta values from a repeated measures ANOVA.
The design is a two factor design (Factor 1 has 3 levels, Factor 2 has 7 levels) that is fully crossed with 30 subjects. However, I am unsure I am doing the bootstrapping correctly to calculate confidence intervals for eta. I use the boot() function to bootstrap sample by subjects, but leave the factors and the levels within the factors alone. So I'm not sure if I am doing this correctly or not - do I need to do a more complicated bootstrapping/resampling where I resample by factor/levels or is it okay to just do at the level of subjects. My code seems to give reasonable results...
library(ez)
library(boot)
library(reshape2)
###create a data.frame for a 2-factor (Factor1-3 levels, Factor2-7 levels) fully crossed design with 30 subjects & fill fake data values
subject.number<-factor(rep(1:30,each=21))
factor1.levels<-rep(rep(c("level1","level2","level3"),each=7),30)
factor2.levels<-rep(rep(c("level1","level2","level3","level4","level5","level6","level7"),3),30)
set.seed(1234)
fake.data<-rnorm(630,mean=3)
dframe<-data.frame(subject.number,factor1.levels,factor2.levels,fake.data)
names(dframe)<-c("Subject","Factor1","Factor2","OutcomeValue")
###to work with boot() convert from long to wide format
dframe.wide<-dcast(dframe,Subject~Factor1+Factor2,value.var="OutcomeValue")
###function to use with boot() to calculate generalized eta value for Factor1, Factor2, and Factor1xFactor2 interaction in a repeated measures ANOVA
generalized_eta<-function(data,indices){
d.wide<-data[indices,] #use boot() indices to sample data
#now that have used indices from boot(), convert data back to long with correct Factor labeling
dframe.long<-melt(d.wide,value.name="OutcomeValue",id="Subject")
dframe.long<-cbind(dframe.long,colsplit(dframe.long$variable,"_",c("Factor1","Factor2")))
dframe.long$Factor1<-factor(dframe.long$Factor1)
dframe.long$Factor2<-factor(dframe.long$Factor2)
dframe.long$Subject<-factor(dframe.long$Subject)
#do repeated measures ANOVA with ezANOVA() which calculates generalized eta
aov.ez = ezANOVA(data = dframe.long, dv = .(OutcomeValue), wid = .(Subject), within = .(Factor1,Factor2), type = 1)
#return the three generalized eta values - Factor1, Factor2, Factor1xFactor2
return(aov.ez[[1]]$ges)
}
###call boot() to do the bootstrap - only 200 to make it fast
results<-boot(data=dframe.wide,statistic=generalized_eta,R=200)
###plot the bootstrap results
plot(results,index=1) #for Factor1
plot(results,index=2) #for Factor2
plot(results,index=3) #for Factor1xFactor2
###create 95%-CI from bootstrap results
boot.ci(results,type="bca",index=1) #for Factor1
boot.ci(results,type="bca",index=2) #for Factor2
boot.ci(results,type="bca",index=3) #for Factor3

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Predicting probabilities in R with mixed effects model - r

Related

how test the difference in factor loadings of latent variable using lavaan package in R

Simulate data for mixed-effects model with predefined parameter

save lmlist coefficients and residuals to dataset

order random effects plot by size of parameter estimates (using nlme)

R: not clear how to bootstrap eta value for repeated measures ANOVA

Categories

Resources