I am doing a simulation study for a mixed effect model (three levels; observations nested within subjects within schools):
f <- lmer(measurement ~ time + race + gender + s_ses +
fidelity + (1 + time|school/subject), mydata_long, REML=0)
The model allows the intercept and time slope to vary across subjects and schools. I am wondering how I can fix the variances to be specific values. I do know how to do that when there is only random intercept:
VarCorr(f)['subject:school']<-0.13
VarCorr(f)['school']<-0.20
However, when there is a random slope, these codes don't work since there are different components in the variance aspect (see the attached picture).
How can I fix the variances of subject: school (Intercept), subject:school time, school (Intercept), and school time to specific values in this case. Any suggestions?
A simulation example. The hardest part is getting the random-effects parameters correctly specified: the key things you need to know are (1) internally the random effects variance matrix is scaled by the residual variance; (2) for vector-valued random effects (like this random-slopes model), the variance-covariance matrix is specified in terms of its Cholesky factor: if we want covariance matrix V, there is a lower-triangular matrix such that C %*% t(C) == V. We compute C using chol(), then read off the elements of the lower triangle (including the diagonal) in column-major order (see helper functions below).
Set up experimental design (simplified from yours, but with the same random effects components):
mydata_long <- expand.grid(time=1:40,
school=factor(letters[1:25]),
subject=factor(LETTERS[1:25]))
Helper functions to convert from
a vector of standard deviations, one or more correlation parameters (in lower-triangular/column major order), and a residual standard deviation
to
a vector of "theta" parameters as used internally by lme4 (see description above)
... and back the other way (conv_chol)
conv_sc <- function(sdvec,cor,sigma) {
## construct symmetric matrix with cor in lower/upper triangles
cormat <- matrix(1,nrow=length(sdvec),ncol=length(sdvec))
cormat[lower.tri(cormat)] <- cor
cormat[upper.tri(cormat)] <- t(cormat)[upper.tri(cormat)]
## convert to covariance matrix and scale by 1/sigma^2
V <- outer(sdvec, sdvec)*cormat/sigma^2
## extract lower triangle in column-major order
return(t(chol(V))[lower.tri(V,diag=TRUE)])
}
conv_chol <- function(ch, s) {
m <- matrix(NA,2,2)
m[lower.tri(m,diag=TRUE)] <- ch
m[upper.tri(m)] <- 0
V <- m %*% t(m) * s^2
list(sd=sqrt(diag(V)), cor=cov2cor(V)[1,2])
}
If you want to start from covariance matrices rather than standard deviations and correlations you can modify the code to skip some steps (starting and ending with V).
Pick some values and convert (and back-convert, to check)
tt1 <- conv_sc(c(0.7, 1.2), 0.3, 0.5)
tt2 <- conv_sc(c(1.4, 0.2), -0.2, 0.5)
tt <- c(tt1, tt2)
conv_chol(tt1, s=0.5)
conv_chol(tt2, s=0.5)
Set up formula and simulate:
form <- m ~ time + (1 + time|school/subject)
set.seed(101)
mydata_long$m <- simulate(form[-2], ## [-2] drops the response
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
VarCorr(f)
The fitted results are close to what we requested above ...
Groups Name Std.Dev. Corr
subject:school (Intercept) 0.66427
time 1.16488 0.231
school (Intercept) 1.78312
time 0.22459 -0.156
Residual 0.49772
Now do the same thing 200 times, to explore the distribution of estimates:
simfun <- function() {
mydata_long$m <- simulate(form[-2],
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
return(as.data.frame(VarCorr(f))[,"sdcor"])
}
set.seed(101)
res <- plyr::raply(200,suppressMessages(simfun()),.progress="text")
Here plyr::raply() is used for convenience, you can do this however you like (for loop, lapply(), replicate(), purrr::map() ...)
par(las=1)
boxplot(res)
## add true values to the plot
points(1:7,c(0.7,1.2,0.3,1.4,0.2,-0.3,0.5),col=2,cex=3,lwd=3)
Related
i have this model
Where TD is a binary variable, and Strata is a numeric variable equals to {1,2,3}. I need to get 95% CI for this two linear combinations:
I have this function to construct confidence intervals
pwp_gt_int <- coxph(Surv(tstart2,tstop2,status==1) ~ TD+ TD:strata(event)
mod_summ <- summary(pwp_gt_int)
coefs <- modsum$coefficients
X <- model.matrix(pwp_gt_int)
dof <- nrow(X) - ncol(X)
coefs_var <- vcov(pwp_gt_int)
halfCI <- qt(0.975, dof) * sqrt(diag(coefs_var))
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
but i need something like this
coefs[2] = coefs[1] + 2*coefs[2]
coefs[3] = coefs[1] + 3*coefs[3]
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
But the CI's i got are not plausible, i'm think im not getting right the variance-covariance matrix for the linear combinations.
Please help.
It looks like you're asking for two different things - one is the variance of a linear combination and the other is a confidence interval (and as such, a variance) for a non-linear combination. The linear combination is relatively easy. We know that the variance of a linear combination is:
where A is a matrix of constants and V(b) is the variance-covariance matrix of the random variables (in this case, the coefficients). If your coefficient vector has three values in it, and you want to do as you suggest in your last block of code, then the you would define:
or in R as:
A = matrix(c(1,1,2,0,0,3), ncol=3)
Then, you could make the linear combinations and their variances with:
b <- matrix(coef(pwp_gt_int)[1:3], ncol=1)
V <- vcov(pwp_gt_int)[1:3,1:3]
lincom <- A %*% b
v_lincom <- A %*% V %*% t(A)
sds <- sqrt(diag(v_lincom))
crit <- qt(.975, dof)
cis <- cbind(lincom - crit*sds, sincom + crit*sds)
That would be the confidence interval for the linear combination. The problem is that there isn't such an easy formula for the variance of a non-linear combination. Further, the confidence intervals may be asymmetric. One thing you could do is an end-point transformation, where you take lincom and cis and then exponentiate all of them. Another option would be a parametric bootstrap. Here's what that would look like.
B <- MASS::mvrnorm(2500, b, V)
nlcom <- exp(A %*% b)
nlsim <- exp(A %*% t(B))
nlcis <- apply(nlsim, 1, quantile, c(.025,.975))
Now, nlcis would have the confidence bounds for the non-linear combination. This should work given your data, but without the data to try it out, I'm not sure.
I´m trying to extract the confidence intervals and the intercept values that are plotted with dotplot(ranef()). How can I do this?
attach(sleepstudy)
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
lattice::dotplot(ranef(fm1, condVar=TRUE))
I tried exploring the list object fm1 but could not fiund the CI.
rr <- ranef(fm1) ## condVar = TRUE has been the default for a while
With as.data.frame: gives the conditional mode and SD, from which you can calculate the intervals (technically, these are not "confidence intervals" because the values of the BLUPs/conditional modes are not parameters ...)
dd <- as.data.frame(rr)
transform(dd, lwr = condval - 1.96*condsd, upr = condval + 1.96*condsd)
Or with broom.mixed::tidy:
broom.mixed::tidy(m1, effects = "ran_vals", conf.int = TRUE)
broom.mixed::tidy() uses as.data.frame.ranef.mer() (the method called by as.data.frame) internally: this function takes the rather complicated data structure described in ?lme4::ranef and extracts the conditional modes and standard deviations in a more user-friendly format:
If ‘condVar’ is ‘TRUE’ the ‘"postVar"’
attribute is an array of dimension j by j by k (or a list of such
arrays). The kth face of this array is a positive definite
symmetric j by j matrix. If there is only one grouping factor in
the model the variance-covariance matrix for the entire random
effects vector, conditional on the estimates of the model
parameters and on the data, will be block diagonal; this j by j
matrix is the kth diagonal block. With multiple grouping factors
the faces of the ‘"postVar"’ attributes are still the diagonal
blocks of this conditional variance-covariance matrix but the
matrix itself is no longer block diagonal.
In this particular case, here's what you need to do to replicate the condsd column of as.data.frame():
## get the 'postVar' attribute of the first (and only) RE term
aa <- attr(rr$Subject, "postVar")
## for each slice of the array, extract the diagonal;
## transpose and drop dimensions;
## take the square root
sqrt(c(t(apply(aa, 3, diag))))
I have three time-series variables (x,y,z) measured in 3 replicates. x and z are the independent variables. y is the dependent variable. t is the time variable. All the three variables follow diel variation, they increase during the day and decrease during the night. An example with a simulated dataset is below.
library(nlme)
library(tidyverse)
n <- 100
t <- seq(0,4*pi,,100)
a <- 3
b <- 2
c.unif <- runif(n)
amp <- 2
datalist = list()
for(i in 1:3){
y <- 3*sin(b*t)+rnorm(n)*2
x <- 2*sin(b*t+2.5)+rnorm(n)*2
z <- 4*sin(b*t-2.5)+rnorm(n)*2
data = as_tibble(cbind(y,x,z))%>%mutate(t = 1:100)%>% mutate(replicate = i)
datalist[[i]] <- data
}
df <- do.call(rbind,datalist)
ggplot(df)+
geom_line(aes(t,x),color='red')+geom_line(aes(t,y),color='blue')+
geom_line(aes(t,z),color = 'green')+facet_wrap(~replicate, nrow = 1)+theme_bw()
I can identify the lead/lag of y with respect to x and z individually. This can be done with ccf() function in r. For example
ccf(x,y)
ccf(z,y)
But I would like to do it in a multivariate regression approach. For example, nlme package and lme function indicates y and z are negatively affecting x
lme = lme(data = df, y~ x+ z , random=~1|replicate, correlation = corCAR1( form = ~ t| replicate))
It is impossible (in actual data) that x and z can negatively affect y.
I need the time-lead/lag and also I would like to get the standardized coefficient (t-value to compare the effect size), both from the same model.
Is there any multivariate model available that can give me the lead/lag and also give me regression coefficient?
We might be considering the " statistical significance of Cramer Rao estimation of a lower bound". In order to find Xbeta-Xinfinity, taking the expectation of Xbeta and an assumed mean neu; will yield a variable, neu^squared which can replace Xinfinity. Using the F test-likelihood ratio, the degrees of freedom is p2-p1 = n-p2.
Put it this way, the estimates are n=(-2neu^squared/neu^squared+n), phi t = y/Xbeta and Xbeta= (y-betazero)/a.
The point estimate is derived from y=aXbeta + b: , Xbeta. The time lead lag is phi t and the standardized coefficient is n. The regression generates the lower bound Xbeta, where t=beta.
Spectral analysis of the linear distribution indicates a point estimate beta zero = 0.27 which is a significant peak of
variability. Scaling Xbeta by Betazero would be an appropriate idea.
I'm trying to simulate data for a model expressed with the following formula:
lme4::lmer(y ~ a + b + (1|subject), data) but with a set of given parameters:
a <- rnorm() measured at subject level (e.g nSubjects = 50)
y is measured at the observation level (e.g. nObs = 7 for each subject
b <- rnorm() measured at observation level and correlated at a given r with a
variance ratio of the random effects in lmer(y ~ 1 + (1 | subject), data) is fixed at for example 50/50 or 10/90 (and so on)
some random noise is present (so that a full model does not explain all the variance)
effect size of the fixed effects can be set at a predefined level (e.g. dCohen=0.5)
I played with various packages like: powerlmm, simstudy or simr but still fail to find a working solution that will accommodate the amount of parameters I'd like to define beforehand.
Also for my learning purposes I'd prefer a base R method than a package solution.
The closest example I found is a blog post by Ben Ogorek "Hierarchical linear models and lmer" which looks great but I can't figure out how to control for parameters listed above.
Any help would be appreciated.
Also if there a package that I don't know of, that can do these type of simulations please let me know.
Some questions about the model definition:
How do we specify a correlation between two random vectors that are different lengths? I'm not sure: I'll sample 350 values (nObs*nSubject) and throw away most of the values for the subject-level effect.
Not sure about "variance ratio" here. By definition, the theta parameters (standard deviations of the random effects) are scaled by the residual standard deviation (sigma), e.g. if sigma=2, theta=2, then the residual std dev is 2 and the among-subject std dev is 4
Define parameter/experimental design values:
nSubjects <- 50
nObs <- 7
## means of a,b are 0 without loss of generality
sdvec <- c(a=1,b=1)
rho <- 0.5 ## correlation
betavec <- c(intercept=0,a=1,b=2)
beta_sc <- betavec[-1]*sdvec ## scale parameter values by sd
theta <- 0.4 ## = 20/50
sigma <- 1
Set up data frame:
library(lme4)
set.seed(101)
## generate a, b variables
mm <- MASS::mvrnorm(nSubjects*nObs,
mu=c(0,0),
Sigma=matrix(c(1,rho,rho,1),2,2)*outer(sdvec,sdvec))
subj <- factor(rep(seq(nSubjects),each=nObs)) ## or ?gl
## sample every nObs'th value of a
avec <- mm[seq(1,nObs*nSubjects,by=nObs),"a"]
avec <- rep(avec,each=nObs) ## replicate
bvec <- mm[,"b"]
dd <- data.frame(a=avec,b=bvec,Subject=subj)
Simulate:
dd$y <- simulate(~a+b+(1|Subject),
newdata=dd,
newparams=list(beta=beta_sc,theta=theta,sigma=1),
family=gaussian)[[1]]
I'd like to use R to find the critical values for the Pearson correlation coefficient.
This has proved difficult to find in search engines since the standard variable for the Pearson correlation coefficient is itself r. In turn, I'm finding a lot of r critical value tables (rather than how to find this by using the statistical package R).
I'm looking for a function that will provide output like the following:
I'm comfortable finding the correlation with:
cor(x,y)
However, I'd also like to find the critical values.
Is there a function I can use to enter n (or degrees of freedom) as well as alpha in order to find the critical value?
The significance of a correlation coefficient, r, is determined by converting r to a t-statistic and then finding the significance of that t-value at the degrees of freedom that correspond to the sample size, n. So, you can use R to find the critical t-value and then convert that value back to a correlation coefficient to find the critical correlation coefficient.
critical.r <- function( n, alpha = .05 ) {
df <- n - 2
critical.t <- qt(alpha/2, df, lower.tail = F)
critical.r <- sqrt( (critical.t^2) / ( (critical.t^2) + df ) )
return(critical.r)
}
# Example usage: Critical correlation coefficient at sample size of n = 100
critical.r( 100 )
The general structure of hypothesis testing is kind of a mish-mash of two systems: Fisherian and Neyman-Pearson. Statisticians understand the differences but rarely does this get clearly presented in undergraduate stats classes. R was designed by and intended for statisticians as a toolbox, so they constructed a function named cor.test that will deliver a p-value (part of the Fisherian tradition) as well as a confidence interval for "r" (derived on the basis of the Neyman-Pearson formalism.) Fisher and Neyman had bitter disputes in their lifetime. The "critical value" terminology is part of the N-P testing strategy. It is equivalent to building a confidence interval and finding the particular statistic that reaches exactly a threshold value of 0.05 significance.
The code for constructing the inferential statistics in cor.test is available with:
methods(cor.test)
getAnywhere(cor.test.default)
# scroll down
method <- "Pearson's product-moment correlation"
#-----partial code----
r <- cor(x, y)
df <- n - 2L
ESTIMATE <- c(cor = r)
PARAMETER <- c(df = df)
STATISTIC <- c(t = sqrt(df) * r/sqrt(1 - r^2))
p <- pt(STATISTIC, df)
# ---- omitted some set up and error checking ----
# this is the confidence interval section------
z <- atanh(r)
sigma <- 1/sqrt(n - 3)
cint <- switch(alternative, less = c(-Inf, z + sigma *
qnorm(conf.level)), greater = c(z - sigma * qnorm(conf.level),
Inf), two.sided = z + c(-1, 1) * sigma * qnorm((1 +
conf.level)/2))
cint <- tanh(cint)
So now you know how R does it. Notice that there is no "critical value" mentioned. I suspect that your hope was to find some table where a tabulation of "r" and "df" was laid out displaying the minimum "r" that would reach a significance of 0.05 for a given 'df'. Such a table could be built but that's not how this particular toolbox is constructed. You should now have the tools to build it yourself.
I would do the same. But if you are using a Spearman correlation you need to convert t into r using a different formula.
just change the last line before the return in the function with this one:
critical.r <- sqrt(((critical.t^2) / (df)) + 1)