I´m trying to extract the confidence intervals and the intercept values that are plotted with dotplot(ranef()). How can I do this?
attach(sleepstudy)
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
lattice::dotplot(ranef(fm1, condVar=TRUE))
I tried exploring the list object fm1 but could not fiund the CI.
rr <- ranef(fm1) ## condVar = TRUE has been the default for a while
With as.data.frame: gives the conditional mode and SD, from which you can calculate the intervals (technically, these are not "confidence intervals" because the values of the BLUPs/conditional modes are not parameters ...)
dd <- as.data.frame(rr)
transform(dd, lwr = condval - 1.96*condsd, upr = condval + 1.96*condsd)
Or with broom.mixed::tidy:
broom.mixed::tidy(m1, effects = "ran_vals", conf.int = TRUE)
broom.mixed::tidy() uses as.data.frame.ranef.mer() (the method called by as.data.frame) internally: this function takes the rather complicated data structure described in ?lme4::ranef and extracts the conditional modes and standard deviations in a more user-friendly format:
If ‘condVar’ is ‘TRUE’ the ‘"postVar"’
attribute is an array of dimension j by j by k (or a list of such
arrays). The kth face of this array is a positive definite
symmetric j by j matrix. If there is only one grouping factor in
the model the variance-covariance matrix for the entire random
effects vector, conditional on the estimates of the model
parameters and on the data, will be block diagonal; this j by j
matrix is the kth diagonal block. With multiple grouping factors
the faces of the ‘"postVar"’ attributes are still the diagonal
blocks of this conditional variance-covariance matrix but the
matrix itself is no longer block diagonal.
In this particular case, here's what you need to do to replicate the condsd column of as.data.frame():
## get the 'postVar' attribute of the first (and only) RE term
aa <- attr(rr$Subject, "postVar")
## for each slice of the array, extract the diagonal;
## transpose and drop dimensions;
## take the square root
sqrt(c(t(apply(aa, 3, diag))))
Related
I am doing a simulation study for a mixed effect model (three levels; observations nested within subjects within schools):
f <- lmer(measurement ~ time + race + gender + s_ses +
fidelity + (1 + time|school/subject), mydata_long, REML=0)
The model allows the intercept and time slope to vary across subjects and schools. I am wondering how I can fix the variances to be specific values. I do know how to do that when there is only random intercept:
VarCorr(f)['subject:school']<-0.13
VarCorr(f)['school']<-0.20
However, when there is a random slope, these codes don't work since there are different components in the variance aspect (see the attached picture).
How can I fix the variances of subject: school (Intercept), subject:school time, school (Intercept), and school time to specific values in this case. Any suggestions?
A simulation example. The hardest part is getting the random-effects parameters correctly specified: the key things you need to know are (1) internally the random effects variance matrix is scaled by the residual variance; (2) for vector-valued random effects (like this random-slopes model), the variance-covariance matrix is specified in terms of its Cholesky factor: if we want covariance matrix V, there is a lower-triangular matrix such that C %*% t(C) == V. We compute C using chol(), then read off the elements of the lower triangle (including the diagonal) in column-major order (see helper functions below).
Set up experimental design (simplified from yours, but with the same random effects components):
mydata_long <- expand.grid(time=1:40,
school=factor(letters[1:25]),
subject=factor(LETTERS[1:25]))
Helper functions to convert from
a vector of standard deviations, one or more correlation parameters (in lower-triangular/column major order), and a residual standard deviation
to
a vector of "theta" parameters as used internally by lme4 (see description above)
... and back the other way (conv_chol)
conv_sc <- function(sdvec,cor,sigma) {
## construct symmetric matrix with cor in lower/upper triangles
cormat <- matrix(1,nrow=length(sdvec),ncol=length(sdvec))
cormat[lower.tri(cormat)] <- cor
cormat[upper.tri(cormat)] <- t(cormat)[upper.tri(cormat)]
## convert to covariance matrix and scale by 1/sigma^2
V <- outer(sdvec, sdvec)*cormat/sigma^2
## extract lower triangle in column-major order
return(t(chol(V))[lower.tri(V,diag=TRUE)])
}
conv_chol <- function(ch, s) {
m <- matrix(NA,2,2)
m[lower.tri(m,diag=TRUE)] <- ch
m[upper.tri(m)] <- 0
V <- m %*% t(m) * s^2
list(sd=sqrt(diag(V)), cor=cov2cor(V)[1,2])
}
If you want to start from covariance matrices rather than standard deviations and correlations you can modify the code to skip some steps (starting and ending with V).
Pick some values and convert (and back-convert, to check)
tt1 <- conv_sc(c(0.7, 1.2), 0.3, 0.5)
tt2 <- conv_sc(c(1.4, 0.2), -0.2, 0.5)
tt <- c(tt1, tt2)
conv_chol(tt1, s=0.5)
conv_chol(tt2, s=0.5)
Set up formula and simulate:
form <- m ~ time + (1 + time|school/subject)
set.seed(101)
mydata_long$m <- simulate(form[-2], ## [-2] drops the response
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
VarCorr(f)
The fitted results are close to what we requested above ...
Groups Name Std.Dev. Corr
subject:school (Intercept) 0.66427
time 1.16488 0.231
school (Intercept) 1.78312
time 0.22459 -0.156
Residual 0.49772
Now do the same thing 200 times, to explore the distribution of estimates:
simfun <- function() {
mydata_long$m <- simulate(form[-2],
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
return(as.data.frame(VarCorr(f))[,"sdcor"])
}
set.seed(101)
res <- plyr::raply(200,suppressMessages(simfun()),.progress="text")
Here plyr::raply() is used for convenience, you can do this however you like (for loop, lapply(), replicate(), purrr::map() ...)
par(las=1)
boxplot(res)
## add true values to the plot
points(1:7,c(0.7,1.2,0.3,1.4,0.2,-0.3,0.5),col=2,cex=3,lwd=3)
I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the rbetabinom function of the emdbook R package:
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))
I want to find the effect of each condition (cond) on the counts (k).
I think the glm.nb model of the MASS R package allows modelling that:
library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the dummy condition A?
Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!
library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df$k <- rbetabinom(n=nrow(df), prob=df$p,
size=df$n,theta = df$theta, shape1=1, shape2=1)
With VGAM:
library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2 cond1 cond2
## 0.4312181 0.5197579 -0.3121925 0.3011559
## Log-likelihood: -147.7304
Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)
I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:
detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n, theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept) lprob.cond1 lprob.cond2 ltheta
## -0.09606536 -0.31615236 0.17353311 1.15201809
##
## Log-likelihood: -148.09
The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.
Effects must be estimated relative to some base level. The effect of having any of the 3 conditions would be the same as a constant in the regression.
Since the intercept is the expected mean value when cond is = 0 for both estimated levels (i.e. "B" and "C"), it is the mean value only for the reference group (i.e. "A").
Therefore, you basically already have this information in your model, or at least as close to it as you can get.
The mean value of a comparison group is the intercept plus the comparison group's coefficient. The comparison groups' coefficients, as you know, therefore give you the effect of having the comparison group = 1 (bearing in mind that each level of your categorical variable is a dummy variable which = 1 when that level is present) relative to the reference group.
So your results give you the means and relative effects of each level. You can of course switch out the reference level according to your presence.
That should hopefully give you all the information you need. If not then you need to ask yourself precisely what information it is that you're after.
I need to create "2D data set with 200 samples created from a multivariate Gaussian distribution with a non-diagonal covariance matrix", but I'm neither a statistician nor a mathematician, and I didn't exactly get this.
Here is what I understood. Diagonal matrix is a matrix that has all zeros in the entries outside the main diagonal. Therefore, I assume non-diagonal means a matrix that doesn't have all zeros in the entries outside the main diagonal, such that any random matrix would do, right? So, I started by creating a random matrix, cause it doesn't say any size here, I just did 100x100:
m <- matrix(rnorm(100*100), 100, 100)
I don't know how to achieve the rest. I know the sample() function which creates a sample, but how can I create "2D data set with 200 samples created from a multivariate Gaussian distribution"?
As long as you have mean vector and covariance matrix, simulating multivariate normal is very simple, via MASS:::mvrnorm. Have a look at ?mvrnorm for how to use this function.
If you do not have special requirement on the covariance matrix, i.e., a random covariance matrix will do. You need to first create a proper covariance matrix first.
A covariance matrix must be positive-definite. We can create a positive-definite matrix by taking crossproduct of a full-rank matrix. That is, if an n * p (n >= p) matrix X has full column rank, A = X' %*% X is positive-definite (hence a proper covariance).
Let's first generate a random X matrix:
p <- 100 ## we want p-dimensional multivariate normal
set.seed(0); X <- matrix(runif(p * p), p, p) ## this random matrix has full rank
Then get a covariance matrix:
COV <- crossprod(X) ## t(X) %*% X but about 2 times faster
We also need mean vector. Let's assume they are 0-mean:
mu <- rep(0, p)
Now we call MASS:::mvrnorm for random sampling:
library(MASS) ## no need to install
x <- mvrnorm(1000, mu, COV) ## mvrnorm(sample.size, mean, covariance)
Now x contains 1000 samples from 100-dimension (p-dimensional) multivariate normal distribution, with mean mu and covariance COV.
> str(x)
num [1:1000, 1:100] 1.66 -2.82 6.62 6.46 -3.35 ...
- attr(*, "dimnames")=List of 2
x is a matrix, each row of which is a random sample. So in total we have 1000 rows.
For multivariate normal, marginal distribution is still normal. Hence, we can plot histograms for marginals. The following sketches the 1st, 10th, 20th and 30th marginal:
par(mfrow = c(2,2))
hist(x[, 1], main = "1st marginal")
hist(x[, 10], main = "10th marginal")
hist(x[, 20], main = "20th marginal")
hist(x[, 30], main = "30th marginal")
I was wondering how to get the actual components from predict(..., type = 'term). I know that if I take the rowSums and add the attr(,"constant") value to each, I will get the predicted values but what I'm not sure about is how this attr(,"constant") is split up between the columns. All in all, how do I alter the matrix returned by predict so that each value represents the model coefficient multiplied by the prediction data. The result should be a matrix (or data.frame) with the same dimensions as returned by predict but the rowSums automatically add up to the predicted values with no further alteration needed.
Note: I realize I could probably take the coefficients produced by the model and matrix multiply them with my prediction matrix but I'd rather not do it that way to avoid any problems that factors could produce.
Edit: The goal of this question is not to produce a way of summing the rows to get the predicted values, that was just meant as a sanity check.
If I have the equation y = 2*a + 3*b + c and my predicted value is 500, I want to know what 2*a was, what 3*b was, and what c was at that particular point. Right now I feel like these values are being returned by predict but they've been scaled. I need to know how to un-scale them.
It's not split up between the columns - it corresponds to the intercept. If you include an intercept in the model, then it is the mean of the predictions. For example,
## With intercept
fit <- lm(Sepal.Length ~ Sepal.Width + Species, data=iris)
tt <- predict(fit, type="terms")
pp <- predict(fit)
attr(tt, "constant")
# [1] 5.843333
attr(scale(pp, scale=F), "scaled:center")
# [1] 5.843333
## or
mean(pp)
# [1] 5.843333
If you make the model without an intercept, there won't be a constant, so you will have a matrix where the rowSums correspond to the predictions.
## Without intercept
fit1 <- lm(Sepal.Length ~ Sepal.Width + Species - 1, data=iris)
tt1 <- predict(fit1, type="terms")
attr(tt1, "constant")
# [1] 0
all.equal(rowSums(tt1), predict(fit1))
## [1] TRUE
By scaling (subtracting the mean) of the predicted variable, only the intercept is changed, so when there is no intercept no scaling is done.
fit2 <- lm(scale(Sepal.Length, scale=F) ~ Sepal.Width + Species, data=iris)
all.equal(coef(fit2)[-1], coef(fit)[-1])
## [1] TRUE
As far as I know, the constant is set as an attribute to save memory, if you want rowSums to calculate the correct predicted values then you either need to create the extra column containing constant or just add constant to the output of rowSums. (see the unnecessarily verbose example below)
rowSums_lm <- function(A){
if(!is.matrix(A) || is.null(attr(A, "constant"))){
stop("Input must be a matrix with a 'constant' attribute")
}
rowSums(A) + attr(A, "constant")
}
I have got a question regarding the ordinal package in R or specifically regarding the predict.clm() function. I would like to calculate the linear predictor of an ordered probit estimation. With the polr function of the MASS package the linear predictor can be accessed by object$lp. It gives me on value for each line and is in line with what I understand what the linear predictor is namely X_i'beta. If I however use the predict.clm(object, newdata,"linear.predictor") on an ordered probit estimation with clm() I get a list with the elements eta1 and eta2,
with one column each, if the newdata contains the dependent variable
where each element contains as many columns as levels in the dependent variable, if the newdata doesn't contain the dependent variable
Unfortunately I don't have a clue what that means. Also in the documentations and papers of the author I don't find any information about it. Would one of you be so nice to enlighten me? This would be great.
Cheers,
AK
UPDATE (after comment):
Basic clm model is defined like this (see clm tutorial for details):
Generating data:
library(ordinal)
set.seed(1)
test.data = data.frame(y=gl(4,5),
x=matrix(c(sample(1:4,20,T)+rnorm(20), rnorm(20)), ncol=2))
head(test.data) # two independent variables
test.data$y # four levels in y
Constructing models:
fm.polr <- polr(y ~ x) # using polr
fm.clm <- clm(y ~ x) # using clm
Now we can access thetas and betas (see formula above):
# Thetas
fm.polr$zeta # using polr
fm.clm$alpha # using clm
# Betas
fm.polr$coefficients # using polr
fm.clm$beta # using clm
Obtaining linear predictors (only parts without theta on the right side of the formula):
fm.polr$lp # using polr
apply(test.data[,2:3], 1, function(x) sum(fm.clm$beta*x)) # using clm
New data generation:
# Contains only independent variables
new.data <- data.frame(x=matrix(c(rnorm(10)+sample(1:4,10,T), rnorm(10)), ncol=2))
new.data[1,] <- c(0,0) # intentionally for demonstration purpose
new.data
There are four types of predictions available for clm model. We are interested in type=linear.prediction, which returns a list with two matrices: eta1 and eta2. They contain linear predictors for each observation in new.data:
lp.clm <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm
Note 1: eta1 and eta2 are literally equal. Second is just a rotation of eta1 by 1 in j index. Thus, they leave left side and right side of linear predictor scale opened respectively.
all.equal(lp.clm$eta1[,1:3], lp.clm$eta2[,2:4], check.attributes=FALSE)
# [1] TRUE
Note 2: Prediction for first line in new.data is equal to thetas (as far as we set this line to zeros).
all.equal(lp.clm$eta1[1,1:3], fm.clm$alpha, check.attributes=FALSE)
# [1] TRUE
Note 3: We can manually construct such predictions. For instance, prediction for second line in new.data:
second.line <- fm.clm$alpha - sum(fm.clm$beta*new.data[2,])
all.equal(lp.clm$eta1[2,1:3], second.line, check.attributes=FALSE)
# [1] TRUE
Note 4: If new.data contains response variable, then predict returns only linear predictor for specified level of y. Again we can check it manually:
new.data$y <- gl(4,3,length=10)
lp.clm.y <- predict(fm.clm, new.data, type="linear.predictor")
lp.clm.y
lp.manual <- sapply(1:10, function(i) lp.clm$eta1[i,new.data$y[i]])
all.equal(lp.clm.y$eta1, lp.manual)
# [1] TRUE