Get Residual Variance-Covariance Matrix in lme4 - r

I am fitting a linear mixed effects model using lme4:
library(lme4)
data(Orthodont)
dent <- Orthodont
d.test <- lmer(distance ~ age + (1|Subject), data=dent)
If we say generically Y = X * B + Z * d + e is the form of a linear mixed effects model, then I am trying to get Var(Y) = Z * Var(d) * Z^t + Var(e) from the results of the model.
Is the following formulation the right way to do this?
k <- table(dent$Subject)[1]
vars <- VarCorr(d.test)
v <- as.data.frame(vars)
sigma <- attr(vars, "sc")
s.tech <- diag(v$vcov[1], nrow=k)
icc <- v$vcov[1]/sum(v$vcov)
s.tech[upper.tri(s.tech)] <- icc
s.tech[lower.tri(s.tech)] <- icc
sI <- diag(sigma^2, nrow=length(dent$age))
var.b <- kronecker(diag(1, nrow=length(dent$age)/k), s.tech)
var.y <- sI + var.b
I think this is a simple question, but I can't find anywhere code for doing this, so I'm asking if I'm doing it right.

You can do this a bit more easily if you know about getME(), which is a general purpose extract-bits-of-a-lmer-fit function. In particular, you can extract the transposed Z matrix (getME(.,"Zt")) and the transposed Lambda matrix - the Lambda matrix is the Cholesky factor of the scaled variance-covariance matrix of the conditional models (BLUPs); in your notation, Var(d) is the residual variance times the cross-product of Lambda.
The answer cited here is pretty good but the answer below is slightly more general (it should work for any lmer fit).
Fit model:
library(lme4)
data(Orthodont,package="nlme")
d.test <- lmer(distance ~ age + (1|Subject), data=Orthodont)
Extract components:
var.d <- crossprod(getME(d.test,"Lambdat"))
Zt <- getME(d.test,"Zt")
vr <- sigma(d.test)^2
Combine them:
var.b <- vr*(t(Zt) %*% var.d %*% Zt)
sI <- vr * Diagonal(nrow(Orthodont))
var.y <- var.b + sI
A picture:
image(var.y)

Related

Replace lm coefficients and calculate results of lm new in R

I am able to change the coefficients of my linear model. Then i want to compare the results of my "new" model with the new coefficients, but R is not calculating the results with the new coefficients.
As you can see in my following example the summary of my models fit and fit1 are excactly the same, though results like multiple R-squared should or fitted values should change.
set.seed(2157010) #forgot set.
x1 <- 1998:2011
x2 <- x1 + rnorm(length(x1))
y <- 3*x2 + rnorm(length(x1)) #you had x, not x1 or x2
fit <- lm( y ~ x1 + x2)
# view original coefficients
coef(fit)
# generate second function for comparing results
fit1 <- fit
# replace coefficients with new values, use whole name which is coefficients:
fit1$coefficients[2:3] <- c(5, 1)
# view new coefficents
coef(fit1)
# Comparing
summary(fit)
summary(fit1)
Thanks in advance
It might be easier to compute the multiple R^2 yourself with the substituted parameters.
mult_r2 <- function(beta, y, X) {
tot_ss <- var(y) * (length(y) - 1)
rss <- sum((y - X %*% beta)^2)
1 - rss/tot_ss
}
(or, more compactly, following the comments, you could compute p <- X %*% beta; (cor(y,beta))^2)
mult_r2(coef(fit), y = model.response(model.frame(fit)), X = model.matrix(fit))
## 0.9931179, matches summary()
Now with new coefficients:
new_coef <- coef(fit)
new_coef[2:3] <- c(5,1)
mult_r2(new_coef, y = model.response(model.frame(fit)), X = model.matrix(fit))
## [1] -343917
That last result seems pretty wild, but the substituted coefficients are very different from the true least-squares coeffs, and negative R^2 is possible when the model is bad enough ...

95% CI for survival linear combination (interaction) using vcov

i have this model
Where TD is a binary variable, and Strata is a numeric variable equals to {1,2,3}. I need to get 95% CI for this two linear combinations:
I have this function to construct confidence intervals
pwp_gt_int <- coxph(Surv(tstart2,tstop2,status==1) ~ TD+ TD:strata(event)
mod_summ <- summary(pwp_gt_int)
coefs <- modsum$coefficients
X <- model.matrix(pwp_gt_int)
dof <- nrow(X) - ncol(X)
coefs_var <- vcov(pwp_gt_int)
halfCI <- qt(0.975, dof) * sqrt(diag(coefs_var))
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
but i need something like this
coefs[2] = coefs[1] + 2*coefs[2]
coefs[3] = coefs[1] + 3*coefs[3]
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
But the CI's i got are not plausible, i'm think im not getting right the variance-covariance matrix for the linear combinations.
Please help.
It looks like you're asking for two different things - one is the variance of a linear combination and the other is a confidence interval (and as such, a variance) for a non-linear combination. The linear combination is relatively easy. We know that the variance of a linear combination is:
where A is a matrix of constants and V(b) is the variance-covariance matrix of the random variables (in this case, the coefficients). If your coefficient vector has three values in it, and you want to do as you suggest in your last block of code, then the you would define:
or in R as:
A = matrix(c(1,1,2,0,0,3), ncol=3)
Then, you could make the linear combinations and their variances with:
b <- matrix(coef(pwp_gt_int)[1:3], ncol=1)
V <- vcov(pwp_gt_int)[1:3,1:3]
lincom <- A %*% b
v_lincom <- A %*% V %*% t(A)
sds <- sqrt(diag(v_lincom))
crit <- qt(.975, dof)
cis <- cbind(lincom - crit*sds, sincom + crit*sds)
That would be the confidence interval for the linear combination. The problem is that there isn't such an easy formula for the variance of a non-linear combination. Further, the confidence intervals may be asymmetric. One thing you could do is an end-point transformation, where you take lincom and cis and then exponentiate all of them. Another option would be a parametric bootstrap. Here's what that would look like.
B <- MASS::mvrnorm(2500, b, V)
nlcom <- exp(A %*% b)
nlsim <- exp(A %*% t(B))
nlcis <- apply(nlsim, 1, quantile, c(.025,.975))
Now, nlcis would have the confidence bounds for the non-linear combination. This should work given your data, but without the data to try it out, I'm not sure.

observed information matrix for logistic model

I have proposed my own model and now am trying to implement it using R, I have got stuck on how to find the observed matrix applying my formula i have use glm() to fit logistic model with penalty term, using binary data set x1, x2, x3 ,y (all binary 0,1) fit1 if the glm() model def.new is the penalise deviance.
X.tilde <- as.matrix(x) # n*p matrix of the data set
W <- Diagonal(length(y), weights) # n*n diagonal matrix of the weights
qq <- exp(fit1$fitted.values)/(1 + exp(fit1$fitted.values)) # n*1 vector (pi=probability of the logistic model )
cc <- t(1 - qq) # n*1 vector
gg <- (dev.new) * t(dev.new) # p*p matrix
ff <- (X.tilde) %*% t(X.tilde) # n*n matrix
pp <- exp(fit1$coefficients)/(1 + exp(fit1$coefficients)) # p*1 matrix
ss <- t(1/(1 + exp(fit1$coefficients))) # p*1 vector
aa <- t(X.tilde) %*% qq %*% cc %*% W %*% (X.tilde) # p*p matrix
firstP <- (aa + (pp * ss)) # p*p matrix
info.mat <- firstP+gg # p*p matrix
info.mat <- as.matrix(info.mat)
this code returns the following error
Error in e1 + Matrix(e2) :
Matrices must have same number of rows for arithmetic
As in my theory the dimension is fine by when I implement its not correct
any help?
r

Individual terms in prediction of linear regression

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

Nonlinear regression with sampling weights (package survey)

I would like to estimate the coefficients of a nonlinear model with a binary dependent variable. The nonlinearity arises because two regressors, A and B, depend on a subset of the dataset and on the two parameters lambda1 and lambda2 respectively:
y = alpha + beta1 * A(lambda1) + beta2 * B(lambda2) + delta * X + epsilon
where for each observation i, we have
Where a and Rs are variables in the data.frame. The regressor B(lambda2) is defined in a similar way.
Moreover, I need to include what in Stata are known as pweights, i.e. survey weights or sampling weights. For this reason, I'm working with the R package survey by Thomas Lumley.
First, I create a function for A (and B), i.e.:
A <- function(l1){
R <- as.matrix(data[,1:(80)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l1
den <- sum((ai-k)^l1)
w <- num/den
w <- c(w,rep(0,dim(R)[2]-length(w)))
var[i] <- R[i,] %*% w
}
return(var)
}
B <- function(l2){
C <- as.matrix(data[,82:(161-1)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l2
den <- sum((ai-k)^l2)
w <- num/den
w <- c(w,rep(0,dim(C)[2]-length(w)))
var[i] <- C[i,] %*% w
}
return(var)
}
But the problem is that I don't know how to include the nonlinear regressors in the model (or in the survey design, using the function svydesign):
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
Because, when I try to estimate the model:
# loglikelihood function:
LLsvy <- function(y, model, lambda1, lambda2){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fit <- svymle(loglike=LLsvy,
formulas=list(~y, model = ~ A(lambda1)+B(lambda2)+X,lambda1=~1,lambda2=~1),
design=d_test,
start=list(c(0,0,0,0),c(lambda1=11),c(lambda2=8)),
na.action="na.exclude")
I get the error message:
Error in eval(expr, envir, enclos) : object 'lambda1' not found
I think that the problem is in including the nonlinear part, because everything works fine if I fix A and B for some lambda1 and lambda2 (so that the model becomes linear):
lambda1=11
lambda2=8
data$A <- A(lambda1)
data$B <- B(lambda2)
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
LLsvylin <- function(y, model){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fitlin <- svymle(loglike=LLsvylin,
formulas=list(~y, model = ~A+B+X),
design=d_test,
start=list(0,0,0,0),
na.action="na.exclude")
On the contrary, if I don't use the sampling weights, I can easily estimate my nonlinear model using the function mle from package stats4 or the function mle2 from package bbmle.
To sum up,
how can I combine sampling weights (svymle) while estimating a nonlinear model (which I can do using mle or mle2)?
=========================================================================
A problem with the nonlinear part of the model arises also when using the function svyglm (with fixed lambda1 and lambda2, in order to get good starting values for svymle):
lambda1=11
lambda2=8
model0 = y ~ A(lambda1) + B(lambda2) + X
probit1 = svyglm(formula = model0,
data = data,
family = binomial(link=probit),
design = d_test)
Because I get the error message:
Error in svyglm.survey.design(formula = model0, data = data, family = binomial(link = probit), :
all variables must be in design= argument
This isn't what svymle does -- it's for generalised linear models, which have linear predictors and a potentially complicated likelihood or loss function. You want non-linear weighted least squares, with a simple loss function but complicated predictors.
There isn't an implementation of design-weighted nonlinear least squares in the survey package, probably because no-one has previously asked for one. You could try emailing the package author.
The upcoming version 4 of the survey package will have a function svynls, so if you know how to fit your model without sampling weights using nls you will be able to fit it with sampling weights.

Resources