95% CI for survival linear combination (interaction) using vcov - r

i have this model
Where TD is a binary variable, and Strata is a numeric variable equals to {1,2,3}. I need to get 95% CI for this two linear combinations:
I have this function to construct confidence intervals
pwp_gt_int <- coxph(Surv(tstart2,tstop2,status==1) ~ TD+ TD:strata(event)
mod_summ <- summary(pwp_gt_int)
coefs <- modsum$coefficients
X <- model.matrix(pwp_gt_int)
dof <- nrow(X) - ncol(X)
coefs_var <- vcov(pwp_gt_int)
halfCI <- qt(0.975, dof) * sqrt(diag(coefs_var))
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
but i need something like this
coefs[2] = coefs[1] + 2*coefs[2]
coefs[3] = coefs[1] + 3*coefs[3]
matrix(c(coefs - halfCI, coefs + halfCI), nrow=3)
But the CI's i got are not plausible, i'm think im not getting right the variance-covariance matrix for the linear combinations.
Please help.

It looks like you're asking for two different things - one is the variance of a linear combination and the other is a confidence interval (and as such, a variance) for a non-linear combination. The linear combination is relatively easy. We know that the variance of a linear combination is:
where A is a matrix of constants and V(b) is the variance-covariance matrix of the random variables (in this case, the coefficients). If your coefficient vector has three values in it, and you want to do as you suggest in your last block of code, then the you would define:
or in R as:
A = matrix(c(1,1,2,0,0,3), ncol=3)
Then, you could make the linear combinations and their variances with:
b <- matrix(coef(pwp_gt_int)[1:3], ncol=1)
V <- vcov(pwp_gt_int)[1:3,1:3]
lincom <- A %*% b
v_lincom <- A %*% V %*% t(A)
sds <- sqrt(diag(v_lincom))
crit <- qt(.975, dof)
cis <- cbind(lincom - crit*sds, sincom + crit*sds)
That would be the confidence interval for the linear combination. The problem is that there isn't such an easy formula for the variance of a non-linear combination. Further, the confidence intervals may be asymmetric. One thing you could do is an end-point transformation, where you take lincom and cis and then exponentiate all of them. Another option would be a parametric bootstrap. Here's what that would look like.
B <- MASS::mvrnorm(2500, b, V)
nlcom <- exp(A %*% b)
nlsim <- exp(A %*% t(B))
nlcis <- apply(nlsim, 1, quantile, c(.025,.975))
Now, nlcis would have the confidence bounds for the non-linear combination. This should work given your data, but without the data to try it out, I'm not sure.

Related

How to manually calculate coefficients for Gamma GLM

The input I'm giving to the GLM function is:
glm(family=fam,data=regFrame1,start=starter1,formula=as.formula(paste(yvar,"~.+0")),na.action=na.exclude,y=T)
Where the family is Gamma and the link function is identity.
I'm trying to manually reproduce the coefficients from my model where one of them is for example:
Estimate Std. Error t value Pr(>|t|)
coefficient A 480.6062 195.2952 2.461 0.013902 *
I know the equation I need for coefficient A is:
βA = (XTX)−1XTY
Where y is my dependent variable and x is my independent variable.
In R I write this to produce βA:
# x transposed multiplied by x when both are matrices
xtx <- t(x) %*% x
# x transposed multiplied by y when both are matrices
xty <- t(x) %*% y
# we need to inverse xtx
xtxinv <- solve(xtx, tol=0)
# finally we multiply the inverse of xtx by xty to get betaHat
betaHat <- xtxinv %*% xty
betaHat = 148
When I complete this calculation manually I get the coefficient that is produced when running a GLM on the default normal Gaussian family without specifying a family. Which looks like this:
glm(data=regFrame1,formula=as.formula(paste(yvar,"~.+0")),na.action=na.exclude,y=T)
So the question is how do I tailor my manual calculation to the Gamma family identity link function instead of the Gaussian identity default that is in the glm.fit function in R.
The only two differences with my two runs using the glm function are:
providing the family (Gamma identity)
giving the model starting values (100 for each column in the dataframe)
I tried to recreate glm.fit function manually to get out the coefficient (beta). When I didn't provide a family or starting values I got the correct answer but when I gave Gamma as the family and identity as the link with starting values I get a much different coefficient.
For linear regression, which is fit with least squares, BA is indeed (XTX)-1XTY. However, for generalized linear regression, BA is fit by iteratively weighted least squares, which is an iterative algorithm. Therefore, there is no direct formula to compute BA. However, we can compute the equivalent of the hat matrix H in linear regression. In linear regression, the hat matrix is H=X(XTX)-1XT. In generalized linear model, the analogy of the hat matrix is H=WX(XTWX)-1XT where W = diag(mu'(XB)). In both cases, Hy give the fitted values, yA. Here is code to demonstrate.
#' Test that the two parameterizations of Gamma are the same
curve(dgamma(x, 3, scale=3), xlim=c(0, 10))
grid <- seq(0, 10, length=1000)
d <- 1/grid/gamma(3)*(grid/(1/3)/9)^3*exp(-grid/3)
plot(grid, d, type='l')
#' Generate random variates according to GLM with
#' Y_i ~ Gamma(mean=mu,
#' squared coefficient of variation (variance over squared mean) = phi)
#' Y_i ~ Gamma(shape=alpha, scale=beta)
#' mu = alpha*beta
#' phi= 1/alpha
#' Let Beta = (3, 4)
set.seed(123)
X <- data.frame(x1=runif(1000, 0, 10))
mu = (3+4*X$x1)^(-1)
y=NULL
for (i in 1:1000) {
alpha = 1/3
beta = mu[i] * 3
y[i]=rgamma(1, alpha, scale=beta)
}
#' Fit the model and compute the hat matrix, then the fitted values manually
mod <- glm(y ~ ., family=Gamma(), data=X)
x <- as.matrix(cbind(1, X))
W=diag(c(-(x%*%c(3, 4))^(-2)))
H=W%*%x%*%solve(t(x)%*%W%*%x)%*%t(x)
#Manual fitted values
head(H%*%y)
#Fitted values from model
head(mod$fitted.values)

Fix variances to specific values in lme4/lmer

I am doing a simulation study for a mixed effect model (three levels; observations nested within subjects within schools):
f <- lmer(measurement ~ time + race + gender + s_ses +
fidelity + (1 + time|school/subject), mydata_long, REML=0)
The model allows the intercept and time slope to vary across subjects and schools. I am wondering how I can fix the variances to be specific values. I do know how to do that when there is only random intercept:
VarCorr(f)['subject:school']<-0.13
VarCorr(f)['school']<-0.20
However, when there is a random slope, these codes don't work since there are different components in the variance aspect (see the attached picture).
How can I fix the variances of subject: school (Intercept), subject:school time, school (Intercept), and school time to specific values in this case. Any suggestions?
A simulation example. The hardest part is getting the random-effects parameters correctly specified: the key things you need to know are (1) internally the random effects variance matrix is scaled by the residual variance; (2) for vector-valued random effects (like this random-slopes model), the variance-covariance matrix is specified in terms of its Cholesky factor: if we want covariance matrix V, there is a lower-triangular matrix such that C %*% t(C) == V. We compute C using chol(), then read off the elements of the lower triangle (including the diagonal) in column-major order (see helper functions below).
Set up experimental design (simplified from yours, but with the same random effects components):
mydata_long <- expand.grid(time=1:40,
school=factor(letters[1:25]),
subject=factor(LETTERS[1:25]))
Helper functions to convert from
a vector of standard deviations, one or more correlation parameters (in lower-triangular/column major order), and a residual standard deviation
to
a vector of "theta" parameters as used internally by lme4 (see description above)
... and back the other way (conv_chol)
conv_sc <- function(sdvec,cor,sigma) {
## construct symmetric matrix with cor in lower/upper triangles
cormat <- matrix(1,nrow=length(sdvec),ncol=length(sdvec))
cormat[lower.tri(cormat)] <- cor
cormat[upper.tri(cormat)] <- t(cormat)[upper.tri(cormat)]
## convert to covariance matrix and scale by 1/sigma^2
V <- outer(sdvec, sdvec)*cormat/sigma^2
## extract lower triangle in column-major order
return(t(chol(V))[lower.tri(V,diag=TRUE)])
}
conv_chol <- function(ch, s) {
m <- matrix(NA,2,2)
m[lower.tri(m,diag=TRUE)] <- ch
m[upper.tri(m)] <- 0
V <- m %*% t(m) * s^2
list(sd=sqrt(diag(V)), cor=cov2cor(V)[1,2])
}
If you want to start from covariance matrices rather than standard deviations and correlations you can modify the code to skip some steps (starting and ending with V).
Pick some values and convert (and back-convert, to check)
tt1 <- conv_sc(c(0.7, 1.2), 0.3, 0.5)
tt2 <- conv_sc(c(1.4, 0.2), -0.2, 0.5)
tt <- c(tt1, tt2)
conv_chol(tt1, s=0.5)
conv_chol(tt2, s=0.5)
Set up formula and simulate:
form <- m ~ time + (1 + time|school/subject)
set.seed(101)
mydata_long$m <- simulate(form[-2], ## [-2] drops the response
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
VarCorr(f)
The fitted results are close to what we requested above ...
Groups Name Std.Dev. Corr
subject:school (Intercept) 0.66427
time 1.16488 0.231
school (Intercept) 1.78312
time 0.22459 -0.156
Residual 0.49772
Now do the same thing 200 times, to explore the distribution of estimates:
simfun <- function() {
mydata_long$m <- simulate(form[-2],
family=gaussian,
newdata=mydata_long,
newparams=list(theta=tt,
beta=c(1,1),
sigma=0.5))[[1]]
f <- lmer(form, data=mydata_long, REML=FALSE)
return(as.data.frame(VarCorr(f))[,"sdcor"])
}
set.seed(101)
res <- plyr::raply(200,suppressMessages(simfun()),.progress="text")
Here plyr::raply() is used for convenience, you can do this however you like (for loop, lapply(), replicate(), purrr::map() ...)
par(las=1)
boxplot(res)
## add true values to the plot
points(1:7,c(0.7,1.2,0.3,1.4,0.2,-0.3,0.5),col=2,cex=3,lwd=3)

R: Confidence intervals on non-linear fit with a non-analytic model

I need to fit x-y data with a model, which is non-analytic. I have a function f(x) that calculates the model for each x numerically, but there is no analytical equation. For the fit, I use optim in R. I minimise RMS between the model and the data. It works well and returns reasonable parameters.
I would like to find confidence intervals (or at least standard errors) on the best-fitting parameters. I found on internet that this can be done from the Hessian matrix, but only if maximising log-likelihood function. I don't know how to do this, all I have is x, y and f(x) from which I find RMS. Alas, I have no good way of estimating errors on y.
How can I find confidence intervals on my fit parameters?
Edit: perhaps an example in R might help explaining what I'm asking for. This example uses a simple analytic function to fit the data, in my real case the function is non-analytic, so I cannot use, e.g., nls.
set.seed(666)
# generate data
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# use optim to fit the model to the data
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x, y)
# best-fitting parameters
best_a <- res$par[1]
best_b <- res$par[2]
The best fitting parameters are a = 0.50 and b = 0.20. I need to find 95% confidence intervals on these.
This is a job for the bootstrap:
(1) create a large number of synthetic datasets x*. These are created by sampling from x with replacement the same number of data as were in x. For example, if your data is (1,2,3,4,5,6) an x* might be (5,2,4,4,2,3) (note that values might appear multiple times, or not at all because we are sampling with replacement)
(2) For each x*, calculate f(x*). If there are other parameters which don't depend on the data, don't change them. (so f(x,a,b,c) becomes f(x*,a,b,c) as long as a,b,c don't depend on x. Call these quantities f*.
(3) You can estimate anything you want from these f*. If you want the standard deviation of f(x), take the standard deviation of f*. If you want the 95% confidence interval, take the range from the 2.5 to the 97.5 percentiles of f*. More formally, if you want to estimate g(f(x)) you estimate it as g(f(x*)).
I should say this is a very practically-oriented explanation of the bootstrap. I have glossed over many theoretical details, but the bootstrap is near-universally applicable (basically as long as the thing you are trying to estimate actually exists, you are usually okay).
To apply this to the example you have given in your code:
x <- seq(100) / 100
y <- 0.5 * x + rnorm(100, sd = 0.03) + 0.2
# function to fit
f <- function(x, a, b) {
a * x + b
}
# error function to minimise: RMS
errfun <- function(par, x, y) {
a <- par[1]
b <- par[2]
err <- sqrt(sum((f(x, a, b) - y)^2))
}
# this is the part where we bootstrap
# use optim to fit the model to the data
best_a <- best_b <- numeric(10000)
for(i in 1:10000){
j <- sample(100,replace=TRUE)
x.boot <- x[j]; y.boot <- y[j]
par <- c(1, 0)
res <- optim(par, errfun, gr=NULL, x.boot, y.boot)
# best-fitting parameters
best_a[i] <- res$par[1]
best_b[i] <- res$par[2]
}
# now, we look at the *vector* best_a
# for example, if you want the standard deviation of a,
sd(best_a)
# or a 95% confidence interval for b,
quantile(best_b,c(0.025,0.975))

Individual terms in prediction of linear regression

I performed a regression analyses in R on some dataset and try to predict the contribution of each individual independent variable on the dependent variable for each row in the dataset.
So something like this:
set.seed(123)
y <- rnorm(10)
m <- data.frame(v1=rnorm(10), v2=rnorm(10), v3=rnorm(10))
regr <- lm(formula=y~v1+v2+v3, data=m)
summary(regr)
terms <- predict.lm(regr,m, type="terms")
In short: run a regression and use the predict function to calculate the terms of v1,v2 and v3 in dataset m. But I am having a hard time understanding what the predict function is calculating. I would expect it multiplies the coefficient of the regression result with the variable data. So something like this for v1:
coefficients(regr)[2]*m$v1
But that gives different results compared to the predict function.
Own calculation:
0.55293884 0.16253411 0.18103537 0.04999729 -0.25108302 0.80717945 0.22488764 -0.88835486 0.31681455 -0.21356803
And predict function calculation:
0.45870070 0.06829597 0.08679724 -0.04424084 -0.34532115 0.71294132 0.13064950 -0.98259299 0.22257641 -0.30780616
The prediciton function is of by 0.1 or so Also if you add all terms in the prediction function together with the constant it doesn’t add up to the total prediction (using type=”response”). What does the prediction function calculate here and how can I tell it to calculate what I did with coefficients(regr)[2]*m$v1?
All the following lines result in the same predictions:
# our computed predictions
coefficients(regr)[1] + coefficients(regr)[2]*m$v1 +
coefficients(regr)[3]*m$v2 + coefficients(regr)[4]*m$v3
# prediction using predict function
predict.lm(regr,m)
# prediction using terms matrix, note that we have to add the constant.
terms_predict = predict.lm(regr,m, type="terms")
terms_predict[,1]+terms_predict[,2]+terms_predict[,3]+attr(terms_predict,'constant')
You can read more about using type="terms" here.
The reason that your own calculation (coefficients(regr)[2]*m$v1) and the predict function calculation (terms_predict[,1]) are different is because the columns in the terms matrix are centered around the mean, so their mean becomes zero:
# this is equal to terms_predict[,1]
coefficients(regr)[2]*m$v1-mean(coefficients(regr)[2]*m$v1)
# indeed, all columns are centered; i.e. have a mean of 0.
round(sapply(as.data.frame(terms_predict),mean),10)
Hope this helps.
The function predict(...,type="terms") centers each variable by its mean. As a result, the output is a little difficult to interpret. Here's an alternative where each variable (constant, x1, and x2) is multiplied to its coefficient.
TLDR: pred_terms <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
library(tidyverse)
### simulate data
set.seed(123)
nobs <- 50
x1 <- cumsum(rnorm(nobs) + 3)
x2 <- cumsum(rnorm(nobs) * 3)
y <- 2 + 2*x1 -0.5*x2 + rnorm(nobs,0,50)
df <- data.frame(t=1:nobs, y=y, x1=x1, x2=x2)
train <- 1:round(0.7*nobs,0)
rm(x1, x2, y)
trainData <- df[train,]
testData <- df[-train,]
### linear model
mod <- lm(y ~ x1 + x2 , data=trainData)
summary(mod)
### predict test set
test_preds <- predict(mod, newdata=testData)
head(test_preds)
### contribution by predictor
test_contribution <- model.matrix(formula(mod$terms), testData) %*% diag(coef(mod))
colnames(test_contribution) <- names(coef(mod))
head(test_contribution)
all(round(apply(test_contribution, 1, sum),5) == round(test_preds,5)) ## should be true
### Visualize each contribution
test_contribution_df <- as.data.frame(test_contribution)
test_contribution_df$pred <- test_preds
test_contribution_df$t <- row.names(test_contribution_df)
test_contribution_df$actual <- df[-train,"y"]
test_contribution_df_long <- pivot_longer(test_contribution_df, -t, names_to="variable")
names(test_contribution_df_long)
ggplot(test_contribution_df_long, aes(x=t, y=value, group=variable, color=variable)) +
geom_line() +
theme_bw()

Get Residual Variance-Covariance Matrix in lme4

I am fitting a linear mixed effects model using lme4:
library(lme4)
data(Orthodont)
dent <- Orthodont
d.test <- lmer(distance ~ age + (1|Subject), data=dent)
If we say generically Y = X * B + Z * d + e is the form of a linear mixed effects model, then I am trying to get Var(Y) = Z * Var(d) * Z^t + Var(e) from the results of the model.
Is the following formulation the right way to do this?
k <- table(dent$Subject)[1]
vars <- VarCorr(d.test)
v <- as.data.frame(vars)
sigma <- attr(vars, "sc")
s.tech <- diag(v$vcov[1], nrow=k)
icc <- v$vcov[1]/sum(v$vcov)
s.tech[upper.tri(s.tech)] <- icc
s.tech[lower.tri(s.tech)] <- icc
sI <- diag(sigma^2, nrow=length(dent$age))
var.b <- kronecker(diag(1, nrow=length(dent$age)/k), s.tech)
var.y <- sI + var.b
I think this is a simple question, but I can't find anywhere code for doing this, so I'm asking if I'm doing it right.
You can do this a bit more easily if you know about getME(), which is a general purpose extract-bits-of-a-lmer-fit function. In particular, you can extract the transposed Z matrix (getME(.,"Zt")) and the transposed Lambda matrix - the Lambda matrix is the Cholesky factor of the scaled variance-covariance matrix of the conditional models (BLUPs); in your notation, Var(d) is the residual variance times the cross-product of Lambda.
The answer cited here is pretty good but the answer below is slightly more general (it should work for any lmer fit).
Fit model:
library(lme4)
data(Orthodont,package="nlme")
d.test <- lmer(distance ~ age + (1|Subject), data=Orthodont)
Extract components:
var.d <- crossprod(getME(d.test,"Lambdat"))
Zt <- getME(d.test,"Zt")
vr <- sigma(d.test)^2
Combine them:
var.b <- vr*(t(Zt) %*% var.d %*% Zt)
sI <- vr * Diagonal(nrow(Orthodont))
var.y <- var.b + sI
A picture:
image(var.y)

Resources