Nelson Aalen estimate in mice - r

So I am dealing with the imputation of data set having time to event data. Several papers suggest that the use of Nelson-Aalen estimate (as an approximation to the baseline hazard) will provide better imputation results. Is there a way to find Nelson-Aalen estimate and bind to my data set in R. I have found a function named nelsonaalen(data, time, event) in the mice package but I am afraid whether it will cause any error since I can only include one time variable(failure time) in it. The variables in my data are as follows:
N = 1000
xt=runif(N, 0, 50)
x1=rnorm(N, 2, 1)
x2=rnorm(N, -2, 1)
x3 <- rnorm(N, 0.5*x1 + 0.5*x2, 2)
x4 <- rnorm(N, 0.3333*x1 + 0.3333*x2 + 0.3333*x3, 2 )
lp <- 0.05*x1 + 0.2*x2 + 0.1*x3 + 0.02*x4
T <- qweibull(runif(N,pweibull(xt,shape = 7.5, scale = 84*exp(-lp/7.5)),1), shape=7.5, scale=84*exp(-lp/7.5))
Cens1 <- 100
time_M <- pmin(T,Cens1)
event_M <- time_M == Tm
Here xt denotes starting time, T denotes the failure time and the x1 to x4 are my covariates in which I'll create missing values in two of the covariates (x3 and x4).

Related

3-variable linear model in R

I want to get coefficients for a linear model related to synergism/antagonism between different chemicals.
Chemicals X, Y, Z. Coefficients b0...b7.
0 = b0 + b1x + b2y + b3z + b4xy + b5xz + b6yz + b7xyz
Some combination of X, Y, Z will kill 50% of cells (b0 represents this total effectiveness), and the sign/magnitude of the higher order terms represents interactions between chemicals.
Given real datapoints, I want to fit this model.
EDIT: I've gotten rid of the trivial solution by adding a forcing value at the start. Test data:
x1 <- c(0,1,2,3,4)
y1 <- c(0,2,1,5,4)
z1 <- c(0,1,-0.66667,-6,-7.25)
q <- c(-1,0,0,0,0)
model <- lm(q ~ x1*y1*z1)
This set has coefficients: -30, 12, 6, 4, 1, -1, 0, 0.5
EDIT: Progress made from before, will try some more data points. The first four coefficients look good (multiply by 30):
Coefficients:
(Intercept) x1 y1 z1 x1:y1 x1:z1 y1:z1 x1:y1:z1
-1.00000 0.47826 0.24943 0.13730 -0.05721 NA NA NA
EDIT: Adding more data points hasn't been successful so far, not sure if I need to have a certain minimum amount to be accurate.
Am I setting things up ccorrectly? Once I have coefficents, I want to solve for z so that I can plot a 3D surface. Thanks!
I was able to get the coefficients using just 16 arbitrary data points, and appending a point to exclude the trivial answer:
x1 <- c(0,1,2,3,4,1,2,3,4,1,2,3,4,5,6,7,8)
y1 <- c(0,2,1,5,4,3,7,5,8,6,2,1,5,5,3,5,7)
z1 <- c(0,1,-0.66667,-6,-7.25,-0.66667,-5.55556,-6,-6.125,-4,-2.5,-6,-6.8,-7.3913,-11.1429,-8.2069,-6.83333)
q <- c(-1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0)
model <- lm(q ~ x1*y1*z1)

How to do a one-sample location, two-way approximate Z test in R using estimates from the delta method?

I used the delta method to estimate the difference between two coefficients from a glm fit (attached code below). Now, I want to compare this estimate to zero (i.e., a null hypothesis of no difference). One article mentions using a one-sample location, two-way approximate Z test to compute this difference.
However, I cannot find an easy way to do that in R using the delta difference. I looked over the two-sample Z test documentation and possibly thought of using the difference as a substitute in the z-stat formula...but I am not sure if that's the best way to go about it.
##GENERATE DATA SET
y <- c(1:12)
x1 <- rep(c(1000, 4000, 0), each = 4)
x2 <- rep(c(0, 1000, 4000), each = 4)
df <- data.frame(y, x1, x2)
##RUN GLM
library(lmerTest)
g1 <- glm(log(y) ~ x1 + x2, data = df)
##Use delta-method to estimate the difference between coefficients of x1 and x2 (Ritz & Streibig 2008)
library(car)
g1.delta <- deltaMethod(g1,"(-x1) - (-x2)")
Estimate SE 2.5 % 97.5 %
(-x1) - (-x2) 2.3217e-04 7.3180e-05 8.8738e-05 4e-04

CausalImpact package in R doesn't work for Poisson bsts model

I'd like to use the CausalImpact package in R to estimate the impact of an intervention on infectious disease case counts. We typically characterize the distributions of case counts as either Poisson or negative binomial. The bsts() function allows us to specify the Poisson family. However this encountered an error in CausalImpact()
set.seed(1)
x1 <- 100 + arima.sim(model = list(ar = 0.999), n = 100)
y <- rpois(100, 1.2 * x1)
y[71:100] <- y[71:100] + 10
data <- cbind(y, x1)
pre.period <- c(1, 70)
post.period <- c(71, 100)
post.period.response <- y[post.period[1] : post.period[2]]
y[post.period[1] : post.period[2]] <- NA
ss <- AddLocalLevel(list(), y)
bsts.model <- bsts(y ~ x1, ss, family="poisson", niter = 1000)
impact <- CausalImpact(bsts.model = bsts.model,
post.period.response = post.period.response)
Error in rnorm(prod(dim(state.samples)), 0, sigma.obs) : invalid arguments
This is due to the fact that bsts.model has no sigma.obs slot when generated using family="poisson".
Am I doing this correctly or is there another way to use CausalImpact with Poisson data? (I'd also love to be able to use negative binomial data, but I won't get too greedy).
Last, is this the best place for coding issues for CausalImpact? I didn't see an Issues tab on the GitHub page.

lme4: Random slopes shared by all observations

I'm using R's lme4. Suppose I have a mixed-effects logistic-regression model where I want some random slopes shared by every observation. They're supposed to be random in the sense that these random slopes should all come from a single normal distribution. This is essentially the same thing as ridge regression, but without choosing a penalty size with cross-validation.
I tried the following code:
library(lme4)
ilogit = function(v)
1 / (1 + exp(-v))
set.seed(20)
n = 100
x1 = rnorm(n)
x2 = rnorm(n)
x3 = rnorm(n)
x4 = rnorm(n)
x5 = rnorm(n)
y.p = ilogit(.5 + x1 - x2)
y = rbinom(n = n, size = 1, prob = y.p)
m1 = glm(
y ~ x1 + x2 + x3 + x4 + x5,
family = binomial)
print(round(d = 2, unname(coef(m1))))
m2 = glmer(
y ~ ((x1 + x2 + x3 + x4 + x5)|1),
family = binomial)
print(round(d = 2, unname(coef(m2))))
This yields:
Loading required package: Matrix
[1] 0.66 1.14 -0.78 -0.01 -0.16 0.25
Error: (p <- ncol(X)) == ncol(Y) is not TRUE
Execution halted
What did I do wrong? What's the right way to do this?
Looks like lme4 can't do this as-is. Here's what #amoeba said in stats.SE chat:
What Kodi wants to do is definitely a mixed model, in the sense of Bates et al. see e.g. eq (2) here https://cran.r-project.org/web/packages/lme4/vignettes/lmer.pdf As far as I can see, X and Z design matrices are equal in this case. However, there is no way one can use lme4 to fit this (without hacking into the code): it allows only particular Z matrices that arise from the model formulas of the type (formula|factor).
See https://stat.ethz.ch/pipermail/r-sig-mixed-models/2011q1/015581.html "We intend to allow lmer to be able to use more flexible model matrices for the random effects although, at present, that requires a certain amount of tweaking on the part of the user"
And https://stat.ethz.ch/pipermail/r-sig-mixed-models/2009q2/002351.html "I view the variance-covariance structures available in the lme4 package as being related to random-effects terms in the model matrix. A random-effects term is of the form (LMexpr | GrpFac). The expression on the right of the vertical bar is evaluated as a factor, which I call the grouping factor. The expression on the left is evaluated as a linear model expression."
That's all quotes from Bates. He does say "In future versions of lme4 I plan to allow for extensions of the unconditional variance-covariance structures." (in 2009) but I don't this was implemented.

R: Predict (0,1) in logistic regression in glm()

I am trying to model a "what if" situation in a binary logit model. I am estimating the probability of passing a test, given the level of difficulty of the test (1=easiest, 5=toughest), with gender as control. (The data is here). Students are administered a test which is generally tough ("HIGH" in the data). From this we can estimate the impact of test-difficulty on the likelihood of passing:
model = glm(PASS ~ as.factor(SEX) + as.factor(HIGH), family=binomial(link="logit"), data=df)
summary(model)
We can also get the predicted probabilities of passing with:
predict.high = predict(model, type="response")
The question is, what if the "LOW" test were given instead? To get the new probabilities, we can do:
newdata = rename.vars(subset(df, select=c(-HIGH)), 'LOW','HIGH')
predict.low = predict(model, newdata=newdata, type="response")
But how do I know how many additional students would have passed in this case? Is there an obvious switch in glm() I am not seeing?
I have not yet tried to dig out my code for prediction that I wrote based on Gelman and Hill (2006) who, I seem to recall used simulation. I still intend to do that. One aspect of your question that seemed unique in my limited experience was that I was accustomed to predicting for a single observation (in this case a single student taking a single test). You, however, seem to want to predict a difference between two sets of predictions. In other words, you want to predict how many more students will pass if given a set of 5 easy exams rather than a set of 5 hard exams.
I am not sure whether Gelman and Hill (2006) covered that. You also seem to want to do this with a frequentist approach.
I am thinking that if you can predict for a single observation, so that you have a confidence interval for each observation, then perhaps you can estimate a weighted average probability of passing within each group and subtract the two weighted averages. The delta method could be used to estimate a confidence interval on the weighted averages and on their difference.
Covariance among predicted observations might have to be assumed to be 0 to implement that approach.
If assuming a covariance of 0 is not satisfactory then perhaps a Bayesian approach would be better. Again, I am only familiar with predicting for a single observation. With a Bayesian approach I have predicted a single observation by including the independent variables, but not the dependent variable, for the observation to be predicted. I suppose you could predict for every observation in the same Bayesian run (predict each student in HIGH and in LOW). The weighted averages of passing tests for each group and the difference in weighted averages are derived parameters and I suspect could be included directly in the code for the Bayesian logistic regression. Then you would have your point estimate and estimate of variance for probability of passing each group of tests and for the difference in probability of passing each group of tests. If you want the difference in the number of students passing each group of tests, perhaps that could be included in the Bayesian code as a derived parameter also.
I realize this answer, so far, has been more conversational than might be desired. I am simply mapping out strategies to attempt without having had the time yet to try implementing those strategies. Providing all of the R and WinBUGS code to implement both proposed strategies might take me a few days. (WinBUGS or OpenBUGS can be called from within R.) I will append the code to this answer as I go along. If anyone deems my proposed strategies, and/or forthcoming code, incorrect I hope they will feel free to point out my errors and offer corrections.
EDIT
Below is code that generates fake data and analyzes that data using a frequentist and Bayesian approach. I have not yet added the code to implement the above ideas for prediction. I will try to add the Bayesian prediction code in the next 1-2 days. I only used three tests instead of five. The way the code is written below you can change the number of students, n, to any non-zero number that can be divided into 6 equal whole numbers.
# Bayesian_logistic_regression_June2012.r
# June 24, 2012
library(R2WinBUGS)
library(arm)
library(BRugs)
set.seed(3234)
# create fake data for n students and three tests
n <- 1200
# create factors for n/6 students in each of 6 categories
gender <- c(rep(0, (n/2)), rep(1, (n/2)))
test2 <- c(rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)),
rep(0, (n/6)), rep(1, (n/6)), rep(0, (n/6)))
test3 <- c(rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)),
rep(0, (n/6)), rep(0, (n/6)), rep(1, (n/6)))
# assign slopes to factors
B0 <- 0.4
Bgender <- -0.2
Btest2 <- 0.6
Btest3 <- 1.2
# estimate probability of passing test
p.pass <- ( exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3) /
(1 + exp(B0 + Bgender * gender +
Btest2 * test2 +
Btest3 * test3)))
# identify which students passed their test, 0 = fail, 1 = pass
passed <- rep(0, n)
r.passed <- runif(n,0,1)
passed[r.passed <= p.pass] = 1
# use frequentist approach in R to estimate probability
# of passing test
m.freq <- glm(passed ~ as.factor(gender) +
as.factor(test2) +
as.factor(test3) ,
family = binomial)
summary(m.freq)
# predict(m.freq, type = "response")
# use OpenBUGS to analyze same data set
# Define model
sink("Bayesian.logistic.regression.txt")
cat("
model {
# Priors
alpha ~ dnorm(0,0.01)
bgender ~ dnorm(0,0.01)
btest2 ~ dnorm(0,0.01)
btest3 ~ dnorm(0,0.01)
# Likelihood
for (i in 1:n) {
passed[i] ~ dbin(p[i], 1)
logit(p[i]) <- (alpha + bgender * gender[i] +
btest2 * test2[i] +
btest3 * test3[i])
}
# Derived parameters
p.g.t1 <- exp(alpha) / (1 + exp(alpha))
p.b.t1 <- exp(alpha + bgender) / (1 + exp(alpha + bgender))
p.g.t2 <- ( exp(alpha + btest2) /
(1 + exp(alpha + btest2)))
p.b.t2 <- ( exp(alpha + bgender + btest2) /
(1 + exp(alpha + bgender + btest2)))
p.g.t3 <- ( exp(alpha + btest3) /
(1 + exp(alpha + btest3)))
p.b.t3 <- ( exp(alpha + bgender + btest3) /
(1 + exp(alpha + bgender + btest3)))
}
", fill = TRUE)
sink()
my.data <- list(passed = passed,
gender = gender,
test2 = test2,
test3 = test3,
n = length(passed))
# Inits function
inits <- function(){ list(alpha = rlnorm(1),
bgender = rlnorm(1),
btest2 = rlnorm(1),
btest3 = rlnorm(1)) }
# Parameters to estimate
params <- c("alpha", "bgender", "btest2", "btest3",
"p.g.t1", "p.b.t1", "p.g.t2", "p.b.t2",
"p.g.t3", "p.b.t3")
# MCMC settings
nc <- 3
ni <- 2000
nb <- 500
nt <- 2
# Start Gibbs sampling
out <- bugs(data = my.data, inits = inits,
parameters.to.save = params,
"c:/users/Mark W Miller/documents/Bayesian.logistic.regression.txt",
program = 'OpenBUGS',
n.thin = nt, n.chains = nc,
n.burnin = nb, n.iter = ni, debug = TRUE)
print(out, dig = 5)
Before I attempt to implement the weighted-average approach to prediction I wanted to convince myself that it might work. So I ginned up the following code, which seems to suggest it may:
# specify number of girls taking each test and
# number of boys taking each test
g.t1 <- rep(0,400)
b.t1 <- rep(0,120)
g.t2 <- rep(0,1200)
b.t2 <- rep(0,50)
g.t3 <- rep(0,1000)
b.t3 <- rep(0,2000)
# specify probability of individuals in each of the
# 6 groups passing their test
p.g1.t1 <- 0.40
p.b1.t1 <- 0.30
p.g1.t2 <- 0.60
p.b1.t2 <- 0.50
p.g1.t3 <- 0.80
p.b1.t3 <- 0.70
# identify which individuals in each group passed their test
g.t1[1:(p.g1.t1 * length(g.t1))] = 1
sum(g.t1)
b.t1[1:(p.b1.t1 * length(b.t1))] = 1
sum(b.t1)
g.t2[1:(p.g1.t2 * length(g.t2))] = 1
sum(g.t2)
b.t2[1:(p.b1.t2 * length(b.t2))] = 1
sum(b.t2)
g.t3[1:(p.g1.t3 * length(g.t3))] = 1
sum(g.t3)
b.t3[1:(p.b1.t3 * length(b.t3))] = 1
sum(b.t3)
# determine the weighted average probability of passing
# on test day for all individuals as a class
wt.ave.p <- ((p.g1.t1 * length(g.t1) + p.b1.t1 * length(b.t1) +
p.g1.t2 * length(g.t2) + p.b1.t2 * length(b.t2) +
p.g1.t3 * length(g.t3) + p.b1.t3 * length(b.t3) ) /
(length(g.t1) + length(b.t1) + length(g.t2) +
length(b.t2) + length(g.t3) + length(b.t3)))
wt.ave.p
# determine the expected number of individuals passing
# their test in the class as a whole
exp.num.pass <- wt.ave.p * (length(g.t1) + length(b.t1) +
length(g.t2) + length(b.t2) +
length(g.t3) + length(b.t3))
exp.num.pass
# determine the number of individuals passing
num.passing <- (sum(g.t1) + sum(b.t1) +
sum(g.t2) + sum(b.t2) +
sum(g.t3) + sum(b.t3) )
num.passing
# the expected number of students passing, exp.num.pass,
# should equal the observed number of students passing,
# num.passing regardless of the number of students in each
# group and regardless of the probability of passing a
# given test, within rounding error
identical(round(exp.num.pass), round(num.passing))
Hopefully in the next couple of days I can try adding the prediction code to the above Bayesian code.
EDIT - June 27, 2012
I have not forgotten about this. Rather, I have encountered several problems:
With logistic regression it is possible to predict: a) the probability, p, that students in a given group pass a test and b) the outcome of a given student taking a test (0 or 1). All of the 0's and 1's are then averaged. I am not sure which of these to use. The point estimate and SD of the predicted p is identical to the estimated p for known test outcomes. The point estimate of the average of the predicted 0's and 1's is a little different and the SD of the averaged 0's and 1's is much larger. I believe I want b, the average of the predicted 0's and 1's. However, I am attempting to examine various websites and books to be sure. Collett (1991) has a worked example that does not employ computer code, but that worked example includes a half-dozen variables including 2 interactions and I am having a little trouble getting my Bayesian estimates to match her frequentist estimates.
With lots of derived parameters the program is taking a long time to run.
Apparently OpenBUGS has been crashing frequently, I believe, even without prediction code. I am not sure whether that is because of something I am doing wrong or because of changes in the recent versions of R or changes in recent versions of R packages or maybe because I am trying to run the code with a 64-bit R or something else.
I will try to post the prediction code soon, but all of the above issues have slowed me down.
You can easily use this approach to find a cut off:
cutoff <- runif(length(predicted_probabilities))
This is a deterministic decision based on Metropolis-Hastings.

Resources