I have the hypothesis that the automatically found model below is not significantly better than one where the middle segment is horizontal. How could I test that?
df<-structure(list(ageThen=c(9,10,11,12,13,14,15,16,17,
18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,
34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,
50,51,52,53,54,55,56,57,58,59,60,61,62),mTh=c(-0.057,
-0.253,-0.345,-0.185,-0.155,-0.013,0.285,0.16,0.197,0.199,
0.215,0.288,0.401,0.363,0.387,0.37,0.387,0.28,0.571,
0.383,0.297,0.366,0.36,0.25,0.269,0.235,0.273,0.336,
0.354,0.286,0.331,0.21,0.32,0.278,0.195,0.257,0.259,
0.251,0.222,0.206,0.214,-0.072,-0.123,-0.043,-0.003,0.116,
-0.193,-0.218,-0.278,-0.265,-0.218,-0.541,-0.76,-0.401
),n=c(64L,524L,20595L,2504L,795L,704L,1700L,1239L,
1273L,1149L,1011L,1122L,1031L,814L,717L,667L,462L,414L,
405L,313L,256L,305L,187L,255L,240L,221L,262L,227L,230L,
239L,199L,290L,201L,246L,217L,215L,273L,229L,213L,193L,
199L,204L,159L,207L,148L,121L,115L,89L,87L,78L,68L,
85L,55L,80L)),class=c("tbl_df","tbl","data.frame"),row.names=c(NA,
-54L))
library(segmented)
m1<-lm(mTh~ageThen,data=df,weights=n)##initialfit
s2<-segmented(m1,psi=c(20,50))##twobreakpoints,estimatedstartingvalues
plot(mTh~ageThen,data=df)
lines(df$ageThen,predict(s2),col=2,lwd=2)
This workflow seems to solve my problem. The literature that I saw suggests that the there is a significant increase in the predictive value of the model if the elpd_diff/se_diff (in absolute terms) is larger than 2, in other places 5. In my case it isn't, so unconstrained model is not much better predictor than the constrained (theoretical). A citable source would be much appreciated to corroborate this.
https://lindeloev.github.io/mcp/articles/comparison.html#what-is-loo-cv.
https://discourse.mc-stan.org/t/interpreting-elpd-diff-loo-package/1628
library(mcp)
df<-structure(list(ageThen=c(9,10,11,12,13,14,15,16,17,
18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,
34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,
50,51,52,53,54,55,56,57,58,59,60,61,62),mTh=c(-0.057,
-0.253,-0.345,-0.185,-0.155,-0.013,0.285,0.16,0.197,0.199,
0.215,0.288,0.401,0.363,0.387,0.37,0.387,0.28,0.571,
0.383,0.297,0.366,0.36,0.25,0.269,0.235,0.273,0.336,
0.354,0.286,0.331,0.21,0.32,0.278,0.195,0.257,0.259,
0.251,0.222,0.206,0.214,-0.072,-0.123,-0.043,-0.003,0.116,
-0.193,-0.218,-0.278,-0.265,-0.218,-0.541,-0.76,-0.401
),n=c(64L,524L,20595L,2504L,795L,704L,1700L,1239L,
1273L,1149L,1011L,1122L,1031L,814L,717L,667L,462L,414L,
405L,313L,256L,305L,187L,255L,240L,221L,262L,227L,230L,
239L,199L,290L,201L,246L,217L,215L,273L,229L,213L,193L,
199L,204L,159L,207L,148L,121L,115L,89L,87L,78L,68L,
85L,55L,80L)),class=c("tbl_df","tbl","data.frame"),row.names=c(NA,
-54L))
modelC = list(
mTh | weights(n) ~ 1 + ageThen, # slope
~ 0 , # joined plateau
~ 0 + ageThen # joined slope
)
fitC<-mcp(modelC, data = df, prior = list(cp_1 = "dunif(12, 25)", cp_2 = "dunif(35, 60)"))
plot(fitC)
summary(fitC)
modelUc = list(
mTh | weights(n) ~ 1 + ageThen, # slope
~ 0 + ageThen, # joined slope
~ 0 + ageThen # joined slope
)
fitUc<-mcp(modelUc, data = df, prior = list(cp_1 = "dunif(12, 25)", cp_2 = "dunif(35, 60)"))
plot(fitUc)
summary(fitUc)
fitUc$loo = loo(fitUc)
fitC$loo = loo(fitC)
library(tidyverse)
as.data.frame(loo::loo_compare(fitUc$loo,fitC$loo))%>%
mutate(diffRatio=elpd_diff/se_diff, .keep='used')
I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:
v n river_dist
1 102 256 1040
2 1 11 720
3 19 24 475
4 12 15 611
Which I fit to this model:
library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):
exp(coef(m_r))
That's all good. But when I try to predict new values, I get this odd cycling of values:
newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)
Example of predicted values:
river_dist v
1 475.0000 216.855114
2 480.7071 9.285536
3 486.4141 20.187424
4 492.1212 12.571487
5 497.8283 213.762248
6 503.5354 9.150584
7 509.2424 19.888471
8 514.9495 12.381805
9 520.6566 210.476312
10 526.3636 9.007289
11 532.0707 19.571218
12 537.7778 12.180629
Why are the values cycling up and down like that, creating crazy spikes when graphed?
In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:
library(bbmle)
m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
At this point, as #user20650 suggests, you'll also have to specify a value (or values) for n in newdata.
This model appears to be identical to binomial regression: is there a reason not to use
glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial)
? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)
Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.
m_r <- mle2(v ~ dbinom(size=n,
prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig,
method="Nelder-Mead")
pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
pframe$prop <- predict(m_r, newdata=pframe, type="response")
CIs <- lapply(seq(nrow(ig)),
function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
c("lwr","upr")))
library(ggplot2); theme_set(theme_bw())
ggplot(ig2,aes(river_dist,v/n))+
geom_point(aes(size=n)) +
geom_linerange(aes(ymin=lwr,ymax=upr)) +
geom_smooth(method="glm",
method.args=list(family=binomial),
aes(weight=n))+
geom_line(data=pframe,aes(y=prop),colour="red")
Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).
I'm trying to get a lme with self constructed interaction variables to fit. I need those for post-hoc analysis.
library(nlme)
# construct fake dataset
obsr <- 100
dist <- rep(rnorm(36), times=obsr)
meth <- dist+rnorm(length(dist), mean=0, sd=0.5); rm(dist)
meth <- meth/dist(range(meth)); meth <- meth-min(meth)
main <- data.frame(meth = meth,
cpgl = as.factor(rep(1:36, times=obsr)),
pbid = as.factor(rep(1:obsr, each=36)),
agem = rep(rnorm(obsr, mean=30, sd=10), each=36),
trma = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)),
depr = as.factor(rep(sample(c(TRUE, FALSE), size=obsr, replace=TRUE), each=36)))
# check if all factor combinations are present
# TRUE for my real dataset; Naturally TRUE for the fake dataset
with(main, all(table(depr, trma, cpgl) >= 1))
# construct interaction variables
main$depr_trma <- interaction(main$depr, main$trma, sep=":", drop=TRUE)
main$depr_cpgl <- interaction(main$depr, main$cpgl, sep=":", drop=TRUE)
main$trma_cpgl <- interaction(main$trma, main$cpgl, sep=":", drop=TRUE)
main$depr_trma_cpgl <- interaction(main$depr, main$trma, main$cpgl, sep=":", drop=TRUE)
# model WITHOUT preconstructed interaction variables
form1 <- list(fixd = meth ~ agem + depr + trma + depr*trma + cpgl +
depr*cpgl +trma*cpgl + depr*trma*cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl1 <- nlme::lme(fixed=form1[["fixd"]],
random=form1[["rndm"]],
correlation=corCompSymm(form=form1[["corr"]]),
data=main)
# model WITH preconstructed interaction variables
form2 <- list(fixd = meth ~ agem + depr + trma + depr_trma + cpgl +
depr_cpgl + trma_cpgl + depr_trma_cpgl,
rndm = ~ 1 | pbid,
corr = ~ cpgl | pbid)
modl2 <- nlme::lme(fixed=form2[["fixd"]],
random=form2[["rndm"]],
correlation=corCompSymm(form=form2[["corr"]]),
data=main)
The first model fits without any problems whereas the second model gives me following error:
Error in MEEM(object, conLin, control$niterEM) :
Singularity in backsolve at level 0, block 1
Nothing i found out about this error so far helped me to solve the problem. However the solution is probably pretty easy.
Can someone help me? Thanks in advance!
EDIT 1:
When i run:
modl3 <- lm(form1[["fixd"]], data=main)
modl4 <- lm(form2[["fixd"]], data=main)
The summaries reveal that modl4 (with the self constructed interaction variables) in contrast to modl3 shows many more predictors. All those that are in 4 but not in 3 show NA as coefficients. The problem therefore definitely lies within the way i create the interaction variables...
EDIT 2:
In the meantime I created the interaction variables "by hand" (mainly paste() and grepl()) - It seems to work now. However I would still be interested in how i could have realized it by using the interaction() function.
I should have only constructed the largest of the interaction variables (combining all 3 simple variables).
If i do so the model gets fit. The likelihoods then are very close to each other and the number of coefficients matches exactly.
I'm trying to build a survival model in JAGS that allows for time-varying covariates. I'd like it to be a parametric model — for example, assuming survival follows the Weibull distribution (but I'd like to allow the hazard to vary, so exponential is too simple). So, this is essentially a Bayesian version of what can be done in the flexsurv package, which allows for time-varying covariates in parametric models.
Therefore, I want to be able to enter the data in a 'counting-process' form, where each subject has multiple rows, each corresponding to a time interval in which their covariates remained constant (as described in this pdf or here. This is the (start, stop] formulation that the survival or flexurv packages allow.
Unfortunately, every explanation of how to perform survival analysis in JAGS seems to assume one row per subject.
I attempted to take this simpler approach and extend it to the counting process format, but the model does not correctly estimate the distribution.
A Failed Attempt:
Here's an example. First we generate some data:
library('dplyr')
library('survival')
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
person person_start_time person_censored person_duration
(int) (dbl) (lgl) (dbl)
1 1 11.81416 FALSE 487.4553
2 2 114.20900 FALSE 168.7674
3 3 75.34220 FALSE 356.6298
4 4 339.98225 FALSE 385.5119
5 5 389.23357 FALSE 259.9791
6 6 253.71067 FALSE 259.0032
7 7 419.52305 TRUE 310.4770
Then we split the data into 2 observations per subject. I'm just splitting each subject at time = 300 (unless they didn't make it to time=300, in which they get just one observation).
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
person split START END TINTERVAL CENS TINTERVAL_CENS
(int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 300.0000 0 300.0000 300.00000 1 NA
2 1 300.0000 300 487.4553 187.45530 0 187.45530
3 2 168.7674 0 168.7674 168.76738 1 NA
4 3 300.0000 0 300.0000 300.00000 1 NA
5 3 300.0000 300 356.6298 56.62979 0 56.62979
6 4 300.0000 0 300.0000 300.00000 1 NA
Now we can set up the JAGS model.
## Set-Up JAGS Model -------
dat_jags <- as.list(dat_split)
dat_jags$N <- length(dat_jags$TINTERVAL)
inits <- replicate(n = 2, simplify = FALSE, expr = {
list(TINTERVAL_CENS = with(dat_jags, ifelse(CENS, TINTERVAL + 1, NA)),
END_CENS = with(dat_jags, ifelse(CENS, END + 1, NA)) )
})
model_string <-
"
model {
# set priors on reparameterized version, as suggested
# here: https://sourceforge.net/p/mcmc-jags/discussion/610036/thread/d5249e71/?limit=25#8c3b
log_a ~ dnorm(0, .001)
log(a) <- log_a
log_b ~ dnorm(0, .001)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# Estimate Subject-Durations:
CENS[i] ~ dinterval(TINTERVAL_CENS[i], TINTERVAL[i])
TINTERVAL_CENS[i] ~ dweibull( nu, lambda )
}
}
"
library('runjags')
param_monitors <- c('a', 'b', 'nu', 'lambda')
fit_jags <- run.jags(model = model_string,
burnin = 1000, sample = 1000,
monitor = param_monitors,
n.chains = 2, data = dat_jags, inits = inits)
# estimates:
fit_jags
# actual:
c(a=true_shape, b=true_scale)
Depending on where the split point is, the model estimates very different parameters for the underlying distribution. It only gets the parameters right if the data isn't split into the counting process form. It seems like this is not the way to format the data for this kind of problem.
If I am missing an assumption and my problem is less related to JAGS and more related to how I'm formulating the problem, suggestions are very welcome. I might be despairing that time-varying covariates can't be used in parametric survival models (and can only be used in models like the Cox model, which assumes constant hazards and which doesn't actually estimate the underlying distribution)— however, as I mentioned above, the flexsurvreg package in R does accommodate the (start, stop] formulation in parametric models.
If anyone knows how to build a model like this in another language (e.g. STAN instead of JAGS) that would be appreciated too.
Edit:
Chris Jackson provides some helpful advice via email:
I think the T() construct for truncation in JAGS is needed here. Essentially for each period (t[i], t[i+1]) where a person is alive but the covariate is constant, the survival time is left-truncated at the start of the period, and possibly also right-censored at the end. So you'd write something like y[i] ~ dweib(shape, scale[i])T(t[i], )
I tried implementing this suggestion as follows:
model {
# same as before
log_a ~ dnorm(0, .01)
log(a) <- log_a
log_b ~ dnorm(0, .01)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# modified to include left-truncation
CENS[i] ~ dinterval(END_CENS[i], END[i])
END_CENS[i] ~ dweibull( nu, lambda )T(START[i],)
}
}
Unfortunately this doesn't quite do the trick. With the old code, the model was mostly getting the scale parameter right, but doing a very bad job on the shape parameter. With this new code, it gets very close to the correct shape parameter, but consistently over-estimates the scale parameter. I have noticed that the degree of over-estimation is correlated with how late the split point comes. If the split-point is early (cens_point = 50), there's not really any over-estimation; if it's late (cens_point = 350), there is a lot.
I thought maybe the problem could be related to 'double-counting' the observations: if we see a censored observation at t=300, then from that same person, an uncensored observation at t=400, it seems intuitive to me that this person is contributing two data-points to our inference about the Weibull parameters when really they should just be contributing one point. I, therefore, tried incorporating a random-effect for each person; however, this completely failed, with huge estimates (in the 50-90 range) for the nu parameter. I'm not sure why that is, but perhaps that's a question for a separate post. Since I'm not whether the problems are related, you can find the code for this whole post, including the JAGS code for that model, here.
You can use rstanarm package, which is a wrapper around STAN. It allows to use standard R formula notation to describe survival models. stan_surv function accepts arguments in a "counting process" form. Different base hazard functions including Weibull can be used to fit the model.
The survival part of rstanarm - stan_surv function is still not available at CRAN so you should install the package directly from mc-stan.org.
install.packages("rstanarm", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Please see the code below:
library(dplyr)
library(survival)
library(rstanarm)
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
dat_split$CENS <- as.integer(!(dat_split$CENS))
# Fit STAN survival model
mod_tvc <- stan_surv(
formula = Surv(START, END, CENS) ~ 1,
data = dat_split,
iter = 1000,
chains = 2,
basehaz = "weibull-aft")
# Print fit coefficients
mod_tvc$coefficients[2]
unname(exp(mod_tvc$coefficients[1]))
Output, which is consistent with true values (true_shape <- 2; true_scale <- 365):
> mod_tvc$coefficients[2]
weibull-shape
1.943157
> unname(exp(mod_tvc$coefficients[1]))
[1] 360.6058
You can also look at STAN source using rstan::get_stanmodel(mod_tvc$stanfit) to compare STAN code with the attempts you made in JAGS.