I'm trying to write stan code for multilevel logistic regression. The model that I tried is a mixed intercept logistic model with two predictors. The first level is children level and the second level is mom level. When I tried to match the summary result of the code I wrote versus the one generated by function stan_glmer(), the results of fixed intercept did not match. First, the data I used as below:
library(rstanarm)
library(rstan)
data(guImmun, package = "mlmRev")
summary(guImmun)
require(dplyr)
guImmun <- guImmun %>%
mutate(immun = ifelse(immun == "N",0,1))
Second, the stan code was written as below:
data {
int N; // number of obs
int M; // number of groups
int K; // number of predictors
int y[N]; // outcome
row_vector[K] x[N]; // predictors
int g[N]; // map obs to groups (kids to women)
}
parameters {
real alpha;
real a[M];
vector[K] beta;
real<lower=0,upper=10> sigma;
}
model {
alpha ~ normal(0,1);
a ~ normal(0,sigma);
beta ~ normal(0,1);
for(n in 1:N) {
y[n] ~ bernoulli(inv_logit( alpha + a[g[n]] + x[n]*beta));
}
}
Fitting data to the model:
guI_data <- list(g=as.integer(guImmun$mom),
y=guImmun$immun,
x=data.frame(guImmun$kid2p, guImmun$mom25p),
N=nrow(guImmun),
K=2,
M=nlevels(guImmun$mom))
ranIntFit <- stan(file = "first_model.stan", data = guI_data,
iter = 500, chains = 1)
summary(ranIntFit, pars = c("alpha", "beta", "a[1]", "a[2]", "a[3]", "sigma"),
probs = c(0.025, 0.975),
digits = 2)
I got the following result:
results of written model
However, if I use stan_glmer() function, the result would be presented as follows.
M1_stanglmer <- stan_glmer(immun ~ kid2p + mom25p + (1 | mom),
family = binomial("logit"),
data = guImmun,
iter = 500,
chains = 1,
seed = 349)
print(M1_stanglmer, digits = 2)
But the results do not match, especially the result of fixed intercept.
Results generated by the stan_glmer() function
Could anyone help me figure out what's wrong with my code? Thanks!
So, I wouldn't expect an exact equivalence between your model in Stan and the version implemented in stan_glmer, but for models that sample well it's reasonable to expect the estimates to be similar.
However, in your case, there is yet another issue impacting your estimates:
The covariates you are using in the guI_Data$x object have values in {1,2}, where the typical implementation would use values in {0,1} to represent a binary covariate. This is what is done in stan_glmer.
This coding is apparent if you use glimpse to inspect the data structure:
> library(tidyverse)
> glimpse(guI_data)
List of 6
$ g: int [1:2159] 1 2 3 4 5 5 6 7 7 8 ...
$ y: num [1:2159] 1 0 0 0 0 1 1 1 1 1 ...
$ x:'data.frame': 2159 obs. of 2 variables:
..$ guImmun.kid2p : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 1 2 2 ...
..$ guImmun.mom25p: Factor w/ 2 levels "N","Y": 1 1 1 1 2 1 1 2 2 2 ...
$ N: int 2159
$ K: num 2
$ M: int 1595
This is having the biggest impact in your intercept parameter, since the intercept represents the expected linear predictor when all covariates are 0. That value will often change when covariates are transformed or added.
Actually, I would expect that the estimated coefficients from your fit and the stan_glmer model are in fact similar, once you take this transformation into consideration.
For example, consider:
define: x_m = x + 1
Your model (m): yhat_m = alpha_m + x_m1*beta_m1 + x_m2*beta_m2
Stan_glmer: yhat = alpha + x_1*beta_1 + x_2*beta_2
and substitute:
yhat_m = alpha_m + (x_1 + 1)*beta_m1 + (x_2 + 1)*beta_m1
yhat_m = alpha_m + x_1*beta_m1 + beta_m1 + x_2*beta_m2 + beta_m2
yhat_m = alpha_m + beta_m1 + beta_m2 + x_1*beta_m1 + x_2*beta_m2
If we assume that yhat_m ~= yhat, beta_m1 ~= beta_1, and beta_m2 ~= beta_2... Then
alpha = alpha_m + beta_m1 + beta_m2
So, I would expect the stan_glmer alpha (-1.7) to be close to the hand-coded Stan alpha + both betas (-3.2 + 1.7 - 0.1).
Which indeed it is (-1.6).
If you furthermore update your Stan data to scale these covariates as {0,1} instead of {1,2}:
guI_data2 <- list(g=as.integer(guImmun$mom),
y=guImmun$immun,
x=data.frame(guImmun$kid2p == "Y", guImmun$mom25p == "Y"),
N=nrow(guImmun),
K=2,
M=nlevels(guImmun$mom))
ranIntFit2 <- stan(file = "first_model.stan", data = guI_data2,
iter = 500, chains = 1)
And look at the output:
> summary(ranIntFit2, pars = c('alpha', 'beta'))
$summary
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha -1.5110714 0.022982199 0.1903571 -1.8974997 -1.6318370 -1.5038593 -1.3861628334 -1.1729671 68.60488 1.0505237
beta[1] 1.5224756 0.025017739 0.1737332 1.2260666 1.4058789 1.5118314 1.6492158203 1.8673450 48.22471 1.0592955
beta[2] -0.1206084 0.009410305 0.1640406 -0.4267987 -0.2368855 -0.1267984 -0.0003187197 0.1894375 303.87510 0.9964177
You can confirm for yourself that you are in the right ballpark.
After this, the differences between your model and the stan_glmer will come down to the priors, the parameterization of hierarchical parameters, the sampling quality, etc.
Aside: there are a number of ways that categorical covariates can be coded into a model.matrix, each for a specific interpretation of the effect parameters. The models are usually equivalent, meaning that one can convert from one parameterization to another using linear transformations of effects as done above.
Related
I wrote a binomial regression model to predict the prevalence of igneous stone, v, at an archaeological site based on proximity to a river, river_dist, but when I use the predict() function I'm getting odd cyclical results instead of the curve I was expecting. For reference, my data:
v n river_dist
1 102 256 1040
2 1 11 720
3 19 24 475
4 12 15 611
Which I fit to this model:
library(bbmle)
m_r <- mle2(ig$v ~ dbinom(size=ig$n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
This produces a coefficient which, when back-transformed, suggests about 0.4% decrease in the likelihood of igneous stone per meter from the river (br = 0.996):
exp(coef(m_r))
That's all good. But when I try to predict new values, I get this odd cycling of values:
newdat <- data.frame(river_dist=seq(min(ig$river_dist), max(ig$river_dist),len=100))
newdat$v <- predict(m_r, newdata=newdat, type="response")
plot(v~river_dist, data=ig, col="red4")
lines(v ~ river_dist, newdat, col="green4", lwd=2)
Example of predicted values:
river_dist v
1 475.0000 216.855114
2 480.7071 9.285536
3 486.4141 20.187424
4 492.1212 12.571487
5 497.8283 213.762248
6 503.5354 9.150584
7 509.2424 19.888471
8 514.9495 12.381805
9 520.6566 210.476312
10 526.3636 9.007289
11 532.0707 19.571218
12 537.7778 12.180629
Why are the values cycling up and down like that, creating crazy spikes when graphed?
In order for newdata to work, you have to specify the variables as 'raw' values rather than with $:
library(bbmle)
m_r <- mle2(v ~ dbinom(size=n, prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig)
At this point, as #user20650 suggests, you'll also have to specify a value (or values) for n in newdata.
This model appears to be identical to binomial regression: is there a reason not to use
glm(cbind(v,n-v) ~ river_dist, data=ig, family=binomial)
? (bbmle:mle2 is more general, but glm is much more robust.) (Also: fitting two parameters to four data points is theoretically fine, but you should not try to push the results too far ... in particular, a lot of the default results from GLM/MLE are asymptotic ...)
Actually, in double-checking the correspondence of the MLE fit with GLM I realized that the default method ("BFGS", for historical reasons) doesn't actually give the right answer (!); switching to method="Nelder-Mead" improves things. Adding control=list(parscale=c(a=1,br=0.001)) to the argument list, or scaling the river dist (e.g. going from "1 m" to "100 m" or "1 km" as the unit), would also fix the problem.
m_r <- mle2(v ~ dbinom(size=n,
prob = 1/(1+exp(-(a + br * river_dist)))),
start = list(a = 0, br = 0), data = ig,
method="Nelder-Mead")
pframe <- data.frame(river_dist=seq(500,1000,length=51),n=1)
pframe$prop <- predict(m_r, newdata=pframe, type="response")
CIs <- lapply(seq(nrow(ig)),
function(i) prop.test(ig[i,"v"],ig[i,"n"])$conf.int)
ig2 <- data.frame(ig,setNames(as.data.frame(do.call(rbind,CIs)),
c("lwr","upr")))
library(ggplot2); theme_set(theme_bw())
ggplot(ig2,aes(river_dist,v/n))+
geom_point(aes(size=n)) +
geom_linerange(aes(ymin=lwr,ymax=upr)) +
geom_smooth(method="glm",
method.args=list(family=binomial),
aes(weight=n))+
geom_line(data=pframe,aes(y=prop),colour="red")
Finally, note that your third-farthest site is an outlier (although the small sample size means it doesn't hurt much).
I want to estimate a binomial model with the R package MCMCglmm. The model shall incorporate an intercept and a slope - both as fixed and random parts. How do I have to specify an accepted prior? (Note, here is a similar question, but in a much more complicated setting.)
Assume the data have the following form:
y x cluster
1 0 -0.56047565 1
2 1 -0.23017749 1
3 0 1.55870831 1
4 1 0.07050839 1
5 0 0.12928774 1
6 1 1.71506499 1
In fact, the data have been generated by
set.seed(123)
nj <- 15 # number of individuals per cluster
J <- 30 # number of clusters
n <- nj * J
x <- rnorm(n)
y <- rbinom(n, 1, prob = 0.6)
cluster <- factor(rep(1:nj, each = J))
dat <- data.frame(y = y, x = x, cluster = cluster)
The information in the question about the model, suggest to specify fixed = y ~ 1 + x and random = ~ us(1 + x):cluster. With us() you allow the random effects to be correlated (cf. section 3.4 and table 2 in Hadfield's 2010 jstatsoft-article)
First of all, as you only have one dependent variable (y), the G part in the prior (cf. equation 4 and section 3.6 in Hadfield's 2010 jstatsoft-article) for the random effects variance(s) only needs to have one list element called G1. This list element isn't the actual prior distribution - this was specified by Hadfield to be an inverse-Wishart distribution. But with G1 you specify the parameters of this inverse-Whishart distribution which are the scale matrix ( in Wikipedia notation and V in MCMCglmm notation) and the degrees of freedom ( in Wikipedia notation and nu in MCMCglmm notation). As you have two random effects (the intercept and the slope) V has to be a 2 x 2 matrix. A frequent choice is the two dimensional identity matrix diag(2). Hadfield often uses nu = 0.002 for the degrees of freedom (cf. his course notes)
Now, you also have to specify the R part in the prior for the residual variance. Here again an inverse-Whishart distribution was specified by Hadfield, leaving the user to specify its parameters. As we only have one residual variance, V has to be a scalar (lets say V = 0.5). An optional element for R is fix. With this element you specify, whether the residual variance shall be fixed to a certain value (than you have to write fix = TRUE or fix = 1) or not (then fix = FALSE or fix = 0). Notice, that you don't fix the residual variance to be 0.5 by fix = 0.5! So when you find in Hadfield's course notes fix = 1, read it as fix = TRUE and look to which value of V it is was fixed.
All togehter we set up the prior as follows:
prior0 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = FALSE))
With this prior we can run MCMCglmm:
library("MCMCglmm") # for MCMCglmm()
set.seed(123)
mod0 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior0)
The draws from the Gibbs-sampler for the fixed effects are found in mod0$Sol, the draws for the variance parameters in mod0$VCV.
Normally a binomial model requires the residual variance to be fixed, so we set the residual variance to be fixed at 0.5
set.seed(123)
prior1 <- list(G = list(G1 = list(V = diag(2), nu = 0.002)),
R = list(V = 0.5, nu = 0.002, fix = TRUE))
mod1 <- MCMCglmm(fixed = y ~ 1 + x,
random = ~ us(1 + x):cluster,
data = dat,
family = "categorical",
prior = prior1)
The difference can be seen by comparing mod0$VCV[, 5] to mod1$VCV[, 5]. In the later case, all entries are 0.5 as specified.
I'm trying to build a survival model in JAGS that allows for time-varying covariates. I'd like it to be a parametric model — for example, assuming survival follows the Weibull distribution (but I'd like to allow the hazard to vary, so exponential is too simple). So, this is essentially a Bayesian version of what can be done in the flexsurv package, which allows for time-varying covariates in parametric models.
Therefore, I want to be able to enter the data in a 'counting-process' form, where each subject has multiple rows, each corresponding to a time interval in which their covariates remained constant (as described in this pdf or here. This is the (start, stop] formulation that the survival or flexurv packages allow.
Unfortunately, every explanation of how to perform survival analysis in JAGS seems to assume one row per subject.
I attempted to take this simpler approach and extend it to the counting process format, but the model does not correctly estimate the distribution.
A Failed Attempt:
Here's an example. First we generate some data:
library('dplyr')
library('survival')
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
person person_start_time person_censored person_duration
(int) (dbl) (lgl) (dbl)
1 1 11.81416 FALSE 487.4553
2 2 114.20900 FALSE 168.7674
3 3 75.34220 FALSE 356.6298
4 4 339.98225 FALSE 385.5119
5 5 389.23357 FALSE 259.9791
6 6 253.71067 FALSE 259.0032
7 7 419.52305 TRUE 310.4770
Then we split the data into 2 observations per subject. I'm just splitting each subject at time = 300 (unless they didn't make it to time=300, in which they get just one observation).
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
person split START END TINTERVAL CENS TINTERVAL_CENS
(int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
1 1 300.0000 0 300.0000 300.00000 1 NA
2 1 300.0000 300 487.4553 187.45530 0 187.45530
3 2 168.7674 0 168.7674 168.76738 1 NA
4 3 300.0000 0 300.0000 300.00000 1 NA
5 3 300.0000 300 356.6298 56.62979 0 56.62979
6 4 300.0000 0 300.0000 300.00000 1 NA
Now we can set up the JAGS model.
## Set-Up JAGS Model -------
dat_jags <- as.list(dat_split)
dat_jags$N <- length(dat_jags$TINTERVAL)
inits <- replicate(n = 2, simplify = FALSE, expr = {
list(TINTERVAL_CENS = with(dat_jags, ifelse(CENS, TINTERVAL + 1, NA)),
END_CENS = with(dat_jags, ifelse(CENS, END + 1, NA)) )
})
model_string <-
"
model {
# set priors on reparameterized version, as suggested
# here: https://sourceforge.net/p/mcmc-jags/discussion/610036/thread/d5249e71/?limit=25#8c3b
log_a ~ dnorm(0, .001)
log(a) <- log_a
log_b ~ dnorm(0, .001)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# Estimate Subject-Durations:
CENS[i] ~ dinterval(TINTERVAL_CENS[i], TINTERVAL[i])
TINTERVAL_CENS[i] ~ dweibull( nu, lambda )
}
}
"
library('runjags')
param_monitors <- c('a', 'b', 'nu', 'lambda')
fit_jags <- run.jags(model = model_string,
burnin = 1000, sample = 1000,
monitor = param_monitors,
n.chains = 2, data = dat_jags, inits = inits)
# estimates:
fit_jags
# actual:
c(a=true_shape, b=true_scale)
Depending on where the split point is, the model estimates very different parameters for the underlying distribution. It only gets the parameters right if the data isn't split into the counting process form. It seems like this is not the way to format the data for this kind of problem.
If I am missing an assumption and my problem is less related to JAGS and more related to how I'm formulating the problem, suggestions are very welcome. I might be despairing that time-varying covariates can't be used in parametric survival models (and can only be used in models like the Cox model, which assumes constant hazards and which doesn't actually estimate the underlying distribution)— however, as I mentioned above, the flexsurvreg package in R does accommodate the (start, stop] formulation in parametric models.
If anyone knows how to build a model like this in another language (e.g. STAN instead of JAGS) that would be appreciated too.
Edit:
Chris Jackson provides some helpful advice via email:
I think the T() construct for truncation in JAGS is needed here. Essentially for each period (t[i], t[i+1]) where a person is alive but the covariate is constant, the survival time is left-truncated at the start of the period, and possibly also right-censored at the end. So you'd write something like y[i] ~ dweib(shape, scale[i])T(t[i], )
I tried implementing this suggestion as follows:
model {
# same as before
log_a ~ dnorm(0, .01)
log(a) <- log_a
log_b ~ dnorm(0, .01)
log(b) <- log_b
nu <- a
lambda <- (1/b)^a
for (i in 1:N) {
# modified to include left-truncation
CENS[i] ~ dinterval(END_CENS[i], END[i])
END_CENS[i] ~ dweibull( nu, lambda )T(START[i],)
}
}
Unfortunately this doesn't quite do the trick. With the old code, the model was mostly getting the scale parameter right, but doing a very bad job on the shape parameter. With this new code, it gets very close to the correct shape parameter, but consistently over-estimates the scale parameter. I have noticed that the degree of over-estimation is correlated with how late the split point comes. If the split-point is early (cens_point = 50), there's not really any over-estimation; if it's late (cens_point = 350), there is a lot.
I thought maybe the problem could be related to 'double-counting' the observations: if we see a censored observation at t=300, then from that same person, an uncensored observation at t=400, it seems intuitive to me that this person is contributing two data-points to our inference about the Weibull parameters when really they should just be contributing one point. I, therefore, tried incorporating a random-effect for each person; however, this completely failed, with huge estimates (in the 50-90 range) for the nu parameter. I'm not sure why that is, but perhaps that's a question for a separate post. Since I'm not whether the problems are related, you can find the code for this whole post, including the JAGS code for that model, here.
You can use rstanarm package, which is a wrapper around STAN. It allows to use standard R formula notation to describe survival models. stan_surv function accepts arguments in a "counting process" form. Different base hazard functions including Weibull can be used to fit the model.
The survival part of rstanarm - stan_surv function is still not available at CRAN so you should install the package directly from mc-stan.org.
install.packages("rstanarm", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
Please see the code below:
library(dplyr)
library(survival)
library(rstanarm)
## Make the Data: -----
set.seed(3)
n_sub <- 1000
current_date <- 365*2
true_shape <- 2
true_scale <- 365
dat <- data_frame(person = 1:n_sub,
true_duration = rweibull(n = n_sub, shape = true_shape, scale = true_scale),
person_start_time = runif(n_sub, min= 0, max= true_scale*2),
person_censored = (person_start_time + true_duration) > current_date,
person_duration = ifelse(person_censored, current_date - person_start_time, true_duration)
)
## Split into multiple observations per person: --------
cens_point <- 300 # <----- try changing to 0 for no split; if so, model correctly estimates
dat_split <- dat %>%
group_by(person) %>%
do(data_frame(
split = ifelse(.$person_duration > cens_point, cens_point, .$person_duration),
START = c(0, split[1]),
END = c(split[1], .$person_duration),
TINTERVAL = c(split[1], .$person_duration - split[1]),
CENS = c(ifelse(.$person_duration > cens_point, 1, .$person_censored), .$person_censored), # <— edited original post here due to bug; but problem still present when fixing bug
TINTERVAL_CENS = ifelse(CENS, NA, TINTERVAL),
END_CENS = ifelse(CENS, NA, END)
)) %>%
filter(TINTERVAL != 0)
dat_split$CENS <- as.integer(!(dat_split$CENS))
# Fit STAN survival model
mod_tvc <- stan_surv(
formula = Surv(START, END, CENS) ~ 1,
data = dat_split,
iter = 1000,
chains = 2,
basehaz = "weibull-aft")
# Print fit coefficients
mod_tvc$coefficients[2]
unname(exp(mod_tvc$coefficients[1]))
Output, which is consistent with true values (true_shape <- 2; true_scale <- 365):
> mod_tvc$coefficients[2]
weibull-shape
1.943157
> unname(exp(mod_tvc$coefficients[1]))
[1] 360.6058
You can also look at STAN source using rstan::get_stanmodel(mod_tvc$stanfit) to compare STAN code with the attempts you made in JAGS.
I am a college student taking a stats class on ecological design and need help figuring out how to go through a power analysis in R. I am using an ANCOVA design with no interaction; in this hypothetical experiment there are two flower varieties planted in independent plots and for each plot the explanatory variable collected is soil moisture with the response being flower yield.
I simulated a dataset called sim.flowers that has in it an alpha effect for the difference between flowers, a slope, an n for my x values, and a sigma for a normal distribution I will use in my y vector (following an ancova model of y = alpha + beta0 + beta1X; for simplicity I made the intercept 0) See below:
sim.flowers <- function(alpha,slope,n,sigma) {
x <-runif(2*n, min = -1, max = 1)
flower.effects <- rep(c(0,alpha),each=n) #there are two different flower varieties and I gave them a true difference of 1.
y <- flower.effects + slope * x + rnorm(2*n, 0, sigma)
data.frame(x=x,y=y,flower.effects = flower.effects)
}
I tested that out and it worked, it gave me a data set with an X a Y and column for flower effects
> test1 <- sim.oats(1,0.5,3,0.3)
> test1
x y flower.effects
1 -0.99913780 -0.31373866 0
2 -0.38610391 0.41070965 0
3 -0.58308522 0.07426254 0
4 -0.35900237 0.36395132 1
5 -0.07296464 1.29149447 1
6 0.18575996 0.85001847 1
The goal is to create a figure showing power to detect the flower Variety effect, with different lines for different numbers of replicates for each flower Variety treatment for this I was told to should choose only 1 value of the Soil Moisture effect.
And a figure showing power to detect the Soil Moisture effect, with different lines for different numbers of replicates for each flower Variety treatment, choosing 1 value of the flower Variety
To start to get here I ran a linear regression in a for loop so that I could extract p-values and be able to plot a power graph setting the probability to reject the null at 0.5 my code is below
> sim.flowers.many <- function(alpha,slope,n,sigma,numsimulations){
+ pvals <-numeric(numsimulations)
+ for(i in 1:numsimulations){
+ thisdat <-sim.flowers(alpha,slope,n,sigma)
+ thisfit <-lm(y~x,thisdat)
+ pvals[i]<-coefficients(summary(thisfit))['x','Pr(>|t|)']
+ }
+ return(pvals)
+ }
> sim.flowers.many( alpha = 1,slope = 0.5, n = 3, sigma = 6, numsimulations =3)
[1] 0.7662218 0.4454654 0.2414637
I seem to be getting p-values alright. I did the following with the expectation that I would get a dataframe with a column that states the probability of rejecting the null so that I could just plot it out.
> determine.power <- function(true.slopes){
+ true.slopes <-0:3
+ out<- data.frame(true.slopes,prob.reject.null=NA)
+ for(i in 1:length(true.slopes)){
+ thesepvals <- sim.flowers.many(alpha = 1,slope = 0.5, n = 3, sigma = 6, numsimulations =3)
+ out[i,2] <- mean(thesepvals < 0.05)
+ }
+ return(out)
+ }
I got this as an output:
> determine.power(true.slopes=0.5)
true.slopes prob.reject.null
1 0 0.0000000
2 1 0.0000000
3 2 0.3333333
4 3 0.3333333
And thought I could graph it out like this:
power.results <- determine.power(seq(0, 2, by = 0.2))
plot(prob.reject.null ~ true.slopes, data = power.results, main = "Power analysis", xlab = "true slope",
ylab = "Prob. reject null", ylim = c(0,1), type = "b", col = "slateblue4")
grid(col = "hotpink")
While I seem to be understanding the writing of the code needed for a power analysis I'm having trouble understanding how to create the two figures that I'm supposed to. I don't know how to change this code so that the power figures I generate reflect the effects of soil moisture and flower variety. I would appreciate any help with working through this problem and I apologize for the length but I felt it was necessary context for the question I am asking.
You can compare your results with pwr.f2.test{pwr}
I'm trying to create a model using the MCMCglmm package in R.
The data are structured as follows, where dyad, focal, other are all random effects, predict1-2 are predictor variables, and response 1-5 are outcome variables that capture # of observed behaviors of different subtypes:
dyad focal other r present village resp1 resp2 resp3 resp4 resp5
1 10101 14302 0.5 3 1 0 0 4 0 5
2 10405 11301 0.0 5 0 0 0 1 0 1
…
So a model with only one outcome (teaching) is as follows:
prior_overdisp_i <- list(R=list(V=diag(2),nu=0.08,fix=2),
G=list(G1=list(V=1,nu=0.08), G2=list(V=1,nu=0.08), G3=list(V=1,nu=0.08), G4=list(V=1,nu=0.08)))
m1 <- MCMCglmm(teaching ~ trait-1 + at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units, family = "zipoisson", prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
Hadfield's course notes (Ch 5) give an example of a multinomial model that uses only a single outcome variable with 3 levels (sheep horns of 3 types). Similar treatment can be found here: http://hlplab.wordpress.com/2009/05/07/multinomial-random-effects-models-in-r/ This is not quite right for what I'm doing, but contains helpful background info.
Another reference (Hadfield 2010) gives an example of a multi-response MCMCglmm that follows the same format but uses cbind() to predict a vector of responses, rather than a single outcome. The same model with multiple responses would look like this:
m1 <- MCMCglmm(cbind(resp1, resp2, resp3, resp4, resp5) ~ trait-1 +
at.level(trait,1):r + at.level(trait,1):present,
random= ~idh(at.level(trait,1)):focal + idh(at.level(trait,1)):other +
idh(at.level(trait,1)):X + idh(at.level(trait,1)):village,
rcov=~idh(trait):units,
family = cbind("zipoisson","zipoisson","zipoisson","zipoisson","zipoisson"),
prior=prior_overdisp_i,
data = data, nitt = nitt.1, thin = 50, burnin = 15000, pr = TRUE, pl = TRUE, verbose = TRUE, DIC = TRUE)
I have two programming questions here:
How do I specify a prior for this model? I've looked at the materials mentioned in this post but just can't figure it out.
I've run a similar version with only two response variables, but I only get one slope - where I thought I should get a different slope for each resp variable. Where am I going wrong, or having I misunderstood the model?
Answer to my first question, based on the HLP post and some help from a colleage/stats consultant:
# values for prior
k <- 5 # originally: length(levels(dative$SemanticClass)), so k = # of outcomes for SemanticClass aka categorical outcomes
I <- diag(k-1) #should make matrix of 0's with diagonal of 1's, dimensions k-1 rows and k-1 columns
J <- matrix(rep(1, (k-1)^2), c(k-1, k-1)) # should make k-1 x k-1 matrix of 1's
And for my model, using the multinomial5 family and 5 outcome variables, the prior is:
prior = list(
R = list(fix=1, V=0.5 * (I + J), n = 4),
G = list(
G1 = list(V = diag(4), n = 4))
For my second question, I need to add an interaction term to the fixed effects in this model:
m <- MCMCglmm(cbind(Resp1, Resp2...) ~ -1 + trait*predictorvariable,
...
The result gives both main effects for the Response variables and posterior estimates for the Response/Predictor interaction (the effect of the predictor variable on each response variable).