Stan Model Posterior Predictions Outside Possible Range of Data - r

I am having a lot of fun right now learning the ropes of modeling in Stan. Right now I'm wrestling with my model of a mixed between- and within-subjects factorial experimental design. There are different groups of subjects, Each subject indicates how much they expect each of three different beverages (water, decaf, and coffee) to reduce their caffeine withdrawal. The outcome variable - expectancy of withdrawal reduction - was measured via a Visual Analog Scale from 0 - 10 with 0 indicating no expectation of withdrawal reduction and 10 indicating a very high expectation of withdrawal reduction. I want to test if there are between-group differences in the amount of expected withdrawal-reduction potential of the three different beverages.
Here is the data
df <- data.frame(id = rep(1:46, each = 3),
group = c(3,3,3,1,1,1,3,3,3,1,1,1,3,3,3,1,1,1,3,3,3,2,2,2,1,1,1,3,3,3,3,3,3,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,2,2,2,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,2,2,2,3,3,3,3,3,3,2,2,2,3,3,3,3,3,3,1,1,1,3,3,3,3,3,3,1,1,1,2,2,2,2,2,2,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,3,3,3,1,1,1,3,3,3),
bevType = rep(c(3,2,1), times = 46),
score = c(2.9,1.0,0.0,9.5,5.0,4.5,9.0,3.0,5.0,5.0,0.0,3.0,9.5,2.0,3.0,8.5,0.0,6.0,5.2,3.0,4.0,8.4,7.0,2.0,10.0,0.0,3.0,7.3,1.0,1.8,8.5,2.0,9.0,10.0,5.0,10.0,8.3,2.0,5.0,6.0,0.0,5.0,6.0,0.0,5.0,10.0,0.0,5.0,6.8,1.0,4.8,8.0,1.0,4.0,7.0,4.0,6.0,6.5,1.0,3.1,9.0,1.0,0.0,6.0,0.0,2.0,9.5,4.0,6.0,8.0,1.0,3.8,0.4,0.0,7.0,7.0,0.0,3.0,9.0,2.0,5.0,9.5,2.0,7.0,7.9,5.0,4.9,8.0,1.0,1.0,9.3,5.0,7.9,6.5,2.0,3.0,8.0,2.0,6.0,10.0,0.0,5.0,6.0,0.0,5.0,6.8,0.1,7.0,8.0,3.0,9.1,8.2,0.0,7.9,8.2,5.0,0.0,9.2,1.0,3.1,9.1,3.0,0.6,5.7,2.0,5.1,7.0,0.0,7.4,8.0,1.0,1.5,9.1,4.0,4.3,8.5,8.0,5.0))
Now for the model. The model has a grand mean parameter a, a categorical predictor representing groups deflections from the grand mean bGroup, a term for deflections of the different beverage types from the grand mean bBev, a term for each subject's intercept bSubj, and a term for the group by beverage interaction bGxB. I also estimated separate noise parameters for each beverage type.
To allow posterior predictive checks I drew from the joint posterior using the generated quantities block and the normal_rng function.
### Step 1: Put data into list
dList <- list(N = 138,
nSubj = 46,
nGroup = 3,
nBev = 3,
sIndex = df$id,
gIndex = df$group,
bIndex = df$bevType,
score = df$score,
gMean = 4.718841,
gSD = 3.17)
#### Step 1 model
write("
data{
int<lower=1> N;
int<lower=1> nSubj;
int<lower=1> nGroup;
int<lower=1> nBev;
int<lower=1,upper=nSubj> sIndex[N];
int<lower=1,upper=nGroup> gIndex[N];
int<lower=1,upper=nBev> bIndex[N];
real score[N];
real gMean;
real gSD;
}
parameters{
real a;
vector[nSubj] bSubj;
vector[nGroup] bGroup;
vector[nBev] bBev;
vector[nBev] bGxB[nGroup]; // vector of vectors, stan no good with matrix
vector[nBev] sigma;
real<lower=0> sigma_a;
real<lower=0> sigma_s;
real<lower=0> sigma_g;
real<lower=0> sigma_b;
real<lower=0> sigma_gb;
}
model{
vector[N] mu;
//hyper-priors
sigma_s ~ normal(0,10);
sigma_g ~ normal(0,10);
sigma_b ~ normal(0,10);
sigma_gb ~ normal(0,10);
//priors
sigma ~ cauchy(0,1);
a ~ normal(gMean, gSD);
bSubj ~ normal(0, sigma_s);
bGroup ~ normal(0,sigma_g);
bBev ~ normal(0,sigma_b);
for (i in 1:nGroup) { //hierarchical prior on interaction
bGxB[i] ~ normal(0, sigma_gb);
}
// likelihood
for (i in 1:N){
score[i] ~ normal(a + bGroup[gIndex[i]] + bBev[bIndex[i]] + bSubj[sIndex[i]] + bGxB[gIndex[i]][bIndex[i]], sigma[bIndex[i]]);
}
}
generated quantities{
real y_draw[N];
for (i in 1:N) {
y_draw[i] = normal_rng(a + bGroup[gIndex[i]] + bBev[bIndex[i]] + bSubj[sIndex[i]] + bGxB[gIndex[i]][bIndex[i]], sigma[bIndex[i]]);
}
}
", file = "temp.stan")
##### Step 3: generate the chains
mod <- stan(file = "temp.stan",
data = dList,
iter = 5000,
warmup = 3000,
cores = 1,
chains = 1)
Next we extract the draws from the joint posterior, and generate estimates of the group mean, upper and lower 95% HPDI. First we need a function to calculate the HPDI
HPDIFunct <- function (vector) {
sortVec <- sort(vector)
ninetyFiveVec <- ceiling(.95*length(sortVec))
fiveVec <- length(sortVec) - length(ninetyFiveVec)
diffVec <- sapply(1:fiveVec, function (i) sortVec[i + ninetyFiveVec] - sortVec[i])
minVal <- sortVec[which.min(diffVec)]
maxVal <- sortVec[which.min(diffVec) + ninetyFiveVec]
return(list(sortVec, minVal, maxVal))
}
Now to extract the draws from the posterior
#### Step 5: Posterior predictive checks
y_draw <- data.frame(extract(mod, pars = "y_draw"))
And plot the mean, lower HPDI and upper HPDI draws of these draws against the actual data.
df$drawMean <- apply(y_draw, 2, mean)
df$HPDI_Low <- apply(y_draw, 2, function(i) HPDIFunct(i)[[2]][1])
df$HPDI_Hi <- apply(y_draw, 2, function(i) HPDIFunct(i)[[3]][1])
### Step 6: plot posterior draws against actual data
ggplot(df, aes(x = factor(bevType), colour = factor(group))) +
geom_jitter(aes(y = score), shape = 1, position = position_dodge(width=0.9)) +
geom_point(aes(y = drawMean), position = position_dodge(width=0.9), stat = "summary", fun.y = "mean", shape = 3, size = 3, stroke = 2) +
geom_point(aes(y = HPDI_Low), position = position_dodge(width=0.9), stat = "summary", fun.y = "mean", shape = 1, size = 3, stroke = 1) +
geom_point(aes(y = HPDI_Hi), position = position_dodge(width=0.9), stat = "summary", fun.y = "mean", shape = 1, size = 3, stroke = 1) +
scale_colour_manual(name = "Experimental Group", labels = c("Group 1", "Group 2", "Group 3"), values = c("#616a6b", "#00AFBB", "#E7B800")) +
scale_x_discrete(labels = c("Water", "Decaf", "Coffee")) +
labs(x = "Beverage Type", y = "Expectancy of Withdrawal Alleviation") +
scale_y_continuous(breaks = seq(0,10,2)) +
theme(axis.text.x = element_text(size = 12),
axis.title.x = element_text(face = "bold"),
axis.title.y = element_text(face = "bold"),
axis.text.y = element_text(size = 12),
legend.title = element_text(size = 13, face = "bold"))
Looking at the graph, for Water expectancies the model seems to represent the centre (crosses) and spread (open circles) of the data quite well. But this breaks down for the Decaf and Coffee expectancies. For Decaf expectancies the lower HPDI is below the range of possible values (lower limit = 0) and the spread of the draws from the posterior (represented in each group by the open circles) is too large. The Coffee group's upper HPDI limit is also above the range of the data (upper limit = 10) and the spread is too large for the actual data.
So my question is:
How do I constrain the draws from the joint posterior to the actual range of the data?
Is there some sort of brute-force way to constrain the draws from the posterior in Stan? Or would a more adaptable estimation of differences in the variance across the three beverage conditions be more effective (in which case this would be more of a CV question than a SO question)?

The standard way to constrain a posterior variable is to use a link function to transform it. That's the way generalized linear models (GLMs) like logistic regression and Poisson regression work. For example, to go from positive ot unconstrained, we use a log transform. To go from a probability in (0, 1) to unconstrained, we use a log odds transform.
If your outcomes are ordinal values on a 1-10 scale, a common approach that respects that data scale is ordinal logistic regression.

To expand on #Bob Carpenter's answer, here are two ways you could approach the problem. (I've had cause to use both of these recently and struggled to get them up and running. This may be useful to other beginners like me.)
Method 1: Ordered Logistic Regression
We're going to assume that each user has a "true" expectancy for each response, which is on an arbitrary continuous scale, and model it as a latent variable. If the user's actual responses fall into K categories, we also model K - 1 cutpoints between those categories. The probability that the user selects a given response category is equal to the area under the logistic pdf between the relevant cutpoints.
The Stan model looks like this. The main difference is that the model fits an additional ordered vector of cutpoints, and uses the ordered_logistic distribution. (I also changed the priors on the sigmas to Cauchy, to keep them positive, and switched to non-centered parameterization. But those changes are independent of the question at hand.) Edit: Also added inputs for new (hypothetical) observations about which we want to make predictions, and added a new generated quantity for those predictions.
data {
// the real data
int<lower=1> N;
int<lower=1> nSubj;
int<lower=1> nGroup;
int<lower=1> nBev;
int minResponse;
int maxResponse;
int<lower=1,upper=nSubj> sIndex[N];
int<lower=1,upper=nGroup> gIndex[N];
int<lower=1,upper=nBev> bIndex[N];
int<lower=minResponse,upper=maxResponse> score[N];
// hypothetical observations for new predictions
int<lower=1> nNewPred;
int<lower=0> nNewSubj;
int<lower=0> nNewGroup;
int<lower=0> nNewBev;
int<lower=1,upper=nSubj+nNewSubj> sNewIndex[nNewPred];
int<lower=1,upper=nGroup+nNewGroup> gNewIndex[nNewPred];
int<lower=1,upper=nBev+nNewBev> bNewIndex[nNewPred];
}
parameters {
real a;
vector[nSubj] bSubj;
vector[nGroup] bGroup;
vector[nBev] bBev;
vector[nBev] bGxB[nGroup];
real<lower=0> sigma_s;
real<lower=0> sigma_g;
real<lower=0> sigma_b;
real<lower=0> sigma_gb;
ordered[maxResponse - minResponse] cutpoints;
}
model {
// hyper-priors
sigma_s ~ cauchy(0, 1);
sigma_g ~ cauchy(0, 1);
sigma_b ~ cauchy(0, 1);
sigma_gb ~ cauchy(0, 1);
// priors
a ~ std_normal();
bSubj ~ std_normal();
bGroup ~ std_normal();
bBev ~ std_normal();
for (i in 1:nGroup) {
bGxB[i] ~ std_normal();
}
// likelihood
for(i in 1:N) {
score[i] ~ ordered_logistic(a +
(bGroup[gIndex[i]] * sigma_g) +
(bBev[bIndex[i]] * sigma_b) +
(bSubj[sIndex[i]] * sigma_s) +
(bGxB[gIndex[i]][bIndex[i]] * sigma_gb),
cutpoints);
}
}
generated quantities {
real y_draw[N];
real y_new_pred[nNewPred];
vector[nGroup+nNewGroup] bNewGroup;
vector[nBev+nNewBev] bNewBev;
vector[nSubj+nNewSubj] bNewSubj;
vector[nBev+nNewBev] bNewGxB[nGroup+nNewGroup];
// generate posterior predictions for the real data
for (i in 1:N) {
y_draw[i] = ordered_logistic_rng(a +
(bGroup[gIndex[i]] * sigma_g) +
(bBev[bIndex[i]] * sigma_b) +
(bSubj[sIndex[i]] * sigma_s) +
(bGxB[gIndex[i]][bIndex[i]] * sigma_gb),
cutpoints);
}
// generate predictions for the new observations
for (i in 1:(nGroup+nNewGroup)) {
if (i <= nGroup) { bNewGroup[i] = bGroup[i]; }
else { bNewGroup[i] = normal_rng(0, 1); }
}
for (i in 1:(nBev+nNewBev)) {
if (i <= nBev) { bNewBev[i] = bBev[i]; }
else { bNewBev[i] = normal_rng(0, 1); }
}
for (i in 1:(nSubj+nNewSubj)) {
if (i <= nSubj) { bNewSubj[i] = bSubj[i]; }
else { bNewSubj[i] = normal_rng(0, 1); }
}
for (i in 1:(nBev+nNewBev)) {
for(j in 1:(nGroup+nNewGroup)) {
if (i <= nBev && j <= nGroup) { bNewGxB[i][j] = bGxB[i][j]; }
else { bNewGxB[i][j] = normal_rng(0, 1); }
}
}
for (i in 1:nNewPred) {
y_new_pred[i] = ordered_logistic_rng(a +
(bNewGroup[gNewIndex[i]] * sigma_g) +
(bNewBev[bNewIndex[i]] * sigma_b) +
(bNewSubj[sNewIndex[i]] * sigma_s) +
(bNewGxB[gNewIndex[i]][bNewIndex[i]] * sigma_gb),
cutpoints);
}
}
It looks like responses in your dataset are recorded to the nearest tenth, so that gives us 101 possible categories between 0 and 10. To keep everything as Stan-friendly integers, we can multiply all the responses by 10. (I also added one to each response because I had trouble fitting the model when one of the possible categories was zero.) Edit: Added new test data for a hypothetical "subject 47", one observation for each group/beverage.
new.pred.obs = expand.grid(group = 1:3, bevType = 2:3) %>%
mutate(id = max(df$id) + 1)
dList <- list(N = 138,
nSubj = 46,
nGroup = 3,
nBev = 3,
minResponse = 1,
maxResponse = 101,
sIndex = df$id,
gIndex = df$group,
bIndex = df$bevType,
score = (df$score * 10) + 1,
nNewPred = nrow(new.pred.obs),
nNewSubj = 1,
nNewGroup = 0,
nNewBev = 0,
sNewIndex = new.pred.obs$id,
gNewIndex = new.pred.obs$group,
bNewIndex = new.pred.obs$bevType)
After we extract y_draw, we can convert it back to the original scale:
y_draw <- (data.frame(extract(mod, pars = "y_draw")) - 1) / 10
Everything else is the same as before. Now the posterior predictions are correctly confined to [0, 10].
To draw inferences on the original scale about differences between beverages, we can use the predictions for our hypothetical data. For each sample, we have one predicted output for a new subject in each group/beverage combination. We can compare the "coffee" vs. "decaf" responses within each sample and group:
# Get predictions for hypothetical observations
new.preds.df = data.frame(rstan::extract(mod, pars = "y_new_pred")) %>%
rownames_to_column("sample") %>%
gather(obs, pred, -sample) %>%
mutate(obs = gsub("y_new_pred\\.", "", obs),
pred = (pred - 1) / 10) %>%
inner_join(new.pred.obs %>%
rownames_to_column("obs") %>%
mutate(bevType = paste("bev", bevType, sep = ""),
group = paste("Group", group)),
by = c("obs")) %>%
select(-obs) %>%
spread(bevType, pred) %>%
mutate(bevTypeDiff = bev3 - bev2)
(Alternatively, we could have done this prediction for new observations in R, or in a separate Stan model; see here for examples of how this could be done.)
Method 2: Beta Regression
Once we get up to 101 response categories, calling these possibilities discrete categories seems a little strange. It feels more natural to say, as your original model tried to do, that we're capturing a continuous outcome that happens to be bounded between 0 and 10. Also, in ordered logistic regression, the response categories don't have to be regularly spaced with respect to the latent variable. (This is a feature, not a bug; for example, for Likert responses, there's no guarantee that the difference between "Strongly agree" and "Agree" is the same as the difference between "Agree" and "Neither agree not disagree".) as a result, it's difficult to say anything about the "distance" a particular factor causes a response to move on the original scale (as opposed to the scale of the latent variable). But the cutpoints inferred by the model above are pretty regularly spaced, which again suggests that the outcome in your dataset is already reasonably scale-like:
# Get the sampled parameters
sampled.params.df = data.frame(as.array(mod)[,1,]) %>%
select(-matches("y_draw")) %>%
rownames_to_column("iteration")
# Plot selected cutpoints
sampled.params.df %>%
select(matches("cutpoints")) %>%
gather(cutpoint, value) %>%
mutate(cutpoint.num = as.numeric(gsub("^cutpoints\\.([0-9]+)\\.$", "\\1", cutpoint))) %>%
group_by(cutpoint.num) %>%
summarize(mean.value = mean(value),
lower.95 = quantile(value, 0.025),
lower.50 = quantile(value, 0.25),
upper.50 = quantile(value, 0.75),
upper.95 = quantile(value, .975)) %>%
ggplot(aes(x = cutpoint.num, y = mean.value)) +
geom_point(size = 3) +
geom_linerange(aes(ymin = lower.95, ymax = upper.95)) +
geom_linerange(aes(ymin = lower.50, ymax = upper.50), size = 2) +
scale_x_continuous("cutpoint", breaks = seq(0, 100, 10)) +
scale_y_continuous("") +
theme_bw()
(Thick and thin lines represent 50% and 95% intervals, respectively. I'm enjoying the little "jump" every 10 cutpoints, which suggests subjects treated, say, 5.9 vs. 6.0 as a larger difference than 5.8 vs. 5.9. But the effect seems to be quite mild. The scale also seems to stretch out a bit towards the high end, but again, it's not too drastic.)
For a continuous outcome in a bounded interval, we can use the beta distribution; see here and here for further discussion.
For the beta distribution, we need two parameters, mu and phi, both of which must be positive. In this example, I allowed mu to be unbounded and applied inv_logit before feeding it into the beta distribution; I constrained phi to be positive and gave it a Cauchy prior. But you could do it in any number of ways. I also coded a full set of mu parameters but only a single phi; again, you can experiment with other options.
data {
int<lower=1> N;
int<lower=1> nSubj;
int<lower=1> nGroup;
int<lower=1> nBev;
int<lower=1,upper=nSubj> sIndex[N];
int<lower=1,upper=nGroup> gIndex[N];
int<lower=1,upper=nBev> bIndex[N];
vector<lower=0,upper=1>[N] score;
}
parameters {
real a;
real a_phi;
vector[nSubj] bSubj;
vector[nGroup] bGroup;
vector[nBev] bBev;
vector[nBev] bGxB[nGroup];
real<lower=0> sigma_s;
real<lower=0> sigma_g;
real<lower=0> sigma_b;
real<lower=0> sigma_gb;
}
model {
vector[N] mu;
//hyper-priors
sigma_s ~ cauchy(0, 1);
sigma_g ~ cauchy(0, 1);
sigma_b ~ cauchy(0, 1);
sigma_gb ~ cauchy(0, 1);
//priors
a ~ std_normal();
a_phi ~ cauchy(0, 1);
bSubj ~ std_normal();
bGroup ~ std_normal();
bBev ~ std_normal();
for (i in 1:nGroup) {
bGxB[i] ~ std_normal();
}
// likelihood
for(i in 1:N) {
mu[i] = a +
(bGroup[gIndex[i]] * sigma_g) +
(bBev[bIndex[i]] * sigma_b) +
(bSubj[sIndex[i]] * sigma_s) +
(bGxB[gIndex[i]][bIndex[i]] * sigma_gb);
score[i] ~ beta(inv_logit(mu[i]) .* a_phi,
(1 - inv_logit(mu[i])) .* a_phi);
}
}
generated quantities {
real y_draw[N];
real temp_mu;
for (i in 1:N) {
temp_mu = a +
(bGroup[gIndex[i]] * sigma_g) +
(bBev[bIndex[i]] * sigma_b) +
(bSubj[sIndex[i]] * sigma_s) +
(bGxB[gIndex[i]][bIndex[i]] * sigma_gb);
y_draw[i] = beta_rng(inv_logit(temp_mu) .* a_phi,
(1 - inv_logit(temp_mu)) .* a_phi);
}
}
The beta distribution is supported on (0, 1), so we divide the observed scores by 10. (The model also fails if we give it scores of exactly 0 or 1, so I converted all such scores to 0.01 and 0.99, respectively.)
dList.beta <- list(N = 138,
nSubj = 46,
nGroup = 3,
nBev = 3,
sIndex = df$id,
gIndex = df$group,
bIndex = df$bevType,
score = ifelse(df$score == 0, 0.01,
ifelse(df$score == 10, 0.99,
df$score / 10)))
Undo the transformation when extracting y_draw, and then the procedure is the same as before.
y_draw.beta <- data.frame(extract(mod.beta, pars = "y_draw")) * 10
Once again, the posterior draws are correctly bounded.

Related

Setting hypotheses in R package SPRT

I am new to using the SPRT package in R to perform sequential proprortion ratio testing, and vignettes/tutorials for this package seem to be sparse.
By default the SPRT function can receive cumulative values of n & k (trials and events). I will be using this method on a large studies where trials and events will be tallied daily in a cumulative fashion and I want to check my logic on how I have applied SPRT().
SPRT requires users to set explicit null and alternative hypothesis. I have set these to H_0: treat = control
H_1: treat = control * 1.01
In my for-loop that follows I apply the SPRT() function every day to compute the log likelihood ratio of the cumulative data under each hypothesis, and I really just want to confirm that this is the correct way to analyze the data. Most examples I have seen set h0 and h1 in a more explicit fashion (e.g., h0 = .85 & h1 = .85*1.01), while I have set them to reflect the observed rates for each day in the cumulative data as seems more appropriate in the setting of an experiment (e.g., h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i]).
library(SPRT)
library(tidyverse)
# simulate cumulative data from an AB Test
set.seed(42)
DAYS <- 14
DAILY_N <- 1e3
BASERATE <- .85
MDE <- 1.02
df_sprt <-
tibble(
day = 1:DAYS,
control = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE),
treat = rbinom(n = DAYS, size = DAILY_N, prob = BASERATE*MDE),
n = DAILY_N
) %>%
mutate(
control = cumsum(control),
treat = cumsum(treat),
n = cumsum(n)
)
# apply SPRT in a for loop
wald_a <- vector('numeric', length = nrow(df_sprt))
wald_b <- vector('numeric', length = nrow(df_sprt))
llr <- vector('numeric', length = nrow(df_sprt))
for (i in 1:nrow(df_sprt)) {
out <- SPRT(
distribution = "bernoulli",
type1 = 0.05, type2 = 0.20,
h0 = df_sprt$control[i]/df_sprt$n[i], h1 = (df_sprt$control[i] * MDE)/df_sprt$n[i],
n = df_sprt$n[i],
k = df_sprt$treat[i]
)
wald_a[i] <- out$wald.A
wald_b[i] <- out$wald.B
llr[i] <- out$llr
}
sprt_out <-
tibble(
llr,
wald_a,
wald_b,
cohort_day = 1:DAYS
)
# Plot the results
sprt_out %>%
ggplot(aes(x = cohort_day, y = llr)) +
geom_hline(
yintercept =
c(max(sprt_out$wald_a), max(sprt_out$wald_b)),
color = c('darkgreen', 'red')
) +
geom_point() +
geom_line() +
annotate(
x=10,y=max(sprt_out$wald_b),
label="Reject Alternative Hy & Retain Null Hy",
vjust=-1, geom="text", color = 'red'
) +
annotate(
x=10,y=max(sprt_out$wald_a),
label="Reject Null Hy & Accept Alternative Hy",
vjust=1.5, geom="text", color = 'darkgreen'
) +
scale_y_continuous(breaks = -10:20) +
scale_x_continuous(breaks = 1:20) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))

deSolve: differential equations with two consecutive dynamics

I am simulating a ring tube with flowing water and a temperature gradient using deSolve::ode(). The ring is modelled as a vector where each element has a temperature value and position.
I am modelling the heat diffusion formula:
1)
But I'm struggling with also moving the water along the ring. In theory, it's just about substituting the temperature at the element i in the tube vector with that at the element s places earlier. Since s may not be an integer, it can be separated into the integer part (n) and the fractional part (p): s=n+p. Consequently, the change in temperature due to the water moving becomes:
2)
The problem is that s equals to the water velocity v by the dt evaluated at each iteration of the ode solver.
My idea is to treat the phenomenons as additive, that is first computing (1), then (2) and finally adding them together. I'm afraid though about the effect of time. The ode solver with implicit methods decides the time step automatically and scales down linearly the unitary change delta.
My question is whether just returning (1) + (2) in the derivative function is correct or if I should break the two processes apart and compute the derivatives separately. In the second case, what would be the suggested approach?
EDIT:
As by suggestion by #tpetzoldt I tried to implement the water flow using ReacTran::advection.1D(). My model has multiple sources of variation of temperature: the spontaneous symmetric heat diffusion; the water flow; a source of heat that is turned on if the temperature near a sensor (placed before the heat source) drops below a lower threshold and is turned off if raises above an upper threshold; a constant heat dispersion determined by a cyclical external temperature.
Below the "Moving water" section there is still my previous version of the code, now substituted by ReacTran::advection.1D().
The plot_type argument allows visualizing either a time sequence of the temperature in the water tube ("pipe"), or the temperature sequence at the sensors (before and after the heater).
library(deSolve)
library(dplyr)
library(ggplot2)
library(tidyr)
library(ReacTran)
test <- function(simTime = 5000, vel = 1, L = 500, thresh = c(16, 25), heatT = 25,
heatDisp = .0025, baseTemp = 15, alpha = .025,
adv_method = 'up', plot_type = c('pipe', 'sensors')) {
plot_type <- match.arg(plot_type)
thresh <- c(16, 25)
sensorP <- round(L/2)
vec <- c(rep(baseTemp, L), 0)
eventfun <- function(t, y, pars) {
heat <- y[L + 1] > 0
if (y[sensorP] < thresh[1] & heat == FALSE) { # if heat is FALSE -> T was above the threshold
#browser()
y[L + 1] <- heatT
}
if (y[sensorP] > thresh[2] & heat == TRUE) { # if heat is TRUE -> T was below the threshold
#browser()
y[L + 1] <- 0
}
return(y)
}
rootfun <- function (t, y, pars) {
heat <- y[L + 1] > 0
trigger_root <- 1
if (y[sensorP] < thresh[1] & heat == FALSE & t > 1) { # if heat is FALSE -> T was above the threshold
#browser()
trigger_root <- 0
}
if (y[sensorP] > thresh[2] & heat == TRUE & t > 1) { # if heat is TRUE -> T was below the threshold
#browser()
trigger_root <- 0
}
return(trigger_root)
}
roll <- function(x, n) {
x[((1:length(x)) - (n + 1)) %% length(x) + 1]
}
fun <- function(t, y, pars) {
v <- y[1:L]
# Heat diffusion: dT/dt = alpha * d2T/d2X
d2Td2X <- c(v[2:L], v[1]) + c(v[L], v[1:(L - 1)]) - 2 * v
dT_diff <- pars * d2Td2X
# Moving water
# nS <- floor(vel)
# pS <- vel - nS
#
# v_shifted <- roll(v, nS)
# nS1 <- nS + 1
# v_shifted1 <- roll(v, nS + 1)
#
# dT_flow <- v_shifted + pS * (v_shifted1 - v_shifted) - v
dT_flow <- advection.1D(v, v = vel, dx = 1, C.up = v[L], C.down = v[1],
adv.method = adv_method)$dC
dT <- dT_flow + dT_diff
# heating of the ring after the sensor
dT[sensorP + 1] <- dT[sensorP + 1] + y[L + 1]
# heat dispersion
dT <- dT - heatDisp * (v - baseTemp + 2.5 * sin(t/(60*24) * pi * 2))
return(list(c(dT, 0)))
}
out <- ode.1D(y = vec, times = 1:simTime, func = fun, parms = alpha, nspec = 1,
events = list(func = eventfun, root = T),
rootfunc = rootfun)
if (plot_type == 'sensors') {
## Trend of the temperature at the sensors levels
out %>%
{.[,c(1, sensorP + 1, sensorP + 3, L + 2)]} %>%
as.data.frame() %>%
setNames(c('time', 'pre', 'post', 'heat')) %>%
mutate(Amb = baseTemp + 2.5 * sin(time/(60*24) * pi * 2)) %>%
pivot_longer(-time, values_to = "val", names_to = "trend") %>%
ggplot(aes(time, val)) +
geom_hline(yintercept = thresh) +
geom_line(aes(color = trend)) +
theme_minimal() +
theme(panel.spacing=unit(0, "lines")) +
labs(x = 'time', y = 'T°', color = 'sensor')
} else {
## Trend of the temperature in the whole pipe
out %>%
as.data.frame() %>%
pivot_longer(-time, values_to = "val", names_to = "x") %>%
filter(time %in% round(seq.int(1, simTime, length.out = 40))) %>%
ggplot(aes(as.numeric(x), val)) +
geom_hline(yintercept = thresh) +
geom_line(alpha = .5, show.legend = FALSE) +
geom_point(aes(color = val)) +
scale_color_gradient(low = "#56B1F7", high = "red") +
facet_wrap(~ time) +
theme_minimal() +
theme(panel.spacing=unit(0, "lines")) +
labs(x = 'x', y = 'T°', color = 'T°')
}
}
It's interesting that setting an higher number of segment (L = 500) and high speed (vel = 2) it's possible to observe a spiking sequence in the post heating sensor. Also, the processing time drastically increases, but more as an effect of increased velocity than due to increased pipe resolution.
My biggest doubt now is whether ReacTran::advection.1D() does make sense in my context since I'm modeling water temperature, while this function seems more related to the concentration of a solute in flowing water.
The problem looks like a PDE example with a mobile and a fixed phase. A good introduction about the "method of lines" (MOL) approach with R/deSolve can be be found in the paper about ReachTran from Soetaert and Meysman (2012) doi.org/10.1016/j.envsoft.2011.08.011.
An example PDE can be found at slide 55 of some workshop slides, more in the teaching package RTM.
R/deSolve/ReacTran tries to make ODEs/PDEs easy, but pitfalls remain. If numerical dispersion or oscillations occur, it can be caused by violating the Courant–Friedrichs–Lewy condition.

Fixing a parameter to a distribution in JAGS

In the Bayesian programing language JAGS, I am looking for a way to fix a parameter to a specific distribution, as opposed to a constant. The paragraph below presents this question more explicitly and references JAGS code. I would also be open to answers that use other probabilistic programming languages (e.g., stan).
The first code chunk below (model1) is a JAGS script designed to estimate a two-group Gaussian mixture model with unequal variances. I am looking for a way to fix one of the parameters (say $\mu_2$) to a particular distribution (e.g., dnorm(0,0.0001)). I know how to fix $\mu_2$ to a constant (e.g., see model2 in code chunk 2), though I cannot find a way to fix $\mu_2$ to my prior belief(e.g., see model3 in code chunk 3, which shows conceptually what I am trying to do).
Thanks in advance!
Code chunk 1
model1 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 ~ dnorm (0,0.0001)
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
Code chunk 2
model2 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 <- 1.27
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
Code chunk 3
model3 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 <- dnorm (0,0.0001)
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
I don't know JAGS, but here are two Stan versions. One takes a single sample of mu2 across all iterations; the second takes a different sample of mu2 for each iteration.
Either way, I'm not qualified to judge whether this is actually a good idea. (The second version, in particular, is something that the Stan team has deliberately tried to avoid, for the reasons described here.) But it's at least possible.
(In both examples, I changed some of the prior distributions to make the data easier to work with, but the basic idea is the same.)
One sample of mu2
First, the Stan model.
data {
int<lower=0> n1;
vector[n1] y1;
int<lower=0> n2;
vector[n2] y2;
}
transformed data {
// Set mu2 to a single randomly selected value (instead of giving it a prior
// and estimating it).
real mu2 = normal_rng(0, 0.0001);
}
parameters {
real mu1;
real<lower=0> phi1;
real<lower=0> phi2;
}
transformed parameters {
real sigma1 = 1 / phi1;
real sigma2 = 1 / phi2;
}
model {
mu1 ~ normal(0, 0.0001);
phi1 ~ gamma(1, 1);
phi2 ~ gamma(1, 1);
y1 ~ normal(mu1, sigma1);
y2 ~ normal(mu2, sigma2);
}
generated quantities {
real delta = mu1 - mu2;
// We can't return mu2 from the transformed data block. So if we want to see
// what it was, we have to copy its value into a generated quantity and return
// that.
real mu2_return = mu2;
}
Next, R code to generate fake data and fit the model.
# Generate fake data.
n1 = 1000
n2 = 1000
mu1 = rnorm(1, 0, 0.0001)
mu2 = rnorm(1, 0, 0.0001)
phi1 = rgamma(1, shape = 1, rate = 1)
phi2 = rgamma(1, shape = 1, rate = 1)
y1 = rnorm(n1, mu1, 1 / phi1)
y2 = rnorm(n2, mu2, 1 / phi2)
delta = mu1 - mu2
# Fit the Stan model.
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = T)
stan.data = list(n1 = n1, y1 = y1, n2 = n2, y2 = y2)
stan.model = stan(file = "stan_model.stan",
data = stan.data,
cores = 3, iter = 1000)
We can extract the samples from the Stan model and see that we correctly recovered the parameters' true values - except, of course, in the case of mu2.
# Pull out the samples.
library(tidybayes)
library(tidyverse)
stan.model %>%
spread_draws(mu1, phi1, mu2_return, phi2) %>%
ungroup() %>%
dplyr::select(.draw, mu1, phi1, mu2 = mu2_return, phi2) %>%
pivot_longer(cols = -c(.draw), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_histogram() +
geom_vline(data = data.frame(parameter = c("mu1", "phi1", "mu2", "phi2"),
true.value = c(mu1, phi1, mu2, phi2)),
aes(xintercept = true.value), color = "red", size = 1.5) +
facet_wrap(~ parameter, scales = "free") +
theme_bw() +
scale_x_continuous("Parameter value") +
scale_y_continuous("Number of samples")
New sample of mu2 for each iteration
We can't generate a random number in the parameters, transformed parameters, or model block; again, this is a deliberate design choice. But we can generate a whole bunch of numbers in the transformed data block and grab a new one for each iteration. To do this, we need a way to figure out which iteration we're on in the parameters block. I used Louis's solution from the end of this discussion on the Stan forums. First, save the following C++ code as iter.hpp in your working directory:
static int itct = 1;
inline void add_iter(std::ostream* pstream__) {
itct += 1;
}
inline int get_iter(std::ostream* pstream__) {
return itct;
}
Next, define the Stan model as follows. The functions add_iter() and get_iter() are defined in iter.hpp; if you're working in RStudio, you'll get error symbols when you edit the Stan file because RStudio doesn't know that we're going to bring in those function definitions from elsewhere.
functions {
void add_iter();
int get_iter();
}
data {
int<lower=0> n1;
vector[n1] y1;
int<lower=0> n2;
vector[n2] y2;
int<lower=0> n_iterations;
}
transformed data {
vector[n_iterations + 1] all_mu2s;
for(n in 1:(n_iterations + 1)) {
all_mu2s[n] = normal_rng(0, 0.0001);
}
}
parameters {
real mu1;
real<lower=0> phi1;
real<lower=0> phi2;
}
transformed parameters {
real sigma1 = 1 / phi1;
real sigma2 = 1 / phi2;
real mu2 = all_mu2s[get_iter()];
}
model {
mu1 ~ normal(0, 0.0001);
phi1 ~ gamma(1, 1);
phi2 ~ gamma(1, 1);
y1 ~ normal(mu1, sigma1);
y2 ~ normal(mu2, sigma2);
}
generated quantities {
real delta = mu1 - mu2;
add_iter();
}
Note that the model actually generates 1 more random value for mu2 than we need. When I tried generating exactly n_iterations random values, I got an error informing me that Stan had tried to access all_mu2s[1001].
I find this worrisome, because it means I don't fully understand what's going on internally - shouldn't there be only 1000 iterations, given the R code below? But it just looks like an off-by-one error, and the fitted model looks reasonable, so I didn't pursue this further.
Also, note that this approach gets the iteration number, but not the chain. I ran just one chain; if you run more than one chain, the ith value of mu2 will be the same in each chain. That same Stan forums discussion has a suggestion for distinguishing among chains, but I didn't explore it.
Finally, generate fake data and fit the model to it. When we compile the model, we need to sneak in the function definitions from iter.hpp, as described here.
# Generate fake data.
n1 = 1000
n2 = 1000
mu1 = rnorm(1, 0, 0.0001)
mu2 = rnorm(1, 0, 0.0001)
phi1 = rgamma(1, shape = 1, rate = 1)
phi2 = rgamma(1, shape = 1, rate = 1)
y1 = rnorm(n1, mu1, 1 / phi1)
y2 = rnorm(n2, mu2, 1 / phi2)
delta = mu1 - mu2
n.iterations = 1000
# Fit the Stan model.
library(rstan)
stan.data = list(n1 = n1, y1 = y1, n2 = n2, y2 = y2,
n_iterations = n.iterations)
stan.model = stan_model(file = "stan_model.stan",
allow_undefined = T,
includes = paste0('\n#include "',
file.path(getwd(), 'iter.hpp'),
'"\n'))
stan.model.fit = sampling(stan.model,
data = stan.data,
chains = 1,
iter = n.iterations,
pars = c("mu1", "phi1", "mu2", "phi2"))
Once again, we recovered the values of mu1, phi1, and phi2 reasonably well. This time, we used a whole range of values for mu2, which follow the specified distribution.
# Pull out the samples.
library(tidybayes)
library(tidyverse)
stan.model.fit %>%
spread_draws(mu1, phi1, mu2, phi2) %>%
ungroup() %>%
dplyr::select(.draw, mu1, phi1, mu2 = mu2, phi2) %>%
pivot_longer(cols = -c(.draw), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_histogram() +
stat_function(dat = data.frame(parameter = "mu2", value = 0),
fun = function(.x) { dnorm(.x, 0, 0.0001) * 0.01 },
color = "blue", size = 1.5) +
geom_vline(data = data.frame(parameter = c("mu1", "phi1", "mu2", "phi2"),
true.value = c(mu1, phi1, mu2, phi2)),
aes(xintercept = true.value), color = "red", size = 1.5) +
facet_wrap(~ parameter, scales = "free") +
theme_bw() +
scale_x_continuous("Parameter value") +
scale_y_continuous("Number of samples")

Marginal effects / interaction plots for lfe felm regression object

I need to create an interaction / marginal effects plot for a fixed effects model including clustered standard errors generated using the lfe "felm" command.
I have already created a function that achieves this. However, before I start using it, I wanted to double-check whether this function is correctly specified. Please find the function and a reproducible example below.
library(lfe)
### defining function
felm_marginal_effects <- function(regression_model, data, treatment, moderator, treatment_translation, moderator_translation, dependent_variable_translation, alpha = 0.05, se = NULL) {
library(ggplot2)
library(ggthemes)
library(gridExtra)
### defining function to get average marginal effects
getmfx <- function(betas, data, treatment, moderator) {
betas[treatment] + betas[paste0(treatment, ":", moderator)] * data[, moderator]
}
### defining function to get marginal effects for specific levels of the treatment variable
getmfx_high_low <- function(betas, data, treatment, moderator, treatment_val) {
betas[treatment] * treatment_val + betas[paste0(treatment, ":", moderator)] * data[, moderator] * treatment_val
}
### Defining function to analytically derive standard error for marginal effects
getvarmfx <- function(my_vcov, data, treatment, moderator) {
my_vcov[treatment, treatment] + data[, moderator]^2 * my_vcov[paste0(treatment, ":", moderator), paste0(treatment, ":", moderator)] + 2 * data[, moderator] * my_vcov[treatment, paste0(treatment, ":", moderator)]
}
### constraining data to relevant variables
data <- data[, c(treatment, moderator)]
### getting marginal effects
data[, "marginal_effects"] <- getmfx(coef(regression_model), data, treatment, moderator)
### getting marginal effects for high and low cases of treatment variable
data[, "marginal_effects_treatment_low"] <- getmfx_high_low(coef(regression_model), data, treatment, moderator, quantile(data[,treatment], 0.05))
data[, "marginal_effects_treatment_high"] <- getmfx_high_low(coef(regression_model), data, treatment, moderator, quantile(data[,treatment], 0.95))
### getting robust SEs
if (is.null(se)) {
data$se <- getvarmfx(regression_model$vcv, data, treatment, moderator)
} else if (se == "clustered") {
data$se <- getvarmfx(regression_model$clustervcv, data, treatment, moderator)
} else if (se == "robust") {
data$se <- getvarmfx(regression_model$robustvcv, data, treatment, moderator)
}
### Getting CI bounds
data[, "ci_lower"] <- data[, "marginal_effects"] - abs(qt(alpha/2, regression_model$df, lower.tail = TRUE)) * sqrt(data$se)
data[, "ci_upper"] <- data[, "marginal_effects"] + abs(qt(alpha/2, regression_model$df, lower.tail = TRUE)) * sqrt(data$se)
### plotting marginal effects plot
p_1 <- ggplot(data, aes_string(x = moderator)) +
geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper), fill = "grey70", alpha = 0.4) +
geom_line(aes(y = marginal_effects)) +
theme_fivethirtyeight() +
theme(plot.title = element_text(size = 11.5, hjust = 0.5), axis.title = element_text(size = 10)) +
geom_rug() +
xlab(moderator_translation) +
ylab(paste("Marginal effect of",treatment_translation,"on",dependent_variable_translation)) +
ggtitle("Average marginal effects")
p_2 <- ggplot(data, aes_string(x = moderator)) +
geom_line(aes(y = marginal_effects_treatment_high, color = paste0("High ",treatment_translation))) +
geom_line(aes(y = marginal_effects_treatment_low, color = paste0("Low ",treatment_translation))) +
theme_fivethirtyeight() +
theme(plot.title = element_text(size = 11.5, hjust = 0.5), axis.title = element_text(size = 10), axis.title.y = element_blank(), legend.justification = c(0.95, 0.95), legend.position = c(1, 1), legend.direction = "vertical") +
geom_rug() +
xlab(moderator_translation) +
ylab(paste("Marginal effect of",treatment_translation,"on",dependent_variable_translation)) +
ggtitle("Marginal effects at high / low levels of treatment") +
scale_color_manual(name = NULL, values = c(rgb(229, 93, 89, maxColorValue = 255), rgb(75, 180, 184, maxColorValue = 255)), labels=c(paste0("High ",treatment_translation), paste0("Low ",treatment_translation)))
### exporting plots as combined grob
return(grid.arrange(p_1, p_2, ncol = 2))
}
### example:
# example model (just for demonstration, fixed effects and cluster variables make little sense here)
model <- felm(mpg ~ cyl + am + cyl:am | carb | 0 | cyl, data = mtcars)
# creating marginal effects plot
felm_marginal_effects(regression_model = model, data = mtcars, treatment = "cyl", moderator = "am", treatment_translation = "Number of cylinders", moderator_translation = "Transmission", dependent_variable_translation = "Miles per (US) gallon")
The example output looks like this:
Happy for any advice on how to make this a better, "well-coded", fast function so that it's more useful for others afterwards. However, I'm mostly looking to confirm whether it's "correct" in the first place.
Additionally, I wanted to check back with the community regarding some remaining questions, particularly:
Can I use the standard errors I generated for the average marginal effects for the "high" and "low" treatment cases as well or do I need to generate different standard errors for these cases? If so how?
Instead of using the analytically derived standard errors, I could also calculate bootstrapped standard errors by creating many coefficient estimates based on repeated sub-samples of the data. How would I generate bootstrapped standard errors for the high / low case?
Is there something about fixed effects models or fixed effects models with clustered standard errors that make marginal effects plots or anything else I did in the code fundamentally inadmissible?
PS.: The above function and questions are kind of an extension of How to plot marginal effect of an interaction after felm() function

Plotting a graph with sample sizes and power estimates

I have simulated a linear model 1000 times using a randomly generated height and weight values, and randomly assigned each participant to a treatment or non-treatment (factor of 1 and 0). Let's say the model was:
lm(bmi~height + weight + treatment, data = df)
I am now struggling for the following:
The model now needs to cycle through the sample sizes between 300 and 500 in steps of 10 for each of the 1000 replications and store the proportion of simulated experiments with p values less than 0.05 for the purpose of estimating the power that can detect a change of 0.5 in bmi between two treatment groups at 5% significance level.
After doing the above, I then need to create a figure that best depicts the sample sizes on x-axis, and the estimated power on the y-axis, and also reflect the smallest sample size to achieve a 80% power estimate by a distinct color.
Any ideas how and where to go from here?
Thanks,
Chris
I would do it something like this:
library(dplyr)
library(ggplot2)
# first, encapsulate the steps required to generate one sample of data
# at a given sample size, run the model, and extract the treatment p-value
do_simulate <- function(n) {
# use assumed data generating process to simulate data and add error
data <- tibble(height = rnorm(n, 69, 0.1),
weight = rnorm(n, 197.8, 1.9),
treatment = sample(c(0, 1), n, replace = TRUE),
error = rnorm(n, sd = 1.75),
bmi = 703 * weight / height^2 + 0.5 * treatment + error)
# model the data
mdl <- lm(bmi ~ height + weight + treatment, data = data)
# extract p-value for treatment effect
summary(mdl)[["coefficients"]]["treatment", "Pr(>|t|)"]
}
# second, wrap that single simulation in a replicate so that you can perform
# many simulations at a given sample size and estimate power as the proportion
# of simulations that achieve a significant p-value
simulate_power <- function(n, alpha = 0.05, r = 1000) {
p_values <- replicate(r, do_simulate(n))
power <- mean(p_values < alpha)
return(c(n, power))
}
# third, estimate power at each of your desired
# sample sizes and restructure that data for ggplot
mx <- vapply(seq(300, 500, 10), simulate_power, numeric(2))
plot_data <- tibble(n = mx[1, ],
power = mx[2, ])
# fourth, make a note of the minimum sample size to achieve your desired power
plot_data %>%
filter(power > 0.80) %>%
top_n(-1, n) %>%
pull(n) -> min_n
# finally, construct the plot
ggplot(plot_data, aes(x = n, y = power)) +
geom_smooth(method = "loess", se = FALSE) +
geom_vline(xintercept = min_n)

Resources