Failing to simulate data for a negative binomial probability distribution - r

I am hoping someone can help me.
In a beginners workshop I attended, in the process of fitting a multiple regression model, the instructor initially established a prior predictive check using a Poisson distribution for the outcome. This was done in two steps. Initially, a function was created:
multiple_regression_poisson_dgp <- function(predictor1,
predictor2, alpha_mean, alpha_sd, beta_predictor1_mean,
beta_predictor1_sd, beta_predictor2_mean,
beta_predictor2_sd) {
N <- length(predictor1)
alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
beta_predictor1 <- rnorm(1, mean = beta_predictor1_mean,
sd = beta_predictor1_sd);
beta_predictor2 <- rnorm(1, mean = beta_predictor2_mean,
sd = beta_super_sd);
outcome <- rpois(N, lambda = alpha + beta_predictor1 *
predictor1 + beta_predictor2 * predictor2)
return(outcome)
}
After this function was created, the following priors were generated:
multiple_regression_poisson_dgp(dataset$predictor1,
dataset$predictor2,
alpha_mean = 1,
alpha_sd = 0.5,
beta_predictor1_mean = -0.25,
beta_predictor1_sd = 0.5,
beta_predictor2_mean = 0,
beta_predictor2_sd = 1)
This worked fine. The issue is that, further down the line, it was shown that the Poisson distribution was not the most adequate. The negative binomial was suggested as the next step. Unfortunately, when I try to replicate the process for the negative binomial, I am unsuccessful. I have tried to replicate both of the steps shown above, but for the negative binomial. The first step was coded as:
multiple_regression_negative_binomial_dgp <-
function(predictor1, predictor2, alpha_mean, alpha_sd,
beta_predictor1_mean, beta_predictor1_sd,
beta_predictor2_mean, beta_predictor2_sd, phi_mean,
phi_sd) {
N <- length(predictor1)
alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
beta_predictor1 <- rnorm(1, mean = beta_predictor1_mean,
sd = beta_predictor1_sd);
beta_predictor2 <- rnorm(1, mean = beta_predictor2_mean,
sd = beta_super_sd);
phi <- rnorm(1, mean = phi_mean, sd = phi_sd);
outcome<- rnbinom(N, size = mu + mu^2/phi, mu = alpha +
beta_predictor1 * predictor1 + beta_predictor2 *
predictor2)
return(outcome)
}
Because there is a phi in the negative binomial, and given that it will be a parameter whose prior I will be calculating, I assumed it needed to be added to the equation. Additionally, given the documentation for rnbinom(), I thought i could treat mu as I treated lambda in the Poisson generation, feeding the regression equation onto it.
The function is likely inadequate, but after I create it and move onto the second step, the errors emerge. The second step I coded as:
multiple_regression_negative_binomial_dgp(dataset$predictor1,
dataset$predictor2,
alpha_mean = 1,
alpha_sd = 0.5,
beta_predictor1_mean = -0.25,
beta_predictor1_sd = 0.5,
beta_predictor2_mean = 0,
beta_predictor2_sd = 1,
phi_mean = 0,
phi_sd = 1)
However, as soon as I try to run this data generating process, I get the warning stating:
Error in rnbinom(N, size = mu, mu = alpha + beta_predictor1 * predictor1 + beta_predictor2 * predictor2 : object 'mu' not found
Any help would be much appreciated, I realize that I am applying a more mechanistic mindset in trying to replicate the Poisson data generating process for the negative binomial one, but I have been unable to find any clues as to how to solve this. Most examples I came across define a value for mu and for size, instead of 'feeding' it the formula.

In your multiple_regression_negative_binomial_dgp function, you call rnbinom. That function needs a size argument, and you assign mu + mu^2/phi to it, but mu is not defined within the function, nor passed to it. The fact that rnbinom contains a mu argument, which you do provide (alpha + beta_predictor1 * predictor1 + beta_predictor2 * predictor2), doesn't take care of it, because rnbinom doesn't pass that information over to size. I would suggest you try:
multiple_regression_negative_binomial_dgp <- function(predictor1, predictor2,
alpha_mean, alpha_sd, beta_predictor1_mean,
beta_predictor1_sd, beta_predictor2_mean,
beta_predictor2_sd, phi_mean, phi_sd) {
N <- length(predictor1)
alpha <- rnorm(1, mean=alpha_mean, sd=alpha_sd)
beta_predictor1 <- rnorm(1, mean=beta_predictor1_mean, sd=beta_predictor1_sd)
beta_predictor2 <- rnorm(1, mean=beta_predictor2_mean, sd=beta_super_sd);
phi <- rnorm(1, mean=phi_mean, sd=phi_sd)
Mu <- alpha + beta_predictor1*predictor1 + beta_predictor2*predictor2
outcome <- rnbinom(N, size=Mu + Mu^2/phi, mu=Mu)
return(outcome)
}

Related

Fixing a parameter to a distribution in JAGS

In the Bayesian programing language JAGS, I am looking for a way to fix a parameter to a specific distribution, as opposed to a constant. The paragraph below presents this question more explicitly and references JAGS code. I would also be open to answers that use other probabilistic programming languages (e.g., stan).
The first code chunk below (model1) is a JAGS script designed to estimate a two-group Gaussian mixture model with unequal variances. I am looking for a way to fix one of the parameters (say $\mu_2$) to a particular distribution (e.g., dnorm(0,0.0001)). I know how to fix $\mu_2$ to a constant (e.g., see model2 in code chunk 2), though I cannot find a way to fix $\mu_2$ to my prior belief(e.g., see model3 in code chunk 3, which shows conceptually what I am trying to do).
Thanks in advance!
Code chunk 1
model1 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 ~ dnorm (0,0.0001)
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
Code chunk 2
model2 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 <- 1.27
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
Code chunk 3
model3 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 <- dnorm (0,0.0001)
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
I don't know JAGS, but here are two Stan versions. One takes a single sample of mu2 across all iterations; the second takes a different sample of mu2 for each iteration.
Either way, I'm not qualified to judge whether this is actually a good idea. (The second version, in particular, is something that the Stan team has deliberately tried to avoid, for the reasons described here.) But it's at least possible.
(In both examples, I changed some of the prior distributions to make the data easier to work with, but the basic idea is the same.)
One sample of mu2
First, the Stan model.
data {
int<lower=0> n1;
vector[n1] y1;
int<lower=0> n2;
vector[n2] y2;
}
transformed data {
// Set mu2 to a single randomly selected value (instead of giving it a prior
// and estimating it).
real mu2 = normal_rng(0, 0.0001);
}
parameters {
real mu1;
real<lower=0> phi1;
real<lower=0> phi2;
}
transformed parameters {
real sigma1 = 1 / phi1;
real sigma2 = 1 / phi2;
}
model {
mu1 ~ normal(0, 0.0001);
phi1 ~ gamma(1, 1);
phi2 ~ gamma(1, 1);
y1 ~ normal(mu1, sigma1);
y2 ~ normal(mu2, sigma2);
}
generated quantities {
real delta = mu1 - mu2;
// We can't return mu2 from the transformed data block. So if we want to see
// what it was, we have to copy its value into a generated quantity and return
// that.
real mu2_return = mu2;
}
Next, R code to generate fake data and fit the model.
# Generate fake data.
n1 = 1000
n2 = 1000
mu1 = rnorm(1, 0, 0.0001)
mu2 = rnorm(1, 0, 0.0001)
phi1 = rgamma(1, shape = 1, rate = 1)
phi2 = rgamma(1, shape = 1, rate = 1)
y1 = rnorm(n1, mu1, 1 / phi1)
y2 = rnorm(n2, mu2, 1 / phi2)
delta = mu1 - mu2
# Fit the Stan model.
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = T)
stan.data = list(n1 = n1, y1 = y1, n2 = n2, y2 = y2)
stan.model = stan(file = "stan_model.stan",
data = stan.data,
cores = 3, iter = 1000)
We can extract the samples from the Stan model and see that we correctly recovered the parameters' true values - except, of course, in the case of mu2.
# Pull out the samples.
library(tidybayes)
library(tidyverse)
stan.model %>%
spread_draws(mu1, phi1, mu2_return, phi2) %>%
ungroup() %>%
dplyr::select(.draw, mu1, phi1, mu2 = mu2_return, phi2) %>%
pivot_longer(cols = -c(.draw), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_histogram() +
geom_vline(data = data.frame(parameter = c("mu1", "phi1", "mu2", "phi2"),
true.value = c(mu1, phi1, mu2, phi2)),
aes(xintercept = true.value), color = "red", size = 1.5) +
facet_wrap(~ parameter, scales = "free") +
theme_bw() +
scale_x_continuous("Parameter value") +
scale_y_continuous("Number of samples")
New sample of mu2 for each iteration
We can't generate a random number in the parameters, transformed parameters, or model block; again, this is a deliberate design choice. But we can generate a whole bunch of numbers in the transformed data block and grab a new one for each iteration. To do this, we need a way to figure out which iteration we're on in the parameters block. I used Louis's solution from the end of this discussion on the Stan forums. First, save the following C++ code as iter.hpp in your working directory:
static int itct = 1;
inline void add_iter(std::ostream* pstream__) {
itct += 1;
}
inline int get_iter(std::ostream* pstream__) {
return itct;
}
Next, define the Stan model as follows. The functions add_iter() and get_iter() are defined in iter.hpp; if you're working in RStudio, you'll get error symbols when you edit the Stan file because RStudio doesn't know that we're going to bring in those function definitions from elsewhere.
functions {
void add_iter();
int get_iter();
}
data {
int<lower=0> n1;
vector[n1] y1;
int<lower=0> n2;
vector[n2] y2;
int<lower=0> n_iterations;
}
transformed data {
vector[n_iterations + 1] all_mu2s;
for(n in 1:(n_iterations + 1)) {
all_mu2s[n] = normal_rng(0, 0.0001);
}
}
parameters {
real mu1;
real<lower=0> phi1;
real<lower=0> phi2;
}
transformed parameters {
real sigma1 = 1 / phi1;
real sigma2 = 1 / phi2;
real mu2 = all_mu2s[get_iter()];
}
model {
mu1 ~ normal(0, 0.0001);
phi1 ~ gamma(1, 1);
phi2 ~ gamma(1, 1);
y1 ~ normal(mu1, sigma1);
y2 ~ normal(mu2, sigma2);
}
generated quantities {
real delta = mu1 - mu2;
add_iter();
}
Note that the model actually generates 1 more random value for mu2 than we need. When I tried generating exactly n_iterations random values, I got an error informing me that Stan had tried to access all_mu2s[1001].
I find this worrisome, because it means I don't fully understand what's going on internally - shouldn't there be only 1000 iterations, given the R code below? But it just looks like an off-by-one error, and the fitted model looks reasonable, so I didn't pursue this further.
Also, note that this approach gets the iteration number, but not the chain. I ran just one chain; if you run more than one chain, the ith value of mu2 will be the same in each chain. That same Stan forums discussion has a suggestion for distinguishing among chains, but I didn't explore it.
Finally, generate fake data and fit the model to it. When we compile the model, we need to sneak in the function definitions from iter.hpp, as described here.
# Generate fake data.
n1 = 1000
n2 = 1000
mu1 = rnorm(1, 0, 0.0001)
mu2 = rnorm(1, 0, 0.0001)
phi1 = rgamma(1, shape = 1, rate = 1)
phi2 = rgamma(1, shape = 1, rate = 1)
y1 = rnorm(n1, mu1, 1 / phi1)
y2 = rnorm(n2, mu2, 1 / phi2)
delta = mu1 - mu2
n.iterations = 1000
# Fit the Stan model.
library(rstan)
stan.data = list(n1 = n1, y1 = y1, n2 = n2, y2 = y2,
n_iterations = n.iterations)
stan.model = stan_model(file = "stan_model.stan",
allow_undefined = T,
includes = paste0('\n#include "',
file.path(getwd(), 'iter.hpp'),
'"\n'))
stan.model.fit = sampling(stan.model,
data = stan.data,
chains = 1,
iter = n.iterations,
pars = c("mu1", "phi1", "mu2", "phi2"))
Once again, we recovered the values of mu1, phi1, and phi2 reasonably well. This time, we used a whole range of values for mu2, which follow the specified distribution.
# Pull out the samples.
library(tidybayes)
library(tidyverse)
stan.model.fit %>%
spread_draws(mu1, phi1, mu2, phi2) %>%
ungroup() %>%
dplyr::select(.draw, mu1, phi1, mu2 = mu2, phi2) %>%
pivot_longer(cols = -c(.draw), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_histogram() +
stat_function(dat = data.frame(parameter = "mu2", value = 0),
fun = function(.x) { dnorm(.x, 0, 0.0001) * 0.01 },
color = "blue", size = 1.5) +
geom_vline(data = data.frame(parameter = c("mu1", "phi1", "mu2", "phi2"),
true.value = c(mu1, phi1, mu2, phi2)),
aes(xintercept = true.value), color = "red", size = 1.5) +
facet_wrap(~ parameter, scales = "free") +
theme_bw() +
scale_x_continuous("Parameter value") +
scale_y_continuous("Number of samples")

Log-likelihood calculation given estimated parameters

In general: I want to calculate the (log) likelihood of data N given the estimated model parameters from data O.
More specifically, I want to know if my ll_given_modPars function below exists in one of the may R packages dealing with data modeling (lme4, glmm, etc.) as shown in this abstract example (not run):
library(lme4)
o_model <- lmer(observed ~ fixed.id + (1|random.id), data = O, REML = F)
n_logLik <- ll_given_modPars(model.estimates = o_model, data = N)
The fictional example above is on a linear mixed model for simplicity but I would like to eventually do this in a generalized linear mixed model which deals with the Poisson family or directly the negative binomial (for lme4: glmer(..., family="poisson") or glmer.nb ).
From what I could see most packages deal with parameter estimation (great, I need that) but then compare models on the same data with different combinations of fixed and random effects using anova or something to that extent which is not what I want to do.
I want the log likelihood for the same parameters on different data.
The main attempts made:
After not finding a function which seems to be doing that I thought of 'simply' tweaking the lme4 code to my purposes: it calculates the log likelihood for parameters given the data so I thought I could use the same framework but not have it optimize over different parameters but isolate the likelihood calculation function and just give it the parameters and the data. Unfortunately the code is a bit above my current skills https://github.com/lme4/lme4/blob/master/R/nbinom.R (I get a bit lost in how they use the objects over which they optimize).
I thought of doing the likelihood calculation myself, starting with a linear mixed model and then working my way up to more involved ones. But already with this example I'm having a hard time following the math and even when using the formula as specified the obtained log-likelihood is still different (I don't know why, see code in appendix) and I fear it will take me too long before I'll be able to do it for the more involved models (such as Poisson or negative binomial)
At this point I'm not sure what avenue is best to pursue and would appreciate any input you might have.
Appendix: Trying to calculate the log-likelihood (or finding a closed form approximation) based on How does lmer (from the R package lme4) compute log likelihood?. lmer (from lme4) gives a log-likelihood of -17.8 and I get -45.56
library(lme4)
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
sim.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
sim.data$EXPECT.Y <- sim.data$GROUP.EFFECT + sim.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
sim.data$OBS.Y <- rnorm(nrow(sim.data), mean=sim.data$EXPECT.Y, sigma)
model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = sim.data, REML=F)
summary(model)
toy.model.var <- VarCorr(model)
toy.model.sigma <- attr(toy.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
toy.model.tau.squared <- toy.model.var[[1]][1] # corresponds to variance of random effects
toy.model.betas <- model#beta
# left product, spread within gropus
toy.data <- rbind(sim.data$OBS.Y[1:4], sim.data$OBS.Y[5:8])
toy.mean.adj <- rbind(toy.data[1,] - mean(unlist(toy.data[1,])), toy.data[2,] - mean(unlist(toy.data[2,])))
toy.mean.adj.prod1 <- prod(dnorm(unlist(toy.mean.adj[1,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.prod2 <- prod(dnorm(unlist(toy.mean.adj[2,]), mean = 0, sd = toy.model.sigma))
toy.mean.adj.final.prod <- toy.mean.adj.prod1 * toy.mean.adj.prod2
# right product, spread between gropus
toy.mean.beta.adj <- rbind(mean(unlist(toy.data[1,])) - toy.model.betas, mean(unlist(toy.data[2,])) - toy.model.betas)
toy.mean.beta.adj[1,] <- toy.mean.beta.adj[1,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj[2,] <- toy.mean.beta.adj[2,] - c(0, toy.model.betas[1], toy.model.betas[1], toy.model.betas[1])
toy.mean.beta.adj.prod1 <- prod(dnorm(unlist(toy.mean.beta.adj[1,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.prod2 <- prod(dnorm(unlist(toy.mean.beta.adj[2,]), mean = 0, sd = sqrt(toy.model.sigma^2/4 + toy.model.tau.squared)) * sqrt(2/4*pi*toy.model.sigma^2))
toy.mean.beta.adj.final.prod <- toy.mean.beta.adj.prod1 * toy.mean.beta.adj.prod2
toy.total.prod <- toy.mean.adj.final.prod * toy.mean.beta.adj.final.prod
log(toy.total.prod)
EDIT: A helpful link was provided in the comments (https://stats.stackexchange.com/questions/271903/understand-marginal-likelihood-of-mixed-effects-models). Converting my example from above I can replicate the log-likelihood
library(mvtnorm)
z = getME(model, "Z")
zt = getME(model, "Zt")
psi = bdiag(replicate(2, toy.model.tau.squared, simplify=FALSE))
betw = z%*%psi%*%zt
err = Diagonal(8, sigma(model)^2)
v = betw + err
dmvnorm(sim.data$OBS.Y, predict(model, re.form=NA), as.matrix(v), log=TRUE)
While I did not manage to come up with a closed form solution for all of them, I did manage to reproduce the log-likelihoods using numerical integration. I have posted below small examples for how it works in the LMM setting (assuming normal residuals random effects) as well as the GLMM with Poisson and Negative-Binomial. Note that especially the latter one tends so differ ever so slightly when you increase the sample size. My guess is that there is some rounding happening somewhere but for my purposes the precision achieved here is good enough. I will for now accept my own answer but if someone posts a closed form for the Poisson or the Negative-Binomial I will happily accept your answer :)
library(lme4)
library(mvtnorm)
################################################################################
# LMM numerical integration
set.seed(7)
n <- 2 # number of groups
m <- 4 # number of instances per group
fixed.effect <- c(0, -2, -1, 1)
tau <- 5 # standard deviation of random effects
sigma <- 2 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau)
normal.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
normal.data$EXPECT.Y <- normal.data$GROUP.EFFECT + normal.data$INSTANCE.EFFECT
# now observe Y value, assuming normally distributed with fixed std. deviation
normal.data$OBS.Y <- rnorm(nrow(normal.data), mean=normal.data$EXPECT.Y, sigma)
normal.model <- lmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = normal.data, REML=F)
summary(normal.model)
normal.model.var <- VarCorr(normal.model)
normal.model.sigma <- attr(normal.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
normal.model.tau.squared <- normal.model.var[[1]][1] # corresponds to variance of random effects
normal.model.betas <- normal.model#beta
normal.group.tau <- sqrt(normal.model.tau.squared)
normal.group.sigma <- sigma(normal.model)
normal.group.beta <- predict(normal.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[1] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[2] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[3] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[4] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group1.integration <- integrate(integrate_group1, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dnorm(normal.data$OBS.Y[5] - normal.group.beta[1] - x, mean = 0, sd = normal.group.sigma) * dnorm(x, mean = 0, sd = normal.group.tau)
p2 <- dnorm(normal.data$OBS.Y[6] - normal.group.beta[2] - x, mean = 0, sd = normal.group.sigma)
p3 <- dnorm(normal.data$OBS.Y[7] - normal.group.beta[3] - x, mean = 0, sd = normal.group.sigma)
p4 <- dnorm(normal.data$OBS.Y[8] - normal.group.beta[4] - x, mean = 0, sd = normal.group.sigma)
p_out <- p1 * p2 * p3 * p4
p_out
}
normal.group2.integration <- integrate(integrate_group2, lower = -10*normal.group.tau, upper = 10*normal.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(normal.group1.integration) + log(normal.group2.integration)
#################################
# Poisson numerical integration
set.seed(13) #13
n <- 2 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
poisson.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
poisson.data$EXPECT.Y <- exp(poisson.data$GROUP.EFFECT + poisson.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
poisson.data$OBS.Y <- rpois(nrow(poisson.data), poisson.data$EXPECT.Y)
poisson.model <- glmer(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = poisson.data, family="poisson")
summary(poisson.model)
poisson.model.var <- VarCorr(poisson.model)
poisson.model.sigma <- attr(poisson.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
poisson.model.tau.squared <- poisson.model.var[[1]][1] # corresponds to variance of random effects
poisson.model.betas <- poisson.model#beta
poisson.group.tau <- sqrt(poisson.model.tau.squared)
poisson.group.sigma <- sigma(poisson.model)
poisson.group.beta <- predict(poisson.model, re.form=NA)[1:4]
integrate_group1 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[1], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[2], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[3], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[4], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group1.integration <- integrate(integrate_group1, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
integrate_group2 <- function(x){
p1 <- dpois(poisson.data$OBS.Y[5], lambda = exp(poisson.group.beta[1] + x)) * dnorm(x, mean = 0, sd = poisson.group.tau)
p2 <- dpois(poisson.data$OBS.Y[6], lambda = exp(poisson.group.beta[2] + x))
p3 <- dpois(poisson.data$OBS.Y[7], lambda = exp(poisson.group.beta[3] + x))
p4 <- dpois(poisson.data$OBS.Y[8], lambda = exp(poisson.group.beta[4] + x))
p_out <- p1 * p2 * p3 * p4
p_out
}
poisson.group2.integration <- integrate(integrate_group2, lower = -10*poisson.group.tau, upper = 10*poisson.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
log(poisson.group1.integration) + log(poisson.group2.integration)
#############
# Negative-Binomial numerical integration
set.seed(13) #13
n <- 100 # number of groups
m <- 4 # number of instances per group
# effect sizes are much smaller since they are exponentiated
fixed.effect <- c(0, -0.2, -0.1, 0.2)
tau <- 1.5 # standard deviation of random effects
theta <- 0.5
# sigma <- 1.5 # standard deviation of error
random.effect <- rnorm(n, mean=0, sd=tau) # guide effect
nb.data <- data.frame(GROUP.ID=as.factor(rep(1:n, each=m)),
GROUP.EFFECT=rep(random.effect, each=m),
INSTANCE.ID=as.factor(rep(1:m, times=n)),
INSTANCE.EFFECT=rep(fixed.effect, times=n))
# calculate expected Y value
nb.data$EXPECT.Y <- exp(nb.data$GROUP.EFFECT + nb.data$INSTANCE.EFFECT)
# now observe Y value, assuming normally distributed with fixed std. deviation
nb.data$OBS.Y <- rnbinom(nrow(nb.data), mu = nb.data$EXPECT.Y, size = theta)
nb.model <- glmer.nb(OBS.Y ~ INSTANCE.ID + (1|GROUP.ID), data = nb.data)
summary(nb.model)
nb.model.var <- VarCorr(nb.model)
nb.model.sigma <- attr(nb.model.var, 'sc') # corresponds to the epsilon, residual standard deviation
nb.model.tau.squared <- nb.model.var[[1]][1] # corresponds to variance of random effects
nb.model.betas <- nb.model#beta
nb.group.tau <- sqrt(nb.model.tau.squared)
nb.group.beta <- predict(nb.model, re.form=NA)[1:4]
nb.group.dispersion <- getME(nb.model, "glmer.nb.theta")
integration_function_generator <- function(input.obs, input.beta, input.dispersion, input.tau){
function(x){
p1 <- dnbinom(input.obs[1], mu = exp(input.beta[1] + x), size = input.dispersion) * dnorm(x, mean = 0, sd = input.tau)
p2 <- dnbinom(input.obs[2], mu = exp(input.beta[2] + x), size = input.dispersion)
p3 <- dnbinom(input.obs[3], mu = exp(input.beta[3] + x), size = input.dispersion)
p4 <- dnbinom(input.obs[4], mu = exp(input.beta[4] + x), size = input.dispersion)
p_out <- p1 * p2 * p3 * p4
p_out
}
}
nb.all.group.integrations <- c()
for(i in 1:n){
temp.obs <- nb.data$OBS.Y[(1:4)+(i-1)*4]
temp_integrate_function <- integration_function_generator(temp.obs, nb.group.beta, nb.group.dispersion, nb.group.tau)
temp.integration <- integrate(temp_integrate_function, lower = -10*nb.group.tau, upper = 10*nb.group.tau, subdivisions = 10000L, rel.tol = 1e-10, abs.tol = 1e-50)$value[1]
nb.all.group.integrations <- c(nb.all.group.integrations, temp.integration)
}
sum(log(nb.all.group.integrations))

Simulate an AR(1) process with uniform innovations

I need to plot an AR(1) graph for the process
y[k] = 0.75 * y[k-1] + e[k] for y0 = 1.
Assume that e[k] is uniformly distributed on the interval [-0.5, 0.5].
I am trying to use arima.sim:
library(tseries)
y.0 <- arima.sim(model=list(ar=.75), n=100)
plot(y.0)
It does not seem correct. Also, what parameters do I change if y[0] = 10?
We want to use R base function arima.sim for this task, and no extra libraries are required.
By default, arima.sim generates ARIMA with innovations ~ N(0,1). If we want to change this, we need to control the rand.gen or innov argument. For example, you want innovations from uniform distributions U[-0.5, 0.5], we can do either of the following:
arima.sim(model=list(ar=.75), n=100, rand.gen = runif, min = -0.5, max = 0.5)
arima.sim(model=list(ar=.75), n = 100, innov = runif(100, -0.5, 0.5))
Example
set.seed(0)
y <- arima.sim(model=list(ar=.75), n = 100, innov = runif(100, -0.5, 0.5))
ts.plot(y)
In case we want to have explicit control on y[0], we can just shift the above time series such that it starts from y[0]. Suppose y0 is our desired starting value, we can do
y <- y - y[1] + y0
For example, starting from y0 = 1:
y <- y - y[1] + 1
ts.plot(y)

R Remove intercepts in logistic regression

I am using the rms library to perform regularized logistic regression, and wish to force the intercept to zero. I'm using the following to simulate and regress:
library(rms)
N = 100
pred <- vapply(1:12, function(i) rnorm(N, mean = 0, sd =1), numeric(N))
resp <- 20*pred[, 1] - 3*pred[, 7] - 2*pred[, 8] + matrix(rnorm(N, sd = 0.1)) + 20
pr <- 1 / (1 + exp(-resp))
y <- rbinom(N, 1, pr)
lrm(y ~ pred, penalty = 1)
The post at How to remove intercept in R suggests including '0 +' or '- 1' in the model formula. However, this does not appear to work for lrm.
You can use glmnet. It also includes a cross validation function for choosing the turning parameter.
library(glmnet)
N = 1000
pred <- vapply(1:12, function(i) rnorm(N, mean = 0, sd =1), numeric(N))
resp <- 20*pred[, 1] - 3*pred[, 7] - 2*pred[, 8] + matrix(rnorm(N, sd = 0.1)) + 20
pr <- 1 / (1 + exp(-resp))
y <- rbinom(N, 1, pr)
result <- cv.glmnet(pred, y, family="binomial", intercept=FALSE)
# best lambda based on cv
result$lambda.min
# coefficient
coef(result$glmnet.fit, s=result$lambda.min)

Using R for simulation based power analysis of Multi-Factor Within-Subjects Repeated Measures ANOVA

I have been trying to run a power analysis for study design that will be analyzed by a 4 factor Repeated Measures ANOVA where all factors are within-subjects. After a lot of searching and some help on Cross Validated, it is clear that I need to
Do a simulation where I generate data based on some pilot data I have
Run it through the appropriate anova model
And then iterate to find the average power.
The code pasted below was found at this link, and it seems to do exactly what I want to do, albeit for only a 2 factor within-subjects ANOVA.
However, I am having trouble understanding all of the details and what specific lines do because I am very new to R. I would be very grateful if anyone could shed some light on these lines of code:
My questions:
Why is the intercept for all conditions set to -1? Is that standard to this model?
intercept = rep(-1, nconds)
Do the values in these columns represent the levels of the x1 and x2 factors or are they SDs etc? If they are the levels, shouldn't the levels for x2 be (0, .5, 0, .5)?
true.effect.x1 = c(0, 0, .5, .5)
true.effect.x2 = c(0, .5, .5, .5)
What mean and SDs do I take from my pilot data (from multiple subjects) for these three lines of code?
#relatively large subject-specific variance in intercepts
sub.intercept = rep(rnorm(nsub, mean=0, sd=2), times=1, each=nconds)
#relatively small by-subject adjustments to the effects
sub.effect = rep(rnorm(nsub, mean=0, sd=0.05), times=1, each=nconds)
#unexplained error
error = rnorm(nsub*nconds, mean=0, sd=1)
I know this is a very long question but I would really appreciate any help anyone could provide! Thank you so so much!
FULL CODE
library(ez)
nsub = 30
nconds = 4
nsims = 100
#create an empty matrix that will be populated with p-values
p = matrix(NA, nrow=nsims, ncol=3)
#subject vector
sub = sort(rep(1:nsub, nconds))
#2x2 factorial design
cond = data.frame(x1=c('a','a','b','b'), x2=c('c','d','c','d'))
# fixed effects
intercept = rep(-1, nconds)
true.effect.x1 = c(0, 0, .5, .5)
true.effect.x2 = c(0, 0.5, .5, .5)
X = rep((intercept + true.effect.x1 + true.effect.x2),nsub)
#simulation loop
for (x in 1:nsims)
{
#random effects
#relatively large subject-specific variance in intercepts
sub.intercept = rep(rnorm(nsub, mean=0, sd=2), times=1, each=nconds)
#relatively small by-subject adjustments to the effects
sub.effect = rep(rnorm(nsub, mean=0, sd=0.05), times=1, each=nconds)
#unexplained error
error = rnorm(nsub*nconds, mean=0, sd=1)
#simulated dependent variable
observed.y = X + (sub.intercept + sub.effect + error)
#place everything in a data frame
df = cbind(sub, cond, observed.y)
names(df) = c('sub','x1','x2','y')
df$sub = as.factor(df$sub)
#extract the p-values for each effect from a repeated measure ANOVA (ezANOVA from 'ez' package)
p[x,1] = ezANOVA(data=df, dv=.(y), wid=.(sub), within=.(x1, x2))$ANOVA[1,5]
p[x,2] = ezANOVA(data=df, dv=.(y), wid=.(sub), within=.(x1, x2))$ANOVA[2,5]
p[x,3] = ezANOVA(data=df, dv=.(y), wid=.(sub), within=.(x1, x2))$ANOVA[3,5]
}
###### p-values < .05 ? ######
sig.x1 = ifelse(p[,1] <= .05, 1, 0)
sig.x2 = ifelse(p[,2] <= .05, 1, 0)
sig.int = ifelse(p[,3] <= .05, 1, 0)
###### Histograms ######
par(mfrow=c(3,1))
hist(p[,1], 20, xaxp=c(0, 1, 20), col=2, main = paste('power X1:', mean(sig.x1 * 100), '% with ', nsub, 'subjects'))
hist(p[,2], 20, xaxp=c(0, 1, 20), col=2, main = paste('power X2:', mean(sig.x2 * 100), '% with ', nsub, 'subjects'))
hist(p[,3], 20, xaxp=c(0, 1, 20), col=2, main = paste('power interaction:', mean(sig.int * 100), '% with ', nsub, 'subjects'))
1) Because -1 = -sum(true.effect.x1) and the author that you were incompletely copying only had one "true.effect". They wanted a sum-type of contrast. This is suggesting to me that you don't have a sufficient background in statistics to really understand this project.
2) No, they represent covariate values against which coefficients will be estimated. You could use contrasts like c(-2,0,1,1) and the beta estimate would have been one-half what it would be for c(-1,0,-.5,-.5). Think of them as categorical indicators. "Dummy variables" is one term that is used.
3) The question is confusing and I think it is you that are confused and not me. The sd specified is twice the effect, and I think you have copied the comment above from another source without understanding.

Resources