What is causing this rjags error: dimension mismatch? - r

I am experiencing issues with running the following time-series JAGS model in R:
data(lynx)
y <- as.vector(lynx)
y
x <- 1:length(y)
library(rjags)
mod <- "model {
alpha ~ dnorm(0, 0.0001)
beta ~ dnorm(0, 0.0001)
lambda ~ dgamma(1, 1)
for (i in 2:length(y)) {
y[i] ~ dpois(lambda[i])
lambda[i] <- alpha + beta * x[i - 1]
}
}"
mod <- textConnection(mod)
samples <- jags.model(mod, data = list('x' = x, 'y' = y), n.chains = 3) #
# Error in jags.model(mod, data = list(x = x, y = y), n.chains = 3) :
# RUNTIME ERROR:
# Cannot insert node into lambda[1:114]. Dimension mismatch
Is someone able to explain what the above error is referring to and how to fix it?

lambda is written as the rate term of the Poisson distribution in your loop but then you specify it as a gamma distribution in your priors. This is causing a dimension mismatch. On top of this, you need to use the appropriate link function for the Poisson distribution.
mod <- "model {
alpha ~ dnorm(0, 0.0001)
beta ~ dnorm(0, 0.0001)
for (i in 2:length(y)) {
y[i] ~ dpois(lambda[i])
log(lambda[i]) <- alpha + beta * x[i - 1]
}
}"
mod <- textConnection(mod)
# create model object
model_fit <- jags.model(mod, data = list('x' = x, 'y' = y), n.chains = 3)
# collect samples
samples <- coda.samples(model_fit, c("alpha", "beta"), n.iter = 10000)

Related

Nonlinear regression in R error in step factor

I need aproximate datapoints by exponential function with some type of lower limit (variable y is price in time and I need fix minimal value, so asymptote of exponential function cant be at 0). For some "y" is my code function, but at others return error. How can I solve it? Thanks
R code:
y <- c(26973, 24907, 22999, 21236, 19609, 18107, 16720, 15439, 14256, 13163,
12155, 11224, 10364, 9570, 8836)
x <- c(1:15)
train <- data.frame(x, y)
colnames(train) <- c("x", "y")
# Select an approximate $\theta$, since theta must be lower than min(y), and greater than zero
theta.0 <- min(train$y) * 0.5 #min(data.df$y) * 0.5
# Estimate the rest parameters using a linear model
model.0 <- lm(log(price - theta.0) ~ age, data = train)
alpha.0 <- exp(coef(model.0)[1])
beta.0 <- coef(model.0)[2]
# Starting parameters
start <- list(alpha = alpha.0, beta = beta.0, theta = theta.0)
print(start)
model <- nls(y ~ alpha * exp(beta * x) + theta , data = train, start = start)
plot(train$x, train$y)
lines(train$x, predict(model, list(x = train$x)), col = 'skyblue', lwd = 3)
Output:
Error in nls(y ~ alpha * exp(beta * x) + theta, data = train, start = start) :
step factor 0.000488281 reduced below 'minFactor' of 0.000976562

Failing to optimise negative binomial model using optim

I am trying to manually optimise a negative binomial regression model using the optim package in R trying to predict a count variable y using a matrix of factors X using the following code:
# generating some fake data
n <- 1000
X <- matrix(NA, ncol = 5, nrow = n)
X[,1] <- 1
X[,2] <- sample(size = n, x = c(0,1), replace = TRUE)
X[,3] <- sample(size = n, x = c(0,1), replace = TRUE)
X[,4] <- sample(size = n, x = c(0,1), replace = TRUE)
X[,5] <- sample(size = n, x = c(0,1), replace = TRUE)
beta0 <- 3
beta1 <- -2
beta2 <- -2
beta3 <- -4
beta4 <- -0.9
k <- 0.9
## draws from negative binomial distribution
mu <- exp(beta0 + beta1 * X[,2] + beta2 * X[,3] + beta3 * X[,4] + beta4 * X[,5])
theta <- mu + mu ^2 / k
# dependent variable
y <- rnegbin(n, mu = mu, theta = theta)
# function to be optimised
negbin_ll <- function(y, X, theta){
beta <- theta[1:ncol(X)]
alpha <- theta[ncol(X) + 1]
logll <- y * log(alpha) + y *( beta %*% t(X) ) - (y + (1 / alpha ) ) * log( 1 + alpha * exp(beta %*% t(X))) + lgamma(y + (1 / alpha)) - lgamma ( y + 1) - lgamma ( 1 / alpha)
logll <- sum( logll )
return(logll)
}
stval <- rep(0, ncol(X) + 1)
res <-
optim(
stval,
negbin_ll,
y = y,
X = X,
control = list(fnscale = -1),
hessian = TRUE,
method = "BFGS"
)
The code should produce point estimates from the optimisation process, but instead fails when executing the optim-function with the error in optim(stval, negbin_ll, y = y, X = X, control = list(fnscale = -1), : initial value in 'vmmin' is not finite.
I already tried to change log(gamma(...)) to lgamma(...) in the likelihood function and tried many other ways, but I fail to get estimates.
Changing the start values of optim also does not help.
Do you have any idea if there is any particularity to the likelihood function that leads to values being treated in any odd fashion?
Help would be much appreciated.
optim tries several points to get to the minimum, in your case it hits some non-positive values in the arguments inside the logs. One way is to discard the values that return any non-positive inside the problematic functions by returning a negative (in your case) large number, like -lenght(series)*10^6. Remade the log-likelihood function, like this it kinda works:
negbin_ll <- function(y, X, theta){
beta <- theta[1:ncol(X)]
alpha <- theta[ncol(X) + 1]
if(any(alpha<=0)) return(-length(y)*10^6)
if(any(1 + alpha * exp(beta %*% t(X))<=0)) return(-length(y)*10^6)
logll <- y * log(alpha) + y *( beta %*% t(X) ) - (y + (1 / alpha ) ) * log( 1 + alpha * exp(beta %*% t(X))) + lgamma(y + (1 / alpha)) - lgamma ( y + 1) - lgamma ( 1 / alpha)
logll <- sum( logll )
return(logll)
}

Fixing a parameter to a distribution in JAGS

In the Bayesian programing language JAGS, I am looking for a way to fix a parameter to a specific distribution, as opposed to a constant. The paragraph below presents this question more explicitly and references JAGS code. I would also be open to answers that use other probabilistic programming languages (e.g., stan).
The first code chunk below (model1) is a JAGS script designed to estimate a two-group Gaussian mixture model with unequal variances. I am looking for a way to fix one of the parameters (say $\mu_2$) to a particular distribution (e.g., dnorm(0,0.0001)). I know how to fix $\mu_2$ to a constant (e.g., see model2 in code chunk 2), though I cannot find a way to fix $\mu_2$ to my prior belief(e.g., see model3 in code chunk 3, which shows conceptually what I am trying to do).
Thanks in advance!
Code chunk 1
model1 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 ~ dnorm (0,0.0001)
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
Code chunk 2
model2 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 <- 1.27
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
Code chunk 3
model3 = "
model {
for (i in 1:n1){
y1[i] ~ dnorm (mu1 , phi1)
}
for (i in 1:n2){
y2[i] ~ dnorm (mu2 , phi2)
}
# Priors
phi1 ~ dgamma(.001,.001)
phi2 ~ dgamma(.001,.001)
sigma2.1 <- 1/phi1
sigma2.2 <- 1/phi2
mu1 ~ dnorm (0,0.0001)
mu2 <- dnorm (0,0.0001)
# Create a variable for the mean difference
delta <- mu1 - mu2
}
"
I don't know JAGS, but here are two Stan versions. One takes a single sample of mu2 across all iterations; the second takes a different sample of mu2 for each iteration.
Either way, I'm not qualified to judge whether this is actually a good idea. (The second version, in particular, is something that the Stan team has deliberately tried to avoid, for the reasons described here.) But it's at least possible.
(In both examples, I changed some of the prior distributions to make the data easier to work with, but the basic idea is the same.)
One sample of mu2
First, the Stan model.
data {
int<lower=0> n1;
vector[n1] y1;
int<lower=0> n2;
vector[n2] y2;
}
transformed data {
// Set mu2 to a single randomly selected value (instead of giving it a prior
// and estimating it).
real mu2 = normal_rng(0, 0.0001);
}
parameters {
real mu1;
real<lower=0> phi1;
real<lower=0> phi2;
}
transformed parameters {
real sigma1 = 1 / phi1;
real sigma2 = 1 / phi2;
}
model {
mu1 ~ normal(0, 0.0001);
phi1 ~ gamma(1, 1);
phi2 ~ gamma(1, 1);
y1 ~ normal(mu1, sigma1);
y2 ~ normal(mu2, sigma2);
}
generated quantities {
real delta = mu1 - mu2;
// We can't return mu2 from the transformed data block. So if we want to see
// what it was, we have to copy its value into a generated quantity and return
// that.
real mu2_return = mu2;
}
Next, R code to generate fake data and fit the model.
# Generate fake data.
n1 = 1000
n2 = 1000
mu1 = rnorm(1, 0, 0.0001)
mu2 = rnorm(1, 0, 0.0001)
phi1 = rgamma(1, shape = 1, rate = 1)
phi2 = rgamma(1, shape = 1, rate = 1)
y1 = rnorm(n1, mu1, 1 / phi1)
y2 = rnorm(n2, mu2, 1 / phi2)
delta = mu1 - mu2
# Fit the Stan model.
library(rstan)
options(mc.cores = parallel::detectCores())
rstan_options(auto_write = T)
stan.data = list(n1 = n1, y1 = y1, n2 = n2, y2 = y2)
stan.model = stan(file = "stan_model.stan",
data = stan.data,
cores = 3, iter = 1000)
We can extract the samples from the Stan model and see that we correctly recovered the parameters' true values - except, of course, in the case of mu2.
# Pull out the samples.
library(tidybayes)
library(tidyverse)
stan.model %>%
spread_draws(mu1, phi1, mu2_return, phi2) %>%
ungroup() %>%
dplyr::select(.draw, mu1, phi1, mu2 = mu2_return, phi2) %>%
pivot_longer(cols = -c(.draw), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_histogram() +
geom_vline(data = data.frame(parameter = c("mu1", "phi1", "mu2", "phi2"),
true.value = c(mu1, phi1, mu2, phi2)),
aes(xintercept = true.value), color = "red", size = 1.5) +
facet_wrap(~ parameter, scales = "free") +
theme_bw() +
scale_x_continuous("Parameter value") +
scale_y_continuous("Number of samples")
New sample of mu2 for each iteration
We can't generate a random number in the parameters, transformed parameters, or model block; again, this is a deliberate design choice. But we can generate a whole bunch of numbers in the transformed data block and grab a new one for each iteration. To do this, we need a way to figure out which iteration we're on in the parameters block. I used Louis's solution from the end of this discussion on the Stan forums. First, save the following C++ code as iter.hpp in your working directory:
static int itct = 1;
inline void add_iter(std::ostream* pstream__) {
itct += 1;
}
inline int get_iter(std::ostream* pstream__) {
return itct;
}
Next, define the Stan model as follows. The functions add_iter() and get_iter() are defined in iter.hpp; if you're working in RStudio, you'll get error symbols when you edit the Stan file because RStudio doesn't know that we're going to bring in those function definitions from elsewhere.
functions {
void add_iter();
int get_iter();
}
data {
int<lower=0> n1;
vector[n1] y1;
int<lower=0> n2;
vector[n2] y2;
int<lower=0> n_iterations;
}
transformed data {
vector[n_iterations + 1] all_mu2s;
for(n in 1:(n_iterations + 1)) {
all_mu2s[n] = normal_rng(0, 0.0001);
}
}
parameters {
real mu1;
real<lower=0> phi1;
real<lower=0> phi2;
}
transformed parameters {
real sigma1 = 1 / phi1;
real sigma2 = 1 / phi2;
real mu2 = all_mu2s[get_iter()];
}
model {
mu1 ~ normal(0, 0.0001);
phi1 ~ gamma(1, 1);
phi2 ~ gamma(1, 1);
y1 ~ normal(mu1, sigma1);
y2 ~ normal(mu2, sigma2);
}
generated quantities {
real delta = mu1 - mu2;
add_iter();
}
Note that the model actually generates 1 more random value for mu2 than we need. When I tried generating exactly n_iterations random values, I got an error informing me that Stan had tried to access all_mu2s[1001].
I find this worrisome, because it means I don't fully understand what's going on internally - shouldn't there be only 1000 iterations, given the R code below? But it just looks like an off-by-one error, and the fitted model looks reasonable, so I didn't pursue this further.
Also, note that this approach gets the iteration number, but not the chain. I ran just one chain; if you run more than one chain, the ith value of mu2 will be the same in each chain. That same Stan forums discussion has a suggestion for distinguishing among chains, but I didn't explore it.
Finally, generate fake data and fit the model to it. When we compile the model, we need to sneak in the function definitions from iter.hpp, as described here.
# Generate fake data.
n1 = 1000
n2 = 1000
mu1 = rnorm(1, 0, 0.0001)
mu2 = rnorm(1, 0, 0.0001)
phi1 = rgamma(1, shape = 1, rate = 1)
phi2 = rgamma(1, shape = 1, rate = 1)
y1 = rnorm(n1, mu1, 1 / phi1)
y2 = rnorm(n2, mu2, 1 / phi2)
delta = mu1 - mu2
n.iterations = 1000
# Fit the Stan model.
library(rstan)
stan.data = list(n1 = n1, y1 = y1, n2 = n2, y2 = y2,
n_iterations = n.iterations)
stan.model = stan_model(file = "stan_model.stan",
allow_undefined = T,
includes = paste0('\n#include "',
file.path(getwd(), 'iter.hpp'),
'"\n'))
stan.model.fit = sampling(stan.model,
data = stan.data,
chains = 1,
iter = n.iterations,
pars = c("mu1", "phi1", "mu2", "phi2"))
Once again, we recovered the values of mu1, phi1, and phi2 reasonably well. This time, we used a whole range of values for mu2, which follow the specified distribution.
# Pull out the samples.
library(tidybayes)
library(tidyverse)
stan.model.fit %>%
spread_draws(mu1, phi1, mu2, phi2) %>%
ungroup() %>%
dplyr::select(.draw, mu1, phi1, mu2 = mu2, phi2) %>%
pivot_longer(cols = -c(.draw), names_to = "parameter") %>%
ggplot(aes(x = value)) +
geom_histogram() +
stat_function(dat = data.frame(parameter = "mu2", value = 0),
fun = function(.x) { dnorm(.x, 0, 0.0001) * 0.01 },
color = "blue", size = 1.5) +
geom_vline(data = data.frame(parameter = c("mu1", "phi1", "mu2", "phi2"),
true.value = c(mu1, phi1, mu2, phi2)),
aes(xintercept = true.value), color = "red", size = 1.5) +
facet_wrap(~ parameter, scales = "free") +
theme_bw() +
scale_x_continuous("Parameter value") +
scale_y_continuous("Number of samples")

Calculating RSS manually with given pairs of beta0 and beta1

I am trying to manually calculate the RSS for a dataset with given pairs of beta0 and beta1. For each (beta_0,beta_1) pair of values, I need to calculate the residual sum of squares. Store it as a vector in data called RSS. Here's the code provided.
x = pinotnoir$Aroma
y = pinotnoir$Quality
fit = lm(y ~ x)
summary(fit)
b0s <- seq(0, 10, .1)
b1s <- seq(0, 4, .01)
data <- expand.grid(beta0=b0s, beta1=b1s)
Here's what I have so far. I think the residual calculation is wrong but I'm not sure how to fix it.
rows = length(b1s)
rsd <- rep(NA,rows)
for (i in 1:rows){
residual = (y - (b0s[i] + b1s[i] * x))^2
rsd[i] <- residual
}
data <- expand.grid(beta0=b0s, beta1=b1s, RSS=rsd)
Any help would be appreciated. Thanks in advance!
I am not sure this is exactly what you aim but adapting your code slightly you can get the sum of squared residuals and which betas minimizes them. (using mtcars data for the example)
mtcars
x = mtcars$drat
y = mtcars$wt
(fit = lm(y ~ x))
summary(fit)
grid_len <- 20
b0s <- seq(5, 10, length.out = grid_len)
b1s <- seq(-3, -1, length.out = grid_len)
(data <- expand.grid(beta0=b0s, beta1=b1s))
rows = nrow(data)
resids <- rep(NA,rows)
for (i in 1:rows) {
fitted <- (data$beta0[i] + (data$beta1[i] * x))
squared_resid <- (y - fitted)^2
SSR <- sum(squared_resid)
resids[i] <- SSR
cat(i, ": ", SSR, "\n")
}
data[which.min(resids), ]
fit
results:
> data[which.min(resids), ]
beta0 beta1
332 7.894737 -1.315789
> fit
Call:
lm(formula = y ~ x)
Coefficients:
(Intercept) x
7.906 -1.304

Trouble Using Function which takes Formula Argument in R

I have a function implementing the Fisher algorithm in R for a GLM which takes formula as an argument. However when attempting to run it I get the error: Error in model.frame.default(formula = formula, drop.unused.levels = TRUE) : invalid type (closure) for variable 't'
I tried calling it in several ways (e.g using as.formula(y~t) but nothing seems to work.
myglm <- function(formula,data,start = 0) {
X = model.matrix(formula,data) #It appears that the issue comes from this line
Y = data[,1]
n = dim(X)[1]
p <- dim(X)[2]
beta_0 = rep(1,p)
M = t(X)%*%X
beta = rep(0,p)#Least Squares Estimate
epsilon = 0.01
#Run Fisher Iterations
while (norm(beta-beta_0,type = "2")/norm(beta_0, type = "2") > epsilon) {
beta_0 = beta
eta = X %*% beta
lambda = exp(eta)
F = t(X) %*% diag(as.vector(lambda)) %*% X #Fisher information matrix
s = t(X) %*% (Y - exp(eta)) #Score function
beta = beta + solve(F) %*% s
}
vcov = solve(F)
coef = matrix(c(0,0,0,0),nrow = 2, ncol = 2)
coef[,1] = beta
coef[,2] = t(sqrt(diag(vcov)))
colnames(coef) = c("Coefficients","Standard error")
rownames(coef) = c("beta1", "beta2")
#Calculate Deviance
mod_sat = glm(formula, family = poisson(link = "log"))
log_likelihood = Y %*% eta - exp(eta)
deviance = 2*(LogLik(mod_sat) - log_likelihood)
return(list(coef,deviance,vcov))
}
f = formula(y ~ t)
load(url("https://www.math.ntnu.no/emner/TMA4315/2020h/hoge-veluwe.Rdata")) #This is stored as "data"
myglm(f, data)
Your issue is in this line:
mod_sat = glm(formula, family = poisson(link = "log"))
You need to specify a data = argument to glm() so it knows how to interpret the formula.

Resources