So I'm trying to do this bayesian data modeling project and I went back to my notes from my bayesian stats class on rjags where my professor went over a really similar model. I was able to run that one without any issues, but when I adapt it to my model I'm getting an error as my output.
Here is my code:
library(rjags)
set.seed(12196)
# Number of bear reports
Y <- black_bears$Y # Number of surveys saying there was a reported bear sighting
N <- black_bears$N # Number of surveys submitted on whether a bear was seen or not
q <- Y/N # proportion of the non-bear sightings submitted
n <- length(Y)
X <- log(q)-log(1-q) # X = logit(q)
data <- list(Y=Y,N=N,X=X)
params <- c("beta")
model_string <- textConnection("model{
# Likelihood
for (i in 1:n){
Y[i] ~ dbinom(p[i], N[i])
logit(p[i] <- beta[1] + beta[2]*X[i])
}
# Priors
beta[1] ~ dnorm(0, 0.01)
beta[2] ~ dnorm(0, 0.01)
}")
model <- jags.model(model_string,data = data, n.chains=2,quiet=TRUE)
update(model, 10000, progress.bar="none")
samples1 <- coda.samples(model, variable.names=params, thin=5, n.iter=20000, progress.bar="none")
plot(samples1)
I'm getting presented with this error:
Any and all help is appreciated. Thank you!
Related
I am trying to argue that ARIMA models are better than AR models i.e since AR is a subset of ARIMA, the best ARIMA model will not be worse than the best AR model, but may be better. I have used an AR(6) model, and then used auto.arima() in R which has told me that an ARIMA(1,0,2) model is optimal using AICc. I have used both of these to do a rolling window forecast, but am getting an RMSE of 3.901 for AR(6) and 4.503 for ARIMA(1,0,2). My code for the forecasting is below (I know it is not very advanced but I'm a beginner and this is the best way I could find - it matches my results by hand):
#find moving averages and residual errors
ma=rep(NA,14976)
for (i in 3:14976){
ma[i] = mean(ds[(i-2):(i-1)])
}
frame <- ds-ma
#fit model
model <- arima(ds[1:14676],order=c(1,0,2),include.mean=TRUE,method="ML")
`%+=%` = function(e1,e2){eval.parent(substitute(e1 <- e1 + e2))}
training_data <- data[1:14676]
test_data <- data[14677:14976]
window <- 1
window1 <- 2
coef <- model$coef
history <- training_data[(length(training_data)-window+1):14676]
predictions <- list()
for (i in (1:length(test_data))){
length <- length(history)
lag <- array()
for (d in ((length-window+1):length)){
lag[d-i+1] <- history[d]}
yhat <- coef[length(coef)]-1
for (t in (1:window)){
yhat %+=% (coef[t]*lag[window-t+1])}
if (window1 != 0){
for (j in ((window+1):(window+window1))){
yhat %+=% (coef[j]*frame[14676+i-j+1])}
}
obs <- test_data[i]
predictions <- append(predictions,yhat)
history <- append(history,obs)
print(predictions)
}
The graph that comes out for the ARIMA(1,0,2) forecast (compared to the actual values in the test set) looks better, but is quite raised. It seems like the intercept needs to be lower, which does give a better RMSE, but arima() gave the intercept it did so I haven't changed it.
I am super new to JAGS and Bayesian statistics, and have simply been trying to follow the Chapter 22 on Bayesian statistics in Crawley's 2nd Edition R Book. I copy the code down exactly as it appears in the book for the simple linear model: growth = a + b *tannin, where there are 9 rows of two continuous variables: growth and tannins. The data and packages are this:
install.packages("R2jags")
library(R2jags)
growth <- c(12,10,8,11,6,7,2,3,3)
tannin <- c(0,1,2,3,4,5,6,7,8)
N <- c(1,2,3,4,5,6,7,8,9)
bay.df <- data.frame(growth,tannin,N)
The ASCII file looks like this:
model{
for(i in 1:N) {
growth[i] ~ dnorm(mu[i],tau)
mu[i] <- a+b*tannin[i]
}
a ~ dnorm(0.0, 1.0E-4)
b ~ dnorm(0.0, 1.0E-4)
sigma <- 1.0/sqrt(tau)
tau ~ dgamma(1.0E-3, 1.0E-3)
}
But then, when I use this code:
> practicemodel <- jags(data=data.jags,parameters.to.save = c("a","b","tau"),
+ n.iter=100000, model.file="regression.bugs.txt", n.chains=3)
I get an error message that says:
module glm loaded
Compiling model graph
Resolving undeclared variables
Deleting model
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Non-conforming parameters in function :
The problem has been solved!
Basically the change is from N <- (1,2...) to N <- 9, but there is one other solution as well, where no N is specified in the beginning. You can specify N inside the data.jags function as the number of rows in the data frame; data.jags = list(growth=bay.df$growth, tannin=bay.df$tannin, N=nrow(bay.df)).
Here is the new code:
# Make the data frame
growth <- c(12,10,8,11,6,7,2,3,3)
tannin <- c(0,1,2,3,4,5,6,7,8)
# CHANGED : This is for the JAGS code to know there are 9 rows of data
N <- 9 code
bay.df <- data.frame(growth,tannin)
library(R2jags)
# Now, write the Bugs model and save it in a text file
sink("regression.bugs.txt") #tell R to put the following into this file
cat("
model{
for(i in 1:N) {
growth[i] ~ dnorm(mu[i],tau)
mu[i] <- a+b*tannin[i]
}
a ~ dnorm(0.0, 1.0E-4)
b ~ dnorm(0.0, 1.0E-4)
sigma <- 1.0/sqrt(tau)
tau ~ dgamma(1.0E-3, 1.0E-3)
}
", fill=TRUE)
sink() #tells R to stop putting things into this file.
#tell jags the names of the variables containing the data
data.jags <- list("growth","tannin","N")
# run the JAGS function to produce the function:
practicemodel <- jags(data=data.jags,parameters.to.save = c("a","b","tau"),
n.iter=100000, model.file="regression.bugs.txt", n.chains=3)
# inspect the model output. Important to note that the output will
# be different every time because there's a stochastic element to the model
practicemodel
# plots the information nicely, can visualize the error
# margin for each parameter and deviance
plot(practicemodel)
Thanks for the help! I hope this helps others.
I am trying to model the variance in overall species richness with the habitat covariates of a camera trapping station using R2jags. However, I keep getting the error:
"Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Non-conforming parameters in function inprod"
I used a very similar function in my previous JAGS model (to find the species richness) so I am not sure why it is not working now...
I have already tried formatting the covariates within the inprod function in different ways, as a data frame and a matrix, to no avail.
Variable specification:
J=length(ustations) #number of camera stations
NSite=Global.Model$BUGSoutput$sims.list$Nsite
NS=apply(NSite,2,function(x)c(mean(x)))
###What I think is causing the problem:
COV <- data.frame(as.numeric(station.cov$NDVI), as.numeric(station.cov$TRI), as.numeric(station.cov$dist2edge), as.numeric(station.cov$dogs), as.numeric(station.cov$Leopard_captures))
###but I have also tried:
COV <- cbind(station.cov$NDVI, station.cov$TRI, station.cov$dist2edge, station.cov$dogs, station.cov$Leopard_captures)
JAGS model:
sink("Variance_model.txt")
cat("model {
# Priors
Y ~ dnorm(0,0.001) #Mean richness
X ~ dnorm(0,0.001) #Mean variance
for (a in 1:length(COV)){
U[a] ~ dnorm(0,0.001)} #Variance covariates
# Likelihood
for (i in 1:J) {
mu[i] <- Y #Hyper-parameter for station-specific all richness
NS[i] ~ dnorm(mu[i], tau[i]) #Likelihood
tau[i] <- (1/sigma2[i])
log(sigma2[i]) <- X + inprod(U,COV[i,])
}
}
", fill=TRUE)
sink()
var.data <- list(NS = NS,
COV = COV,
J=J)
Bundle data:
# Inits function
var.inits <- function(){list(
Y =rnorm(1),
X =rnorm(1),
U =rnorm(length(COV)))}
# Parameters to estimate
var.params <- c("Y","X","U")
# MCMC settings
nc <- 3
ni <-20000
nb <- 10000
nthin <- 10
Start Gibbs sampler:
jags(data=var.data,
inits=var.inits,
parameters.to.save=var.params,
model.file="Variance_model.txt",
n.chains=nc,n.iter=ni,n.burnin=nb,n.thin=nthin)
Ultimately, I get the error:
Compiling model graph
Resolving undeclared variables
Allocating nodes
Deleting model
Error in jags.model(model.file, data = data, inits = init.values, n.chains = n.chains, :
RUNTIME ERROR:
Non-conforming parameters in function inprod
In the end, I would like to calculate the mean and 95% credible interval (BCI) estimates of the habitat covariates hypothesized to influence the variance in station-specific (point-level) species richness.
Any help would be greatly appreciated!
It looks like you are using length to generate the priors for U. In JAGS this function will return the number of elements in a node array. In this case, that would be the number of rows ins COV multiplied by the number of columns.
Instead, I would supply a scalar to your data list that you supply to jags.model.
var.data <- list(NS = NS,
COV = COV,
J=J,
ncov = ncol(COV)
)
Following this, you can modify your JAGS code where you are generating your priors for U. The model would then become:
sink("Variance_model.txt")
cat("model {
# Priors
Y ~ dnorm(0,0.001) #Mean richness
X ~ dnorm(0,0.001) #Mean variance
for (a in 1:ncov){ # THIS IS THE ONLY LINE OF CODE THAT I MODIFIED
U[a] ~ dnorm(0,0.001)} #Variance covariates
# Likelihood
for (i in 1:J) {
mu[i] <- Y #Hyper-parameter for station-specific all richness
NS[i] ~ dnorm(mu[i], tau[i]) #Likelihood
tau[i] <- (1/sigma2[i])
log(sigma2[i]) <- X + inprod(U,COV[i,])
}
}
", fill=TRUE)
sink()
I would like to estimate the coefficients of a nonlinear model with a binary dependent variable. The nonlinearity arises because two regressors, A and B, depend on a subset of the dataset and on the two parameters lambda1 and lambda2 respectively:
y = alpha + beta1 * A(lambda1) + beta2 * B(lambda2) + delta * X + epsilon
where for each observation i, we have
Where a and Rs are variables in the data.frame. The regressor B(lambda2) is defined in a similar way.
Moreover, I need to include what in Stata are known as pweights, i.e. survey weights or sampling weights. For this reason, I'm working with the R package survey by Thomas Lumley.
First, I create a function for A (and B), i.e.:
A <- function(l1){
R <- as.matrix(data[,1:(80)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l1
den <- sum((ai-k)^l1)
w <- num/den
w <- c(w,rep(0,dim(R)[2]-length(w)))
var[i] <- R[i,] %*% w
}
return(var)
}
B <- function(l2){
C <- as.matrix(data[,82:(161-1)])
a <- data[,169]
N = length(a)
var <- numeric(N)
for (i in 1:N) {
ai <- rep(a[i],a[i]-1) # vector of a(i)
k <- 1:(a[i]-1) # numbers from 1 to a(i)-1
num <- (ai-k)^l2
den <- sum((ai-k)^l2)
w <- num/den
w <- c(w,rep(0,dim(C)[2]-length(w)))
var[i] <- C[i,] %*% w
}
return(var)
}
But the problem is that I don't know how to include the nonlinear regressors in the model (or in the survey design, using the function svydesign):
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
Because, when I try to estimate the model:
# loglikelihood function:
LLsvy <- function(y, model, lambda1, lambda2){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fit <- svymle(loglike=LLsvy,
formulas=list(~y, model = ~ A(lambda1)+B(lambda2)+X,lambda1=~1,lambda2=~1),
design=d_test,
start=list(c(0,0,0,0),c(lambda1=11),c(lambda2=8)),
na.action="na.exclude")
I get the error message:
Error in eval(expr, envir, enclos) : object 'lambda1' not found
I think that the problem is in including the nonlinear part, because everything works fine if I fix A and B for some lambda1 and lambda2 (so that the model becomes linear):
lambda1=11
lambda2=8
data$A <- A(lambda1)
data$B <- B(lambda2)
d_test <- svydesign(id=~1, data = data, weights = ~data$hw0010)
LLsvylin <- function(y, model){
aux1 <- y * log(pnorm(model))
aux2 <- (1-y) * log(1-pnorm(model))
LL <- (aux1) + (aux2)
return(LL)
}
fitlin <- svymle(loglike=LLsvylin,
formulas=list(~y, model = ~A+B+X),
design=d_test,
start=list(0,0,0,0),
na.action="na.exclude")
On the contrary, if I don't use the sampling weights, I can easily estimate my nonlinear model using the function mle from package stats4 or the function mle2 from package bbmle.
To sum up,
how can I combine sampling weights (svymle) while estimating a nonlinear model (which I can do using mle or mle2)?
=========================================================================
A problem with the nonlinear part of the model arises also when using the function svyglm (with fixed lambda1 and lambda2, in order to get good starting values for svymle):
lambda1=11
lambda2=8
model0 = y ~ A(lambda1) + B(lambda2) + X
probit1 = svyglm(formula = model0,
data = data,
family = binomial(link=probit),
design = d_test)
Because I get the error message:
Error in svyglm.survey.design(formula = model0, data = data, family = binomial(link = probit), :
all variables must be in design= argument
This isn't what svymle does -- it's for generalised linear models, which have linear predictors and a potentially complicated likelihood or loss function. You want non-linear weighted least squares, with a simple loss function but complicated predictors.
There isn't an implementation of design-weighted nonlinear least squares in the survey package, probably because no-one has previously asked for one. You could try emailing the package author.
The upcoming version 4 of the survey package will have a function svynls, so if you know how to fit your model without sampling weights using nls you will be able to fit it with sampling weights.
Suppose I have two models created by calling glm() on the same data but with different formulas and/or families. Now I want to compare which model is better by predicting on an unknown data. Something like this:
mod1 <- glm(formula1, family1, data)
mod2 <- glm(formula2, family2, data)
mu1 <- predict(mod1, newdata, type = "response")
mu2 <- predict(mod2, newdata, type = "response")
How can I tell which of the predictions mu1 or mu2 is better?
Is there some simple command to compute the log likelihood of a prediction?
It would be easier to answer this with a reproducible example.
It often makes more sense to choose a family a priori rather than according too goodness of fit -- for example, if you have count (non-negative integer) responses with no obvious upper bound, your only real choice that lies strictly within the exponential family is Poisson.
set.seed(101)
x <- runif(1000)
mu <- exp(1+2*x)
y <- rgamma(1000,shape=3,scale=mu/3)
d <- data.frame(x,y)
New data:
nd <- data.frame(x=runif(100))
nd$y <- rgamma(100,shape=3,scale=exp(1+2*nd$x)/3)
Fit Gamma and Gaussian:
mod1 <- glm(y~x,family=Gamma(link="log"),data=d)
mod2 <- glm(y~x,family=gaussian(link="log"),data=d)
Predictions:
mu1 <- predict(mod1, newdata=nd, type="response")
mu2 <- predict(mod2, newdata=nd, type="response")
Extract shape/scale parameters:
sigma <- sqrt(summary(mod2)$dispersion)
shape <- MASS::gamma.shape(mod1)$alpha
Root mean squared error:
rmse <- function(x1,x2) sqrt(mean((x1-x2)^2))
rmse(mu1,nd$y) ## 5.845
rmse(mu2,nd$y) ## 5.842
Negative log likelihoods:
-sum(dgamma(nd$y,shape=shape,scale=mu1/shape,log=TRUE)) ## 276.84
-sum(dnorm(nd$y,mean=mu2,sd=sigma,log=TRUE)) ## 318.4