I'm attempting to do this in JAGS:
z[l] ~ dbeta(0.5,0.5)
y[i,l] ~ z[l]*dnorm(0,10000) + inprod(1-z[l],dnegbin(exp(eta_star[i,l]),alpha[l]))
(dnorm(0,10000) models a Dirac delta in 0: see here if you are interested in the model).
But I get:
RUNTIME ERROR:
Incorrect number of arguments in function dnegbin
But if I do this:
y[i,l] ~ dnegbin(exp(eta_star[i,l]),alpha[l])
It runs just fine. I wonder that I cannot multiply a value for a distribution, so I imagine that something like this could work:
z[l] ~ dbeta(0.5,0.5)
pointmass_0[l] ~ dnorm(0,10000)
y[i,l] ~ dnegbin(exp(eta_star[i,l]),alpha[l])
y_star[i,l] = z[l]*pointmass_0[l]+inprod(1-z[l],y[i,l])
If I run that I get:
ystar[1,1] is a logical node and cannot be observed
You are looking to model a zero-inflated negative binomial model. You can do this in JAGS if you use the "ones trick", an pseudo-likelihood method that can be used when the distribution of your outcome variables is not one of the standard distributions in JAGS but you can still write down an expression for the likelihood.
The "ones trick" consists of creating pseudo-observations with the value 1. These are then modeled as Bernoulli random variables probability parameter Lik/C where Lik is the likelihood of your observations and C is a large constant to ensure that Lik/C << 1.
data {
C <- 10000
for (i in 1:N) {
one[i,1] <- 1
}
}
model {
for (i in 1:N) {
one[i,1] ~ dbern(lik[i,1]/C)
lik[i,1] <- (y[i,1]==0)*z[1] + (1 - z[1]) * lik.NB[i,1]
lik.NB[i,1] <- dnegbin(y[i,1], exp(eta_star[i,1]), alpha[1])
}
z[l] ~ dbeta(0.5,0.5)
}
Note that the name dnegbin is overloaded in JAGS. There is a distribution that has two parameters and a function that takes three arguments and returns the likelihood. We are using the latter.
I am thinking of adding zero-inflated versions of count distributions to JAGS, since the above construction is quote awkward for the user, whereas zero-inflated distributions are quite easy to implement internally in JAGS.
I too would like to know a better way to handle this situation.
One cheesy solution is to add a stochastic node
ystarstar[i,j] ~ dnorm(ystar[i,j],10000000)
(i.e. a Normal distribution with a very high precision, or a Dirac delta in your terminology) to the model.
Related
I'm trying to use gee to model counts of an outcome with a population offset.I have models with interaction terms and am trying to use the all effects package to summarize parameter estimates and odds ratios (ORs).
When I compute ORs by hand, I'm not sure why its not matching the output I get from the effects::allEffects() function. The data can't be shared but the model is
mdl <- geeglm(count~age+gender+age:gender+offset(log(totalpop)),
family="poisson", corstr="exchangeable", id=geo,
waves=year, data=df)
I use the below code to compute stuff manually. log_OR sums the interaction terms without intercepts added to parameter. log_odds sums the parameters with intercept. The code is taken from here.
tibble(
variables = names(coef(mdl)),
log_OR = c(...),
log_odds = c(...),
OR = exp(log_OR),
odds = exp(log_odds),
probability = odds / (1 + odds)
) %>%
mutate_if(is.numeric, ~round(., 5)) %>%
knitr::kable()
I then compare my manual calculations to the output of allEffects below. They don't match. Can someone help me see what I am doing wrong?
result <- allEffects(mdl)
allEffects(mdl) %>% summary()
variable <- result[["age:gender"]][["x"]]
Prob <- result$`age:gender`$fit
Prob_upper <- result$`age:gender`$upper
Prob_lower <- result$`age:gender`$lower
model_Est <- data.frame("Est"=Prob, "CI Lower"= Prob_lower,
"CI Upper"= Prob_upper)
model_Prob <- exp(model_Est)
model_est <- data.frame("Variable"=variable, model_est)
model_OR <- data.frame("Variable"=variable, model_OR)
You haven't given us very much to go on, but the cause is almost certainly that the offset isn't being dealt with properly. (The first thing I would try is running the model without the offset to see if the results from effects and your by-hand calculations match: that's not the model you want, but it will confirm that the problem is with the offsets.)
?effects says:
offset a function to be applied to the offset values (if
there is an offset) in a linear or generalized linear
model, or a mixed-effects model fit by ‘lmer’ or ‘glmer’;
or a numeric value, to which the offset will be set. The
default is the ‘mean’ function, and thus the offset will
be set to its mean; in the case of ‘"svyglm"’ objects,
the default is to use the survey-design weighted mean.
Note: Only offsets defined by the ‘offset’ argument to
‘lm’, ‘glm’, ‘svyglm’, ‘lmer’, or ‘glmer’ will be handled
correctly; use of the ‘offset’ function in the model
formula is not supported.
(emphasis added)
methods("effects") lists only effects.glm and effects.lm, which suggests that the model is being treated as a glm (i.e., there is no specialized method for GEE models). So, this suggests:
(1) you need to include offset= as a separate argument in your model.
(2) when doing your hand calculation, make sure the value of the offset is set to the mean value across all observations (unless you choose to use the offset= argument to effects/allEffects to change the default summary function).
I have some time series data (x, y) which is consistent with a linear model. I would like to construct a regression formula in which one of the coefficients corresponds to the y-value at a specific moment in time (for example, Jan 1, 2013), rather than the y-intercept at x=0 (which would correspond to Jan 1, year 0 A.D., so far in the past as to be not of interest).
Using the nls(), solver, I am able to construct the type of model that I want, and using simulated test data, I am able to get reasonable results out of it (meaning that the fit results approximately reproduce the "truth" values that I used to generate the artificial sample data). Using the lm() solver, I am also able to construct a linear model, however it defaults to providing me with a fitted y-intercept at a time value that I don't want; i.e., Jan 1, 0 A.D. When I attempt to give the lm() solver the same formula that works with the nls() solver, it returns an invalid model formula error:
# Force warnings to be printed as they are generated
options(warn=1)
# -------------------
# Define Linear Model
# -------------------
# Express the same linear model three ways:
# 1.) As a function (needed for constructing test data)
# 2.) As a formula appropriate for nls()
# 3.) As a formula appropriate for lm() [has 2013 offset removed]
# In first two versions, physical interpretations of model coefficients are:
# sv: starting value on Jan 1, 2013
# slope: annual rate of linear increase
linear_func <- function(year, sv, slope) {
sv + slope * (year-2013)
}
linear_form_offset <- (value ~ sv + slope * (year-2013))
linear_form_nooffset <- (value ~ year)
# -------------------
# Construct Test Data
# -------------------
sv_true <- 5000
slope_true <- 1500
year <- c(2013.5, 2014.5, 2015.5, 2016.5, 2017.5, 2018.5, 2019.5)
# Use truth values, and add some Gaussian noise
value <- linear_func(year, sv_true, slope_true) + rnorm(length(year), sd=100)
dftest <- data.frame(year, value)
# ------------------
# Obtain Fit Results
# ------------------
# nls solver requires approximate starting values, somewhere near the local
# vicinity of the final optimized values.
print("Now running nls (with offset)")
initcoef <- c(sv=3000, slope=1000)
fitresult <- nls(formula=linear_form_offset, data=dftest, start=initcoef)
print(coef(fitresult))
# lm solver, by contrast, has no concept of starting values, so omit them here
print("Now running lm (no offset)")
fitresult <- lm(formula=linear_form_nooffset, data=dftest)
print(coef(fitresult))
# lm solver using the offset formula that I would actually like to use --
# this results in an invalid model formula error.
print("Now running lm (with offset)")
fitresult <- lm(formula=linear_form_offset, data=dftest)
print(coef(fitresult))
When I run this example, I obtain the following as a typical result:
source("test_fit.R")
[1] "Now running nls (with offset)"
sv slope
5002.463 1518.854
[1] "Now running lm (no offset)"
(Intercept) year
-3052450.171 1518.854
[1] "Now running lm (with offset)"
Error in terms.formula(formula, data = data) :
invalid model formula in ExtractVars
>
Question 1 (simple): I am aware that, given a y-intercept at Jan 1, 0 A.D. and slope, I could easily calculate a corresponding y-value at Jan 1, 2013. However, for various reasons, I would prefer not to do that. I'd like to just construct the actual regression model that I really want, and solve that model natively using lm(). Is there any way (e.g., a preferred alternative syntax) to do that?
Question 2 (deeper): What is actually going on here, anyway? My naive understanding of formulas is that they are a specific object type in R--a formula should stand on its own as either an intrinsically valid or invalid construction, without external reference to the solver algorithm that is attempting to regress against it. But here the formula's validity seems to depend entirely upon which solver is actually using it. Why is this? And if the rules for constructing valid formulas are different for lm() vs. nls(), where are those rules written down, so that I can avoid running afoul of them next time?
I understand I can calculate the log likelihood of each sample during sampling, e.g.
...
model {
for (i in 1:N) {
(y[i] - 1) ~ bernoulli(p[i, 2]);
}
}
generated quantities {
vector[N] log_lik;
for (i in 1:N){
log_lik[i] = bernoulli_lpmf((y[i] - 1) | p[i, 2]);
}
}
After fitting, I can then extract log likelihood using the loo package:
log_lik_m <- extract_log_lik(stan_fit)
But I want to evaluate log likelihood of unseen data. This is possible in brms:
ll <- log_lik(fit_star, newdata = new_df)
But I would like to do this with rstan, since I can't easily define my model in brms (I am assuming).
For reference, I am trying to use Estimated LFO-CV to evaluate and compare my time-series model.
(e.g. https://github.com/paul-buerkner/LFO-CV-paper/blob/master/sim_functions.R#L186)
(https://mc-stan.org/loo/articles/loo2-lfo.html)
Thanks to the link from #dipetkov, I solved this myself. I didn't use the exact methods in the link, but came up with an alternative. You can call stan functions from R to get it to compute log likelihood for your model, even with unseen data (and its very fast!).
First, I put everything in my transformed parameters block into a function in stan's functions block. Then, I created a second function that wraps the first function, and evaluates the log likelihood for given observations and provided parameter estimates (I then removed my generated_quantities block). rstan has a function expose_stan_functions which adds all functions in the stan functions block to the R environment.
You can then call the log likelihood function you made to evaluate your model with any observations (previously seen or unseen), along with a set of parameter estimates.
R, Bayestats and Jags newbie here. I'm working on modeling some count data, right censored. Poisson seems to be my best guess. I wanna do a hierarchical model, as it leaves me with more possibilities to fine tune the parameterss. Can I simply write something like this:
A[i,j] <- dpois(a[i,j])
a[i,j]) <- b[i,]*x[i,j] +c[i] for all j,
where x[i,j] are my variables, or should I separate the censored time interval from the previous ones or something?
b[,] and c have a prior.
Thank you!
This is not clear to me what is supposed to be hierarchical.
You can have the time effect separated from the covariate effect, in which case the covariate effect is not related to the station.
Moreover, the linear part of your GLM should be positive because poisson distribution requires positive values. Look here: http://www.petrkeil.com/?p=1709
A proposition for you could be:
b1 ~ prior
b2 ~ prior
c ~ prior
for (t in 1:n_time) {
b_time[t] ~ prior
for (i in 1:n_stations) {
A[i,t] <- dpois(a[i,t])
log(a[i,t]) <- b1*b_time[t]*X1[i,t] + b2*b_time[t]*X2[i,t]+ c[i]
}}
I'm trying to analyze repairable systems reliability using growth models.
I have already fitted a Crow-Amsaa model but I wonder if there is any package or any code for fitting a Generalized Renewal Process (Kijima Model I) or type II
in R and find it's parameters Beta, Lambda(or alpha) and q.
(or some other model for the mean cumulative function MCF)
The equation number 15 of this article gives an expression for the
Log-likelihood
I tried to create the function like this:
likelihood.G1=function(theta,x){
# x is a vector with the failure times, theta vector of parameters
a=theta[1] #Alpha
b=theta[2] #Beta
q=theta[3] #q
logl2=log(b/a) # First part of the equation
for (i in 1:length(x)){
logl2=logl2 +(b-1)*log(x[i]/(a*(1+q)^(i-1))) -(x[i]/(a*(1+q)^(i-1)))^b
}
return(-logl2) #Negavite of the log-likelihood
}
And then use some rutine for minimize the -Log(L)
theta=c(0.5,1.2,0.8) #Start parameters (lambda,beta,q)
nlm(likelihood.G1,theta, x=Data)
Or also
optim(theta,likelihood.G1,method="BFGS",x=Data)
However it seems to be some mistake, since the parameters it returns has no sense
Any ideas of what I'm doing wrong?
Thanks
Looking at equation (16) of the paper you reference and comparing it with your code it looks like you are missing one term in the for loop. It seems that each data point contributes to three terms of the log-likelihood but in your code (inside the loop) you only have two terms (not considering the updating term)
Specifically, your code does not include the 4th term in equation (16):
and neither it does the 7th term, and so on. This is at least one error in the code. An extra consideration would be that α and β are constrained to be greater than zero. I am not sure if the solver you are using is considering this constraint.