How to obtain new samples from ZIP or ZINB-model for bayesian p-value - r

Hopefully someone can help me with this one, because I am really stuck and do not find my coding error!
I am fitting zero-inflated poisson / negative binomial GLMs (no random effects) in JAGS (with R2Jags) and everything is fine with the parameter estimates, priors, initial values and chains convergence. All results are perfectly in line with, e.g., the estimates from the pscl-package, including my calculation of pearson residuals in the model...
The only thing I cannot get to work is to sample from the model a new sample to obtain a bayesian p-value for evaluating the model fit. The "normal" poisson and negative binomial models I fit before all gave the expected replicated samples and no problems occured.
Here's my code so far, but the important part is "#New Samples":
model{
# 1. Priors
beta ~ dmnorm(b0[], B0[,])
aB ~ dnorm(0.001, 1)
#2. Likelihood function
for (i in 1:N){
# Logistic part
W[i] ~ dbern(psi.min1[i])
psi.min1[i] <- 1 - psi[i]
eta.psi[i] <- aB
logit(psi[i]) <- eta.psi[i]
# Poisson part
Y[i] ~ dpois(mu.eff[i])
mu.eff[i] <- W[i] * mu[i]
log(mu[i]) <- max(-20, min(20, eta.mu[i]))
eta.mu[i] <- inprod(beta[], X[i,])
# Discrepancy measures:
ExpY[i] <- mu [i] * (1 - psi[i])
VarY[i] <- (1- psi[i]) * (mu[i] + psi[i] * pow(mu[i], 2))
PRes[i] <- (Y[i] - ExpY[i]) / sqrt(VarY[i])
D[i] <- pow(PRes[i], 2)
# New Samples:
YNew[i] ~ dpois(mu.eff[i])
PResNew[i] <- (YNew[i] - ExpY[i]) / sqrt(VarY[i])
DNew[i] <- pow(PResNew[i], 2)
}
Fit <- sum(D[1:N])
FitNew <- sum(DNew[1:N])
}
The big problem is, that I really tried all combinations and alterations I think could/should work, but when I look at the simulated samples, I get this here:
> all.equal( Jags1$BUGSoutput$sims.list$YNew, Jags1$BUGSoutput$sims.list$Y )
[1] TRUE
And, to make it really weird, when using the means of Fit and FitNew:
> Jags1$BUGSoutput$mean$Fit
[1] 109.7883
> Jags1$BUGSoutput$mean$FitNew
[1] 119.2111
Has anyone a clue what I am doing wrong? Any help would be deeply appreciated!
Kind regards, Ulf

I suspect this isn't the case, but the only obvious reason I can suspect for Y[i] and YNew[i] being always identical is if mu.eff[i] is ~zero, either because W[i] is 0 or mu[i] is close to zero. This implies that Y[] is always zero, which is easy to check from your data, but as I said it does seem odd that you would be trying to model this... Otherwise, I'm not sure what is going on ... try simplifying the code to see if that solves the problem, and then add things back in until it breaks again. Some other suggestions:
It may be helpful for debugging to look at the absolute values of Y and YNew rather than just Y==YNew
If you want a negative binomial (= gamma-Poisson) try sampling mu[i] from a gamma distribution - I have used this formulation for ZINB models extensively, so am sure it works
Your prior for aB looks odd to me - it gives a prior 95% CI for zero inflation around 12-88% - is that what you intended? And why a mean of 0.001 not 0? If you have no predictors then a beta prior for psi.min seems more natural - and if you have no useful prior information a beta(1,1) prior would be an obvious choice.
Minor point but you are calculating a lot of deterministic functions of aB within the for loop - this is going to slow down your model...
Hope that helps,
Matt

So, after getting a nervous breakdown and typing all again and again while searching for my coding error, I found the most stupid error I have ever made - so far:
I just did not specify "Y" as a parameter to save, only "YNew", so when I compared YNew and Y from the sims.list with all.equal, I did not get what I thought I should. I do not know why JAGS gives me the Y at all (from the sims.list of the JAGS-object), but for some reason it is just giving me YNew when asked to give Y. So this part is actually right:
Jags1$BUGSoutput$mean$Fit
[1] 109.7883
Jags1$BUGSoutput$mean$FitNew
[1] 119.2111
So I hope that I did not cause a major confusion for anybody...

Related

"convergence" for a derived quantity in JAGS/R2Jags

UPDATE: Now with Traceplot example
UPDATE: Now with new traceplot
I am trying to adapt Outhwaite et. als 2018 code for occupancy modelling and have a couple of questions that I just can't seem to find an answer for...
Code used to create model
cat(
"model{
### Model ###
# State model
for (i in 1:nsite){
for (t in 1:nyear){
z[i,t] ~ dbern(psi[i,t])
logit(psi[i,t])<- b[t] + u[i]
}}
# Observation model
for(j in 1:nvisit) {
y[j] ~ dbern(Py[j]+0.0001)
Py[j]<- z[Site[j],Year[j]]*p[j]
logit(p[j]) <- a[Year[j]] + c*logL[j]
}
### Priors ###
# State model priors
for(t in 1:nyear){
b[t] ~ dunif(-10,10) # fixed year effect
}
for (i in 1:nsite) {
u[i] ~ dnorm(0, tau.u) # random site effect
}
tau.u <- 1/(sd.u * sd.u)
sd.u ~ dunif(0, 5) # half-uniform hyperpriors
# Observation model priors
for (t in 1:nyear) {
a[t] ~ dnorm(mu.a, tau.a) # random year effect
}
mu.a ~ dnorm(0, 0.01)
tau.a <- 1 / (sd.a * sd.a)
sd.a ~ dunif(0, 5) # half-uniform hyperpriors
c ~ dunif(-10, 10) # sampling effort effect
### Derived parameters ###
# Finite sample occupancy - proportion of occupied sites
for (t in 1:nyear) {
psi.fs[t] <- sum(z[1:nsite,t])/nsite
}
#data# nyear, nsite, nvisit, y, logL, Site, Year
}", file="bmmodel.txt"
)
Note that dbern(Py[j]+0.0001) includes a correction factor since dbern(0) is not supported in JAGS.
I am running the model on some plant data just basically trying it out to see if it runs and converges and behaves as I would expect it to.
Question number 1(ANSWERED): I am interested in the quantity psi.fs[t]. But since the model calculates this quantity after the actual modelling process, can convergence be assessed for psi.fs[t]?
R code for running model with R2JAGS
jagsrespsi<-jags(data.list, inits=test.inits,
n.chains=2, n.iter=15000, n.thin=3,
DIC=T,
model.file=paste0(modeltype,"model.txt"), parameters.to.save=c("psi.fs"))
Question number 2: When I use traceplot(jagsrespsi) to plot the traceplot seems all over the place but the Rhat for jagsrespsi$BUGSoutput is 1 for all my years? gelman.diag(as.mcmc(jagsrespsi)) also indicates convergence. Same goes for monitoring psi!
I am very astonished by this model behaviour and am suspecting there is something wrong... but no idea where to look
Yes, you can check psi.ft[] for convergence in exactly the same way as you check the convergence of the model's parameters. That's exactly what happens, for example, in a logistic regression, where the fitted probabilities of response are calculated as exp(z)/(1 + exp(z)) for some linear predictor z.
When you say the traceplot is "all over the place", what do you mean? This could be either good or bad. Can you show an example? A "good" traceplot looks like a "fat, hairy caterpillar": consecutive samples taken from all regions of the sample space, a horizontal hair ball. Although written for SAS, this page gives a reasonable high level description of what a good trace plot looks like, and what problems might be indicated by less-than-ideal examples.
In response to your edit to include the trace plot...
That doesn't look like a particularly good traceplot to me: there seems to be some negative autocorrelation between successive samples. Have you calculated the effective sample size [ESS]?
But the plot may look a little odd because your chain is very short, IMHO. You can use the ESS to provide a very rough approximation for the accuracy of an estimated probability: the worst case half width CI of a binomial proportion is +/-2 * sqrt(0.5*0.5/N), where N is the sample size (or ESS in this case). So even if the efficiency of your MCMC process is 1 - so that the ESS is equal to the chain length - then the accuracy of your estimates is only +/-0.02. To estimate a probability to 2 decimal places (so that the half width of the CI is no more than 0.005), you need an ESS of 40,000.
There's nothing wrong with using short chain lengths during testing, but for "production" runs, then I would always use a chan length much greater than 2,500. (And I'd also use multiple chains so that I can use Gelman-Rubin statistics to test for convergence.)

Why is the likelihood/AIC of my poisson regression infinite?

I am trying to evaluate themodel fit of several regressions in R, and I have run into a problem I have had multiple times now: the log-likelihood of my Poisson regression is infinite.
I'm using a non-integer dependent variable (Note: I know what I'm doing in this regard), and I'm wondering if maybe that's the problem. However, I don't get an infinite log-likelihood when running the regression with glm.nb.
Code to reproduce the issue is below.
Edit: the problem appears to go away when I coerce the DV to integer. Any idea how to get log likelihood from Poissons with non-integer DVs?
# Input Data
so_data <- data.frame(dv = c(21.0552722691125, 24.3061351414885, 7.84658638053276,
25.0294679770848, 15.8064731063311, 10.8171744654056, 31.3008088413026,
2.26643928259238, 18.4261153345417, 5.62915828161753, 17.0691184593063,
1.11959635820499, 30.0154935602592, 23.0000809735738, 28.4389825676123,
27.7678405415711, 23.7108405071757, 23.5070651053276, 14.2534787168392,
15.2058525068363, 19.7449094187771, 2.52384709295823, 29.7081691356397,
32.4723790240354, 19.2147002673637, 61.7911384519901, 10.5687170234821,
23.9047421013736, 18.4889651451222, 13.0360878554798, 15.1752866581849,
11.5205948111817, 31.3539840929108, 31.7255952728076, 25.3034625215724,
5.00013988265465, 30.2037887018226, 1.86123112349445, 3.06932041603219,
22.6739418581257, 6.33738321053804, 24.2933951601142, 14.8634827414491,
31.8302947881089, 34.8361908525564, 1.29606416941288, 13.206844629927,
28.843579313401, 25.8024295609021, 14.4414831628722, 18.2109680632694,
14.7092063453463, 10.0738043919183, 28.4124482962025, 27.1004208775326,
1.31350378236957, 14.3009307888745, 1.32555197766214, 2.70896028922312,
3.88043749517381, 3.79492216916016, 19.4507965653633, 32.1689088941444,
2.61278585713499, 41.6955885902228, 2.13466761675063, 30.4207256294235,
24.8231524369244, 20.7605955978196, 17.2182798298094, 2.11563574288652,
12.290778250655, 0.957467139696772, 16.1775287334746))
# Run Model
p_mod <- glm(dv ~ 1, data = so_data, family = poisson(link = 'log'))
# Be Confused
logLik(p_mod)
Elaborating on #ekstroem's comment: the Poisson distribution is only supported over the non-negative integers (0, 1, ...). So, technically speaking, the probability of any non-integer value is zero -- although R does allow for a little bit of fuzz, to allow for round-off/floating-point representation issues:
> dpois(1,lambda=1)
[1] 0.3678794
> dpois(1.1,lambda=1)
[1] 0
Warning message:
In dpois(1.1, lambda = 1) : non-integer x = 1.100000
> dpois(1+1e-7,lambda=1) ## fuzz
[1] 0.3678794
It is theoretically possible to compute something like a Poisson log-likelihood for non-integer values:
my_dpois <- function(x,lambda,log=FALSE) {
LL <- -lambda+x*log(lambda)-lfactorial(x)
if (log) LL else exp(LL)
}
but I would be very careful - some quick tests with integrate suggest it integrates to 1 (after I fixed the bug in it), but I haven't checked more carefully that this is really a well-posed probability distribution. (On the other hand, some reasonable-seeming posts on CrossValidated suggest that it's not insane ...)
You say "I know what I'm doing in this regard"; can you give some more of the context? Some alternative possibilities (although this is steering into CrossValidated territory) -- the best answer depends on where your data really come from (i.e., why you have "count-like" data that are non-integer but you think should be treated as Poisson).
a quasi-Poisson model (family=quasipoisson). (R will still not give you log-likelihood or AIC values in this case, because technically they don't exist -- you're supposed to do inference on the basis of the Wald statistics of the parameters; see e.g. here for more info.)
a Gamma model (probably with a log link)
if the data started out as count data that you've scaled by some measure of effort or exposure), use an appropriate offset model ...
a generalized least-squares model (nlme::gls) with an appropriate heteroscedasticity specification
Poisson log-likelihood involves calculating log(factorial(x)) (https://www.statlect.com/fundamentals-of-statistics/Poisson-distribution-maximum-likelihood). For values larger than 30 it has to be done using Stirling's approximation formula in order to avoid exceeding the limit of computer arithmetic. Sample code in Python:
# define a likelihood function. https://www.statlect.com/fundamentals-of- statistics/Poisson-distribution-maximum-likelihood
def loglikelihood_f(lmba, x):
#Using Stirling formula to avoid calculation of factorial.
#logfactorial(n) = n*ln(n) - n
n = x.size
logfactorial = x*np.log(x+0.001) - x #np.log(factorial(x))
logfactorial[logfactorial == -inf] = 0
result =\
- np.sum(logfactorial) \
- n * lmba \
+ np.log(lmba) * np.sum(x)
return result

linear regression using lm() - surprised by the result

I used a linear regression on data I have, using the lm function. Everything works (no error message), but I'm somehow surprised by the result: I am under the impression R "misses" a group of points, i.e. the intercept and slope are not the best fit. For instance, I am referring to the group of points at coordinates x=15-25,y=0-20.
My questions:
is there a function to compare fit with "expected" coefficients and "lm-calculated" coefficients?
have I made a silly mistake when coding, leading the lm to do
that?
Following some answers: additionnal information on x and y
x and y are both visual estimates of disease symptoms. There is the same uncertainty on both of them.
The data and code are here:
x1=c(24.0,23.9,23.6,21.6,21.0,20.8,22.4,22.6,
21.6,21.2,19.0,19.4,21.1,21.5,21.5,20.1,20.1,
20.1,17.2,18.6,21.5,18.2,23.2,20.4,19.2,22.4,
18.8,17.9,19.1,17.9,19.6,18.1,17.6,17.4,17.5,
17.5,25.2,24.4,25.6,24.3,24.6,24.3,29.4,29.4,
29.1,28.5,27.2,27.9,31.5,31.5,31.5,27.8,31.2,
27.4,28.8,27.9,27.6,26.9,28.0,28.0,33.0,32.0,
34.2,34.0,32.6,30.8)
y1=c(100.0,95.5,93.5,100.0,98.5,99.5,34.8,
45.8,47.5,17.4,42.6,63.0,6.9,12.1,30.5,
10.5,14.3,41.1, 2.2,20.0,9.8,3.5,0.5,3.5,5.7,
3.1,19.2,6.4, 1.2, 4.5, 5.7, 3.1,19.2, 6.4,
1.2,4.5,81.5,70.5,91.5,75.0,59.5,73.3,66.5,
47.0,60.5,47.5,33.0,62.5,87.0,86.0,77.0,
86.0,83.0,78.5,83.0,83.5,73.0,69.5,82.5,78.5,
84.0,93.5,83.5,96.5,96.0,97.5)
## x11()
plot(x1,y1,xlim=c(0,35),ylim=c(0,100))
# linear regression
reg_lin=lm(y1 ~ x1)
abline(reg_lin,lty="solid", col="royalblue")
text(12.5,25,labels="R result",col="royalblue", cex=0.85)
text(12.5,20,labels=bquote(y== .(5.26)*x - .(76)),col="royalblue", cex=0.85)
# result I would have imagined
abline(a=-150,b=8,lty="dashed", col="red")
text(27.5,25,labels="What I think is better",col="red", cex=0.85)
text(27.5,20,labels=bquote(y== .(8)*x - .(150)),col="red", cex=0.85)
Try this:
reg_lin_int <- reg_lin$coefficients[1]
reg_lin_slp <- reg_lin$coefficients[2]
sum((y1 - (reg_lin_int + reg_lin_slp*x1)) ^ 2)
# [1] 39486.33
sum((y1 - (-150 + 8 * x1)) ^ 2)
# [1] 55583.18
The sum of squared residuals is lower under the lm fit line. This is to be expected, as reg_lin_int and reg_lin_slp are guaranteed to produce the minimal total squared error.
Intuitively, we know estimators under squared loss functions are sensitive to outliers. It's "missing" the group at the bottom because it gets closer to the group at the top left that's much further away--and squared distance gives these points more weight.
In fact, if we use Least Absolute Deviations regression (i.e., specify an absolute loss function instead of a square), the result is much closer to your guess:
library(quantreg)
lad_reg <- rq(y1 ~ x1)
(Pro tip: use lwd to make your graphs much more readable)
What gets even closer to what you had in mind is Total Least Squares, as mentioned by #nongkrong and #MikeWilliamson. Here is the result of TLS on your sample:
v <- prcomp(cbind(x1, y1))$rotation
bbeta <- v[-ncol(v), ncol(v)] / v[1, 1]
inter <- mean(y1) - bbeta * mean(x1)
You got a nice answer already, but maybe this is also helpful:
As you know, OLS minimizes the sum of squared errors in y-direction. This implies that the uncertainty of your x-values is negligible, which is often the case. But possibly it's not the case for your data. If we assume that uncertainties in x and y are equal and do Deming regression we get a fit more similar to what you expected.
library(MethComp)
dem_reg <- Deming(x1, y1)
abline(dem_reg[1:2], col = "green")
You don't provide detailed information about your data. Thus, this might be useful or not.

R gmm package using exactly identified moment conditions

For exactly identified moments, GMM results should be the same regardless of initial starting values. This doesn't appear to be the case however.
library(gmm)
data(Finance)
x <- data.frame(rm=Finance[1:500,"rm"], rf=Finance[1:500,"rf"])
# want to solve for coefficients theta[1], theta[2] in exactly identified
# system
g <- function(theta, x)
{
m.1 <- x[,"rm"] - theta[1] - theta[2]*x[,"rf"]
m.z <- (x[,"rm"] - theta[1] - theta[2]*x[,"rf"])*x[,"rf"]
f <- cbind(m.1, m.z)
return(f)
}
# gmm coefficient result should be identical to ols regressing rm on rf
# since two moments are E[u]=0 and E[u*rf]=0
model.lm <- lm(rm ~ rf, data=x)
model.lm
# gmm is consistent with lm given correct starting values
summary(gmm(g, x, t0=model.lm$coefficients))
# problem is that using different starting values leads to different
# coefficients
summary(gmm(g, x, t0=rep(0,2)))
Is there something wrong with my setup?
The gmm package author Pierre Chausse was kind enough to respond to my inquiry.
For linear models, he suggests using the formula approach:
gmm(rm ~ rf, ~rf, data=x)
For non-linear models, he emphasizes that the starting values are indeed critical. In the case of exactly identified models, he suggests setting the fnscale to a small number to force the optim minimizer to converge closer to 0. Also, he thinks the BFGS algorithm works better with GMM.
summary(gmm(g, x, t0=rep(0,2), method = "BFGS", control=list(fnscale=1e-8)))
Both solutions work for this example. Thanks Pierre!

Error in Gradient Descent Calculation

I tried to write a function to calculate gradient descent for a linear regression model. However the answers I was getting does not match the answers I get using the normal equation method.
My sample data is:
df <- data.frame(c(1,5,6),c(3,5,6),c(4,6,8))
with c(4,6,8) being the y values.
lm_gradient_descent <- function(df,learning_rate, y_col=length(df),scale=TRUE){
n_features <- length(df) #n_features is the number of features in the data set
#using mean normalization to scale features
if(scale==TRUE){
for (i in 1:(n_features)){
df[,i] <- (df[,i]-mean(df[,i]))/sd(df[,i])
}
}
y_data <- df[,y_col]
df[,y_col] <- NULL
par <- rep(1,n_features)
df <- merge(1,df)
data_mat <- data.matrix(df)
#we need a temp_arr to store each iteration of parameter values so that we can do a
#simultaneous update
temp_arr <- rep(0,n_features)
diff <- 1
while(diff>0.0000001){
for (i in 1:(n_features)){
temp_arr[i] <- par[i]-learning_rate*sum((data_mat%*%par-y_data)*df[,i])/length(y_data)
}
diff <- par[1]-temp_arr[1]
print(diff)
par <- temp_arr
}
return(par)
}
Running this function,
lm_gradient_descent(df,0.0001,,0)
the results I got were
c(0.9165891,0.6115482,0.5652970)
when I use the normal equation method, I get
c(2,1,0).
Hope someone can shed some light on where I went wrong in this function.
You used the stopping criterion
old parameters - new parameters <= 0.0000001
First of all I think there's an abs() missing if you want to use this criterion (though my ignorance of R may be at fault).
But even if you use
abs(old parameters - new parameters) <= 0.0000001
this is not a good stopping criterion: it only tells you that progress has slowed down, not that it's already sufficiently accurate. Try instead simply to iterate for a fixed number of iterations. Unfortunately it's not that easy to give a good, generally applicable stopping criterion for gradient descent here.
It seems that you have not implemented a bias term. In a linear model like this, you always want to have an additional additive constant, i.e., your model should be like
w_0 + w_1*x_1 + ... + w_n*x_n.
Without the w_0 term, you usually won't get a good fit.
I know this is a couple of weeks old at this point but I'm going to take a stab at for several reasons, namely
Relatively new to R so deciphering your code and rewriting it is good practice for me
Working on a different Gradient Descent problem so this is all fresh to me
Need the stackflow points and
As far as I can tell you never got a working answer.
First, regarding your data structures. You start with a dataframe, rename a column, strip out a vector, then strip out a matrix. It would be a lot easier to just start with an X matrix (capitalized since its component 'features' are referred to as xsubscript i) and a y solution vector.
X <- cbind(c(1,5,6),c(3,5,6))
y <- c(4,6,8)
We can easily see what the desired solutions are, with and without scaling by fitting a linear fit model. (NOTE We only scale X/features and not y/solutions)
> lm(y~X)
Call:
lm(formula = y ~ X)
Coefficients:
(Intercept) X1 X2
-4 -1 3
> lm(y~scale(X))
Call:
lm(formula = y ~ scale(X))
Coefficients:
(Intercept) scale(X)1 scale(X)2
6.000 -2.646 4.583
With regards to your code, one of the beauties of R is that it can perform matrix multiplication which is significantly faster than using loops.
lm_gradient_descent <- function(X, y, learning_rate, scale=TRUE){
if(scale==TRUE){X <- scale(X)}
X <- cbind(1, X)
theta <- rep(0, ncol(X)) #your old temp_arr
diff <- 1
old.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
while(diff>0.000000001){
theta <- theta - learning_rate * t(X) %*% (X %*% theta - y) / length(y)
new.error <- sum( (X %*% theta - y)^2 ) / (2*length(y))
diff <- abs(old.error - new.error)
old.error <- new.error
}
return(theta)
}
And to show it works...
> lm_gradient_descent(X, y, .01, 0)
[,1]
[1,] -3.9360685
[2,] -0.9851775
[3,] 2.9736566
vs expected of (-4, -1, 3)
For what its worth while I agree with #cfh that I would prefer a loop with a defined number of iterations, I'm actually not sure you need the abs function. If diff < 0 then your function is not converging.
Finally rather than using something like old.error and new.error I'd suggest using a a vector that records all errors. You can then plot that vector to see how quickly your function converges.

Resources