Error with ZIP and ZINB Models upon subseting and factoring data - r

I am trying to run ZIP and ZINB models to try look at some of the factors which might help explain disease (orf) distribution within 8 geographical regions. The models works fine for some regions and not others. However upon factoring and running the model in R I get error message.
How can I solve this problem or would there be a model that might work better with the subset as the analysis will only make sense when it’s all uniform across all regions.
zinb3 = zeroinfl(Cases2012 ~ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles) ,data=orf3, dist="negbin",link="logit")
Error in solve.default(as.matrix(fit$hessian)) :
system is computationally singular: reciprocal condition number = 2.99934e-24
Results after fitting zerotrunc & glm as suggested by #Achim Zeileis. How do i interprete zerotruc output given that no p values. Also how can I correct the error with glm?
zerotrunc(Cases2012 ~ Flock2012+Stocking.Density2012+ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles),data=orf1, subset = Cases2012> 0)
Call:
zerotrunc(formula = Cases2012 ~ Flock2012 + Stocking.Density2012 + Precip + Altitude +
factor(Breed) + factor(Farming.Practise) + factor(Lambing.Management) + factor(Thistles),
data = orf1, subset = Cases2012 > 0)
Coefficients (truncated poisson with log link):
(Intercept) Flock2012 Stocking.Density2012
14.1427130 -0.0001318 -0.0871504
Precip Altitude factor(Breed)2
-0.1467075 -0.0115919 -3.2138767
factor(Farming.Practise)2 factor(Lambing.Management)2 factor(Thistles)3
1.3699477 -2.9790725 2.0403543
factor(Thistles)4
0.8685876
glm(factor(Cases2012 ~ 0) ~ Precip+ Altitude +factor(Breed)+ factor(Farming.Practise)+factor(Lambing.Management)+ factor(Thistles) +Flock2012+Stocking.Density2012 ,data=orf1, family = binomial)
Error in unique.default(x, nmax = nmax) :
unique() applies only to vectors

It's hard to say exactly what is going on based on the information provided. However, I would suspect that the data in some regions does not allow to fit the model specified. For example, there might be some regions where certain factor levels (of Breed or Farming.Practise or Lambin.Management or Thristles) only have zero values (or only non-zero but that is less frequent in practice). Then the coefficient estimates often degenerate so that the associated zero-inflation probability goes to 1 and the count coefficient cannot be estimated.
It's typically easier to separate these effects by using the hurdle rather the zero-inflation model. Then the two parts of the model can also be fitted separately by glm(factor(y > 0) ~ ..., ..., family = binomial) and zerotrunc(y ~ ..., ..., subset = y > 0). The latter function is essentially the same code as pscl uses but has been factored into a standalone function in the package countreg on R-Forge (not yet on CRAN).

Related

How to correctly interpret glmmTMB models with large z statistics/conflicting error messages?

I am using glmmTMB to run a zero-inflated two-component hurdle model to determine how certain covariates might influence (1) whether or not a fish has food in its stomach and (2) if the stomach contains food, which covariates effect the number of prey items found in its stomach.
My data consists of the year a fish was caught, the season it was caught, sex, condition, place of origin, gross sea age (1SW = one year at sea, MSW = multiple years at sea), its genotype at two different loci, and fork length residuals. Data are available at my GitHub here.
Model interpretation
When I run the model (see code below), I get the following warning message about unusually large z-statistics.
library(glmmTMB)
library(DHARMa)
library(performance)
set.seed(111)
feast_or_famine_all_prey <- glmmTMB(num_prey ~ autumn_winter+
fishing_season + sex+ condition_scaled +
place_of_origin+
sea_age/(gene1+gene2+fork_length_residuals) + (1|location),
data = data_5,
family= nbinom2,
ziformula = ~ .,
dispformula = ~ fishing_season + place_of_origin,
control = glmmTMBControl(optCtrl = list(iter.max = 100000,
eval.max = 100000),
profile = TRUE, collect = FALSE))
summary(feast_or_famine_all_prey_df)
diagnose(feast_or_famine_all_prey_df)
Since the data does display imbalance for the offending variables (e.g. mean number of prey items in autumn = 85.33, mean number of prey items in winter = 10.61), I think the associated model parameters are near the edge of their range, hence, the extreme probabilities suggested by the z-statistics. Since this is an actual reflection of the underlying data structure (please correct me if I'm wrong!) and not a failure of the model itself, is the model output safe to interpret and use?
Conflicting error messages
Using the diagnose() function as well as exploring model diagnostics using the DHARMa package seem to suggest the model is okay.
diagnose(feast_or_famine_all_prey_df)
ff_all_prey_residuals_df<- simulateResiduals(feast_or_famine_all_prey_df, n = 1000)
testUniformity(ff_all_prey_residuals_df)
testOutliers(ff_all_prey_residuals_df, type = "bootstrap")
testDispersion(ff_all_prey_residuals_df)
testQuantiles(ff_all_prey_residuals_df)
testZeroInflation(ff_all_prey_residuals_df)
However, if I run the code performance::r2_nakagawa(feast_or_famine_all_prey_df) then I get the following error messages:
> R2 for Mixed Models
Conditional R2: 0.333
Marginal R2: 0.251
Warning messages:
1: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
2: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
3: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
4: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
5: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')"
None of these appeared using diagnose() nor were they (to the best of my knowledge) hinted at by the DHARMa diagnostics. Should these errors be believed?
Short answer: when you run performance::r2_nakagawa it refits the model with the fixed effects components removed. It's possible that your R^2 estimates are unreliable, but this shouldn't affect any of the other model results.
(update after much digging):
The code descends through these functions:
performance::r2_nakagawa
performance:::.compute_random_vars
insight::get_variance
insight:::.compute_variances
insight:::.compute_variance_distribution
insight:::.variance_distributional
insight:::null_model
insight:::.null_model_mixed
at which point it tries to run a null model with no fixed effects (num_prey ~ (1 | location)). This is where the warnings are coming from.
When I run your code I get R^2 values of 0.308/0.237, which does suggest that this is a somewhat unstable calculation (not that these differences would really change the conclusion much).

Strange glmulti results: Why are interaction variables from the candidate model dropped/not included?

I have been using glmulti to obtain model averaged estimates and relative importance values for my variables of interest. In running glmulti I specified a candidate model for which all variables and interactions were included based on a priori knowledge (see code below).
After running the glmutli model I studied the results by using the functions summary() and weightable(). There seem to be a number of strange things going on with the results which I do not understand.
First of all, when I run my candidate model with lme4 glmer() function I obtain an AIC value of 2086. In the glmulti output this candidate model (with exactly the same formula) has a lower AIC value (2107), as a result of which it appears at position 8 out of 26 in the list of all potential models (as obtained through the weigtable() function).
What seems to be causing this problem is that the logArea:Habitat interaction is dropped from the candidate model, despite level=2 being specified. The function summary(output_new#objects[[8]]) provides a different formula (without the logArea:Habitat interaction variable) compared to the formula provided through weightable(). This explains why the candidate model AIC value is not the same as obtained through lme4, but I do not understand why the interaction variables logArea:Habitat is missing from the formula. The same is happening for other possible models. It seems that for all models with 2 or more interactions, one interaction is dropped.
Does anyone have an explanation for what is going on? Any help would be much appreciated!
Best,
Robert
Note: I have created a subset of my data (https://drive.google.com/open?id=1rc0Gkp7TPdnhW6Bw87FskL5SSNp21qxl) and simplified the candidate model by removing variables in order to decrease model run time. (The problem remains the same)
newdat <- Data_ommited2[, c("Presabs","logBodymass", "logIsolation", "Matrix", "logArea", "Protection","Migration", "Habitat", "Guild", "Study","Species", "SpeciesStudy")]
glmer.glmulti <- function (formula, data, random, ...) {
glmer(paste(deparse(formula), random), data = data, family=binomial(link="logit"),contrasts=list(Matrix=contr.sum, Habitat=contr.treatment, Protection=contr.treatment, Guild=contr.sum),glmerControl(optimizer="bobyqa", optCtrl = list(maxfun = 100000)))
}
output_new <- glmulti(y = Presabs ~ Matrix + logArea*Protection + logArea*Habitat,
data = sampledata,
random = '+(1|Study)+(1|Species)+(1|SpeciesStudy)',
family = binomial,
method = 'h',
level=2,
marginality=TRUE,
crit = 'aic',
fitfunc = glmer.glmulti,
confsetsize = 26)
print(output_new)
summary(output_new)
weightable(output_new)
I found a post (https://stats.stackexchange.com/questions/341356/glmulti-package-in-r-reporting-incorrect-aicc-values) of someone who encountered the same problem and it appears that the problem was caused by this line of code:
glmer.glmulti <- function (formula, data, random, ...) {
glmer(paste(deparse(formula), random), data = data, family=binomial(link="logit"))
}
By changing this part of the code into the following the problem was solved:
glmer.glmulti<-function(formula,data,random,...) {
newf <- formula
newf[[3]] <- substitute(f+r,
list(f=newf[[3]],
r=reformulate(random)[[2]]))
glmer(newf,data=data,
family=binomial(link="logit"))
}

Ordinal regression - proportional odds assumption not met for variable in interaction

I try to analyze a dataset with an ordinal response (0-4) and three categorical factors. I'm interested in the interactions of all three factors as well as the main effects. I used the clm function of the package "ordinal" and checked the assumptions by using the "nominal_test" function. It revealed a significant difference for one of the predictors. And now I don't know how to proceed... I tried to put the problematic factor and all its interactions in the "nominal" argument (see code) and R gives me warnings. Nevertheless, I made several likelyhood ratio tests always comparing a model including an interaction with one missing it (ANOVA(without,with, test="Chisq")) and get some nice significant results. Still, I feel like I have no clue what I'm doing here and I don't trust the results. So my question is: Is it ok what I did? What else can I do? or is the data just 'unanalyzable'?
Here is the code for the test:
# this is the model
res=clm(cue~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
data=xdata)
#proportional odds assumption
nominal_test(res)
# Df logLik AIC LRT Pr(>Chi)
#<none> -221.50 467.00
#intention 3 -215.05 460.11 12.891 0.004879 **
#outcome 3 -219.44 468.87 4.124 0.248384
#age
#Gender 3 -219.50 469.00 3.994 0.262156
#intention:outcome
#intention:age
#outcome:age 6 -217.14 470.28 8.716 0.190199
#intention:outcome:age 12 -188.09 424.19 66.808 1.261e-09 ***
And here is an example of how I tried to solve it -> and check the 3-way-interaction of all three predictors. I did the same for the 2-way-interactions as well...
res=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~ intention:age:outcome+
intention:age+
intention:outcome+
intention,
data=xdata)
res.red=clm(cue~ outcome:age+
outcome+age+
Gender,
nominal= ~
intention:age+
intention:outcome+
intention,
data=xdata)
anova(res,res.red, test="Chisq")
# no.par AIC logLik LR.stat df Pr(>Chisq)
#res.red 26 412.50 -180.25
#res 33 424.11 -179.05 2.3945 7 0.9348
And here is the warning that R gives me when I try to cenverge the model:
Warning message:
(-3) not all thresholds are increasing: fit is invalid
In addition: Absolute convergence criterion was met, but relativecriterion was not met
I'm especially concerned about the sentence "Fit is not valid"... I don't know what to do with this and would be happy about any idea or hint!
Thank you!
Have you tried to use a more general model like the partial proportional odds model? Your data only has to be nominal, not ordinal to use this model. If you find hugh differences between the log likelihoods, your assumption about ordinality is not met.
You can use vlgm() from the VGAM package. Here are a few examples.
As I don't know how your data looks like, I can't say whether it's unanalyzable, but the code would be something like this:
library(VGAM)
res <- vglm(cue ~ intention:outcome:age+
intention:outcome+
intention:age+
outcome:age+
intention+outcome+age+
Gender,
family = cumulative(parallel = FALSE ~ intention),
data = xdata)
summary(res)
I think you could use pchiq() as proposed in the example I posted above to compare both models like you did before with anova():
pchisq(deviance(res) - deviance(res.red),
df = df.residual(res) - df.residual(res.red), lower.tail = FALSE)

Resolving symmetry in gnls model

I'm trying to fit a logistic growth model in R, using gnls in the nlme package.
I have previously successfully fit a model:
mod1 <- gnls(Weight ~ I(A/(1+exp(b + v0*Age + v1*Sum.T))),
data = df,
start = c(A= 13.157132, b= 3, v0= 0.16, v1= -0.0059),
na.action=na.omit)
However, I now wish to constrain b so that it is not fitted by the model, so have tried fitting a second model:
mod2 <- gnls(Weight ~ I(A/(1+exp(log((A/1.022)-1) + v0*Age + v1*Sum.T))),
data = df,
start = c(A= 13.157132, v0= 0.16, v1= -0.0059),
na.action=na.omit)
This model returned the error:
Error in gnls(Weight ~ A/(1 + exp(log((A/1.022) - 1) + v0 * Age + :
approximate covariance matrix for parameter estimates not of full rank
Warning messages:
1: In log((A/1.022) - 1) : NaNs produced
Searching the error suggests that the problem is caused by symmetry in the model, and solutions to specific questions involve adapting the formula with different parameters. Unfortunately, my statistical knowledge is not good enough to a) fully understand the problem or b) adapt the formula myself.
As for the warning messages (there were 15 in all, all the same) I can't see why they arise because this section of the model works alone (with example numbers).
Any help with any of these queries would be greatly appreciated.
It may be informative for users to know that I finally solved this with what turned out to be a simple solution (with help from a friend).
Since exp(a+b) = exp(a)*exp(b), the equation can be rewritten:
Weight ~ I(A/(1+((A/1.022)-1) * exp(v0*Age + v1*Sum.T))
Which fits without any problems. In general, writing the equation in a different form would seem to be the answer.

gwr fitting using package mgcv and R2Bayesx in R

I want to compare GWR fittings produced between spgwr and mgcv, but I got a error with gam function of mgcv . Here is a example :
require(spgwr)
require(mgcv)
require(R2BayesX)
data(columbus)
col.bw <- gwr.sel(crime ~ income + housing, data=columbus,verbose=F,
coords=cbind(columbus$x, columbus$y))
col.gauss <- gwr(crime ~ income + housing, data=columbus,
coords=cbind(columbus$x, columbus$y),
bandwidth=col.bw, hatmatrix=TRUE)
#gwr fitting with Intercept
col.gam<-gam(crime ~s(x,y)+s(x,y)*income+s(x,y)*housing, data=columbus)#mgcv ERROR
b1<-bayesx(crime ~sx(x,y)+sx(x,y)*income+sx(x,y)*housing, data=columbus)#R2Bayesx ERROR
Question:
How to fit the same gwr using gam and bayesx function(the smooth functions of location )
How to control the parameters to be similiar as possible including optimal bandwidth
The mgcv error comes from the factor that you are specifying the "interactions" between the spatial smooth and variables income and housing. Read ?gam.models for details on using by terms. I think for this you need
col.gam <- gam(crime ~s(x,y, k = 5) + s(x,y, by = income, k = 5) +
s(x,y, by = housing, k = 5), data=columbus)
In this example, as there are only 49 observations, you need to restrict the dimensions of the basis functions, which I do here with k = 5, but you should investigate whether you need to vary these a little, within the constraints of the data.
By the looks of the error from bayesx, you have the same issue of specifying the model incorrectly. I'm not familiar with bayesx(), but it looks like it uses the same s() function as supplied with mgcv, so the model specification should be the same as I show above.
As for 2. can you expand on what you mean here Comparable getween gam() and bayesx() or getting both or one of these comparable with the spgwr() model?

Resources