Durbin Watson test on the residuals of a mixed model - r

I am trying to calculate the temporal autocorrelation of a poison distributed mixed model, and was wondering how to do so. I get an error that says "$ operator not defined for this S4 class" I can successfully run the the dwtest on a linear model, with a poisson distribution, but not the one I really want.
Successful model and code:
temp.nem.cuc.glm<-glm(AllDat$nem.cuc~ AllDat$year.collected, family=poisson(link="log"))
summary(temp.nem.cuc.lm)
time<-AllDat$year.collected
dwnem.cuc<-dwtest(temp.nem.cuc.lm, order.by = time, alternative = "two.sided", iterations = 50, exact = FALSE, tol = 1e-10)
dwnem.cuc
Unsuccessful model and code
#the model I am really interested in
nem.cuc.pois=glmer(nem.cuc~ I(year.collected-1930)+I(standard.length..mm./100) + (1|sites1), family = "poisson", data=AllDat)
time<-AllDat$year.collected
dwnemresid.cuc<-dwtest(nem.cuc.pois, order.by = time, alternative = "two.sided", iterations = 50, exact = FALSE, tol = 1e-10)
dwnem.cuc

For future reference, I found the "check_autocorrelation" function from the performance package, but it returns only whether it detected autocorrelation or not.
I would also be interested in finding a DW test that returns a statistic, if anyone knows one - or in a way to solve the S4 class error, since I encounter it repeatedly.

Related

How to correctly interpret glmmTMB models with large z statistics/conflicting error messages?

I am using glmmTMB to run a zero-inflated two-component hurdle model to determine how certain covariates might influence (1) whether or not a fish has food in its stomach and (2) if the stomach contains food, which covariates effect the number of prey items found in its stomach.
My data consists of the year a fish was caught, the season it was caught, sex, condition, place of origin, gross sea age (1SW = one year at sea, MSW = multiple years at sea), its genotype at two different loci, and fork length residuals. Data are available at my GitHub here.
Model interpretation
When I run the model (see code below), I get the following warning message about unusually large z-statistics.
library(glmmTMB)
library(DHARMa)
library(performance)
set.seed(111)
feast_or_famine_all_prey <- glmmTMB(num_prey ~ autumn_winter+
fishing_season + sex+ condition_scaled +
place_of_origin+
sea_age/(gene1+gene2+fork_length_residuals) + (1|location),
data = data_5,
family= nbinom2,
ziformula = ~ .,
dispformula = ~ fishing_season + place_of_origin,
control = glmmTMBControl(optCtrl = list(iter.max = 100000,
eval.max = 100000),
profile = TRUE, collect = FALSE))
summary(feast_or_famine_all_prey_df)
diagnose(feast_or_famine_all_prey_df)
Since the data does display imbalance for the offending variables (e.g. mean number of prey items in autumn = 85.33, mean number of prey items in winter = 10.61), I think the associated model parameters are near the edge of their range, hence, the extreme probabilities suggested by the z-statistics. Since this is an actual reflection of the underlying data structure (please correct me if I'm wrong!) and not a failure of the model itself, is the model output safe to interpret and use?
Conflicting error messages
Using the diagnose() function as well as exploring model diagnostics using the DHARMa package seem to suggest the model is okay.
diagnose(feast_or_famine_all_prey_df)
ff_all_prey_residuals_df<- simulateResiduals(feast_or_famine_all_prey_df, n = 1000)
testUniformity(ff_all_prey_residuals_df)
testOutliers(ff_all_prey_residuals_df, type = "bootstrap")
testDispersion(ff_all_prey_residuals_df)
testQuantiles(ff_all_prey_residuals_df)
testZeroInflation(ff_all_prey_residuals_df)
However, if I run the code performance::r2_nakagawa(feast_or_famine_all_prey_df) then I get the following error messages:
> R2 for Mixed Models
Conditional R2: 0.333
Marginal R2: 0.251
Warning messages:
1: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
2: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
3: In (function (start, objective, gradient = NULL, hessian = NULL, :
NA/NaN function evaluation
4: In fitTMB(TMBStruc) :
Model convergence problem; non-positive-definite Hessian matrix. See vignette('troubleshooting')
5: In fitTMB(TMBStruc) :
Model convergence problem; false convergence (8). See vignette('troubleshooting')"
None of these appeared using diagnose() nor were they (to the best of my knowledge) hinted at by the DHARMa diagnostics. Should these errors be believed?
Short answer: when you run performance::r2_nakagawa it refits the model with the fixed effects components removed. It's possible that your R^2 estimates are unreliable, but this shouldn't affect any of the other model results.
(update after much digging):
The code descends through these functions:
performance::r2_nakagawa
performance:::.compute_random_vars
insight::get_variance
insight:::.compute_variances
insight:::.compute_variance_distribution
insight:::.variance_distributional
insight:::null_model
insight:::.null_model_mixed
at which point it tries to run a null model with no fixed effects (num_prey ~ (1 | location)). This is where the warnings are coming from.
When I run your code I get R^2 values of 0.308/0.237, which does suggest that this is a somewhat unstable calculation (not that these differences would really change the conclusion much).

boosting the 'devtol' parameter in lme4

I am estimating a mixed model with glmer and experience the error
Error in zeta(shiftpar, start = opt[seqpar1][-w]) : profiling detected new, lower deviance
I found a solution by "boosting" the devtol parameter. However, I don't know how and I can't find the solution.
Here is my model:
m3.glmer = glmer(binExap ~ (1|id) + Lag1 + Lag2 + Lag5 + BroadQ,
data = CLnMD,
family = binomial(link="logit"),
nAGQ=1, control = glmerControl(optimizer = 'bobyqa',
optCtrl=list(maxfun=100000)))
This is the code I am using for estimating the CIs:
KIsBoot <- confint.merMod(m3.glmer, method = "profile", nsim = 250)
Now where do I boost/how would I boost "devtol"?
This is admittedly a bit obscure. confint.merMod() takes a ... argument that gets passed to profile.merMod. ?profile.merMod says:
devtol: tolerance for fitted deviances less than baseline (supposedly
minimum) deviance.
So, if you want to ignore this check completely,
confint(m3.glmer, devtol = Inf)
should work. (You don't need .merMod, R figures that out automatically; "profile" is the default setting; and nsim is ignored unless method = "boot" [we should add a warning!])
However, I would also say a little bit pessimistically that if you're getting this error your profile CIs might not be very reliable ... try visualizing the profile as well (pp <- profile(m3.glmer, devtol = Inf); lattice::xyplot(pp)) to make sure it looks reasonable (i.e. at least monotonic!)

nlminb problem, convergence error code = 1 message = iteration limit reached without convergence (10)

I am trying to find a best model fitting on my data using library(nlme) and lme function in R. Here is my model when the slope is fixed:
FixedRopeLength <- lme(EnergyCost~ RopeLength,
data = data,
random=~1|Subject, method = "ML")
summary(FixedRopeLength)
To see whether a random slope provides a better model than a fixed slope, I let the slope to vary across Subject as follows:
RandomRopeLength <- lme(EnergyCost~RopeLength,
data = data,
random=~RopeLength|Subject, method = "ML")
summary(RandomRopeLength)
However, I got this error:
Error in lme.formula(EnergyCost ~ RopeLength, data = data, random =
~RopeLength | : nlminb problem, convergence error code = 1
message = iteration limit reached without convergence (10)
Any solution??
Thank you so much for your help. Your code worked. I only needed to justify your code based on lme function. Here is the code which can be used for aforementioned error:
RandomRopeLength<-lme(EnergyCost~RopeLength, data = data, random=~RopeLength|Subject, method = "ML", control =list(msMaxIter = 1000, msMaxEval = 1000))
summary(RandomRopeLength)
Thanks!
?lme shows that there is a control argument, which redirects you to ?lmerControl, which gives you
msMaxIter: maximum number of iterations for the optimization step
inside the ‘lme’ optimization. Default is ‘50’.
and
msMaxEval: maximum number of evaluations of the objective function
permitted for nlminb. Default is ‘200’.
These correspond to eval.max and iter.max from ?nlminb. Since I'm not sure which of these is the problem, I would re-run the model with
control = lmeControl(msMaxIter = 1000, msMaxEval = 1000)
However, I'll warn you that once you have a problem that experiences numerical problems with the default parameter settings, adjusting the parameter settings may just lead to other problems farther down the line ...

mlr3: obtaining response (predicted survival time) from surv.gbm

surv.gbm in the mlr3 framework outputs linear predictors, however what I'm really interested in are predicted survival times per case, which I want to compare with the actual survival times. Is there a way to obtain actual survival times?
In the mlr3 book, there is an example of a transformation between linear predictors and a distribution.
pod = po("distrcompose", param_vals = list(form = "ph", overwrite = FALSE))
prediction = pod$predict(list(base = prediction_distr, pred = prediction_lp))$output
Is there a way to change this pipeline so that it converts "lp" to "response" ?
Any help would be appriciated.
Yes this is definitely possible it just requires another transformation. Your first step is correct to compose a distribution from a linear predictor; as you're using surv.gbm only Cox PH is possible as the underlying model so default for distrcompose works for this.
Now you need to use crankcompose in order to create a survival time prediction from the distribution, you could use the mean, median, or mode of the distribution, people usually pick mean or median but that's your choice! Just make sure to include response = TRUE, overwrite = FALSE. Example code below, includes creating predictions and scoring with RMSE (surprisingly quite good!). I think the book may need updating...
Thanks,
Raphael
library(mlr3extralearners)
library(mlr3proba)
library(mlr3pipelines)
library(mlr3)
learn = ppl("crankcompositor", ppl("distrcompositor", lrn("surv.gbm")),
response = TRUE, overwrite = FALSE, method = "mean",
graph_learner = TRUE)
set.seed(1)
task = tgen("simsurv")$generate(50)
learn$train(task)
p = learn$predict(task)
p$score(msr("surv.rmse"))

How to handle a skewed response in H2O algorithms

In my problem dataset response variable is extremely skewed to the left. I have tried to fit the model with h2o.randomForest() and h2o.gbm() as below. I can give tune min_split_improvement and min_rows to avoid overfitting in these two cases. But with these models, I see very high errors on the tail observations. I have tried using weights_column to oversample the tail observations and undersample other observations, but it does not help.
h2o.model <- h2o.gbm(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,
ntrees =150, max_depth = 10, min_rows = 2, model_id = "GBM_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE",
stopping_rounds = 10, min_split_improvement = 0.0005)
h2o.model <- h2o.randomForest(x = predictors, y = response, training_frame = train,valid = valid, seed = 1,ntrees =150, max_depth = 10, min_rows = 2, model_id = "DRF_DD", balance_classes = T, nbins = 20, stopping_metric = "MSE",
stopping_rounds = 10, min_split_improvement = 0.0005)
I have tried the h2o.automl() function of h2o package for the problem for better performance. However, I see significant overfitting. I don't know of any parameters in h2o.automl() to control overfitting.
Does anyone know of a way to avoid overfitting with h2o.automl()?
EDIT
The distribution of the log transformed response is given below. After the suggestion from Erin
EDIT2:
Distribution of original response.
H2O AutoML uses H2O algos (e.g. RF, GBM) underneath, so if you're not able to get good models there, you will suffer from the same issues using AutoML. I am not sure that I would call this overfitting -- it's more that your models are not doing well at predicting outliers.
My recommendation is to log your response variable -- that's a useful thing to do when you have a skewed response. In the future, H2O AutoML will try to detect a skewed response automatically and take the log, but that's not a feature of the the current version (H2O 3.16.*).
Here's a bit more detail if you are not familiar with this process. First, create a new column, e.g. log_response, as follows and use that as the response when training (in RF, GBM or AutoML):
train[,"log_response"] <- h2o.log(train[,response])
Caveats: If you have zeros in your response, you should use h2o.log1p() instead. Make sure not to include the original response in your predictors. In your case, you don't need to change anything because you are already explicitly specifying the predictors using a predictors vector.
Keep in mind that when you log the response that your predictions and model metrics will be on the log scale. So if you need to convert your predictions back to the normal scale, like this:
model <- h2o.randomForest(x = predictors, y = "log_response",
training_frame = train, valid = valid)
log_pred <- h2o.predict(model, test)
pred <- h2o.exp(log_pred)
This gives you the predictions, but if you also want to see the metrics, you will have to compute those using the h2o.make_metrics() function using the new preds rather than extracting the metrics from the model.
perf <- h2o.make_metrics(predicted = pred, actual = test[,response])
h2o.mse(perf)
You can try this using RF like I showed above, or a GBM, or with AutoML (which should give better performance than a single RF or GBM).
Hopefully that helps improve the performance of your models!
When your target variable is skewed, mse is not a good metric to use. I would try changing the loss function because gbm tries to fit the model to the gradient of the loss function and you want to make sure that you are using the correct distribution. if you have a spike on zero and right skewed positive target, probably Tweedie would be a better option.

Resources