I'm creating diagnostics for my glmmTMB model using DHARMa, and, while I understand most of the lines, I have problems interpreting scaled residual versus predictor variables: there is a red dashed line. Any advice on interpretation?
Example of residual vs one of the predictor plots:
Let me know if you need more information to give me an answer.
The red dashed line is (to my knowledge) simply a non-parametric estimator for the average. In the perfect world scenario, we would expect it to be 0.
In the real world, we expect not to see any systematic deviations from 0. Yours looks rather good here. It oscillates between 0 at random, and only in the area where information is lacking ( pred > 2.5 ) it starts to deviate.
The answer can be found here. Shortly, what you see is a nonparametric smoother that is the default for large datasets. You can get a more intuitive plot by specifying quantreg = T
Related
I'm modelling multiple linear regression. I used the bptest function to test for heteroscedasticity. The result was significant at less than 0.05.
How can I resolve the issue of heteroscedasticity?
Try using a different type of linear regression
Ordinary Least Squares (OLS) for homoscedasticity.
Weighted Least Squares (WLS) for heteroscedasticity without correlated errors.
Generalized Least Squares (GLS) for heteroscedasticity with correlated errors.
Welcome to SO, Arun.
Personally, I don't think heteroskedasticity is something you "solve". Rather, it's something you need to allow for in your model.
You haven't given us any of your data, so let's assume that the variance of your residuals increases with the magnitude of your predictor. Typically a simplistic approach to handling it is to transform the data so that the variance is constant. One way of doing this might be to log-transform your data. That might give you a more constant variance. But it also transforms your model. Your errors are no longer IID.
Alternatively, you might have two groups of observarions that you want to compare with a t-test, bit the variance in one group is larger than in the other. That's a different sot of heteroskedasticity. There are variants of the standard "pooled variance" t-test that might handle that.
I realise this isn't an answer to your question in the conventional sense. I would have made it a comment, but I knew before I started that I'd need more words than a comment would let me have.
I have modeled the probability of an aggressive (vs indolent) form of recurrent respiratory papillomatosis as a function of age of diagnosis. Generally speaking, those who are diagnosed before the age of 5 have a 80% probability of running an aggressive course. Those diagnosed after the age of 10 years have about a 30% chance. Between 5 years and 10 years it is somewhere in between. In all three age groups there does not seem to be a correlation with age (within the limits of the age group).
Look at the graph (open circles) logistic regression wants to go with but look at my manual line (dotted line) that seems to better describe what is going on. My x-axis is log of diagnostic age. The y-axis is probability of aggressive disease. How do I model the dotted line? I thought of using my own logistic function but I do not know how to make R find the best parameters.
Am I missing something in my understanding of the mathematics of the two graphs?
How do I operationalize this in R. Or perhaps I am looking for the dashed green line. I cannot believe the dashed line is correct. Biologically speaking there is little to imagine that the risk of someone diagnosed at age 9.9 is very different than one diagnosed at age 10.1 years
I agree that discontinuous or step functions typically make little ecological sense. Then again, probably your dotted line doesn't, either. If we can agree that the level won't make any discontinuous jumps (as in your green dashed line), then why should the regression coefficient of the response to age make discontinuous jumps to yield the "kinks" in your green line?
You could consider transforming your age using splines to model nonlinearities. Just make sure that you don't overfit. Logistic regression will never yield a perfect fit, so don't search for one.
The "standard" logistic function $\frac{1}{1+e^{-x}}$ passes through 0 and 1 at $±\infty$. This is not a great match for your data, which doesn't seem to approach either of those values, but instead approaches 0.8 from the left and 0.3 from the right.
You may want to add scale and offset parameters so that you can squash and shift that curve into that range. My guess is that, despite the extra parameters, the model will fit better (via AIC, etc) and will end up resembling your dashed line.
Edit: You're on the right track. The next step would be to replace the hard-coded values of 0.5 and 0.3 with parameters to be fit. Your model would look something like
dxage~gain * 1/(1 + exp(-tau*(x-shift))) + offset
You would then fit this with nls: simply pass in the formula (above), and the data. If you have reasonable guesses for the starting values (which you do here), providing them can help speed converge.
I had a dataset for which I needed to provide a linear regression model that represents diameter as a function of length.Data which has length in first column and diameter in second looked like:
0.455,0.365
0.44,0.365
I carried out the required operations on the given dataset in R,and plotted the regression line for the data
I am just confused about what to conclude from the parameters(slope=0.8154, y intercept:-0.019413, correlation coefficient:0.98 ). Can I conclude anything other than line is a good fit. I am new to statistics. Any help would be appreciated.
Slope 0.8154 informs you that each unit increase for lenght causes increase of diamater in 0.8154*unit. Intercept -0.019413 is probably statistically insignificant in this case. To verify that you have to look at t-statistics for example.
On this page you can find nice course with visualizations about simple linear regression and other statistical methods answering your questions.
From the parameters slope and intercept, you cannot conclude if the line is a good fit. The correlation coefficient says that they depend highly and that a straight line could fit your model. However, from the p-values for the slope and intercept, you can conclude if your fit is good. If they are small (say below 0.05) you can conclude that the fit is pretty good.
I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.
I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.