Poisson Regression with overload of zeroes SAS - count

I am testing different models for the best fit and most robust statistics to my data. My dataset contains over 50000 observations, approx. over 99.3% of the data are zeroes - such 0.7% are actual events.
Eventually see: https://imgur.com/a/CUuTlSK
I search to find the best fit of the following models; Logistic, Poisson, NB, ZIP, ZINB, PLH, NBLH. (NB: Negative-binomial, ZI: Zero-Inflated, P: Poisson, LH: Logit Hurdle)
The first way I tried doing this was by estimating the binary response with logistic regression.
My questions: Can I use Poisson on the binary variable or should I instead impose the binary with some integer values? For instance with the associated loss; if y=1 then y_val=y*loss. In my case, the variance of y_val becomes approx. 2.5E9. I held to use the binary variable because it does not matter, in this purpose, what the company defaulted with, default is default no matter the amount.
Both with logistic regression and Poisson, I got some terrible statistic: Very high deviance value (and 0 p-value), terrible estimates (=many of the estimated parameters are 0 -> odds ratio =1), very low confidence intervals, everything seems to be 'wrong'. If I transform the response variable to log(y_val) for y>1 in Poisson the statistics seem to get better - however, this is against the assumptions of integer count response in Poisson.
I briefly have tested the ZINB, it does not change the statistics significantly (=it does not help at all in this case).
Does there exist any proper way of dealing with such a dataset? I am interested in achieving the best fit for my data (about startup business' and their default status).
The data are cleaned and ready to be fitted. Is there anything I should be aware of that I don't have mentioned?
I use the genmod procedure in SAS with dist=Poisson, zinb, zip etc.
Thanks in advance.

Sorry, my rep is too low to comment, so it has to be an answer.
You should consider undersampling technique before using any regression/model, because your target is below 5%, which makes it extremely difficult to to predict.
Undersampling is a method of cutting out non-target events, in order to increase target ratio, I really recommend considering it, I got to use it once in my practice, and it seemed pretty helpful

Related

Coefficient value of covariate in Cox PH model is too big

I am trying to develop Cox PH model with time-varying covariates in R. I use coxph function from survival package. There was not any trouble during estimation process, though coefficient value of one covariates is too large, in particular, 2.5e+32.
I can't guess what is reason of this problem and how to tackle it. This variable is nonstationary and proportional assumption is violated. Does either of this facts may cause such a big value of coefficient?
More information could help framing your problem.
Anyway, I doubt non-proportionality is to blame. It would imply that you have some outliers heavily biasing your coefficient beyond reasonable expectations. You could give this a quick look by plotting the output of cox.zph.
Another possible explanation is that this rather depends on the unit of measure you used to define your covariate. Can the magnitude of the coefficient be meaningfully interpreted? If so, you could simply re-scale/standardise/log-transform that covariate to obtain a 'more manageable' coefficient (if this is theoretically appropriate).
This could also be due to the so called 'complete separation', which has been discussed here and here.

How to interpret a VAR model without sigificant coefficients?

I am trying to investigate the relationship between some Google Trends Data and Stock Prices.
I performed the augmented ADF Test and KPSS test to make sure that both time series are integrated of the same order (I(1)).
However, after I took the first differences, the ACF plot was completely insigificant (except for 1 of course), which told me that the differenced series are behaving like white noise.
Nevertheless I tried to estimate a VAR model which you can see attached.
As you can see, only one constant is significant. I have already read that because Stocks.ts.l1 is not significant in the equation for GoogleTrends and GoogleTrends.ts.l1 is not significant in the equation for Stocks, there is no dynamic between the two time series and both can also be models independently from each other with a AR(p) model.
I checked the residuals of the model. They fulfill the assumptions (normally distributed residuals are not totally given but ok, there is homoscedasticity, its stable and there is no autocorrelation).
But what does it mean if no coefficient is significant as in the case of the Stocks.ts equation? Is the model just inappropriate to fit the data, because the data doesn't follow an AR process. Or is the model just so bad, that a constant would describe the data better than the model? Or a combination of the previous questions? Any suggestions how I could proceed my analysis?
Thanks in advance

Running a GLM with a Gamma distribution, but data includes zeros

I'm trying to run a GLM in R for biomass data (reductive biomass and ratio of reproductive biomass to vegetative biomass) as a function of habitat type ("hab"), year data was collected ("year"), and site of data collection ("site"). My data looks like it would fit a Gamma distribution well, but I have 8 observations with zero biomass (out of ~800 observations), so the model won't run. What's the best way to deal with this? What would be another error distribution to use? Or would adding a very small value (such as .0000001) to my zero observations be viable?
My model is:
reproductive_biomass<-glm(repro.biomass~hab*year + site, data=biom, family = Gamma(link = "log"))
Ah, zeroes - gotta love them.
Depending on the system you're studying, I'd be tempted to check out zero-inflated or hurdle models - the basic idea is that there are two components to the model: some binomial process deciding whether the response is zero or nonzero, and then a gamma that works on the nonzeroes. Slick part is you can then do inferences on the coefficients of both models and even use different coefficients for both.
http://seananderson.ca/2014/05/18/gamma-hurdle.html ... but a search for "zero-inflated gamma" or "tweedie models" might also yield something informative and/or scholarly.
In an ideal world, your analytic tool should fit your system and your intended inferences. The zero-inflated world is pretty sweet, but is conditional on the assumption of separate processes. Thus an important question to answer, of course, is what zeroes "mean" in the context of your study, and only you can answer that - whether they're numbers that just happened to be really really small, or true zeroes that are the result of some confounding process like your coworker spilling the bleach (or something otherwise uninteresting to your study), or else true zeroes that ARE interesting.
Another thought: ask the same question over on crossvalidated, and you'll probably get an even more statistically informed answer. Good luck!

How do you correctly perform a glmmPQL on non-normal data?

I ran a model using glmer looking at the effect that Year and Treatment had on the number of points covered with wood, then plotted the residuals to check for normality and the resulting graph is slightly skewed to the right. Is this normally distributed?
model <- glmer(Number~Year*Treatment(1|Year/Treatment), data=data,family=poisson)
This site recommends using glmmPQL if your data is not normal: http://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html
library(MASS)
library(nlme)
model1<-glmmPQL(Number~Year*Treatment,~1|Year/Treatment,
family=gaussian(link = "log"),
data=data,start=coef(lm(Log~Year*Treatment)),
na.action = na.pass,verbose=FALSE)
summary(model1)
plot(model1)
Now do you transform the data in the Excel document or in the R code (Number1 <- log(Number)) before running this model? Does the link="log" imply that the data is already log transformed or does it imply that it will transform it?
If you have data with zeros, is it acceptable to add 1 to all observations to make it more than zero in order to log transform it: Number1<-log(Number+1)?
Is fit<-anova(model,model1,test="Chisq") sufficient to compare both models?
Many thanks for any advice!
tl;dr your diagnostic plots look OK to me, you can probably proceed to interpret your results.
This formula:
Number~Year*Treatment+(1|Year/Treatment)
might not be quite right (besides the missing + between the terms above ...) In general you shouldn't include the same term in both the random and the fixed effects (although there is one exception - if Year has more than a few values and there are multiple observations per year you can include it as a continuous covariate in the fixed effects and a grouping factor in the random effects - so this might be correct).
I'm not crazy about the linked introduction; at a quick skim there's nothing horribly wrong with it, but there seem to b e a lot of minor inaccuracies and confusions. "Use glmmPQL if your data aren't Normal" is really shorthand for "you might want to use a GLMM if your data aren't Normal". Your glmer model should be fine.
interpreting diagnostic plots is a bit of an art, but the degree of deviation that you show above doesn't look like a problem.
since you don't need to log-transform your data, you don't need to get into the slightly messy issue of how to log-transform data containing zeros. In general log(1+x) transformations for count data are reasonable - but, again, unnecessary here.
anova() in this context does a likelihood ratio test, which is a reasonable way to compare models.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources