Regression for a Rate variable in R - r

I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.
However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.
This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.
A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.
Any suggestions on the best form of distribution for this new parameter, and how to model it in R?
Thanks!

As suggested in the comments you could keep the Poisson model and do it with an offset:
glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
family=poisson,data=...)
Or you could use a binomial GLM, either
glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
data=...)
or
glm(response/pop_size ~ predictor1 + ... , family=binomial,
weights=pop_size,
data=...)
The latter form is sometimes more convenient, although less widely used.
Be aware that in general switching from Poisson to binomial will change the
link function from log to logit, although you can use family=binomial(link="log")) if you prefer.
Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.
I think glmmADMB will do a zero-inflated binomial model, but I haven't tested it.

Related

How to use binomial family for non-integers in GLMMTMB?

Here's my model:
revisitsm0 <- glmmTMB(cbind(revisits_per_bout, tot_visits_bout - revisits_per_bout) ~ experiment_type * foraging_bout + (1|colony/bee_id), data=table_training, family=binomial)
My model doesn't fit so well because of dispersion, so I square-rooted my variables "revisits_per_bout" and "tot_visits_bout", hence giving me non-integers.
Since quasibinomial is not available in GLMMTMB, how can I fix this?
Thanks!
This really belongs on CrossValidated, because it is a question of "what to do" more than "how to do it".
family = "betabinomial" should be the simplest way to handle overdispersion (I would not recommend transforming the response variable. It is a good rule of thumb that if you have count-type (non-negative integer) responses, it's best to model them on the original scale).
if overdispersion seems to be associated with particular predictors (e.g. dispersion is different for different experiment types, even after taking the mean-variance relationship of the beta-binomial into account; you can use the location-scale plot [sqrt(abs(pearson_resids)) vs. the predictor of interest) to assess this, you can use family = "betabinomial" plus a dispformula = argument
you could also use an observation-level random effect
you can also do a post hoc conversion of a binomial fit to a quasi-binomial fit following the recipe in the GLMM FAQ 'fitting models with overdispersion' section (this link has more general advice/info on methods for dealing with overdispersion)

Relationship between logistic regression and linear regression

I've encountered a problem where I need to analyze the relationship between a movie's length, a movie's price and it's sale on a video streaming platform. Now I have two choices to quantify sale as my dependent variable:
whether or not a user ended up buying the movie
selling rate (# of people buying the movie / # of people watched the trailer)
if I use selling rate I essentially would use a linear regression where I have
selling rate= beta_0 + beta_1*length + beta_2*price + beta_3*length*price
But if I'm asked to use option 1 where my response is a binary output, and I assume I need to switch to logistic regression, how would the standard error change? Will the standard error be an underestimate?
Your SE will be on a different scale but if you have a large effect with the continuous outcome there is a solid chance that you will get the same inferences with the binary logistic. The logistic is "throwing away" nearly all the variability in the responses so it has relatively low power. As SweetSpot said you should treat this a a glm problem because of the restrictions in the range on the outcome. That is, you don't want a model that can give you negative counts/rates. Also the variance estimates need care. Consider using glm with family = binomial for the yes/no sold outcome and family = poisson for the count/rate. The UCLA web pages for logistic, poisson and negative binomial regression are a great place to start. Probably the best book for people who want clean writing without proofs is Agresti's Introduction to Categorical Data Analysis.

Poisson Regression with overload of zeroes SAS

I am testing different models for the best fit and most robust statistics to my data. My dataset contains over 50000 observations, approx. over 99.3% of the data are zeroes - such 0.7% are actual events.
Eventually see: https://imgur.com/a/CUuTlSK
I search to find the best fit of the following models; Logistic, Poisson, NB, ZIP, ZINB, PLH, NBLH. (NB: Negative-binomial, ZI: Zero-Inflated, P: Poisson, LH: Logit Hurdle)
The first way I tried doing this was by estimating the binary response with logistic regression.
My questions: Can I use Poisson on the binary variable or should I instead impose the binary with some integer values? For instance with the associated loss; if y=1 then y_val=y*loss. In my case, the variance of y_val becomes approx. 2.5E9. I held to use the binary variable because it does not matter, in this purpose, what the company defaulted with, default is default no matter the amount.
Both with logistic regression and Poisson, I got some terrible statistic: Very high deviance value (and 0 p-value), terrible estimates (=many of the estimated parameters are 0 -> odds ratio =1), very low confidence intervals, everything seems to be 'wrong'. If I transform the response variable to log(y_val) for y>1 in Poisson the statistics seem to get better - however, this is against the assumptions of integer count response in Poisson.
I briefly have tested the ZINB, it does not change the statistics significantly (=it does not help at all in this case).
Does there exist any proper way of dealing with such a dataset? I am interested in achieving the best fit for my data (about startup business' and their default status).
The data are cleaned and ready to be fitted. Is there anything I should be aware of that I don't have mentioned?
I use the genmod procedure in SAS with dist=Poisson, zinb, zip etc.
Thanks in advance.
Sorry, my rep is too low to comment, so it has to be an answer.
You should consider undersampling technique before using any regression/model, because your target is below 5%, which makes it extremely difficult to to predict.
Undersampling is a method of cutting out non-target events, in order to increase target ratio, I really recommend considering it, I got to use it once in my practice, and it seemed pretty helpful

Get the AIC or BIC citerium from a gamm, gam, and lme models: How in mgcv? And how can I trust the result? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I am new to Gamms and gams, so this question may be a bit basic, I'd appreciate your help on this very much:
I am using the following code:
M.gamm <- gamm (bsigsi ~ s(summodpa, sed,k= 1, fx= TRUE, bs="tp") + s(sumlightpa, sed, k=1, fx= TRUE, bs="tp") , random = list(school=~ 1) , method= "ML", na.action= na.omit, data= Pilot_fitbit2)
The code runs, but gives me this feedback:
Warning messages: 1: In smooth.construct.tp.smooth.spec(object,
dk$data, dk$knots) : basis dimension, k, increased to minimum
possible
2: In smooth.construct.tp.smooth.spec(object, dk$data, dk$knots) :
basis dimension, k, increased to minimum possible
Questions:
My major question is however how I can get an AIC or BIC from this?
I've tried BIC(M.gamm$gam) and BIC(M.gamm$lme), since gamm exists of both parts (lme and gam), and for the latter one (with lme) I do get a value, bot for the first one, I don'get a value. Does anyone know why and how I can get one?
The issue is that I would like to compare this value to the BIC value of a gam model, and I am not sure which one (BIC(M.gamm$lme) or BIC(M.gam$gam)) would be the correct one. It is possible for me to derive a BIC and AIC for the gam and lme model.
If I'd be able to get the AIC or BIC for the gamm model - how can I know I can trust the results? What do I need to be careful with so I interpret the result correctly? Currently, I am using ML in all models and also use the same package (mgcv) to estimate lme, gam, and gamm to estabilish comparability.
Any help/ advice or ideas on this would be greatly appreciated!!
Best wishes,
Noemi
Thank you very much for this!
This warnings come as a result of requesting a smoother basis of a single function for each of your two smooths; this doesn't make any sense as both such bases would only contain equivalent of constant functions, both of which are are unidentifiable given you have another constant term (the intercept) in your model. Once mgcv applies identifiable to constraints the two smooths would get dropped entirely from the model.
Hence the warnings; mgcv didn't do what you wanted. Instead it set k to be the smallest values possible. Set k to something larger; you might as well leave it at the default and not specify it in the s() if you want a low rank smooth. Also, unless you really want an unpenalized spline fit, don't use fix = TRUE.
I'm not really familiar with any theory for BIC applied to GAM(M)s that corrects for smoothness selection. The AIC method for gam() models estimated using REML smoothness selection does have some theory beyond it, including a recent paper by Simon Wood and colleagues.
The mgcv FAQ has the following two things to say
How can I compare gamm models? In the identity link normal errors case, then AIC and hypotheis testing based methods are fine. Otherwise it is best to work out a strategy based on the summary.gam Alternatively, simple random effects can be fitted with gam, which makes comparison straightforward. Package gamm4 is an alternative, which allows AIC type model selection for generalized models.
When using gamm or gamm4, the reported AIC is different for the gam object and the lme or lmer object. Why is this? There are several reasons for this. The most important is that the models being used are actually different in the two representations. When treating the GAM as a mixed model, you are implicitly assuming that if you gathered a replicate dataset, the smooths in your model would look completely different to the smooths from the original model, except for having the same degree of smoothness. Technically you would expect the smooths to be drawn afresh from their distribution under the random effects model. When viewing the gam from the usual penalized regression perspective, you would expect smooths to look broadly similar under replication of the data. i.e. you are really using Bayesian model for the smooths, rather than a random effects model (it's just that the frequentist random effects and Bayesian computations happen to coincide for computing the estimates). As a result of the different assumptions about the data generating process, AIC model comparisons can give rather different answers depending on the model adopted. Which you use should depend on which model you really think is appropriate. In addition the computations of the AICs are different. The mixed model AIC uses the marginal likelihood and the corresponding number of model parameters. The gam model uses the penalized likelihood and the effective degrees of freedom.
So, I'd probably stick to AIC, not use BIC. I'd be thinking about which interpretation of the GAM(M) I was interested most in. I'd also likely fit the random effects you have here using gam() if they are this simple. An equivalent model would include + s(school, bs = 're') in the main formula and exclude the random bit whilst using gam()
gam(bsigsi ~ s(summodpa, sed) + s(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
Do be careful with 2D isotopic smooths; both sed and summodpa and sumlightpa need to be in the same units have the same degrees of wiggliness in each smooth. If these aren't in the same units or have different wigglinesses, use te() instead of s() for the 2D terms.
Also be careful with variables that appear in two or more smooths like this; mgcv will do it's best to make the models identifiable, but you can easily get into computational problems even so. A better modelling approach would to be estimate the marginal effects of sed and the other terms plus their 2nd order interactions by decomposing the effects in the two 2d smooths as follows:
gam(bsigsi ~ s(sed) + s(summodpa) + s(sumlightpa) +
ti(summodpa, sed) + ti(sumlightpa, sed) +
s(school, bs = 're'), data = Pilot_fitbit2,
method = 'REML')
where the ti() smooths are tensor product interaction bases, when're the main effects of the two marginal variables have been removed from the basis. Hence you can treat them as a pure smooth interaction term. In this way, the main effect of sed is contained in a single smooth term.

Comparing nonlinear regression models

I want to compare the curve fits of three models by r-squared values. I ran models using the nls and drc packages. It appears, though, that neither of those packages calculate r-squared values; they give "residual std error" and "residual sum of squares" though.
Can these two be used to compare model fits?
This is really a statistics question, rather than a coding question: consider posting on stats.stackexchange.com; you're likely to get a better answer.
RSQ is not really meaningful for non-linear regression. This is why summary.nls(...) does not provide it. See this post for an explanation.
There is a common, and understandable, tendency to hope for a single statistic that allows one to assess which of a set of models better fits a dataset. Unfortunately, it doesn't work that way. Here are some things to consider.
Generally, the best model is the one that has a mechanistic underpinning. Do your models reflect some physical process, or are you just trying a bunch of mathematical equations and hoping for the best? The former approach almost always leads to better models.
You should consider how the models will be used. Will you be interpolating (e.g. estimating y|x within the range of your dataset), or will you be extrapolating (estimating y|x outside the range of your data)? Some models yield a fit that provides relatively accurate estimates slightly outside the dataset range, and others completely fall apart.
Sometimes the appropriate modeling technique is suggested by the type of data you have. For example, if you have data that counts something, then y is likely to be poisson distributed and a generalized linear model (glm) in the poisson family is indicated. If your data is binary (e.g. only two possible outcomes, success or failure), then a binomial glm is indicated (so-called logistic regression).
The key underlying assumption of least squares techniques is that the error in y is normally distributed with mean 0 and constant variance. We can test this after doing the fit by looking at a plot of standardized residuals vs. y, and by looking at a Normal Q-Q plot of the residuals. If the residuals plot shows scatter increasing or decreasing with y then the model in not a good one. If the Normal Q-Q plot is not close to a straight line, then the residuals are not normally distributed and probably a different model is indicated.
Sometimes certain data points have high leverage with a given model, meaning that the fit is unduly influenced by those points. If this is a problem you will see it in a leverage plot. This indicates a weak model.
For a given model, it may be the case that not all of the parameters are significantly different from 0 (e.g., p-value of the coefficient > 0.05). If this is the case, you need to explore the model without those parameters. With nls, this often implies a completely different model.
Assuming that your model passes the tests above, it is reasonable to look at the F-statistic for the fit. This is essentially the ratio of SSR/SSE corrected for the dof in the regression (R) and the residuals (E). A model with more parameters will generally have smaller residual SS, but that does not make it a better model. The F-statistic accounts for this in that models with more parameters will have larger regression dof and smaller residual dof, making the F-statistic smaller.
Finally, having considered the items above, you can consider the residual standard error. Generally, all other things being equal, smaller residual standard error is better. Trouble is, all other things are never equal. This is why I would recommend looking at RSE last.

Resources