Relationship between logistic regression and linear regression - r

I've encountered a problem where I need to analyze the relationship between a movie's length, a movie's price and it's sale on a video streaming platform. Now I have two choices to quantify sale as my dependent variable:
whether or not a user ended up buying the movie
selling rate (# of people buying the movie / # of people watched the trailer)
if I use selling rate I essentially would use a linear regression where I have
selling rate= beta_0 + beta_1*length + beta_2*price + beta_3*length*price
But if I'm asked to use option 1 where my response is a binary output, and I assume I need to switch to logistic regression, how would the standard error change? Will the standard error be an underestimate?

Your SE will be on a different scale but if you have a large effect with the continuous outcome there is a solid chance that you will get the same inferences with the binary logistic. The logistic is "throwing away" nearly all the variability in the responses so it has relatively low power. As SweetSpot said you should treat this a a glm problem because of the restrictions in the range on the outcome. That is, you don't want a model that can give you negative counts/rates. Also the variance estimates need care. Consider using glm with family = binomial for the yes/no sold outcome and family = poisson for the count/rate. The UCLA web pages for logistic, poisson and negative binomial regression are a great place to start. Probably the best book for people who want clean writing without proofs is Agresti's Introduction to Categorical Data Analysis.

Related

Comparing mixed models: singular fit for random effect group but only in some models. Should I drop RE group from all models or just where singular?

I am fitting several mixed models using the lmer function in the package lme4, each with the same fixed effects and random effects, but different response variables. The purpose of these models is to determine how environmental conditions influence different fruit and seed traits in a particular tree species. I want to know which traits respond most strongly to which environmental variables, and how well the variation in each trait is captured by each model overall.
The data have been collected from several sites, and several trees within each site.
Response variables: measures of fruits and seeds, e.g. fresh mass, dry mass, volume
Fixed effects: temperature, rainfall, soil nitrogen, soil phosphorus
Random effects: Sites, trees
Example of model notation I have been using:
lmer(fruit.mass ~ temperature + rainfall + soil N + soil P +
(1|site/tree), data = fruit)
My problem: some of the models run fine with no detectable issues, however, some produce a singular fit where the estimated variance for 'site' = 0.
I know there is considerable debate around dealing with singular fit models, although one approach is to drop site and keep the tree level random effect. The models run fine after this.
My question: Should I then drop the site random effect from the models which weren't singular if I want to compare these models in anyway? If so, are there certain methods for comparing model performance more suited to this situation?
If this is covered in publications or discussion threads then any links would be much appreciated.
When the model converges to a singular fit, it indicates that the random structure is overfitted. Therefore I would argue that it does not make sense to compare these. I would also be concerned about multiple testing in this situation.
I would drop models that give "boundary is singular" warnings. The rest are fine and you can compare among them. Of note: It is best to specify REML=FALSE when comparing models (can give references if need be). Once the "best model" is selected you can run it normally (i.e. with REML). I would recommend using conditional AIC (cAIC4 package), for example anocAIC(model1, model2 ...). Another good one is the performance package. It has many options, such as: model_performance, check_model....

Calculating importance of independent variable in explaining variance of dependent variable in linear regression

I am working on a Media Mix Modeling (MMM) project where I have to build linear model for predicting traffic factoring in various spends as input variables. I have got the linear model equation which is:
Traffic = 1918 + 0.08*TV_Spend + 0.01*Print_Spend + 0.05*Display_spend
I want to calculate two things which I don't know how to do:
How much each variable is contributing in explaining variance of traffic?
What percentage of total traffic is due to each independent variable?
I think this question is already been answered several times at several places (a duplicate?);
For example see:
https://stats.stackexchange.com/questions/79399/calculate-variance-explained-by-each-predictor-in-multiple-regression-using-r
You also may want to compute the standardized regression coefs (first standardize the variables and next rerun the regression analysis) to find out which independent variable has the largest effect on the dependent variable (if significant, I would like to add). I think the interpretation of standardized regression weights is more intuitively than considering the explained variance.
Cheers,
Peter

Using offset in GAM zero inflated poisson (ziP) model

I am trying to model count data of birds in forest fragments of different size. As the plots in which the surveys were conducted also differ in size among fragments, I would like to add survey plot size as an offset term to convert counts to densities.
As I understand from previous questions on this site this is generally done for poisson models as these have a log link. The GAM model (mgcv package) I am running with family ziP has link="identity". As far as I understand in such cases the offset term will be subtracted from the response, rather than resulting in the desired response/offset rate.
However, when I run the model with the offset term and plot the results it seems to be giving the result I want (I compared the plot for a poisson model with the ziP model).
This is the model I used, whereby Guild reflects different feeding guilds, logArea is the log of fragment size and Study is my random effect (data come from several studies).
gam1 <- gam(Count ~ Guild + s(logArea, by=Guild) + s(Study,bs="re"), offset=lnTotalPlotsize, family=ziP(),data=Data_ommited2)
Can someone explain how GAM handles offset terms in this case (ziP model with identity link)? Is it really resulting in the desired response/offset rate or is it doing something else?
Thanks for your help!
Regards,
Robert
Whilst only the identity link is provided, the linear predictor returns the log of the expected count. As such the linear predictor is on a log-scale and your use of the offset is OK.
Basically, the model is parameterized for the log response not the response, hence identity link functions are used. This is the same as for the ziplss() family.

Backward Elimination for Cox Regression

I want to explore the following variables and their 2-way interactions as possible predictors: the number of siblings (nsibs), weaning age (wmonth), maternal age (mthage), race, poverty, birthweight (bweight) and maternal smoking (smoke).
I created my Cox regression formula but I don't know how to form the 2-way interaction with the predictors:
coxph(Surv(wmonth,chldage1)~as.factor(nsibs)+mthage+race+poverty+bweight+smoke,data=pneumon)
final<-step(coxph(Surv(wmonth,chldage1)~(as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,data=pneumon),direction='backward')
The formula interface is the same for coxph as it is for lm or glm. If you need to form all the two-way interactions, you use the ^-operator with a first argument of the "sum" of the covariates and a second argument of 2:
coxph(Surv(wmonth,chldage1) ~
( as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,
data=pneumon)
I do not think there is a Cox regression step stepdown function. Thereau has spoken out in the past against making the process easy to automate. As Roland notes in his comment the prevailing opinion among all the R Core package authors is that stepwise procedures are statistically suspect. (This often creates some culture shock when persons cross-over to R from SPSS or SAS, where the culture is more accepting of stepwise procedures and where social science stats courses seem to endorse the method.)
First off you need to address the question of whether your data has enough events to support such a complex model. The statistical power of Cox models is driven by the number of events, not the number of subjects at risk. An admittedly imperfect rule of thumb is that you need 10-15 events for each covariate and by expanding the interactions perhaps 10-fold, you expand the required number of events by a similar factor.
Harrell has discussed such matters in his RMS book and rms-package documentation and advocates applying shrinkage to the covariate estimates in the process of any selection method. That would be a more statistically principled route to follow.
If you do have such a large dataset and there is no theory in your domain of investigation regarding which covariate interactions are more likely to be important, an alternate would be to examine the full interaction model and then proceed with the perspective that each modification of your model adds to the number of degrees of freedom for the overall process. I have faced such a situation in the past (thousands of events, millions at risk) and my approach was to keep the interactions that met a more stringent statistical theory. I restricted this approach to groups of variables that were considered related. I examined them first for their 2-way correlations. With no categorical variables in my model except smoking and gender and 5 continuous covariates, I kept 2-way interactions that had delta-deviance (distributed as chi-square stats) measures of 30 or more. I was thereby retaining interactions that "achieved significance" where the implicit degrees of freedom were much higher than the naive software listings. I also compared the results for the retained covariate interactions with and without the removed interactions to make sure that the process had not meaningfully shifted the magnitudes of the predicted effects. I also used Harrell's rms-package's validation and calibration procedures.

Regression for a Rate variable in R

I was tasked with developing a regression model looking at student enrollment in different programs. This is a very nice, clean data set where the enrollment counts follow a Poisson distribution well. I fit a model in R (using both GLM and Zero Inflated Poisson.) The resulting residuals seemed reasonable.
However, I was then instructed to change the count of students to a "rate" which was calculated as students / school_population (Each school has its own population.)) This is now no longer a count variable, but a proportion between 0 and 1. This is considered the "proportion of enrollment" in a program.
This "rate" (students/population) is no longer Poisson, but is certainly not normal either. So, I'm a bit lost as to the appropriate distribution, and subsequent model to represent it.
A log normal distribution seems to fit this rate parameter well, however I have many 0 values, so it won't actually fit.
Any suggestions on the best form of distribution for this new parameter, and how to model it in R?
Thanks!
As suggested in the comments you could keep the Poisson model and do it with an offset:
glm(response~predictor1+predictor2+predictor3+ ... + offset(log(population),
family=poisson,data=...)
Or you could use a binomial GLM, either
glm(cbind(response,pop_size-response) ~ predictor1 + ... , family=binomial,
data=...)
or
glm(response/pop_size ~ predictor1 + ... , family=binomial,
weights=pop_size,
data=...)
The latter form is sometimes more convenient, although less widely used.
Be aware that in general switching from Poisson to binomial will change the
link function from log to logit, although you can use family=binomial(link="log")) if you prefer.
Zero-inflation might be easier to model with the Poisson + offset combination (I'm not sure if the pscl package, the most common approach to ZIP, handles offsets, but I think it does), which will be more commonly available than a zero-inflated binomial model.
I think glmmADMB will do a zero-inflated binomial model, but I haven't tested it.

Resources