How to adjust for age effect in an ANOVA? - r

I recently submitted a paper to a peer-review journal. The aim is to assess whether census blocks with worst material deprivation index in a given region are those that have the lowest levels of access to health emergency services (in terms of travel time taking into account the location of services and location of ambulances - accumulated time).
For this we did the following: after measuring the MDI (material deprivation index - a rather classical index) and obtaining the total journey times to the emergency services, we performed an ANOVA (MDI as a categorical variable - 6 levels - and times as a continuous variable). I did it using R.
The reviewer now says that we should see if there is an effect of age. He argues that in some countries, the more deprived patients are also the youngest. He adds that the effect of age could be taking into account in a linear regression (or ordered logistic regression if normality was not assumed) with the proportion of aged 30 (for example) as an independent variable in addition to MDI categories. Thus, we will obtain the beta coefficient of the effect of MDI on travel times adjusted on age (mean age).
To get a view of the age effect, I intuitively would have added an age structure indicator (for example the % of young people, % of older people, etc.) and ANOVA would become an ANCOVA. It would simply be an ANOVA with an additional covariate. But the reviewer suggests a linear or logistic regression. Does it make sense to you? As a non-statistician, I am not sure about what I should do in this case.
Hope I am not too confuse. Please tell me if I am, I will try to clarify.
Thanks in advance,
Cheers
PS: I am not a native English speaker and my English is rather poor. Apologies for the mistakes.

Related

How to interpret Quasipoisson glm p values given hidden default categories?

Hi
I am struggling with how to interpret my Quasipoisson glm table due to what i have learned to be default categories for example in my case i have 3 sites but there are only 2 sites shown in the table (Husab Mine and Swakop), the third site (Hope Mine is not shown), and for sex is only male (M) displayed and female is a default category.
I am studying pollinator visitation rate to a dioecious plant, comparing visits to male plants (M) and visits to female plants (F) within and across sites. All i want to find out is if pollinator visitation rate is affected by site and plant sex.
My data are overdispersed thus opted to use the Quasipoisson robust model, which i now find confusing to interpret.
So, i assume the model is comparing the two sites in the table to the default site i.e. Husab Mine vs Hope Mine, and Swakop vs Hope Mine, but is it also not supposed to compare between the the two sites in the table i.e. Husab Mine vs Swakop? Which p values should i pick to conclude if visitation rate is significantly affected by site and sex?
Thank for your help.

Predict the injury time of a football match?

I have a project which requires me to predict the injury time of a football match.
I have the relevant information, such as goals, corners, referee, 2 team names and the injury time for each half.
I tried to use Poisson regression and I am confused how I can include referee as a factor in my model?
As different referees were involved in a different number of matches. Say Tom was involved in 200 games while Jerry was in 30 games.
I tried to add the factor "referee" into the model and the summary told me that only some of the referees have a significant effect on the results. So I wonder is it correct to add the referee into the model directly, and are there any other methods I can use?
Thanks!

Relationship between logistic regression and linear regression

I've encountered a problem where I need to analyze the relationship between a movie's length, a movie's price and it's sale on a video streaming platform. Now I have two choices to quantify sale as my dependent variable:
whether or not a user ended up buying the movie
selling rate (# of people buying the movie / # of people watched the trailer)
if I use selling rate I essentially would use a linear regression where I have
selling rate= beta_0 + beta_1*length + beta_2*price + beta_3*length*price
But if I'm asked to use option 1 where my response is a binary output, and I assume I need to switch to logistic regression, how would the standard error change? Will the standard error be an underestimate?
Your SE will be on a different scale but if you have a large effect with the continuous outcome there is a solid chance that you will get the same inferences with the binary logistic. The logistic is "throwing away" nearly all the variability in the responses so it has relatively low power. As SweetSpot said you should treat this a a glm problem because of the restrictions in the range on the outcome. That is, you don't want a model that can give you negative counts/rates. Also the variance estimates need care. Consider using glm with family = binomial for the yes/no sold outcome and family = poisson for the count/rate. The UCLA web pages for logistic, poisson and negative binomial regression are a great place to start. Probably the best book for people who want clean writing without proofs is Agresti's Introduction to Categorical Data Analysis.

Rasch: Desicion for model and group-analysis

I currently have a data set of about 300 people on behaviors (answers: yes/no/NA) + variables on age, place of residence (city/country), income, etc.
In principle, I would like to find out the item difficulties for the overall sample (with which R-package is the best?-How does that work? Don't fully understand some codes :/)
and in the next step examine different groups (young, old, city/country, income (median split) with regard to their possibly significantly different item difficulties.
How do I do that? (is this possible with Wald tests, Rasch trees, or raschmix?) (do I need latent groups - which are grouped data-driven)?

Backward Elimination for Cox Regression

I want to explore the following variables and their 2-way interactions as possible predictors: the number of siblings (nsibs), weaning age (wmonth), maternal age (mthage), race, poverty, birthweight (bweight) and maternal smoking (smoke).
I created my Cox regression formula but I don't know how to form the 2-way interaction with the predictors:
coxph(Surv(wmonth,chldage1)~as.factor(nsibs)+mthage+race+poverty+bweight+smoke,data=pneumon)
final<-step(coxph(Surv(wmonth,chldage1)~(as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,data=pneumon),direction='backward')
The formula interface is the same for coxph as it is for lm or glm. If you need to form all the two-way interactions, you use the ^-operator with a first argument of the "sum" of the covariates and a second argument of 2:
coxph(Surv(wmonth,chldage1) ~
( as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,
data=pneumon)
I do not think there is a Cox regression step stepdown function. Thereau has spoken out in the past against making the process easy to automate. As Roland notes in his comment the prevailing opinion among all the R Core package authors is that stepwise procedures are statistically suspect. (This often creates some culture shock when persons cross-over to R from SPSS or SAS, where the culture is more accepting of stepwise procedures and where social science stats courses seem to endorse the method.)
First off you need to address the question of whether your data has enough events to support such a complex model. The statistical power of Cox models is driven by the number of events, not the number of subjects at risk. An admittedly imperfect rule of thumb is that you need 10-15 events for each covariate and by expanding the interactions perhaps 10-fold, you expand the required number of events by a similar factor.
Harrell has discussed such matters in his RMS book and rms-package documentation and advocates applying shrinkage to the covariate estimates in the process of any selection method. That would be a more statistically principled route to follow.
If you do have such a large dataset and there is no theory in your domain of investigation regarding which covariate interactions are more likely to be important, an alternate would be to examine the full interaction model and then proceed with the perspective that each modification of your model adds to the number of degrees of freedom for the overall process. I have faced such a situation in the past (thousands of events, millions at risk) and my approach was to keep the interactions that met a more stringent statistical theory. I restricted this approach to groups of variables that were considered related. I examined them first for their 2-way correlations. With no categorical variables in my model except smoking and gender and 5 continuous covariates, I kept 2-way interactions that had delta-deviance (distributed as chi-square stats) measures of 30 or more. I was thereby retaining interactions that "achieved significance" where the implicit degrees of freedom were much higher than the naive software listings. I also compared the results for the retained covariate interactions with and without the removed interactions to make sure that the process had not meaningfully shifted the magnitudes of the predicted effects. I also used Harrell's rms-package's validation and calibration procedures.

Resources