Backward Elimination for Cox Regression - r

I want to explore the following variables and their 2-way interactions as possible predictors: the number of siblings (nsibs), weaning age (wmonth), maternal age (mthage), race, poverty, birthweight (bweight) and maternal smoking (smoke).
I created my Cox regression formula but I don't know how to form the 2-way interaction with the predictors:
coxph(Surv(wmonth,chldage1)~as.factor(nsibs)+mthage+race+poverty+bweight+smoke,data=pneumon)
final<-step(coxph(Surv(wmonth,chldage1)~(as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,data=pneumon),direction='backward')

The formula interface is the same for coxph as it is for lm or glm. If you need to form all the two-way interactions, you use the ^-operator with a first argument of the "sum" of the covariates and a second argument of 2:
coxph(Surv(wmonth,chldage1) ~
( as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,
data=pneumon)
I do not think there is a Cox regression step stepdown function. Thereau has spoken out in the past against making the process easy to automate. As Roland notes in his comment the prevailing opinion among all the R Core package authors is that stepwise procedures are statistically suspect. (This often creates some culture shock when persons cross-over to R from SPSS or SAS, where the culture is more accepting of stepwise procedures and where social science stats courses seem to endorse the method.)
First off you need to address the question of whether your data has enough events to support such a complex model. The statistical power of Cox models is driven by the number of events, not the number of subjects at risk. An admittedly imperfect rule of thumb is that you need 10-15 events for each covariate and by expanding the interactions perhaps 10-fold, you expand the required number of events by a similar factor.
Harrell has discussed such matters in his RMS book and rms-package documentation and advocates applying shrinkage to the covariate estimates in the process of any selection method. That would be a more statistically principled route to follow.
If you do have such a large dataset and there is no theory in your domain of investigation regarding which covariate interactions are more likely to be important, an alternate would be to examine the full interaction model and then proceed with the perspective that each modification of your model adds to the number of degrees of freedom for the overall process. I have faced such a situation in the past (thousands of events, millions at risk) and my approach was to keep the interactions that met a more stringent statistical theory. I restricted this approach to groups of variables that were considered related. I examined them first for their 2-way correlations. With no categorical variables in my model except smoking and gender and 5 continuous covariates, I kept 2-way interactions that had delta-deviance (distributed as chi-square stats) measures of 30 or more. I was thereby retaining interactions that "achieved significance" where the implicit degrees of freedom were much higher than the naive software listings. I also compared the results for the retained covariate interactions with and without the removed interactions to make sure that the process had not meaningfully shifted the magnitudes of the predicted effects. I also used Harrell's rms-package's validation and calibration procedures.

Related

I'm trying to create a logistic regression model for a spam email dataset but have a lot of variables (over 2500)(NOVICE)

As said above I'm trying to create a model for detecting spam emails based on word occurrences. my information from my dataset is as follows:
about 2800 variables representing each word and the frequency of their occurrences
binary spam variable 1 for spam 0 for legit
I've been using online resources but can only find logistic regression and NN tutorials for much smaller datasets, which seem much simpler in comparison. So far I've totaled up the total words for spam and non spam to analyze, but I'm having trouble creating the model itself
Does anyone have any sources or insight on how to manage this with a much larger dataset?
Apologies for the simple question (if it is so) I appreciate any advice.
A classical approach uses a generalised linear model (GLM) with a penalty for the number of variables. The GLM will be the logistic regression model in this case. The classic approach for the penalty is the LASSO, ridge regression and elastic net techniques. The shrinkage in your parameter values may be such that no parameters are selected to be predictive if your ratio of the number of variables (p) to the number of samples (N) is too high. Some parameters can control the shrinkage for that. It's overall a well studied topic. Your questions haven't asked about the programming language you will use, but you may find helpful packages in Python, R, Julia and other widespread data science programming languages. There will also be a lot of information in the CV community.
I would start analysing each variable individually. I would implement a logistic regression for each one, and remain only with those whose p-value is really significative.
After this first step, then you can run a more complex logistic regression model, where you include the remaining variables in the first step.

Comparing mixed models: singular fit for random effect group but only in some models. Should I drop RE group from all models or just where singular?

I am fitting several mixed models using the lmer function in the package lme4, each with the same fixed effects and random effects, but different response variables. The purpose of these models is to determine how environmental conditions influence different fruit and seed traits in a particular tree species. I want to know which traits respond most strongly to which environmental variables, and how well the variation in each trait is captured by each model overall.
The data have been collected from several sites, and several trees within each site.
Response variables: measures of fruits and seeds, e.g. fresh mass, dry mass, volume
Fixed effects: temperature, rainfall, soil nitrogen, soil phosphorus
Random effects: Sites, trees
Example of model notation I have been using:
lmer(fruit.mass ~ temperature + rainfall + soil N + soil P +
(1|site/tree), data = fruit)
My problem: some of the models run fine with no detectable issues, however, some produce a singular fit where the estimated variance for 'site' = 0.
I know there is considerable debate around dealing with singular fit models, although one approach is to drop site and keep the tree level random effect. The models run fine after this.
My question: Should I then drop the site random effect from the models which weren't singular if I want to compare these models in anyway? If so, are there certain methods for comparing model performance more suited to this situation?
If this is covered in publications or discussion threads then any links would be much appreciated.
When the model converges to a singular fit, it indicates that the random structure is overfitted. Therefore I would argue that it does not make sense to compare these. I would also be concerned about multiple testing in this situation.
I would drop models that give "boundary is singular" warnings. The rest are fine and you can compare among them. Of note: It is best to specify REML=FALSE when comparing models (can give references if need be). Once the "best model" is selected you can run it normally (i.e. with REML). I would recommend using conditional AIC (cAIC4 package), for example anocAIC(model1, model2 ...). Another good one is the performance package. It has many options, such as: model_performance, check_model....

Singularity in Linear Mixed Effects Models

Dataset Description: I use a dataset with neuropsychological (np) tests from several subjects. Every subject has more than one tests in his/her follow up i.e one test per year. I study the cognitive decline in these subjects. The information that I have are: Individual number(identity number), Education(years), Gender(M/F as factor), Age(years), Time from Baseline (= years after the first np test).
AIM: My aim is to measure the rate of change in their np tests i.e the cognitive decline per year for each of them. To do that I use Linear Mixture Effects Models (LMEM), taking into account the above parameters and I compute the slope for each subject.
Question: When I run the possible models (combining different parameters every time), I also check their singularity and the result in almost all cases is TRUE. So my models present singularity! In the case that I would like to use these models to do predictions this is not good as it means that the model overfits the data. But now that I just want to find the slope for each individual I think that this is not a problem, or even better I think that this is an advantage, as in that case singularity offers a more precise calculation for the subjects' slopes. Do you think that this thought is correct?

R linear regression with lm - how to deal with categorical variables with thousands of values (like city or zip code)?

I am using R and the linear regression method function lm() to build a prediction model for business sales of retail stores. Among the many dependent feature variables in my dataset, there are some categorical (factor) features that can take on thousands of different values, such as zip code (and/or city name). For example, there are over 6000 different zip codes for California alone; if I instead use city, there are over 400 cities.
I understand that lm() creates a variable for each value of a categorical feature. The problem is that when I run lm(), the explosion of variables takes a lot of memory and a really long time. How can I avoid or handle this situation with my categorical variables?
Your intuition to move from zip codes to cities is good. However, the question is, is there a further level of spatial aggregation which will capture important spatial variation, but will result in the creation of less categorical (i.e. dummy) variables? Probably. Depending on your question, simply including a dummy for rural/suburban/urban maybe all you need.
In your case geographic region is likely a proxy meant to capture variation in socio-economic data. If so, why not include the socio-economic data directly. To do this you could use your city/zip data to link to US census data.
However, if you really need/want to include cities, try estimating a fixed effect model. The within-estimator that results differences out time invariant categorical coefficients such as your city coefficients.
Even if you find a way to obtain an OLS estimate with 400 cities in R, I would strongly encourage you not do use an OLS estimator, use a Ridge or Lasso estimator. Unless your data is massive (it can't be too big since your using R), the inclusive of so many dummy variables is going to dramatically reduce the degrees of freedom, which can lead to over-fitting and generally poorly estimated coefficients and standard errors.
In a slightly more sophisticated language, when degrees of freedom are low the minimization problem you solve when you estimate the OLS is "ill-posed", consequently you should use a regularization. For example, a Ridge Regression (i.e. Tikhonov regularization), would be a good solution. Remember, however, Ridge regression is a biased estimator and therefore you should perform bias-correction.
My solutions in order of my preference:
Aggregate up to a coarser spatial area (i.e. maybe a regions instead of cities)
Fixed effect estimator.
Ridge regression.
If you don't like my suggestions, I would suggest you pose this question on cross validated. IMO your question is closer to a statistics question than a programming question.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Resources