Variable Importance Magnitude Meaning [migrated] - r

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 10 days ago.
I am attempting to perform recursive feature elimination (RFE) in mlr3 using ranger random forest models and had a statistical question about the meaning of the variable importance score magnitude.
I have looked at a couple different outcomes with the same independent variable set and notice that the VIP magnitudes following RFE (I used impurity as the method) are quite different for the different outcomes. For example, in one model, the top feature has a score of 3 while the top feature of another model has a score of 0.15. Does this indicate that the independent variables being used are more important to the outcome in model 1 (top variable has a value of 3) compared to model 2 (top variable has a value of 0.15)? I just want to try and wrap my head around the meaning of the magnitudes of the variable importance scores across these models.

Related

How to correctly interpret fixed effect in a GLMM with repeated measures data and quantify its effect [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 9 days ago.
I have dataframe of relative proportions of cell types, measured at 3 different timepoints in several individuals.
I want to write a model that primarily evaluates the effect of time, but also potential influences on the longitudinal development of a few other factors.
My approach so far would be to loop over every dependent variable and every possible metadata variable and evaluate the model:
model<-glmmTMB(formula = (cell_type1~ Age + metadata1 + (1|patient_id) , data = data, family = beta_family(), na.action = na.omit, REML = FALSE))
car::Anova(model, test.statistic=c("Chisq"), type=c(2))
I have a hard time interpreting the results from this. For example, if i do get a significant p-value for both my cell type and my metadata variable, can i say that the effect of Age is significantly influenced by my metadata variable? If so , what is the best way to quantify this influence? The metadata variables are usually fixed for one individual and don't change over the observation period.
Unfortuantely, potential influencing variables on my time dependent effect include binary, catgeroical as well as continouus data.
What would be the best way to approach this problem?
Edit: Another approach which seems a little more sensible to me right now would be to calculate a baseline model (time~celltype) and compare it to another model time~celltype+metadata via a likelihood ratio test. However, I’d still be stuck at quantifying the effect of the metadata variable…

How should I select features for logistic regression in R? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
I tried several ways of selecting predictors for a logistic regression in R. I used lasso logistic regression to get rid of irrelevant features, cutting their number from 60 to 24, then I used those 24 variables in my stepAIC logistic regression, after which I further cut 1 variable with p-value of approximately 0.1. What other feature selection methods I can or even should use? I tried to look for Anova correlation coefficient analysis, but I didn't find any examples for R. And I think I cannot use correlation heatmap in this situation since my output is categorical? I seen some instances recommending Lasso and StepAIC, and other instances criticising them, but I didn't find any definitive comprehensive alternative, which left me confused.
Given the methodological nature of your question you might also get a more detailed answer at Cross Validated: https://stats.stackexchange.com/
From your information provided, 23-24 independent variables seems quite a number to me. If you do not have a large sample, remember that overfitting might be an issue (i.e. low cases to variables ratio). Indications of overfitting are large parameter estimates & standard errors, or failure of convergence, for instance. You obviously have already used stepwise variable selection according to stepAIC which would have also been my first try if I would have chosen to let the model do the variable selection.
If you spot any issues with standard errors/parameter estimates further options down the road might be to collapse categories of independent variables, or check whether there is any evidence of multicollinearity which could also result in deleting highly-correlated variables and narrow down the number of remaining features.
Apart from a strictly mathematical approach you might also want to identify features that are likely to be related to your outcome of interest according to your underlying content hypothesis and your previous experience, meaning to look at the model from your point of view as expert in your field of interest.
If sample size is not an issue and the point is reduction of feature numbers, you may consider running a principal component analysis (PCA) to find out about highly correlated features and do the regression with the PCAs instead which are non-correlated linear combination of your "old" features. A package to accomplish PCA is factoextra using prcomp or princomp arguments http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/

lmer or binomial GLMM [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 12 months ago.
Improve this question
I am running a mixed model in R. However I am having some difficulty understanding the type of model I should be running for the data that I have.
Let's call the dependant variable the number of early button presses in a computerised experiment. An experiment is made up of multiple trials. In each trial a participant has to press a button to react to a target appearing on a screen. However they may press the button too early and this is what is being measured as the outcome variable. So for example, participant A may have in total 3 early button presses in an experiment across trials whereas participant B may have 15.
In a straightforward linear regression model using the lm command in R, I would think this outcome is a continuous numerical variable. As well... its a number that participants score on in the experiment. However I am not trying to run a linear regression, I am trying to run a mixed model with random effects. My understanding of a mixed model in R is that the data format that the model takes from should be structured to show every participant by every trial. When the data is structured like this at trial level suddenly I have a lot of 1s and 0s in my outcome column. As of course at a trial level participants may accidently press the button too early scoring a 1, or not and score a 0.
Does this sound like something that needs to be considered as categorical. If so would it then be looked at through the glmer function with family set to binomial?
Thanks
As started by Martin, this question seems to be more of a cross-validation question. But I'll throw in my 2 cents here.
The question often becomes what you're interested in with the experiment, and whether you have cause to believe that there is a random effect in your model. In your example you have 2 possible effects that could be random: The individuals and the trials. In classical random-effect models the random effects are often chosen based on a series of rule-of-thumbs such as
If the parameter can be thought of as random. This often refers to the levels changing within a factor. In this situation both individuals and the trials are likely to change between experiments.
If you're interested in the systematic effect (eg. how much did A affect B) then the effect is not random and should be considered for the fixed effects. In your case, it is really only relevant if there are enough trials to see a systematic effects across individuals, but one could then question how relevant this effect would be for generalized results.
Several other rule-of-thumbs exist out there, but this at least gives us a place to start. The next question becomes which effect we're actually interested in. In your case it is not quite clear, but it sounds like you're interested in one of the following.
How many early button presses can we expect for any given trial
How many early button presses can we expect for any given individual
How big is the chance that an early button press happen during any given trial
For the first 2, you can benefit from averaging over either individual or trial and using a linear mixed effect model with the counter part as random effect. Although I would argue that a poisson generalized linear model is likely a better fit, as you are modelling counts that can only be positive. Eg. in a rather general sense use:
#df is assumed contain raw data
#1)
df_agg <- aggregate(. ~ individual, data = df)
lmer(early_clicks ~ . - individual + (1 | individual)) #or better: glmer(early_clicks ~ . - individual + (1 | individual), family = poisson, data = df_agg)
#2)
df_agg <- aggregate(. ~ trial, data = df)
lmer(early_clicks ~ . - trial+ (1 | trial)) #or better: glmer(early_clicks ~ . - trial+ (1 | trial), family = poisson, data = df_agg)
#3)
glmer(early_clicks ~ . + (1 | trial) + (1 | individual), family = binomial, data = df)
Note that we could use 3) to get answers for 1) and 2) by using 3) to predict probabilities and use these to find the expected early_clicks. However one can show theoretically that the estimation methods used in linear mixed models are exact, while this is not possible for generalized linear models. As such the results may differ slightly (or quite substantially) between all models. Especially in 3) the number of random effects may be quite substantial compared to the number of observations, and in practice may be impossible to estimate.
Disclaimer
I have only very briefly gone over some principals, and while they may be a very brief introduction they are by no means exhaustive. In the last 15 - 20 years the theory and practical side of mixed effect models has been extended substantially. If you'd like more information about mixed effect models I'd suggest starting at the glmm faq side by ben bolker (and others) and the references listed within there. For estimation and implementations I suggest reading the vignettes of the lme4, glmmTMB and possibly merTools packages. glmmTMB being a more recent and interesting project.

Backward Elimination for Cox Regression

I want to explore the following variables and their 2-way interactions as possible predictors: the number of siblings (nsibs), weaning age (wmonth), maternal age (mthage), race, poverty, birthweight (bweight) and maternal smoking (smoke).
I created my Cox regression formula but I don't know how to form the 2-way interaction with the predictors:
coxph(Surv(wmonth,chldage1)~as.factor(nsibs)+mthage+race+poverty+bweight+smoke,data=pneumon)
final<-step(coxph(Surv(wmonth,chldage1)~(as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,data=pneumon),direction='backward')
The formula interface is the same for coxph as it is for lm or glm. If you need to form all the two-way interactions, you use the ^-operator with a first argument of the "sum" of the covariates and a second argument of 2:
coxph(Surv(wmonth,chldage1) ~
( as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,
data=pneumon)
I do not think there is a Cox regression step stepdown function. Thereau has spoken out in the past against making the process easy to automate. As Roland notes in his comment the prevailing opinion among all the R Core package authors is that stepwise procedures are statistically suspect. (This often creates some culture shock when persons cross-over to R from SPSS or SAS, where the culture is more accepting of stepwise procedures and where social science stats courses seem to endorse the method.)
First off you need to address the question of whether your data has enough events to support such a complex model. The statistical power of Cox models is driven by the number of events, not the number of subjects at risk. An admittedly imperfect rule of thumb is that you need 10-15 events for each covariate and by expanding the interactions perhaps 10-fold, you expand the required number of events by a similar factor.
Harrell has discussed such matters in his RMS book and rms-package documentation and advocates applying shrinkage to the covariate estimates in the process of any selection method. That would be a more statistically principled route to follow.
If you do have such a large dataset and there is no theory in your domain of investigation regarding which covariate interactions are more likely to be important, an alternate would be to examine the full interaction model and then proceed with the perspective that each modification of your model adds to the number of degrees of freedom for the overall process. I have faced such a situation in the past (thousands of events, millions at risk) and my approach was to keep the interactions that met a more stringent statistical theory. I restricted this approach to groups of variables that were considered related. I examined them first for their 2-way correlations. With no categorical variables in my model except smoking and gender and 5 continuous covariates, I kept 2-way interactions that had delta-deviance (distributed as chi-square stats) measures of 30 or more. I was thereby retaining interactions that "achieved significance" where the implicit degrees of freedom were much higher than the naive software listings. I also compared the results for the retained covariate interactions with and without the removed interactions to make sure that the process had not meaningfully shifted the magnitudes of the predicted effects. I also used Harrell's rms-package's validation and calibration procedures.

PLM in R with time invariant variable

I am trying to analyze a panel data which includes observations for each US state collected across 45 years.
I have two predictor variables that vary across time (A,B) and one that does not vary (C). I am especially interested in knowing the effect of C on the dependent variable Y, while controlling for A and B, and for the differences across states and time.
This is the model that I have, using plm package in R.
random <- plm(Y~log1p(A)+B+C, index=c("state","year"),model="random",data=data)
My reasoning is that with a time invariant variable I should be using random rather than fixed effect model.
My question is: Is my model and thinking correct?
Thank you for your help in advance.
You base your answer about the decision between fixed and random effect soley on computational grounds. Please see the specific assumptions associated with the different models. The Hausman test is often used to discriminate between the fixed and the random effects model, but should not be taken as the definite answer (any good textbook will have further details).
Also pooled OLS could yield a good model, if it applies. Computationally, pooled OLS will also give you estimates for time-invariant variables.

Resources