How do the various criteria affect the outcome? - math

I have 4 projects (Alfa,Beta,Gamma,Delta). The projects are rank from 1 to 4 based on personal feelings. Afterwards, the projects are evaluated based on 6 criteria (C1-C6).
What is the significance index of each criterion (how important was each criterion in the projects evaluation based on personal feeling)? What statistic can be used to figure it out? This is probably not multiple nonlinear regression because the order of the criteria is not important in my case but in regression is.

Related

Variable Importance Magnitude Meaning [migrated]

This question was migrated from Stack Overflow because it can be answered on Cross Validated.
Migrated 10 days ago.
I am attempting to perform recursive feature elimination (RFE) in mlr3 using ranger random forest models and had a statistical question about the meaning of the variable importance score magnitude.
I have looked at a couple different outcomes with the same independent variable set and notice that the VIP magnitudes following RFE (I used impurity as the method) are quite different for the different outcomes. For example, in one model, the top feature has a score of 3 while the top feature of another model has a score of 0.15. Does this indicate that the independent variables being used are more important to the outcome in model 1 (top variable has a value of 3) compared to model 2 (top variable has a value of 0.15)? I just want to try and wrap my head around the meaning of the magnitudes of the variable importance scores across these models.

clustering standard errors within MLMs/lme4

Is it possible to use both cluster standard errors and multilevel models together and how does one implement this in R?
In my set up I am running a conjoint experiment in 26 countries with 2000 participants per country. Like any conjoint experiment each participant is shown two vignettes and asked to choose/rate each vignette. The same participants is then shown two fresh vignettes for comparison and asked to repeat the task. In this case each participant performs two comparisons. The hierarchy is thus comparisons nested within individuals nested within countries. I am currently running a multilevel model with each comparison at level 1 and country is the level 2 unit. Obviously comparisons within individuals are likely to be correlated so I'd like to cluster standard errors at the individual level as well. It seems overkill to add another level in the MLM for this since the size of my clusters are extremely small (n=2) and it makes more sense to do my analysis at the individual level (not to mention unnecessarily complicating the model since with 2000 individuals*26 countries the parameter space becomes crazy huge). Is this possible? If so how does one do this in R together with a multilevel model set up?
The cluster size of 2 is not an issue, and I don't see any issue with the parameter space either. If you fit random intercepts for participants, and countries, these are estimated as latent normally distributed variables. A model such as:
lmer(outomce ~ fixed effects + (1|country/participant)
This will handle the dependencies within clusters (at the participant level and the country level) so there will be no need to use cluster standard errors.

Most straightforward R package for setting subject as random effect in mixed logit model

I have a dataset in which individuals, each belonging to a particular group, repeatedly chose between multiple discrete outcomes.
subID group choice
1 Big A
1 Big B
2 Small B
2 Small B
2 Small C
3 Big A
3 Big B
. . .
. . .
I want to test how group membership influences choice, and want to account for non-independence of observations due to repeated choices being made by the same individuals. In turn, I planned to implement a mixed multinomial regression treating group as a fixed effect and subID as a random effect. It seems that there are a few options for multinomial logits in R, and I'm hoping for some guidance on which may be most easily implemented for this mixed model:
1) multinom - GLM, via nnet, allows the usage of the multinom function. This appears to be a nice, clear, straightforward option... for fixed effect models. However is there a manner to implement random effects with multinom? A previous CV post suggests that multinom is able to handle mixed-effects GLM with poisson distribution and a log link. However, I don't understand (a) why this is the case or (b) the required syntax. Can anyone clarify?
2) mlogit - A fantastic package, with incredibly helpful vignettes. However, the "mixed logit" documentation refers to models that have random effects related to alternative specific covariates (implemented via the rpar argument). My model has no alternative specific variables; I simply want to account for the random intercepts of the participants. Is this possible with mlogit? Is that variance automatically accounted for by setting subID as the id.var when shaping the data to long form with mlogit.data? EDIT: I just found an example of "tricking" mlogit to provide random coefficients for variables that vary across individuals (very bottom here), but I don't quite understand the syntax involved.
3) MCMCglmm is evidently another option. However, as a relative novice with R and someone completely unfamiliar with Bayesian stats, I'm not personally comfortable parsing example syntax of mixed logits with this package, or, even following the syntax, making guesses at priors or other needed arguments.
Any guidance toward the most straightforward approach and its syntax implementation would be thoroughly appreciated. I'm also wondering if the random effect of subID needs to be nested within group (as individuals are members of groups), but that may be a question for CV instead. In any case, many thanks for any insights.
I would recommend the Apollo package by Hess & Palma. It comes with a great documentation and a quite helpful user group.

Backward Elimination for Cox Regression

I want to explore the following variables and their 2-way interactions as possible predictors: the number of siblings (nsibs), weaning age (wmonth), maternal age (mthage), race, poverty, birthweight (bweight) and maternal smoking (smoke).
I created my Cox regression formula but I don't know how to form the 2-way interaction with the predictors:
coxph(Surv(wmonth,chldage1)~as.factor(nsibs)+mthage+race+poverty+bweight+smoke,data=pneumon)
final<-step(coxph(Surv(wmonth,chldage1)~(as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,data=pneumon),direction='backward')
The formula interface is the same for coxph as it is for lm or glm. If you need to form all the two-way interactions, you use the ^-operator with a first argument of the "sum" of the covariates and a second argument of 2:
coxph(Surv(wmonth,chldage1) ~
( as.factor(nsibs)+mthage+race+poverty+bweight+smoke)^2,
data=pneumon)
I do not think there is a Cox regression step stepdown function. Thereau has spoken out in the past against making the process easy to automate. As Roland notes in his comment the prevailing opinion among all the R Core package authors is that stepwise procedures are statistically suspect. (This often creates some culture shock when persons cross-over to R from SPSS or SAS, where the culture is more accepting of stepwise procedures and where social science stats courses seem to endorse the method.)
First off you need to address the question of whether your data has enough events to support such a complex model. The statistical power of Cox models is driven by the number of events, not the number of subjects at risk. An admittedly imperfect rule of thumb is that you need 10-15 events for each covariate and by expanding the interactions perhaps 10-fold, you expand the required number of events by a similar factor.
Harrell has discussed such matters in his RMS book and rms-package documentation and advocates applying shrinkage to the covariate estimates in the process of any selection method. That would be a more statistically principled route to follow.
If you do have such a large dataset and there is no theory in your domain of investigation regarding which covariate interactions are more likely to be important, an alternate would be to examine the full interaction model and then proceed with the perspective that each modification of your model adds to the number of degrees of freedom for the overall process. I have faced such a situation in the past (thousands of events, millions at risk) and my approach was to keep the interactions that met a more stringent statistical theory. I restricted this approach to groups of variables that were considered related. I examined them first for their 2-way correlations. With no categorical variables in my model except smoking and gender and 5 continuous covariates, I kept 2-way interactions that had delta-deviance (distributed as chi-square stats) measures of 30 or more. I was thereby retaining interactions that "achieved significance" where the implicit degrees of freedom were much higher than the naive software listings. I also compared the results for the retained covariate interactions with and without the removed interactions to make sure that the process had not meaningfully shifted the magnitudes of the predicted effects. I also used Harrell's rms-package's validation and calibration procedures.

Problems with writing to a table from a looped stepwise regression

I have a total of 95 potential predictor variables, I'd like to reduce that number to those variables with more predictive power. My plan thus far has been to write some code to:
within a loop select 6 random predictors and perform a stepwise regression (direction=both) upon them.
this loop will continue for 100,000 iterations to ensure that every possible combination is entered.
The significance of the predictor (from the summary command) will be based on the p values. Where significant values <0.05 are coded as '1' and >0.05 are '0' for the 6 predictors (or less) which make it through. The predictor name is preserved in the loop output table.
I cannot seem to create a single output table with the 95 columns and write to each individual line using the 6 column ones generated for each iteration of the loop.
So is there any way to add to an array created with:
results <- array(NA,c(100000,95))
with column names assigned by:
colnames(results)<-c(<inputdata>)
Instead of choosing variables at random, why not use a shrinkage and variable selection method, such as the lasso or least angle regression. Both will automatically select variables that are most correlated with the outcome.
There is a mature R package for this.
aix and Ben Bolker have both made good suggestions. I'd also recommend glmnet, and take a look at the settings for dfmax and pmax, which allow you to constrain the number of active variables in a model and the total number of variables considered along a particular sequence of models.
Essentially, stepwise regression, one variable at a time, is a little antiquated (oh, when I was a young iterator, doing my first iterations, I did stepwise regression all the time), but it's good to move on to a different methodology entirely. There are instances where it's still reasonable, but they're few and rather specialized. All-subsets modeling, however, should be avoided: it simply doesn't scale, and virtually nothing is gained from all of that computational effort.

Resources