Calculating VIF for ordinal logistic regression & multicollinearity in R - r

I am running an ordinal regression model. I have 8 explanatory variables, 4 of them categorical ('0' or '1') , 4 of them continuous. Beforehand I want to be sure there's no multicollinearity, so I use the variance inflation factor (vif function from the car package) :
mod1<-polr(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, Hess = T, data=df)
vif(mod1)
but I get a VIF value of 125 for one of the variables, as well as the following warning :
Warning message: In vif.default(mod1) : No intercept: vifs may not be sensible.
However, when I convert my dependent variable to numeric (instead of a factor), and do the same thing with a linear model :
mod2<-lm(Y ~ X1+X2+X3+X4+X5+X6+X7+X8, data=df)
vif(mod2)
This time all the VIF values are below 3, suggesting that there's no multicollinearity.
I am confused about the vif function. How can it return VIFs > 100 for one model and low VIFs for another ? Should I stick with the second result and still do an ordinal model anyway ?

The vif() function uses determinants of the correlation matrix of the parameters (and subsets thereof) to calculate the VIF. In the linear model, this includes just the regression coefficients (excluding the intercept). The vif() function wasn't intended to be used with ordered logit models. So, when it finds the variance-covariance matrix of the parameters, it includes the threshold parameters (i.e., intercepts), which would normally be excluded by the function in a linear model. This is why you get the warning you get - it doesn't know to look for threshold parameters and remove them. Since the VIF is really a function of inter-correlations in the design matrix (which doesn't depend on the dependent variable or the non-linear mapping from the linear predictor into the space of the response variable [i.e., the link function in a glm]), you should get the right answer with your second solution above, using lm() with a numeric version of your dependent variable.

Related

mnrval Matlab function in R

Can someone please help me with the equivalent of the mnrval function in R? I have not been able to find one where predicted probabilities are returned based on arguments, coefficient estimates and predictor values. I tried to rewrite the Matlab function in R but was unable to because one of the inner functions that was used was private. I would highly appreciate your help on this.
The documentation page on mnrval() states
MNRVAL Predict values for a nominal or ordinal multinomial regression model.
PHAT = MNRVAL(B,X) computes predicted probabilities for the nominal
multinomial logistic regression model with predictor values X. B is the
intercept and coefficient estimates as returned by the MNRFIT function. X
is an N-by-P design matrix with N observations on P predictor variables.
MNRVAL automatically includes intercept (constant) terms in the model; do
not enter a column of ones directly into X. PHAT is an N-by-K matrix of
predicted probabilities for each multinomial category.

Logistic regression for numerical predictor?

I'm working on a data set and want to use some of following variables to predict "Operatieduur". All the predictors have been factorized.
LogicFit <- train(Operatieduur ~ Anesthesioloog + Aorta_chirurgie + Benadering +
Chirurg + Operatietype, data = TrainData,
method="glm", family="binomial")
Here I use "train" function from caret package to make a logistic fitting with glm. When I ran this code I got the error message:
1: model fit failed for Resample01: parameter=none Error in eval(family$initialize) : y values must be 0 <= y <= 1
I googled it and found that the reason is that the resopnse "Operatieduur" is a continuous numerical value(it's a duration time). So how should I modify the function to use the predictors(they are all categorical values) to predict a continuous numerical value? Can logistic function do that?
Logistic regression predicts categories, not numerical variables. If you want to predict a continuous numerical variable (even using categorical variables), use normal regression. Depending on the number of categories of your predictor variables, you may want to consider one hot/dummy encoding.

beta coefficient in Anova with R and XLStat

I’m working with the software R and XLStat. I’ve conducted an one-way ANOVA (my categorical variable is 3 modal (1,2,3) and my response variable is quantitative on scale 1-10).
I’ve conducted this ANOVA on R and XLStat and the outputs for the F fisher, p-value, coefficient estimations, t-values, std error … are exactly the same.
However, XLstat offers an extra output : the standardized coefficients (called too beta coefficients). Firstly, I was surprised, because I didn’t think we could calculate beta coefficient for a categorical variable and according to the bibliography I read, it doesn’t have any sense.
Anyway, I tried to find these coefficients with R, thanks to the unique formula I found : beta = estimate * sd(x)/sd(y). sd(x) being the standard deviation of the categorical variable (which is automatically transformed as numeric variable with R, in order to calculate sd(x), seems logical ) and sd(y) being the standard deviation of my response variable.
The first beta I obtained with R is the same than in XLstat , but not the second and the third. Given that the first one is the same with R and XLStat, I suppose that Xlstat convert too the categorical variable in numeric variable (which is senseless but this is not the question).
Moreover, I conducted the anova on Statistica in order to see if XLStat did any mistake but its outputs for the beta coefficients are the same than in Xlstat …
So, my question is this one : what is the formula to obtain the beta coefficient in a one way Anova ?
Then, I would like to ask you about the relevance of these beta coefficients for a categorical variable. According to my thoughts and publications I read, it doesn't have any sense …
ps contrasts in R and Xlstat are sum(ai)=0. For beta coefficients, XLStat remove the intercept. I guess this fact could be important but I don't know somehow
The formula for obtaining beta coefficients from metric coefficients for an ANOVA is the same as for a linear regression. The coefficients have no sensible interpretation (for categorical variables), but standardized coefficients are useful in comparing the relative effects of IVs with different metrics.
In R, either use scale() to transform the data to z-scores before fitting the model, or use lm.beta() instead of lm().
It is not clear why you would obtain different beta coefficients with XLStat, but it could have something to do with degrees of freedom if it's not an error. This example compares the lm.beta() function in R with SAS and obtains the same coefficients.

interpret/extract coefficients from factor variable in glmnet

I have run a logit model through glmnet. I am extracting the coefficients from the minimum lambda, and it gives me the results I expect. However I have a factor variable with nine unique values, and glmnet produces a single coefficient for this, which is expected for a binary variable but not factor...
library(glmnet)
coef(model.obj, s = 'lambda.min')
#output:
TraumaticInj 2.912419e-02
Toxin .
OthInj 4.065266e-03
CurrentSTDYN 7.601812e-01
GeoDiv 1.372628e-02 #this is a factor variable w/ 9 options...
so my questions:
1) how should I interpret a single coefficient from a factor variable in glmnet?
2) is there a method to extract the coefficients for the different factors of the variable?
Glmnet doesn't handle factor variables. You have to convert them to dummies using eg model. Matrix. So the results you are seeing is glmnet treating your factor variable as a single real variable.
Can't be done, b/c glmnet doesn't treat factor variables. This is pretty much answered here: How does glmnet's standardize argument handle dummy variables?
This comment by #R_User in the answer is particularly insightful:
#DTRM - In general, one does not standardize categorical variables to
retain the interpretability of the estimated regressors. However, as
pointed out by Tibshirani here:
statweb.stanford.edu/~tibs/lasso/fulltext.pdf, "The lasso method
requires initial standardization of the regressors, so that the
penalization scheme is fair to all regressors. For categorical
regressors, one codes the regressor with dummy variables and then
standardizes the dummy variables" - so while this causes arbitrary
scaling between continuous and categorical variables, it's done for
equal penalization treatment. – R_User Dec 6 '13 at 1:20

R, Multinomial Regression: How to Find Conditional Probabilities?

In R, given a multinomial linear logit regression, I would need to obtain the conditional probability given some values of the predictors.
For example, using the function multinom from the package nnet, imagine to have computed fit <- multinom(response ~ predictor). From fit, how can I obtain the probability weights of the different response classes, given a certain value of the predictor?
I thought of using something like predict(fit,newdata,type=???), but I have no idea about how to continue.
I found a possible solution: predict(fit, newdata = predictor, "probs"). In this way, I was able to find the probability weights for all the values of the predictor: every row corresponds to a certain value.

Resources