Logistic Regression Model & Multicolinearity of Categorical Variables in R - r

I have a training dataset that has 3233 rows and 62 columns. The independent variable is Happy (train$Happy), which is a binary variable. The other 61 columns are categorical independent variables.
I've created a logistic regression model as follows:
logModel <- glm(Happy ~ ., data = train, family = binary)
However, I want to reduce the number of independent variables that go into the model, perhaps down to 20 or so. I would like to start by getting rid of colinear categorical variables.
Can someone shed some light on how to determine which categorical variables are colinear and what threshold that I should use when removing a variable from a model?
Thank you!

if your variables were categorical then the obvious solution would be penalized logistic regression (Lasso) in R it is implemented in glmnet.
With categorical variables the problem is much more difficult.
I was in a similar situation and I used the importance plot from the package random forest in order to reduce the number of variables.
This would not help you to find collinearity but only to rank the variables by importance.
You have only 60 variable and maybe you have a knowledge of the field so you can try to add to you model some variables that makes sense to you (like z=x1-x3 if you think that the value x1-x3 is important.) and then rank them according to a random forest model

You could use Cramer's V, or the related Phi or contingency coefficient (see a great paper at http://www.harding.edu/sbreezeel/460%20files/statbook/chapter15.pdf), to measure colinearity among categorical variables. If two or more categorical variables have a Cramer's V value close to 1, it means they're highly "correlated" and you may not need to keep all of them in your logistic regression model.

Related

How to compare importance of qualitative AND quantitative factors in a mixed linear model?

I am trying to explain the effects of different variables (qualitative and quantitative) on a quantitative variable. The aim is to see if we can rise its value by acting on the other variables.
For that, I analysed a big dataset (more than 10 000 lines) with a mixed linear model. Prior to constructing the model, I standardized all the quantitative factors.
I would like to compare the importance of the factors' effects on the covariable. The aim is to determine which one is more efficient to use in order to increase the variable of interest's value.
As a result, I need to compare quantitative factors with qualitative factors.
I tried to do that by comparing the "standardized" estimates from my regression between them (for the fixed effects only). I divided these estimates by the standard deviation of the covariable. I supposed this allows me to compare whether if the variations of the covariable due to each factor is important or not, respectively the the overall variation of my covariable.
We can summarize it in this graph :
graph
I think this approach is suited for quantitative variables. I can say for example that increasing the value of the quantitative variable n°1 by one standard deviation decreases the covariable by 40% of its standard deviation, but if we do the same on quantitative variable n°2, it only increases the covariable by 10%. I conclude that quantitative variable n°1 has more effect on the covariable than quantitative variable n°2.
I am not so sure about the qualitative variables.
I figured I could approximate the importance of these qualitative variables by the average effect of all the factor levels. Is it appropriate ?
And I don't know at all how to compare these effects to those of the quantitative variables. Do I need a new method ?
Edit : after the comment of Shawn Hemelstrand, I will give you an example. So if you take the mtcars database from the explore package:
And you run a mixed model on it (even if it seems unecessary to add a random variable) :
library(explore)
library(tidyverse)
library(lme4)
dat= mtcars %>%
# Put factors where they should be
mutate(across(vs, am, gear, carb), ~as.factor(.x)) %>%
# Standardize quantitative variables
mutate((across(where(is.numeric),~scale(.x)))
regession=lme4::lmer(mpg~., data=dat)
Then, the aim would be to determine which factor(s) have the most effect on the mpg value, because you want to find ways to augment it. The question would be : do the other variables have effects on the mpg that matter relatively to the overall mpg variation, and relatively to the effects of the other variables.
For that, you have to compare quantitative variables with qualitative variable.
PS : I have some factors in my dataset that have more than 80 levels, so comparing each level one by one isn't an option

How can I get log-likelihoods in mitools (survey design with multiple imputed datasets)?

I've run successfully a logistic regression for a complex design survey where data were imputed in multiple datasets with mitools.
Although I can get the confidence interval and the significance of each variable, I'm interested in studying the significance of a block of variables (dummy variables that represent a categorical variable with several categories). That could be accomplished subtracting the log-likelihoods of models with and without this block of variables.
Can this be accomplished with mitools?
Thank you.

Right Regression for 2 binominal IVs and one metric DV

My independet variables are binominal (gender and posture)
but my dependent varibale is interval scaled (it's seven steps Likert scale)
What is the right regression model for this? And how do I apply it in R?
Thanky you for any advice
`For the constellation 2 binominal IVs and one metric DV, it is okay to use a simple Linear model like lm in R.

Use of multiple imputation in R for two-level binary logistic regression model

I am using the glmer function of the R library lme4 to fit a General Linear Mixed (GLM) models using the Laplace approximation and with a binary response variable, BINARY_r , say. I have one level two fixed effects variables (‘FIXED’, say) and two level two cross-classified random effects variables (‘RANDOM1’ and ‘RANDOM2’, say). The level one binary response variable, BINARY_r, is nested separately within each of the two level two variables. The logit function is used as a link function for representing the non-Gaussian nature of the response variable. The interactive effect between the two random effects variables is represented by ‘RANDOM1:RANDOM2’. All three independent variables are categorical. The model takes the form,
BINARY_r ~ FIXED + RANDOM1 + RANDOM2 + RANDOM1:RANDOM2.
There are missing data for ‘FIXED’ and ‘BINARY_r’ and I wish to explore the improvement in the model through applying multiple imputation for each of these two variables.
I am very unclear, however, as to how to use MI to generate a new function in R using glmer which is identical to the original one but now includes imputed data for FIXED and BINARY_r. Can you help, please?
Many thanks in advance

Linear model (lm) when dependent variable is a factor/categorical variable?

I want to do linear regression with the lm function. My dependent variable is a factor called AccountStatus:
1:0 days in arrears, 2:30-60 days in arrears, 3:60-90 days in arrears and 4:90+ days in arrears. (4)
As independent variable I have several numeric variables: Loan to value, debt to income and interest rate.
Is it possible to do a linear regression with these variables? I looked on the internet and found something about dummy's, but those were all for the independent variable.
This did not work:
fit <- lm(factor(AccountStatus) ~ OriginalLoanToValue, data=mydata)
summary(fit)
Linear regression does not take categorical variables for the dependent part, it has to be continuous. Considering that your AccountStatus variable has only four levels, it is unfeasible to treat it is continuous. Before commencing any statistical analysis, one should be aware of the measurement levels of one's variables.
What you can do is use multinomial logistic regression, see here for instance. Alternatively, you can recode the AccountStatus as dichotomous and use simple logistic regression.
Sorry to disappoint you, but this is just an inherent restriction of multiple regression, nothing to do with R really. If you want to learn more about which statistical technique is appropriate for different combinations of measurement levels of dependent and independent variables, I can wholeheartedly advise this book.
Expanding a little bit on #MaximK's answer: multinomial approaches are appropriate when the levels of the factor are unordered: in your case, however, when the measurement level is ordinal (i.e. ordered, but the distance between the levels is unknown/undefined), you can get more out of your data by doing ordinal regression, e.g. with the polr() function in the MASS package or with functions in the ordinal package. However, since ordinal regression has different/more complex underlying theory than simple linear regression, you should probably read more about it (e.g. at the Wikipedia article linked above, or in the vignettes of the ordinal package, or at the UCLA stats consulting page on ordinal regression, or browsing related questions on CrossValidated.
If you can give a numeric value to the variables then you might have a solution. You have to rename the values to numbers, then convert the variable into a numeric one. Here is how:
library(plyr)
my.data2$islamic_leviathan_score <- revalue(my.data2$islamic_leviathan,
c("(1) Very Suitable"="3", "(2) Suitable"="2", "(3) Somewhat Suitable"="1", "(4) Not Suitable At All"="-1"))
my.data2$islamic_leviathan_score_1 <- as.numeric(as.character(my.data2$islamic_leviathan_score))
This revaluates the potential values while transforming the variable as numeric ones. The results I get are consistent with the original values contained in the dataset when the variables are as factor variables. You can use this solution to change the name of the variables to whatever you may like, while transforming them to numeric variables.
Finally, this is worth doing because it allows you to draw histograms or regressions, something that is impossible to do with factor variables.
Hope this helps!

Resources