Model Analysis IN R ( Logistic Regression) - r

I have a data file ( 1 million rows) that has one outcome variable as Status ( Yes / no ) with three continuous variables and 5 nominal variables ( 5 categories in each variable )
I want to predict the outcome i.e status.
I wanted to know which type of analysis is good for building up the model.
I have seen logit, probit, logistic regression. I am confused on what to start and analyse the variables that are more likely useful for analysis.
data file:
gender,region,age,company,speciality,jobrole,diag,labs,orders,status
M,west,41,PA,FPC, Assistant,code18,27,3,yes
M,Southwest,65,CV,FPC,Worker,code18,69,11,no
M,South,27,DV,IMC,Assistant,invalid,62,13,no
M,Southwest,18,CV,IMC,Worker,code8,6,1,yes
PS: Using R language.
Any help would be greatly appreciated Thanks !

Given the three, most usually start their analysis with Logistic regression.
Note that, Logistic and Logit are the same thing.
While deciding between Logistic and Probit, go for Logistic.
Probit usually returns results faster, while Logistic has a better edge for interpretation result.
Now, to settle on variables - You can vary the number of variables that you are going to use in your model.
model1 <- glm(status ~., data = df, family = binomial(link = 'logit'))
Now, check the model summary and check the importance of predictor variables.
model2 <- glm(status ~ gender + region + age + company + speciality + jobrole + diag + labs, data = df, family = binomial(link = 'logit'))
With reducing the number of variables you would better be able to identify what variables are important.
Also, ensure that you have performed data cleaning prior to this.
Avoid including highly correlated variables, you can check them using cor()

Related

Step vs. StepAIC

I'm trying to understand why my code has taken several days to process and how I can improve the next iteration. I'm on my third day and continue to have outputs with marginal improvements in AIC. The last couple of AIC's have been 18135.38, 18187.43, and 18243.13. I currently have 33 covariates in the model. The "none" option is 12th from the bottom, so there are still many covariates to run.
The data is ~610K observations with ~1600 variables. Outcome variables and covariates are mostly binary. My covariates were chosen after doing univariate logistical regression and P-value adjustment using Holm procedure (alpha=0.05). No interaction terms are included.
The code I've written is here:
intercept_only <- glm(outcome ~ 1, data=data, family="binomial")
full.model <- glm(outcome ~ 157 covariates, data=data, family = "binomial")
forward_step_model <- step(intercept_only, direction = "forward", scope = formula(full.model))
I'm hoping to run the same code on a different outcome variable with double the number of covariates identified in the same way as above but am worried it will take even longer to process. I see there are both the step and stepAIC functions to perform stepwise regression. Is there an appreciable difference between these functions? Are any other ways of doing this? Is there any way to speed up the processing?

Having trouble with overfitting in simple R logistic regression

I am a newbie to R and I am trying to perform a logistic regression on a set of clinical data.
My independent variable is AGE, TEMP, WBC, NLR, CRP, PCT, ESR, IL6, and TIME.
My dependent variable is binomial CRKP.
After using glm.fit, I was given this error message:
glm.fit <- glm(CRKP ~ AGE + TEMP + WBC + NLR + CRP + PCT + ESR, data = cv, family = binomial, subset=train)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
I searched up potential problems and used the corrplot function to see if there is multicollinearity that could potentially result in overfitting.
This is what I have as the plot.
Correlation plot shows that my ESR and PCT variable are highly correlated. Similarly, CRP and IL6 are highly correlated. But they are all important clinical indicators I would like to include in the model.
I tried to use the VIF to selectively discard variables, but wouldn't that be biased and also I would have to sacrifice some of my variables of interest.
Does anyone know what I can do? Please help. Thank you!
You have a multicollinearity problem but don't want to drop variables. In this case you can use Partial Least Squares (PLS) or Principal Component Regression (PCR).

How to drop specific instances of a factor variable in R from a regression

I'm running this regression.
model1 <- lm(DV ~ IV1 + IV1 + IV3 + SubjectID, data = df)
I'm checking for multicollinearity between the variables. The SubjectID is the ID of each subject. Each subject has 8 observations and there are about 300 subjects. When I run the model above I get no errors. When I run car::vif I get an error that indicates there is multicollinearity in the model. I check the regression results and the model is saying three of the SubjectIDs are multicollinear. This really surprised me, my understanding is that only one of the SubjectIDs should be linearly dependent on the rest. Regardless, assume I know that SubjectID2, SubjectID3, and SubjectID4 are multicollinear, how do I drop them? My understanding is that if I simply subset them then R will describe other factors as linearly dependent, so I can't simply drop them.

Why do heteroscedasticity-robust standard errors in logistic regression?

I am following a course on R. At the moment, we are working with logistic regression. The basic form we are taught is this one:
model <- glm(
formula = y ~ x1 + x2,
data = df,
family = quasibinomial(link = "logit"),
weights = weight
)
This makes perfectly sense to me. However, then we are being recommended to use the following to get coefficients and heteroscedasticity-robust inference:
model_rob <- lmtest::coeftest(model, sandwich::vcovHC(model))
This confuses me bit. Reading about vcovHC is states that it creates a "heteroskedasticity-consistent estimation". Why would you do this when doing logistic regression? I taught it did not assume homoscedasticity? Also, I am not sure what the coeftest does?
Thank you!
You're right - homoscedasticity (residuals at each level of the predictor have the same variance), is not an assumption in logistic regression. However, the binary response in logistic regression is heteroscedastic (0 or 1) which is why a corresponding estimator should be consistent with it. I guess that is what is meant with "heteroscedasticity-consistent". As #MrFlick already pointed out, if you would like more information on that topic, Cross Validated is likely to be the place to be. The coeftest produces the Wald test statistic of the estimated coefficients. These tests give you some information on whether a predictor (independent variable) seems to be associated to the dependent variable according to your data.

Mixed Modelling - Different Results between lme and lmer functions

I am currently working through Andy Field's book, Discovering Statistics Using R. Chapter 14 is on Mixed Modelling and he uses the lme function from the nlme package.
The model he creates, using speed dating data, is such:
speedDateModel <- lme(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality,
random = ~1|participant/looks/personality)
I tried to recreate a similar model using the lmer function from the lme4 package; however, my results are different. I thought I had the proper syntax, but maybe not?
speedDateModel.2 <- lmer(dateRating ~ looks + personality + gender +
looks:gender + personality:gender +
(1|participant) + (1|looks) + (1|personality),
data = speedData, REML = FALSE)
Also, when I run the coefficients of these models I notice that it only produces random intercepts for each participant. I was trying to then create a model that produces both random intercepts and slopes. I can't seem to get the syntax correct for either function to do this. Any help would be greatly appreciated.
The only difference between the lme and the corresponding lmer formula should be that the random and fixed components are aggregated into a single formula:
dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+ (1|participant/looks/personality)
using (1|participant) + (1|looks) + (1|personality) is only equivalent if looks and personality have unique values at each nested level.
It's not clear what continuous variable you want to define your slopes: if you have a continuous variable x and groups g, then (x|g) or equivalently (1+x|g) will give you a random-slopes model (x should also be included in the fixed-effects part of the model, i.e. the full formula should be y~x+(x|g) ...)
update: I got the data, or rather a script file that allows one to reconstruct the data, from here. Field makes a common mistake in his book, which I have made several times in the past: since there is only a single observation in the data set for each participant/looks/personality combination, the three-way interaction has one level per observation. In a linear mixed model, this means the variance at the lowest level of nesting will be confounded with the residual variance.
You can see this in two ways:
lme appears to fit the model just fine, but if you try to calculate confidence intervals via intervals(), you get
intervals(speedDateModel)
## Error in intervals.lme(speedDateModel) :
## cannot get confidence intervals on var-cov components:
## Non-positive definite approximate variance-covariance
If you try this with lmer you get:
## Error: number of levels of each grouping factor
## must be < number of observations
In both cases, this is a clue that something's wrong. (You can overcome this in lmer if you really want to: see ?lmerControl.)
If we leave out the lowest grouping level, everything works fine:
sd2 <- lmer(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+
(1|participant/looks),
data=speedData)
Compare lmer and lme fixed effects:
all.equal(fixef(sd2),fixef(speedDateModel)) ## TRUE
The starling example here gives another example and further explanation of this issue.

Resources