Having trouble with overfitting in simple R logistic regression - r

I am a newbie to R and I am trying to perform a logistic regression on a set of clinical data.
My independent variable is AGE, TEMP, WBC, NLR, CRP, PCT, ESR, IL6, and TIME.
My dependent variable is binomial CRKP.
After using glm.fit, I was given this error message:
glm.fit <- glm(CRKP ~ AGE + TEMP + WBC + NLR + CRP + PCT + ESR, data = cv, family = binomial, subset=train)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
I searched up potential problems and used the corrplot function to see if there is multicollinearity that could potentially result in overfitting.
This is what I have as the plot.
Correlation plot shows that my ESR and PCT variable are highly correlated. Similarly, CRP and IL6 are highly correlated. But they are all important clinical indicators I would like to include in the model.
I tried to use the VIF to selectively discard variables, but wouldn't that be biased and also I would have to sacrifice some of my variables of interest.
Does anyone know what I can do? Please help. Thank you!

You have a multicollinearity problem but don't want to drop variables. In this case you can use Partial Least Squares (PLS) or Principal Component Regression (PCR).

Related

Longitudinal analysis using sampling weigths in R

I have longitudinal data from two surveys and I want to do a pre-post analysis. Normally, I would use survey::svyglm() or svyVGAM::svy_vglm (for multinomial family) to include sampling weights, but these functions don't account for the random effects. On the other hand, lme4::lmer accounts for the repeated measures, but not the sampling weights.
For continuous outcomes, I understand that I can do
w_data_wide <- svydesign(ids = ~1, data = data_wide, weights = data_wide$weight)
svyglm((post-pre) ~ group, w_data_wide)
and get the same estimates that I would get if I could use lmer(outcome ~ group*time + (1|id), data_long) with weights [please correct me if I'm wrong].
However, for categorical variables, I don't know how to do the analyses. WeMix::mix() has a parameter weights, but I'm not sure if it treats them as sampling weights. Still, this function can't support multinomial family.
So, to resume: can you enlighten me on how to do a pre-post test analysis of categorical outcomes with 2 or more levels? Any tips about packages/functions in R and how to use/write them would be appreciated.
I give below some data sets with binomial and multinomial outcomes:
library(data.table)
set.seed(1)
data_long <- data.table(
id=rep(1:5,2),
time=c(rep("Pre",5),rep("Post",5)),
outcome1=sample(c("Yes","No"),10,replace=T),
outcome2=sample(c("Low","Medium","High"),10,replace=T),
outcome3=rnorm(10),
group=rep(sample(c("Man","Woman"),5,replace=T),2),
weight=rep(c(1,0.5,1.5,0.75,1.25),2)
)
data_wide <- dcast(data_long, id~time, value.var = c('outcome1','outcome2','outcome3','group','weight'))[, `:=` (weight_Post = NULL, group_Post = NULL)]
EDIT
As I said below in the comments, I've been using lmer and glmer with variables used to calculate the weights as predictors. It happens that glmer returns a lot of problems (convergence, high eigenvalues...), so I give another look at #ThomasLumley answer in this post and others (https://stat.ethz.ch/pipermail/r-help/2012-June/315529.html | https://stats.stackexchange.com/questions/89204/fitting-multilevel-models-to-complex-survey-data-in-r).
So, my question is now if a can use participants id as clusters in svydesign
library(survey)
w_data_long_cluster <- svydesign(ids = ~id, data = data_long, weights = data_long$weight)
summary(svyglm(factor(outcome1) ~ group*time, w_data_long_cluster, family="quasibinomial"))
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.875e+01 1.000e+00 18.746 0.0339 *
groupWoman -1.903e+01 1.536e+00 -12.394 0.0513 .
timePre 5.443e-09 5.443e-09 1.000 0.5000
groupWoman:timePre 2.877e-01 1.143e+00 0.252 0.8431
and still interpret groupWoman:timePre as differences in the average rate of change/improvement in the outcome over time between sex groups, as if I was using mixed models with participants as random effects.
Thank you once again!
A linear model with svyglm does not give the same parameter estimates as lme4::lmer. It does estimate the same parameters as lme4::lmer if the model is correctly specified, though.
Generalised linear models with svyglm or svy_vglm don't estimate the same parameters as lme4::glmer, as you note. However, they do estimate perfectly good regression parameters and if you aren't specifically interested in the variance components or in estimating the realised random effects (BLUPs) I would recommend just using svy_glm.
Another option if you have non-survey software for random effects versions of the models is to use that. If you scale the weights to sum to the sample size and if all the clustering in the design is modelled by random effects in the model, you will get at least a reasonable approximation to valid inference. That's what I've seen recommended for Bayesian survey modelling, for example.

What weights mean in WeightIt package

I want to balance my data using the WeightIt package in R (method= ebal). I have used a code similar to the one below;
#Balancing covariates between treatment groups (binary)
W1 <- weightit(treat ~ age + educ + married + nodegree + re74, data = lalonde, method = "ebal", estimand = "ATT")
match.data(W1)
The outcome is my data table with an additional column called weights. What do those weights mean and how do I go on from here? (My next step would be to do a logit regression with a balanced dataset)
Thank you so much for helping!
weightit() estimates weights that, when applied to a dataset, yield balance in the treatment groups. To estimate effects in the weighted sample, include the weights in a regression of the outcome on the treatment. This is demonstrated in the WeightIt vignette.
You should not used match.data() with WeightIt. I'm not sure where you found the code to do that. match.data() is for use with MatchIt, which is a different package with its own functions. The fact that match.data() happened to work with WeightIt is unintended behavior and should not be relied on.
To estimate the effect of the treatment on a binary outcome (which I'll denote as Y in the code below and assume is in the lalonde dataset, even though in reality it is not), you would run the following after running the first line in your code above:
fit <- glm(Y ~ treat, data = lalonde, weights = W1$weights, family = binomial)
lmtest::coeftest(fit, vcov. = sandwich::vcovHC)
The coefficient on treat is the log odds ratio of the outcome.

plot - How to plot interaction of logistic regression model (glm) with multiple imputed data (MICE)?

I created an interaction term with iv*sex and imputed the data with mice. Then used the imputed data to run a logistic regression model (glm):
model <- with(data=imp, glm(dv~control+iv+sex+iv*sex, family="binomial"))
The following are the abbreviations of the variable names:
dependent variable=dv, independent variable=iv, moderator=sex, interaction term= iv*sex
There is significant interaction for iv*sex and I would like to plot a graph for the interaction but I couldn't find how to. It will be greatly appreciated if any solutions is offered. Thanks!
I've just run into the same issue, and with effects package I solved it.
e <- effects::effect("iv*sex", model)
e <- as.data.frame(e)
ggplot2::ggplot(e, ggplot2::aes(iv, fit, color=sex, group = sex)) +
ggplot2::geom_point() +
ggplot2::geom_line() +
"fit" is your dependent variable, in your case "dv".

Model Analysis IN R ( Logistic Regression)

I have a data file ( 1 million rows) that has one outcome variable as Status ( Yes / no ) with three continuous variables and 5 nominal variables ( 5 categories in each variable )
I want to predict the outcome i.e status.
I wanted to know which type of analysis is good for building up the model.
I have seen logit, probit, logistic regression. I am confused on what to start and analyse the variables that are more likely useful for analysis.
data file:
gender,region,age,company,speciality,jobrole,diag,labs,orders,status
M,west,41,PA,FPC, Assistant,code18,27,3,yes
M,Southwest,65,CV,FPC,Worker,code18,69,11,no
M,South,27,DV,IMC,Assistant,invalid,62,13,no
M,Southwest,18,CV,IMC,Worker,code8,6,1,yes
PS: Using R language.
Any help would be greatly appreciated Thanks !
Given the three, most usually start their analysis with Logistic regression.
Note that, Logistic and Logit are the same thing.
While deciding between Logistic and Probit, go for Logistic.
Probit usually returns results faster, while Logistic has a better edge for interpretation result.
Now, to settle on variables - You can vary the number of variables that you are going to use in your model.
model1 <- glm(status ~., data = df, family = binomial(link = 'logit'))
Now, check the model summary and check the importance of predictor variables.
model2 <- glm(status ~ gender + region + age + company + speciality + jobrole + diag + labs, data = df, family = binomial(link = 'logit'))
With reducing the number of variables you would better be able to identify what variables are important.
Also, ensure that you have performed data cleaning prior to this.
Avoid including highly correlated variables, you can check them using cor()

Mixed Modelling - Different Results between lme and lmer functions

I am currently working through Andy Field's book, Discovering Statistics Using R. Chapter 14 is on Mixed Modelling and he uses the lme function from the nlme package.
The model he creates, using speed dating data, is such:
speedDateModel <- lme(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality,
random = ~1|participant/looks/personality)
I tried to recreate a similar model using the lmer function from the lme4 package; however, my results are different. I thought I had the proper syntax, but maybe not?
speedDateModel.2 <- lmer(dateRating ~ looks + personality + gender +
looks:gender + personality:gender +
(1|participant) + (1|looks) + (1|personality),
data = speedData, REML = FALSE)
Also, when I run the coefficients of these models I notice that it only produces random intercepts for each participant. I was trying to then create a model that produces both random intercepts and slopes. I can't seem to get the syntax correct for either function to do this. Any help would be greatly appreciated.
The only difference between the lme and the corresponding lmer formula should be that the random and fixed components are aggregated into a single formula:
dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+ (1|participant/looks/personality)
using (1|participant) + (1|looks) + (1|personality) is only equivalent if looks and personality have unique values at each nested level.
It's not clear what continuous variable you want to define your slopes: if you have a continuous variable x and groups g, then (x|g) or equivalently (1+x|g) will give you a random-slopes model (x should also be included in the fixed-effects part of the model, i.e. the full formula should be y~x+(x|g) ...)
update: I got the data, or rather a script file that allows one to reconstruct the data, from here. Field makes a common mistake in his book, which I have made several times in the past: since there is only a single observation in the data set for each participant/looks/personality combination, the three-way interaction has one level per observation. In a linear mixed model, this means the variance at the lowest level of nesting will be confounded with the residual variance.
You can see this in two ways:
lme appears to fit the model just fine, but if you try to calculate confidence intervals via intervals(), you get
intervals(speedDateModel)
## Error in intervals.lme(speedDateModel) :
## cannot get confidence intervals on var-cov components:
## Non-positive definite approximate variance-covariance
If you try this with lmer you get:
## Error: number of levels of each grouping factor
## must be < number of observations
In both cases, this is a clue that something's wrong. (You can overcome this in lmer if you really want to: see ?lmerControl.)
If we leave out the lowest grouping level, everything works fine:
sd2 <- lmer(dateRating ~ looks + personality +
gender + looks:gender + personality:gender +
looks:personality+
(1|participant/looks),
data=speedData)
Compare lmer and lme fixed effects:
all.equal(fixef(sd2),fixef(speedDateModel)) ## TRUE
The starling example here gives another example and further explanation of this issue.

Resources