Force inclusion of observations with missing data in lmer - r

I want to fit a linear mixed-effects model using lme4::lmer without discarding observations with missing data. That is, I want lmer to go ahead and maximize the likelihood using all the data.
Am I correct in thinking that using na.pass produces this behavior? This unanswered question is making me wonder if this might be wrong.

lmer(like most model functions) can't deal with missing data. To illustrate that:
data(Orthodont,package="nlme")
Orthodont$nsex <- as.numeric(Orthodont$Sex=="Male")
Orthodont$nsexage <- with(Orthodont, nsex*age)
Orthodont[1, 2] <- NA
lmer(distance ~ age + (age|Subject) + (0+nsex|Subject) +
(0 + nsexage|Subject), data=Orthodont, na.action = na.pass)
#Error in lme4::lFormula(formula = distance ~ age + (age | Subject) + (0 + :
# NA in Z (random-effects model matrix): please use "na.action='na.omit'" or "na.action='na.exclude'"
If you don't want to discard observations with missing data, your only option is imputation. Check out packages like mice or Amelia.

Related

Permutation test error for likelihood ratio test of mixed model in R: permlmer, lmer, lme4, predictmeans

I would like to test the main effect of a categorical variable using a permutation test on a likelihood ratio test. I have a continuous outcome and a dichotomous grouping predictor and a categorical time predictor (Day, 5 levels).
Data is temporarily available in rda format via this Drive link.
library(lme4)
lmer1 <- lmer(outcome ~ Group*Day + (1 | ID), data = data, REML = F, na.action=na.exclude)
lmer2 <- lmer(outcome ~ Group + (1 | ID), data = data, REML = F, na.action=na.exclude)
library(predictmeans)
permlmer(lmer2,lmer1)
However, this code gives me the following error:
Error in density.default(c(lrtest1, lrtest), kernel = "epanechnikov") :
need at least 2 points to select a bandwidth automatically
The following code does work, but does not exactly give me the outcome of a permutated LR-test I believe:
library(nlme)
lme1 <- lme(outcome ~ Genotype*Day,
random = ~1 | ID,
data = data,
na.action = na.exclude)
library(pgirmess)
PermTest(lme1)
Can anyone point out why I get the "epanechnikov" error when using the permlmer function?
Thank you!
The issue is with NANs, remove all nans from your dataset and rerun the models. I had the same problem and that solved it.

Stepwise regression in r with mixed models: numbers of rows changing [duplicate]

I want to run a stepwise regression in R to choose the best fit model, my code is attached here:
full.modelfixed <- glm(died_ed ~ age_1 + gender + race + insurance + injury + ais + blunt_pen +
comorbid + iss +min_dist + pop_dens_new + age_mdn + male_pct +
pop_wht_pct + pop_blk_pct + unemp_pct + pov_100x_npct +
urban_pct, data = trauma, family = binomial (link = 'logit'), na.action = na.exclude)
reduced.modelfixed <- stepAIC(full.modelfixed, direction = "backward")
There is a error message said
Error in stepAIC(full.modelfixed, direction = "backward") :
number of rows in use has changed: remove missing values?
Almost every variable in the data has some missing values, so I cannot delete all missing values (data = na.omit(data))
Any idea on how to fix this?
Thanks!!
This should probably be in a stats forum (stats.stackexchange) but briefly there are a number of considerations.
The main one is that when comparing two models they need to be fitted on the same dataset (i.e you need to be able to nest the models within each other).
For examples
glm1 <- glm(Dependent~indep1+indep2+indep3, family = binomial, data = data)
glm2 <- glm(Dependent~indep2+indep2, family = binomial, data = data)
Now imagine that we are missing values of indep3 but not indep1 or indep2.
When we run glm1 we are running it on a smaller dataset - the dataset for which we have the dependent variable and all three independent ones (i.e we exclude any rows where indep3 values are missing).
When we run glm2 the rows missing a value for indep3 are included because those rows do contain dependent, indep1 and indep2 which are the models in the variable.
We can no longer directly compare models as they are fitted on different datasets.
I think broadly you can either
1) Limit to data which is complete
2) If appropriate consider multiple imputation
Hope that helps.
You can use the MICE package to do imputation, then working with the dataset will not give you errors

R: Using a variable with less observations in a regression (plm)

I have been trying to deal with this for a while now with no luck. Essentially, what I am doing is a two-stage least squares on some panel data. To do this I am using the plm package. What I want to do is
Do a 2SLS
Get the residuals from the 2SLS in 1.
Use these residuals as an instrument in a different 2SLS
The issue I have is that in the first 2SLS the number of observations used is less than the total observations in the dataset, so my residuals vector is short and I get the following error
Error in model.frame.default(terms(formula, lhs = lhs, rhs = rhs, data = data, :
variable lengths differ (found for 'ivreg.2.a$residuals')
Here is the code I am trying to run for reference, let me know if you need any more details. I really just need my residual vector to be the same length as the data used in the first 2SLS. For reference my data has 1713 observations, however, only 1550 get used in the regression and as a result my residuals vector is length 1550. My code for the two 2SLS regressions is below.
ivreg.2.a = plm(formula = diff(loda) ~ factor(year)+diff(lgdp) | index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year), index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
ivreg.2.a = plm(formula = diff(lgdp) ~ factor(year)+index_g_l + diff(lcru_l) + diff(lcru_l_sq) + diff(loda)| index_g_l + diff(lcru_l) + diff(lcru_l_sq) + factor(year) + ivreg.2.a$residuals, index = c("country", "year"), model = "within", data = panel[complete.cases(panel[, c(1,2,3,4,5,7)]),])
Let me know if you need anything else.
I assume the 163 observations are dropped because they have NA in one of the relevant variables. Most *lm functions in R have a na.action argument, which can be used to pad the residuals to correct length. E.g., when missing observation 3,
residuals(lm(formula, data, na.action=na.omit)) # 1 2 4
residuals(lm(formula, data, na.action=na.exclude)) # 1 2 NA 4
Documentation of plm, however, says that this argument is "currently not fully supported", so it would be simpler if you just filter those 1550 rows to a new dataframe first, and do all subsequent work on that.
BTW, if plm behaves like lm, you shouldn't need to specify complete.cases for it to work, as it should just skip anything with NAs.

R - differences between cindex function from pec package and coxph

I'm comparing the cindex function from the pec package with the resulting concordance index from coxph (survival package).
1) First the results between these two functions are different
library(pec)
library(survival)
library(prodlim)
# Simulate survival data
set.seed(12)
dat <- SimSurv(1000)
# C-index from coxph
mod1 <- coxph(Surv(time,status)~X1+X2, data=dat)
summary(mod1)$concordance[1]
0.846249
# C-index from cindex
cindex(mod1,formula=Surv(time,status)~X1+X2,data=dat)
AppCindex Pairs Concordant
coxph.model 83 915194 759712
2) If I use a counting process format, the cindex function give me an error
data(Melanoma)
# Calculate age at entry
Melanoma$age_entry <- Melanoma$age-(Melanoma$time/365.25)
# Use just one outcome (no competing risk scenario)
Melanoma$out <- ifelse(Melanoma$status==1,1,0)
mod1 <- coxph(Surv(age_entry,age,out)~ulcer+thick, data=Melanoma)
summary(mod1)$concordance[1]
0.7661805
cindex(mod1,formula=Surv(age_entry,age,out)~ulcer+thick,data=Melanoma)
Error: is.null(entry) | all(entry <= time) is not TRUE
Does anyone have a clue why 1) the two C-indexes are different and 2) if it possible to use a counting process format in the cindex function?
thanks!
Re: 1) coxph::concordance differs from pec::cindex
It looks like ties are handled differently by default when computing the
C-index between these two.
Formula in coxph (see ?survConcordance()):
(agree + tied/2)/(agree + disagree + tied).
Formula in pec::cindex:
agree/(agree + disagree)
Try computing these manually from the outputs for pec::cindex. For coxph,
obtain the agree/disagree numbers with this formula (via ?survConcordance):
fit <- coxph(Surv(time, status) ~ x1, data=df)
survConcordance(Surv(time, status) ~predict(fit), df)
References
?coxph.object has:
concordance: the concordance, as computed by survConcordance.
?survConcordance has:
The final concordance is (agree + tied/2)/(agree + disagree + tied).
NOTE: read the full help file for important notes on these ties.
I got pec's formula by manually checking the numbers.

how to do predictions from cox survival model with time varying coefficients

I have built a survival cox-model, which includes a covariate * time interaction (non-proportionality detected).
I am now wondering how could I most easily get survival predictions from my model.
My model was specified:
coxph(formula = Surv(event_time_mod, event_indicator_mod) ~ Sex +
ageC + HHcat_alt + Main_Branch + Acute_seizure + TreatmentType_binary +
ICH + IVH_dummy + IVH_dummy:log(event_time_mod)
And now I was hoping to get a prediction using survfit and providing new.data for the combination of variables I am doing the predictions:
survfit(cox, new.data=new)
Now as I have event_time_mod in the right-hand side in my model I need to specify it in the new data frame passed on to survfit. This event_time would need to be set at individual times of the predictions. Is there an easy way to specify event_time_mod to be the correct time to survfit?
Or are there any other options for achieving predictions from my model?
Of course I could create as many rows in the new data frame as there are distinct times in the predictions and setting to event_time_mod to correct values but it feels really cumbersome and I thought that there must be a better way.
You have done what is refereed to as
An obvious but incorrect approach ...
as stated in Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model vignette in version 2.41-3 of the R survival package. Instead, you should use the time-transform functionality, i.e., the tt function as stated in the same vignette. The code would be something similar to the example in the vignette
> library(survival)
> vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
+ data=veteran,
+ tt = function(x, t, ...) x * log(t+20))
>
> vfit3
Call:
coxph(formula = Surv(time, status) ~ trt + prior + karno + tt(karno),
data = veteran, tt = function(x, t, ...) x * log(t + 20))
coef exp(coef) se(coef) z p
trt 0.01648 1.01661 0.19071 0.09 0.9311
prior -0.00932 0.99073 0.02030 -0.46 0.6462
karno -0.12466 0.88279 0.02879 -4.33 1.5e-05
tt(karno) 0.02131 1.02154 0.00661 3.23 0.0013
Likelihood ratio test=53.8 on 4 df, p=5.7e-11
n= 137, number of events= 128
The survfit though does not work when you have a tt term
> survfit(vfit3, veteran[1, ])
Error in survfit.coxph(vfit3, veteran[1, ]) :
The survfit function can not yet process coxph models with a tt term
However, you can easily get out the terms, linear predictor or mean response with predict. Further, you can create the term over time for the tt term using the answer here.

Resources