Visualising logistic regression using the effects package in R - r

I am using the effects package in R to plot the effects of categorical and numerical predictors in a binomial logistic regression estimated using the lme4 package. My dependent variable is the presence or absence of a virus in an individual animal and my predictive factors are various individual traits (eg. sex, age, month/year captured, presence of parasites, scaled mass index (SMI), with site as a random variable).
When I use the allEffects function on my regression, I get the plots below. When compared to the model summary output below, you can see that the slope of each line appears to be zero, regardless of the estimated coefficients, and there is something strange going on with the scale of the y-axes where the ticks and tick labels appear to be overwritten on the same point.
Here is my code for the model and the summary output:
library(lme4)
library(effects)
virus1.mod<-glmer(virus1~ age + sex + month.yr + parasites + SMI + (1|site) , data=virus1data, family=binomial)
virus1.effects<-allEffects(virus1.mod)
plot(virus1.effects, ylab="Probability(infected)", rug=FALSE)
> summary(virus1.mod)
Generalized linear mixed model fit by maximum likelihood ['glmerMod']
Family: binomial ( logit )
Formula: virus1 ~ age + sex + month.yr + parasite + SMI + (1 | site)
Data: virus1data
AIC BIC logLik deviance
189.5721 248.1130 -76.7860 153.5721
Random effects:
Groups Name Variance Std.Dev.
site (Intercept) 4.729e-10 2.175e-05
Number of obs: 191, groups: site, 6
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.340e+00 2.572e+00 2.076 0.03789 *
ageJ 1.126e+00 8.316e-01 1.354 0.17583
sexM -3.943e-02 4.562e-01 -0.086 0.93113
month.yrFeb-08 -2.259e+01 6.405e+04 0.000 0.99972
month.yrFeb-09 -2.201e+01 2.741e+04 -0.001 0.99936
month.yrJan-08.516e+00 8.175e-01 -3.078 0.00208 **
month.yrJan-09 -2.607e+00 8.066e-01 -3.232 0.00123 **
month.yrJul-08 -1.428e+00 8.571e-01 -1.666 0.09563 .
month.yrJul-09 -2.795e+00 1.170e+00 -2.389 0.01691 *
month.yrJun-08 -2.259e+01 3.300e+04 -0.001 0.99945
month.yrMar-09 -5.451e-01 6.705e-01 -0.813 0.41622
month.yrMar-08 -1.863e+00 7.921e-01 -2.352 0.01869 *
month.yrMay-09 -6.319e-01 8.956e-01 -0.706 0.48047
month.yrMay-08 3.818e-01 1.015e+00 0.376 0.70691
month.yrSep-08 2.563e+01 5.806e+05 0.000 0.99996
parasiteTRUE -6.329e-03 4.834e-01 -0.013 0.98955
SMI -3.438e-01 1.616e-01 -2.127 0.03342 *
And str of my data frame:
> str(virus1data)
'data.frame': 191 obs. of 8 variables:
$ virus1 : Factor w/ 2 levels "0","1": 1 1 1 1 1 2 1 2 1 1 ...
$ age : Factor w/ 2 levels "A","J": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 2 2 2 2 1 1 2 1 2 2 ...
$ site : Factor w/ 6 levels “site1”,"site2”,"site3",..: 1 1 1 1 2 2 2 3 2 3 ...
$ rep : Factor w/ 7 levels "NRF","L","NR",..: 3 7 3 7 1 1 3 1 7 7 ...
$ month.yr : Factor w/ 17 levels "Feb-08","Feb-09",..: 4 5 5 5 13 7 14 9 9 9 ...
$ parasite : Factor w/ 2 levels "FALSE","TRUE": 1 1 2 1 1 2 2 1 2 1 ...
$ SMI : num 14.1 14.8 14.5 13.1 15.3 ...
- attr(*, "na.action")=Class 'omit' Named int [1:73] 6 12 13 21 22 23 24 25 26 27 ...
.. ..- attr(*, "names")= chr [1:73] "1048" "1657" "1866" "2961" ...
Without making my actual data available, does anyone have an idea of what might be causing this? I have used this function with a different dataset (same independent variables but a different virus as the response variable, and different records) without problems.
This is the first time I have posted on CV, so I hope that the question is appropriate and that I have provided enough (and the right) information.

Related

"contrasts can be applied only to factors with 2 or more levels" Despite having multiple levels in each factor

I am working on a two-way mixed ANOVA using the data below, using one dependent variable, one between-subjects variable and one within-subjects variable. When I tested the normality of the residuals, of the dependent variable, I find that they are not normally distributed. But at this point I am able to perform the two-way ANOVA. Howerver, when I perform a log10 transformation, and run the script again using the log transformed variable, I get the error "contrasts can be applied only to factors with 2 or more levels".
> str(m_runjumpFREQ)
'data.frame': 564 obs. of 8 variables:
$ ID1 : int 1 2 3 4 5 6 7 8 9 10 ...
$ ID : chr "ID1" "ID2" "ID3" "ID4" ...
$ Group : Factor w/ 2 levels "II","Non-II": 1 1 1 1 1 1 1 1 1 1 ...
$ Pos : Factor w/ 3 levels "center","forward",..: 2 1 2 3 2 2 1 3 2 2 ...
$ Match_outcome : Factor w/ 2 levels "W","L": 2 2 2 2 2 2 2 2 2 1 ...
$ time : Factor w/ 8 levels "runjump_nADJmin_q1",..: 1 1 1 1 1 1 1 1 1 1 ...
$ runjump : num 0.0561 0.0858 0.0663 0.0425 0.0513 ...
$ log_runjumpFREQ: num -1.25 -1.07 -1.18 -1.37 -1.29 ...
Some answers on StackOverflow to this error have mentioned that one or more factors in the data set, used for the ANOVA, are of less than two levels. But as seen above they are not.
Another explanation I have read is that it may be the issue of missing values, where there may be NA's. There is:
m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 88
However, I get the same error even after removing the rows including NA's as follows.
> m_runjumpFREQ <- na.omit(m_runjumpFREQ)
> m1_nasum <- sum(is.na(m_runjumpFREQ$log_runjumpFREQ))
> m1_nasum
[1] 0
I could run the same script without log transformation and it would work, but with it, I get the same error. The factors are the same either way and the missing values do not make a difference. Either I am doing a crucial mistake or the issue is in the line of the log transformation below.
log_runjumpFREQ <- log10(m_runjumpFREQ$runjump)
m_runjumpFREQ <- cbind(m_runjumpFREQ, log_runjumpFREQ)
I appreciate the help.
It is not good enough that the factors have 2 levels. In addition those levels must actually be present. For example, below f has 2 levels but only 1 is actually present.
y <- (1:6)^2
x <- 1:6
f <- factor(rep(1, 6), levels = 1:2)
nlevels(f) # f has 2 levels
## [1] 2
lm(y ~ x + f)
## Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
## contrasts can be applied only to factors with 2 or more levels

Calling error in glmer() with family=gaussian (identity link)

I am trying to fix a glmm for a dataframe with 53 obs. of 17 variables. All variables are standardized, but don't follow the normal distribution and have no missing values. The str() of the data frame is something like below.
species : Factor w/ 19 levels "spp1","spp2",..: 5 18 12 15 19 4 6 14 16 5 ...
association : Factor w/ 4 levels "assoA","assoB",..: 1 1 2 2 2 3 3 4 4 1 ...
site : Factor w/ 2 levels "site1","site2": 1 1 1 1 1 1 1 1 1 2 ...
obs.no : int 1 1 1 1 1 1 1 1 1 1 ...
trait1: num 0.652 0.428 0.535 0.389 0.486 ...
trait2 : num 0.135 0.16 0.134 0.142 0.159
(clipped trait3 to 13)
I executed the following code to check the significance between sites an association classes for the given trait.
model1= glmer(trait1 ~ association+site+ (1 | species),data=df6,family=gaussian)
and received the error given below.
In glmer(trait1 ~ association+site+ (1 | species),data=df6, :
calling glmer() with family=gaussian (identity link) as a shortcut to lmer() is deprecated; please call lmer() directly
After this I want to estimate parameters with Gauss-Hermite quadrature. Any recommendation to fix this error and code to execute Gauss-Hermite quadrature is very much appreciated.
You actually posted the answer. Use lmer, not glmer:
model1 = lmer(trait1~association+site+(1|species), data=df6)
Clarifying: the reason that glmer(..., family = gaussian(link = "identity")) is not allowed (and that lme4 insists you use lmer(...) instead) is that there is no point using numerical (Gauss-Hermite) quadrature for a linear mixed model (which is exactly the special case of a GLMM with a Gaussian response and an identity link); in this case the integral can be expressed in closed form as a penalized least-squares problem (conditional on the random-effects variance/covariance parameters): see Bates et al. 2015.

Interpreting categorical variable importance in logistic regression

I'm using the caret package in R to build a logistic regression model for binary classification and one of my predictors is a categorical variable with 4 levels. Below is my code.
> mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")
> mydata$admit <- factor(mydata$admit)
> mydata$rank <- factor(mydata$rank)
> str(mydata)
'data.frame': 400 obs. of 4 variables:
$ admit: Factor w/ 2 levels "0","1": 1 2 2 2 1 2 2 1 2 1 ...
$ gre : int 380 660 800 640 520 760 560 400 540 700 ...
$ gpa : num 3.61 3.67 4 3.19 2.93 3 2.98 3.08 3.39 3.92 ...
$ rank : Factor w/ 4 levels "1","2","3","4": 3 3 1 4 4 2 1 2 3 2 ...
> mymod <- train(admit ~ gre + gpa + rank, data=mydata, method="glm", family="binomial")
> summary(mymod)$coeff
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.989979073 1.139950936 -3.500132 0.0004650273
gre 0.002264426 0.001093998 2.069864 0.0384651284
gpa 0.804037549 0.331819298 2.423119 0.0153878974
rank2 -0.675442928 0.316489661 -2.134171 0.0328288188
rank3 -1.340203916 0.345306418 -3.881202 0.0001039415
rank4 -1.551463677 0.417831633 -3.713131 0.0002047107
> varImp(mymod)
glm variable importance
Overall
rank3 100.00
rank4 90.72
gpa 19.50
rank2 3.55
gre 0.00
My question is, how do I interpret varImp for the model, especially with respect to rank? Since R has assumed rank1 to the be baseline class, does varImp being highest for rank3 mean that admit is most different for the observations when rank is 3 in comparison with when rank is 1? If this is the case, it doesn't seem to have the same story as the coefficients of the model because rank4 has a steeper slope than rank3, even though it is of lower importance according to varImp.

Kaplan Meier Survival curve results differ between R and SAS?

I'm re-running Kaplan-Meier Survival Curves from previously published data, using the exact data set used in the publication (Charpentier et al. 2008 - Inbreeding depression in ring-tailed lemurs (Lemur catta): genetic diversity predicts parasitism, immunocompetence, and survivorship). This publication ran the curves in SAS Version 9, using LIFETEST, to analyze the age at death structured by genetic heterozygosity and sex of the animal (n=64). She reports a Chi square value of 6.31 and a p value of 0.012; however, when I run the curves in R, I get a Chi square value of 0.9 and a p value of 0.821. Can anyone explain this??
R Code used: Age is the time to death, mort is the censorship code, sex is the stratum of gender, and ho2 is the factor delineating the two groups to be compared.
> survdiff(Surv(age, mort1)~ho2+sex,data=mariekmsurv1)
Call:
survdiff(formula = Surv(age, mort1) ~ ho2 + sex, data = mariekmsurv1)
N Observed Expected (O-E)^2/E (O-E)^2/V
ho2=1, sex=F 18 3 3.23 0.0166 0.0215
ho2=1, sex=M 12 3 2.35 0.1776 0.2140
ho2=2, sex=F 17 5 3.92 0.3004 0.4189
ho2=2, sex=M 17 4 5.50 0.4088 0.6621
Chisq= 0.9 on 3 degrees of freedom, p= 0.821
> str(mariekmsurv1)
'data.frame': 64 obs. of 6 variables:
$ id : Factor w/ 65 levels "","aeschylus",..: 14 31 33 30 47 57 51 39 36 3 ...
$ sex : Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ mort1: int 0 0 0 0 0 0 0 0 0 0 ...
$ age : num 0.12 0.192 0.2 0.23 1.024 ...
$ sex.1: Factor w/ 3 levels "","F","M": 3 2 3 2 2 2 2 2 2 2 ...
$ ho2 : int 1 1 1 2 1 1 1 1 1 2 ...
- attr(*, "na.action")=Class 'omit' Named int [1:141] 65 66 67 68 69 70 71 72 73 74 ...
.. ..- attr(*, "names")= chr [1:141] "65" "66" "67" "68" ...
Some ideas:
Try running it in SAS -- see if you get the same results as the author. Maybe they didn't send you the exact same dataset they used.
Look into the default values of the relevant SAS PROC and compare to the defaults of the R function you are using.
Given the HUGE difference between the Chi-squared (6.81 and 0.9) and P values (0.012 and 0.821) beteween SAS procedure and R procedure for survival analyses; I suspect that you have used wrong variables in the either one of the procedures.
The procedural difference / (data handling difference between SAS and R can cause some very small differences ) .
This is not a software error, this is highly likely to be a human error.

How do I extract lmer fixed effects by observation?

I have a lme object, constructed from some repeated measures nutrient intake data (two 24-hour intake periods per RespondentID):
Male.lme2 <- lmer(BoxCoxXY ~ -1 + AgeFactor + IntakeDay + (1|RespondentID),
data = Male.Data,
weights = SampleWeight)
and I can successfully retrieve the random effects by RespondentID using ranef(Male.lme1). I would also like to collect the result of the fixed effects by RespondentID. coef(Male.lme1) does not provide exactly what I need, as I show below.
> summary(Male.lme1)
Linear mixed model fit by REML
Formula: BoxCoxXY ~ AgeFactor + IntakeDay + (1 | RespondentID)
Data: Male.Data
AIC BIC logLik deviance REMLdev
9994 10039 -4990 9952 9980
Random effects:
Groups Name Variance Std.Dev.
RespondentID (Intercept) 0.19408 0.44055
Residual 0.37491 0.61230
Number of obs: 4498, groups: RespondentID, 2249
Fixed effects:
Estimate Std. Error t value
(Intercept) 13.98016 0.03405 410.6
AgeFactor4to8 0.50572 0.04084 12.4
AgeFactor9to13 0.94329 0.04159 22.7
AgeFactor14to18 1.30654 0.04312 30.3
IntakeDayDay2Intake -0.13871 0.01809 -7.7
Correlation of Fixed Effects:
(Intr) AgFc48 AgF913 AF1418
AgeFactr4t8 -0.775
AgeFctr9t13 -0.761 0.634
AgFctr14t18 -0.734 0.612 0.601
IntkDyDy2In -0.266 0.000 0.000 0.000
I have appended the fitted results to my data, head(Male.Data) shows
NutrientID RespondentID Gender Age SampleWeight IntakeDay IntakeAmt AgeFactor BoxCoxXY lmefits
2 267 100020 1 12 0.4952835 Day1Intake 12145.852 9to13 15.61196 15.22633
7 267 100419 1 14 0.3632839 Day1Intake 9591.953 14to18 15.01444 15.31373
8 267 100459 1 11 0.4952835 Day1Intake 7838.713 9to13 14.51458 15.00062
12 267 101138 1 15 1.3258785 Day1Intake 11113.266 14to18 15.38541 15.75337
14 267 101214 1 6 2.1198688 Day1Intake 7150.133 4to8 14.29022 14.32658
18 267 101389 1 5 2.1198688 Day1Intake 5091.528 4to8 13.47928 14.58117
The first couple of lines from coef(Male.lme1) are:
$RespondentID
(Intercept) AgeFactor4to8 AgeFactor9to13 AgeFactor14to18 IntakeDayDay2Intake
100020 14.28304 0.5057221 0.9432941 1.306542 -0.1387098
100419 14.00719 0.5057221 0.9432941 1.306542 -0.1387098
100459 14.05732 0.5057221 0.9432941 1.306542 -0.1387098
101138 14.44682 0.5057221 0.9432941 1.306542 -0.1387098
101214 13.82086 0.5057221 0.9432941 1.306542 -0.1387098
101389 14.07545 0.5057221 0.9432941 1.306542 -0.1387098
To demonstrate how the coef results relate to the fitted estimates in Male.Data (which were grabbed using Male.Data$lmefits <- fitted(Male.lme1), for the first RespondentID, who has the AgeFactor level 9-13:
- the fitted value is 15.22633, which equals - from the coeffs - (Intercept) + (AgeFactor9-13) = 14.28304 + 0.9432941
Is there a clever command for me to use that will do want I want automatically, which is to extract the fixed effect estimate for each subject, or am I faced with a series of if statements trying to apply the correct AgeFactor level to each subject to get the correct fixed effect estimate, after deducting the random effect contribution off the Intercept?
Update, apologies, was trying to cut down on the output I was providing and forgot about str(). Output is:
>str(Male.Data)
'data.frame': 4498 obs. of 11 variables:
$ NutrientID : int 267 267 267 267 267 267 267 267 267 267 ...
$ RespondentID: Factor w/ 2249 levels "100020","100419",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Gender : int 1 1 1 1 1 1 1 1 1 1 ...
$ Age : int 12 14 11 15 6 5 10 2 2 9 ...
$ BodyWeight : num 51.6 46.3 46.1 63.2 28.4 18 38.2 14.4 14.6 32.1 ...
$ SampleWeight: num 0.495 0.363 0.495 1.326 2.12 ...
$ IntakeDay : Factor w/ 2 levels "Day1Intake","Day2Intake": 1 1 1 1 1 1 1 1 1 1 ...
$ IntakeAmt : num 12146 9592 7839 11113 7150 ...
$ AgeFactor : Factor w/ 4 levels "1to3","4to8",..: 3 4 3 4 2 2 3 1 1 3 ...
$ BoxCoxXY : num 15.6 15 14.5 15.4 14.3 ...
$ lmefits : num 15.2 15.3 15 15.8 14.3 ...
The BodyWeight and Gender aren't being used (this is the males data, so all the Gender values are the same) and the NutrientID is similarly fixed for the data.
I have been doing horrible ifelse statements sinced I posted, so will try out your suggestion immediately. :)
Update2: this works perfectly with my current data and should be future-proof for new data, thanks to DWin for the extra help in the comment for this. :)
AgeLevels <- length(unique(Male.Data$AgeFactor))
Temp <- as.data.frame(fixef(Male.lme1)['(Intercept)'] +
c(0,fixef(Male.lme1)[2:AgeLevels])[
match(Male.Data$AgeFactor, c("1to3", "4to8", "9to13","14to18", "19to30","31to50","51to70","71Plus") )] +
c(0,fixef(Male.lme1)[(AgeLevels+1)])[
match(Male.Data$IntakeDay, c("Day1Intake","Day2Intake") )])
names(Temp) <- c("FxdEffct")
Below is how I've always found it easiest to extract the individuals' fixed effects and random effects components in the lme4-package. It actually extracts the corresponding fit to each observation. Assuming we have a mixed-effects model of form:
y = Xb + Zu + e
where Xb are the fixed effects and Zu are the random effects, we can extract the components (using lme4's sleepstudy as an example):
library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy)
# Xb
fix <- getME(fm1,'X') %*% fixef(fm1)
# Zu
ran <- t(as.matrix(getME(fm1,'Zt'))) %*% unlist(ranef(fm1))
# Xb + Zu
fixran <- fix + ran
I know that this works as a generalized approach to extracting components from linear mixed-effects models. For non-linear models, the model matrix X contains repeats and you may have to tailor the above code a bit. Here's some validation output as well as a visualization using lattice:
> head(cbind(fix, ran, fixran, fitted(fm1)))
[,1] [,2] [,3] [,4]
[1,] 251.4051 2.257187 253.6623 253.6623
[2,] 261.8724 11.456439 273.3288 273.3288
[3,] 272.3397 20.655691 292.9954 292.9954
[4,] 282.8070 29.854944 312.6619 312.6619
[5,] 293.2742 39.054196 332.3284 332.3284
[6,] 303.7415 48.253449 351.9950 351.9950
# Xb + Zu
> all(round((fixran),6) == round(fitted(fm1),6))
[1] TRUE
# e = y - (Xb + Zu)
> all(round(resid(fm1),6) == round(sleepstudy[,"Reaction"]-(fixran),6))
[1] TRUE
nobs <- 10 # 10 observations per subject
legend = list(text=list(c("y", "Xb + Zu", "Xb")), lines = list(col=c("blue", "red", "black"), pch=c(1,1,1), lwd=c(1,1,1), type=c("b","b","b")))
require(lattice)
xyplot(
Reaction ~ Days | Subject, data = sleepstudy,
panel = function(x, y, ...){
panel.points(x, y, type='b', col='blue')
panel.points(x, fix[(1+nobs*(panel.number()-1)):(nobs*(panel.number()))], type='b', col='black')
panel.points(x, fixran[(1+nobs*(panel.number()-1)):(nobs*(panel.number()))], type='b', col='red')
},
key = legend
)
It is going to be something like this (although you really should have given us the results of str(Male.Data) because model output does not tell us the factor levels for the baseline values:)
#First look at the coefficients
fixef(Male.lme2)
#Then do the calculations
fixef(Male.lme2)[`(Intercept)`] +
c(0,fixef(Male.lme2)[2:4])[
match(Male.Data$AgeFactor, c("1to3", "4to8", "9to13","14to18") )] +
c(0,fixef(Male.lme2)[5])[
match(Male.Data$IntakeDay, c("Day1Intake","Day2Intake") )]
You are basically running the original data through a match function to pick the correct coefficient(s) to add to the intercept ... which will be 0 if the data is the factor's base level (whose spelling I am guessing at.)
EDIT: I just noticed that you put a "-1" in the formula so perhaps all of your AgeFactor terms are listed in the output and you can tale out the 0 in the coefficient vector and the invented AgeFactor level in the match table vector.

Resources