Weird plots when plotting logistic regression residuals vs predictor variables? - r

I have fitted a logistic regression for an outcome (a type of side effect - whether patients have this or not). The formula and results of this model is below:
model <- glm(side_effect_G1 ~ age + bmi + surgerytype1 + surgerytype2 + surgerytype3 + cvd + rt_axilla, family = 'binomial', data= data1)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.888112 0.859847 -9.174 < 2e-16 ***
age 0.028529 0.009212 3.097 0.00196 **
bmi 0.095759 0.015265 6.273 3.53e-10 ***
surgery11 0.923723 0.524588 1.761 0.07826 .
surgery21 1.607389 0.600113 2.678 0.00740 **
surgery31 1.544822 0.573972 2.691 0.00711 **
cvd1 0.624692 0.290005 2.154 0.03123 *
rt1 -0.816374 0.353953 -2.306 0.02109 *
I want to check my models, so I have plotted residuals against predictors or fitted values. I know, if a model is properly fitted, there should be no correlation between residuals and predictors and fitted values so I essentially run...
residualPlots(model)
My plots look funny because from what I have seen from examples online, is that it should be symmetrical around 0. Also, my factor variables aren't shown in box-plots although I have checked the structure of my data and coded surgery1, surgery2, surgery4,cvd,rt as factors. Can someone help me interpret my plots and guide me how to plot boxplots for my factor variables?
Thanks

Your label or response variable is expected for an imbalanced dataset. From your plots most of your residuals actually go below the dotted line, so I suspect this is the case.
Very briefly, the symmetric around residuals only holds for logistic regression when your classes are balanced. If it is heavily imbalanced towards the reference label (or 0 label), the intercept will be forced towards a low value (i.e the 0 label), and you will see that positive labels will have a very large pearson residual (because they deviate a lot from the expected). You can read more about imbalanced class and logistic regression in this post
Here's an example to demonstrate this, using a dataset where you see the evenly distributed residues :
library(mlbench)
library(car)
data(PimaIndiansDiabetes)
table(PimaIndiansDiabetes$diabetes)
neg pos
500 268
mdl = glm(diabetes ~ .,data=PimaIndiansDiabetes,family="binomial")
residualPlots(mdl)
Let's make it more imbalanced, and you get a plot exactly like yours:
da = PimaIndiansDiabetes
wh = c(which(da$diabetes=="neg"),which(da$diabetes == "pos")[1:100])
da = da[wh,]
table(da$diabetes)
neg pos
500 100
mdl = glm(diabetes ~ .,data=da,family="binomial")
residualPlots(mdl)

Related

issue with creating bar plot from logistic regression results

I am trying to plot the interaction results of a logistic regression where my independent variable (gbg) is binary and my moderator is binary (gender). The interaction terms is gender*gbg. Here is the code for the logistic regression I ran.
model <- glm(y ~ gender + lunch + cohort + race + gbg + gender*gbg, data = data, family = binomial)
Sample output is here:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.02336 0.20860 -4.906 9.3e-07
gender 0.20342 0.15663 1.299 0.1940
cohort 0.07891 0.13436 0.587 0.5570
race 0.08707 0.18150 0.480 0.6314
gbg 0.22623 0.20622 1.097 0.2726
gxgbg -0.76378 0.30647 -2.492 0.0127
I know how to plot interaction results when at least one of my IVs or moderators are linear, but I don't know how to do it when both are and my outcome is binary. I tried this and no luck
plot_model(model, type = "int")
Can someone please help?

Syntax for diagonal variance-covariance matrix for non-linear mixed effects model in nlme

I am analysing routinely collected substance use data during the first 12 months' of treatment in a large sample of outpatients attending drug and alcohol treatment services. I am interested in whether differing levels of methamphetamine use (no use, low use, and high use) at the outset of treatment predicts different levels after a year in treatment, but the data is very irregular, with different clients measured at different times and different numbers of times during their year of treatment.
The data for the high and low use group seem to suggest that drug use at outset reduces during the first 3 months of treatment and then asymptotes. Hence I thought I would try a non-linear exponential decay model.
I started with the following nonlinear generalised least squares model using the gnls() function in the nlme package:
fitExp <- gnls(outcome ~ C*exp(-k*yearsFromStart),
params = list(C ~ atsBase_fac, k ~ atsBase_fac),
data = dfNL,
start = list(C = c(nsC[1], lsC[1], hsC[1]),
k = c(nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = gnlsControl(nlsTol = 0.1))
where outcome is number of days of drug use in the 28 days previous to measurement, atsBase_fac is a three-level categorical predictor indicating level of amphetamine use at baseline (noUse, lowUse, and highUse), yearsFromStart is a continuous predictor indicating time from start of treatment in years (baseline = 0, max - 1), C is a parameter indicating initial level of drug use, and k is the rate of decay in drug use. The starting values of C and k are taken from nls models estimating these parameters for each group. These are the results of that model
Generalized nonlinear least squares fit
Model: outcome ~ C * exp(-k * yearsFromStart)
Data: dfNL
AIC BIC logLik
27672.17 27725.29 -13828.08
Variance function:
Structure: Exponential of variance covariate
Formula: ~yearsFromStart
Parameter estimates:
expon
0.7927517
Coefficients:
Value Std.Error t-value p-value
C.(Intercept) 0.130410 0.0411728 3.16738 0.0015
C.atsBase_faclow 3.409828 0.1249553 27.28839 0.0000
C.atsBase_fachigh 20.574833 0.3122500 65.89218 0.0000
k.(Intercept) -1.667870 0.5841222 -2.85534 0.0043
k.atsBase_faclow 2.481850 0.6110666 4.06150 0.0000
k.atsBase_fachigh 9.485155 0.7175471 13.21886 0.0000
So it looks as if there are differences between groups in initial rate of drug use and in rate of reduction in drug use. I would like to go a step further and fit a nonlinear mixed effects model.I tried consulting Pinhiero and Bates' book accompanying the nlme package but the only models I could find that used irregular, sparse data like mine used a self-starting function, and my model does not do that.
I tried to adapt the gnls() model to nlme like so:
fitNLME <- nlme(model = outcome ~ C*exp(-k*yearsFromStart),
data = dfNL,
fixed = list(C ~ atsBase_fac, k ~ atsBase_fac),
random = pdDiag(yearsFromStart ~ id),
groups = ~ id,
start = list(fixed = c(nsC[1], lsC[1], hsC[1], nsC[2], lsC[2], hsC[2])),
weights = varExp(-0.8, form = ~ yearsFromStart),
control = nlmeControl(optim = "optimizer"))
bit I keep getting error message, I presume through errors in the syntax specifying the random effects.
Can anyone give me some tips on how the syntax for the random effects works in nlme?
The only dataset in Pinhiero and Bates that resembled mine used a diagonal variance-covariance matrix. Can anyone filled me in on the syntax of this nlme function, or suggest a better one?
p.s. I wish I could provide a reproducible example but coming up with synthetic data that re-creates the same errors is way beyond my skills.

Standardizing regression coefficients changed significance

I originally had this formula:
lm(PopDif ~ RailDensityDif + Ports + Coast, data = Pop) and got a coefficient of 1,419,000 for RailDensityDif, -0.1011 for ports, and 3418 for Coast. After scaling the variables: lm(scale(PopDif) ~ scale(RailDensityDif) + scale(Ports) + scale(Coast), data = Pop), my coefficient for RailDensityDif is 0.02107 and 0.2221 for Coast, so now Coast is more significant than RailDensityDif. I know scaling isn't supposed to change the significance—why did this happen?
tldr; The p-values characterising the statistical significance of parameters in a linear model may change following scaling (standardising) variables.
As an example, I will work with the mtcars dataset, and regress mpg on disp and drat; or in R's formula language mpg ~ disp + drat.
1. Three linear models
We implement three different (OLS) linear models, the difference being different scaling strategies of the variables.
To start, we don't do any scaling.
m1 <- lm(mpg ~ disp + drat, data = mtcars)
Next, we scale values using scale which by default does two things: (1) It centers values at 0 by subtracting the mean, and (2) it scales values to have unit variance by dividing the (centered) values by their standard deviation.
m2 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars)))
Note that we can apply scale to the data.frame directly, which will scale values by column. scale returns a matrix so we need to transform the resulting object back to a data.frame.
Finally, we scale values using scale without centering, but scaling values to have unit variance
m3 <- lm(mpg ~ disp + drat, data = as.data.frame(scale(mtcars, center = F)))
2. Comparison of parameter estimates and statistical significance
Let's inspect the parameter estimates for m1
summary(m1)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 21.84487993 6.747971087 3.237252 3.016655e-03
#disp -0.03569388 0.006652672 -5.365345 9.191388e-06
#drat 1.80202739 1.542091386 1.168561 2.520974e-01
We get the t values from the ratio of the parameter estimates and standard errors; the p-values then follow from the area under the curve of the pdf for df = nrow(mtcars) - 3 (as we have 3 parameters) where x > |t| (corresponding to a two-sided t-test). So for example, for disp we confirm the t value
summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"]
#[1] -5.365345
and the p-value
2 * pt(summary(m1)$coef["disp", "Estimate"] / summary(m1)$coef["disp", "Std. Error"], nrow(mtcars) - 3)
#[1] 9.191388e-06
Let's take a look at results from m2:
summary(m2)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -1.306994e-17 0.09479281 -1.378790e-16 1.000000e+00
#disp -7.340121e-01 0.13680614 -5.365345e+00 9.191388e-06
#drat 1.598663e-01 0.13680614 1.168561e+00 2.520974e-01
Notice how the t values (i.e. the ratios of the estimates and standard errors) are different compared to those of m1, due to the centering and scaling of data to have unit variance.
If however, we don't center values and only scale them to have unit variance
summary(m3)$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 1.0263872 0.31705513 3.237252 3.016655e-03
#disp -0.4446985 0.08288348 -5.365345 9.191388e-06
#drat 0.3126834 0.26757994 1.168561 2.520974e-01
we can see that while estimates and standard errors are different compared to (unscaled) results from m1, their respective ratios (i.e. the t values) are identical. So (default) scale(...) will change the statistical significance of parameter estimates while scale(..., center = FALSE) will not.
It's easy to see why dividing values by their standard deviation does not change the ratio of OLS parameter estimates and standard errors when taking a look at the closed form for the OLS parameter estimate and standard error, see e.g. here.

GAM: why mgcv::gam provides different results regarding to the order of the levels of the explanatory variable

I am trying to get the seasonal trend of two groups of individuals using GAMMs. I performed two analysis changing the order of the levels of the explanatory variable in order to get one plot of the seasonal trend for each level.
However, I am surprised with the output of the two GAMMs because they vary according to the order of the levels of the explanatory variable. I expected that the results would be the same because the data and the model are the same in both occasions. However, as you can see below, the results vary the inference of the data studied.
My database contained the next variables:
Species: 4 levels
Populations: 20 levels
Reproductive_State: 2 levels
Survival_probability: range [0-1]
Year
Month
Fortnight: from 1 to 26 (called Seasonality in analysis)
I am trying to get a descriptive estimates and plots of the "common seasonal survival of the species" checking the existence of differences between the two levels of the variable reproductive_state.
In order to check it I performed did:
# Specify the contrast: Reproductive group
data$Reproductive_Group <- as.factor (data$Reproductive_State)
data$Reproductive_Group <- as.ordered(data$Reproductive_Group )
contrasts(data$Reproductive_Group )<-'contr.treatment'
model_1 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
later I change the order of the levels of the Reproductive_Group and perform the same analysis:
data$Reproductive_Group <- factor (data$Reproductive_Group , levels=c("phiNB", "phiB"))
levels (data$Reproductive_Group )
model_2 <- gam (Survival_probability ~ Reproductive_Group + s(Seasonality) + s(Seasonality, by=Reproductive_Group ), random=list(Species=~1, Population=~1), family=quasibinomial, data=data)
In the first model the output is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83569 0.01202 152.8 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 3.201 3.963 2.430 0.05046 .
s(Seasonality):Rep_GroupphiNB 5.824 6.956 2.682 0.00991 **
whereas the output of the second model is:
Formula:
Survival_probability ~ +s(Seasonality) + s(Seasonality, by = Rep_Group)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.83554 0.01205 152.4 <2e-16 ***
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Seasonality) 5.927 7.061 6.156 3.66e-07 ***
s(Seasonality):Rep_GroupphiB 3.218 3.981 1.029 0.411
Furthermore I have attached the plots of the two models:
Group_B_as_second_level
Group_NB_as_second_level
I thought that the plot of the seasonality should be the same for both analysis, as long as it represents exclusively the seasonality. However if the seasonality reflects the seasonal trend of the other level, the plot 1 of the first picture should match with the plot 2 of the second picture and viceversa, and they don´t do it.
To note, that I followed the blog Overview GAMM analysis of time series data for writting the formula and checking the differences of the seasonal trend accross the two reproductive state.
Do you know why I obtain different results with these two models?

Testing differences in coefficients including interactions from piecewise linear model

I'm running a piecewise linear random coefficient model testing the influence of a covariate on the second piece. Thereby, I want to test whether the coefficient of the second piece under the influence of the covariate (piece2 + piece2:covariate) differs from the coefficient of the first piece (piece1), hence whether the growth rate differs.
I set up some exemplary data:
set.seed(100)
# set up dependent variable
temp <- rep(seq(0,23),50)
y <- c(rep(seq(0,23),50)+rnorm(24*50), ifelse(temp <= 11, temp + runif(1200), temp + rnorm(1200) + (temp/sqrt(temp))))
# set up ID variable, variables indicating pieces and the covariate
id <- sort(rep(seq(1,100),24))
piece1 <- rep(c(seq(0,11), rep(11,12)),100)
piece2 <- rep(c(rep(0,12), seq(1,12)),100)
covariate <- c(rep(0,24*50), rep(c(rep(0,12), rep(1,12)), 50))
# data frame
example.data <- data.frame(id, y, piece1, piece2, covariate)
# run piecewise linear random effects model and show results
library(lme4)
lmer.results <- lmer(y ~ piece1 + piece2*covariate + (1|id) , example.data)
summary(lmer.results)
I came across the linearHypothesis() command from the car package to test differences in coefficients. However, I could not find an example on how to use it when including interactions.
Can I even use linearHypothesis() to test this or am I aiming for the wrong test?
I appreciate your help.
Many thanks in advance!
Mac
Assuming your output looks like this
Estimate Std. Error t value
(Intercept) 0.26293 0.04997 5.3
piece1 0.99582 0.00677 147.2
piece2 0.98083 0.00716 137.0
covariate 2.98265 0.09042 33.0
piece2:covariate 0.15287 0.01286 11.9
if I understand correctly what you want, you are looking for the contrast:
piece1-(piece2+piece2:covariate)
or
c(0,1,-1,0,-1)
My preferred tool for this is function estimable in gmodels; you could also do it by hand or with one of the functions in Frank Harrel's packages.
library(gmodels)
estimable(lmer.results,c(0,1,-1,0,-1),conf.int=TRUE)
giving
Estimate Std. Error p value Lower.CI Upper.CI
(0 1 -1 0 -1) -0.138 0.0127 0 -0.182 -0.0928

Resources