Interpretation of coefficients in logistic regression - r

I know there are answers to similar questions but I cannot understand exactly how to interpret the coefficients estimates in logistic regression.
I am trying to predict if the probability of having a certain disease is influenced by two variables: sex and being a veteran. (there are many more variables that I am considering but for the sake of simplicity I just mentioned these two)
I have this result:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 16.46555 7.73733 2.128 0.033332 *
is.Female 0.37127 0.07204 5.154 2.55e-07 ***
is.Veteran -0.43195 0.13957 -3.095 0.001970 **
I want to claim how sex influences having the disease.
Can I say that being a female makes the probability of having the disease 0.37 times higher than if the patient was a man?

You are fitting a logistic regression, so you can't interpret the regression coefficient directly. You can calculate the odds ratio (OR) with regression coefficient.
In this case,
OR=exp(0.37)=1.45
This means that given the veteran status, risk of female = 1.45 * risk of male.

Related

How to extract average marginal effects (or predicted values) following panel data in Julia?

I am using this helpful package https://github.com/FixedEffects/FixedEffectModels.jl in Julia which helps to run Fixed Effects models.
I have one problem though, i am not sure how to extract the average marginal effects or predicted values of an interaction variable following the use of that package. For instance, the two lines below show how to extract the average marginal effects in Stata.
xtreg chronic_illness age country_birth social_class#macro_unemployment, fe
margins crisis, dydx(social_class)
Here is how to extract them in R:
How to run the predicted probabilities (or average marginal effects) for individuals fixed effects in panel data using R?
Is there by any chance a similar version of it in Julia?
Here is the model that i am running in Julia:
m= reg(df1, #formula(chronic_illness ~ status+ age + social_class*crisis + fe(id) + fe(year) + fe(country)), contrasts=contr,save=true)
Chronic illness is a binary variable (0= no chronic illness), crisis is a binary variable (0= no financial crisis). The idea is to see how much different social classes scored on chronic illness when there is no crisis compared with when there is one. The model here just shows me the interaction effect, but i am interested in the baseline values.
Here is the output:
Fixed Effect Model
========================================================================================================
Number of obs: 1468882 Degrees of freedom: 459252
R2: 0.703 R2 Adjusted: 0.567
F-Stat: 62.8378 p-value: 0.000
R2 within: 0.001 Iterations: 18
========================================================================================================
cillness | Estimate Std.Error t value Pr(>|t|) Lower 95% Upper 95%
-------------------------------------------------------------------------------------------------------
status: Unemployed | 0.0145335 0.00157535 9.22556 0.000 0.0114459 0.0176212
status: missing | -0.00702545 0.0136504 -0.51467 0.607 -0.0337797 0.0197288
age | 0.00178437 2.79058 0.000639427 0.999 -5.46766 5.47123
class: Lower-middle class | 0.00458331 250.251 1.83149e-5 1.000 -490.478 490.487
class: Working class | 0.0286466 163.324 0.000175398 1.000 -320.081 320.138
crisis | -0.00600744 0.00156138 -3.84753 0.000 -0.00906768 -0.00294719
class: Lower-middle class & crisis | -0.00189866 0.00192896 -0.984289 0.325 -0.00567936 0.00188205
class: Working class & crisis | -0.00332881 0.00170221 -1.95558 0.051 -0.0066651 7.46994e-6
I'm not aware of an exact match for emmeans in Julia, but you might be interested in the Effects.jl package. From the docs:
Regression models are useful but they can be tricky to interpret. Variable centering and contrast coding can obscure the meaning of main effects. Interaction terms, especially higher order ones, only increase the difficulty of interpretation. Here, we introduce Effects.jl which translates the fitted model, including estimated uncertainty, back into data space. Using Effects.jl, it is possible to generate effects plots that enable rapid visualization and interpretation of regression models.
As an aside, it appears that you have a binary response in your model so you probably should not fit a lienar model but a model appropriate for your data such that you can interpret coefficients as changes in probabilities/log odds, e.g. a logit or probit model. (Note that these aren't supported in FixedEffectsModels though, you'd have to fall back onto GLM.jl if your data isn't too big or GLFixedEffectModels).

R coxme: How to get study-specific treatment effect and 95% confidence interval from a mixed effect model?

I am using the coxme to fix a mixed effect Cox model. I have two random effects in the model (intercept [baseline hazard] and slope of the treatment). After running the model, I got the fixed coefficients and random effects output:
Fixed coefficients
coef exp(coef) se(coef) z p
treatment 0.420463710 1.5226675 0.28313885 1.49 0.14
Random effects
Group Variable Std Dev Variance Corr
Study Intercept 1.3979618 1.9542971 -0.6995296
treatment 0.4815433 0.2318839
I also used the ranef to get the random effect coefficients for each study. Outpout:
Intercept treatment
Study1 -1.591497 0.5479255
Study2 1.276003 -0.4392570
Study3 -1.051512 0.3621938
Study4 1.367005 -0.4708623
I was wondering how I can obtain the treatment effect (coefficient) and 95% CI for each study separately. Can I get the point estimate by summing up the overall coefficient of fixed effect and study-specific random effect (e.g. 0.420463710 + 0.5479255 for Study1)? But what about the 95%CI?
Is there any way to get the study-specific treatment effect and 95% CI (and the study's weight), like in ipdforest in Stata?
Thank you very much.

R: Calculate and interpret odds ratio in logistic regression

I am having trouble interpreting the results of a logistic regression. My outcome variable is Decision and is binary (0 or 1, not take or take a product, respectively).
My predictor variable is Thoughts and is continuous, can be positive or negative, and is rounded up to the 2nd decimal point.
I want to know how the probability of taking the product changes as Thoughts changes.
The logistic regression equation is:
glm(Decision ~ Thoughts, family = binomial, data = data)
According to this model, Thoughts has a significant impact on probability of Decision (b = .72, p = .02). To determine the odds ratio of Decision as a function of Thoughts:
exp(coef(results))
Odds ratio = 2.07.
Questions:
How do I interpret the odds ratio?
Does an odds ratio of 2.07 imply that a .01 increase (or decrease) in Thoughts affect the odds of taking (or not taking) the product by 0.07 OR
Does it imply that as Thoughts increases (decreases) by .01, the odds of taking (not taking) the product increase (decrease) by approximately 2 units?
How do I convert odds ratio of Thoughts to an estimated probability of Decision?
Or can I only estimate the probability of Decision at a certain Thoughts score (i.e. calculate the estimated probability of taking the product when Thoughts == 1)?
The coefficient returned by a logistic regression in r is a logit, or the log of the odds. To convert logits to odds ratio, you can exponentiate it, as you've done above. To convert logits to probabilities, you can use the function exp(logit)/(1+exp(logit)). However, there are some things to note about this procedure.
First, I'll use some reproducible data to illustrate
library('MASS')
data("menarche")
m<-glm(cbind(Menarche, Total-Menarche) ~ Age, family=binomial, data=menarche)
summary(m)
This returns:
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
The coefficients displayed are for logits, just as in your example. If we plot these data and this model, we see the sigmoidal function that is characteristic of a logistic model fit to binomial data
#predict gives the predicted value in terms of logits
plot.dat <- data.frame(prob = menarche$Menarche/menarche$Total,
age = menarche$Age,
fit = predict(m, menarche))
#convert those logit values to probabilities
plot.dat$fit_prob <- exp(plot.dat$fit)/(1+exp(plot.dat$fit))
library(ggplot2)
ggplot(plot.dat, aes(x=age, y=prob)) +
geom_point() +
geom_line(aes(x=age, y=fit_prob))
Note that the change in probabilities is not constant - the curve rises slowly at first, then more quickly in the middle, then levels out at the end. The difference in probabilities between 10 and 12 is far less than the difference in probabilities between 12 and 14. This means that it's impossible to summarise the relationship of age and probabilities with one number without transforming probabilities.
To answer your specific questions:
How do you interpret odds ratios?
The odds ratio for the value of the intercept is the odds of a "success" (in your data, this is the odds of taking the product) when x = 0 (i.e. zero thoughts). The odds ratio for your coefficient is the increase in odds above this value of the intercept when you add one whole x value (i.e. x=1; one thought). Using the menarche data:
exp(coef(m))
(Intercept) Age
6.046358e-10 5.113931e+00
We could interpret this as the odds of menarche occurring at age = 0 is .00000000006. Or, basically impossible. Exponentiating the age coefficient tells us the expected increase in the odds of menarche for each unit of age. In this case, it's just over a quintupling. An odds ratio of 1 indicates no change, whereas an odds ratio of 2 indicates a doubling, etc.
Your odds ratio of 2.07 implies that a 1 unit increase in 'Thoughts' increases the odds of taking the product by a factor of 2.07.
How do you convert odds ratios of thoughts to an estimated probability of decision?
You need to do this for selected values of thoughts, because, as you can see in the plot above, the change is not constant across the range of x values. If you want the probability of some value for thoughts, get the answer as follows:
exp(intercept + coef*THOUGHT_Value)/(1+(exp(intercept+coef*THOUGHT_Value))
Odds and probability are two different measures, both addressing the same aim of measuring the likeliness of an event to occur. They should not be compared to each other, only among themselves!
While odds of two predictor values (while holding others constant) are compared using "odds ratio" (odds1 / odds2), the same procedure for probability is called "risk ratio" (probability1 / probability2).
In general, odds are preferred against probability when it comes to ratios since probability is limited between 0 and 1 while odds are defined from -inf to +inf.
To easily calculate odds ratios including their confident intervals, see the oddsratio package:
library(oddsratio)
fit_glm <- glm(admit ~ gre + gpa + rank, data = data_glm, family = "binomial")
# Calculate OR for specific increment step of continuous variable
or_glm(data = data_glm, model = fit_glm,
incr = list(gre = 380, gpa = 5))
predictor oddsratio CI.low (2.5 %) CI.high (97.5 %) increment
1 gre 2.364 1.054 5.396 380
2 gpa 55.712 2.229 1511.282 5
3 rank2 0.509 0.272 0.945 Indicator variable
4 rank3 0.262 0.132 0.512 Indicator variable
5 rank4 0.212 0.091 0.471 Indicator variable
Here you can simply specify the increment of your continuous variables and see the resulting odds ratios. In this example, the response admit is 55 times more likely to occur when predictor gpa is increased by 5.
If you want to predict probabilities with your model, simply use type = response when predicting your model. This will automatically convert log odds to probability. You can then calculate risk ratios from the calculated probabilities. See ?predict.glm for more details.
I found this epiDisplay package, works fine! It might be useful for others but note that your confidence intervals or exact results will vary according to the package used so it is good to read the package details and chose the one that works well for your data.
Here is a sample code:
library(epiDisplay)
data(Wells, package="carData")
glm1 <- glm(switch~arsenic+distance+education+association,
family=binomial, data=Wells)
logistic.display(glm1)
Source website
The above formula to logits to probabilities, exp(logit)/(1+exp(logit)), may not have any meaning. This formula is normally used to convert odds to probabilities. However, in logistic regression an odds ratio is more like a ratio between two odds values (which happen to already be ratios). How would probability be defined using the above formula? Instead, it may be more correct to minus 1 from the odds ratio to find a percent value and then interpret the percentage as the odds of the outcome increase/decrease by x percent given the predictor.

Random slope for time in subject not working in lme4

I can not insert a random slope in this model with lme4(1.1-7):
> difJS<-lmer(JS~Tempo+(Tempo|id),dat,na.action=na.omit)
Error: number of observations (=274) <= number of random effects (=278) for term
(Tempo | id); the random-effects parameters and the residual variance (or scale
parameter) are probably unidentifiable
With nlme it is working:
> JSprova<-lme(JS~Tempo,random=~1+Tempo|id,data=dat,na.action=na.omit)
> summary(JSprova)
Linear mixed-effects model fit by REML Data: dat
AIC BIC logLik
769.6847 791.3196 -378.8424
Random effects:
Formula: ~1 + Tempo | id
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 1.1981593 (Intr)
Tempo 0.5409468 -0.692
Residual 0.5597984
Fixed effects: JS ~ Tempo
Value Std.Error DF t-value p-value
(Intercept) 4.116867 0.14789184 138 27.837013 0.0000
Tempo -0.207240 0.08227474 134 -2.518874 0.0129
Correlation:
(Intr)
Tempo -0.837
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.79269550 -0.39879115 0.09688881 0.41525770 2.32111142
Number of Observations: 274
Number of Groups: 139
I think it is a problem of missing data as I have few cases where there is a missing data in time two of the DV but with na.action=na.omit should not the two package behave in the same way?
It is "working" with lme, but I'm 99% sure that your random slopes are indeed confounded with the residual variation. The problem is that you only have two measurements per subject (or only one measurement per subject in 4 cases -- but that's not important here), so that a random slope plus a random intercept for every individual gives one random effect for every observation.
If you try intervals() on your lme fit, it will give you an error saying that the variance-covariance matrix is unidentifiable.
You can force lmer to do it by disabling some of the identifiability checks (see below).
library("lme4")
library("nlme")
library("plyr")
Restrict the data to only two points per individual:
sleepstudy0 <- ddply(sleepstudy,"Subject",
function(x) x[1:2,])
m1 <- lme(Reaction~Days,random=~Days|Subject,data=sleepstudy0)
intervals(m1)
## Error ... cannot get confidence intervals on var-cov components
lmer(Reaction~Days+(Days|Subject),data=sleepstudy0)
## error
If you want you can force lmer to fit this model:
m2B <- lmer(Reaction~Days+(Days|Subject),data=sleepstudy0,
control=lmerControl(check.nobs.vs.nRE="ignore"))
## warning messages
The estimated variances are different from those estimated by lme, but that's not surprising since some of the parameters are jointly unidentifiable.
If you're only interested in inference on the fixed effects, it might be OK to ignore these problems, but I wouldn't recommend it.
The sensible thing to do is to recognize that the variation among slopes is unidentifiable; there may be among-individual variation among slopes, but you just can't estimate it with this model. Don't try; fit a random-intercept model and let the implicit/default random error term take care of the variation among slopes.
There's a recent related question on CrossValidated; there I also refer to another example.

Residual variance extracted from glm and lmer in R

I am trying to take what I have read about multilevel modelling and merge it with what I know about glm in R. I am now using the height growth data from here.
I have done some coding shown below:
library(lme4)
library(ggplot2)
setwd("~/Documents/r_code/multilevel_modelling/")
rm(list=ls())
oxford.df <- read.fwf("oxboys/OXBOYS.DAT",widths=c(2,7,6,1))
names(oxford.df) <- c("stu_code","age_central","height","occasion_id")
oxford.df <- oxford.df[!is.na(oxford.df[,"age_central"]),]
oxford.df[,"stu_code"] <- factor(as.character(oxford.df[,"stu_code"]))
oxford.df[,"dummy"] <- 1
chart <- ggplot(data=oxford.df,aes(x=occasion_id,y=height))
chart <- chart + geom_point(aes(colour=stu_code))
# see if lm and glm give the same estimate
glm.01 <- lm(height~age_central+occasion_id,data=oxford.df)
glm.02 <- glm(height~age_central+occasion_id,data=oxford.df,family="gaussian")
summary(glm.02)
vcov(glm.02)
var(glm.02$residual)
(logLik(glm.01)*-2)-(logLik(glm.02)*-2)
1-pchisq(-2.273737e-13,1)
# lm and glm give the same estimation
# so glm.02 will be used from now on
# see if lmer without level2 variable give same result as glm.02
mlm.03 <- lmer(height~age_central+occasion_id+(1|dummy),data=oxford.df,REML=FALSE)
(logLik(glm.02)*-2)-(logLik(mlm.03)*-2)
# 1-pchisq(-3.408097e-07,1)
# glm.02 and mlm.03 give the same estimation, only if REML=FALSE
mlm.03 gives me the following output:
> mlm.03
Linear mixed model fit by maximum likelihood
Formula: height ~ age_central + occasion_id + (1 | dummy)
Data: oxford.df
AIC BIC logLik deviance REMLdev
1650 1667 -819.9 1640 1633
Random effects:
Groups Name Variance Std.Dev.
dummy (Intercept) 0.000 0.0000
Residual 64.712 8.0444
Number of obs: 234, groups: dummy, 1
Fixed effects:
Estimate Std. Error t value
(Intercept) 142.994 21.132 6.767
age_central 1.340 17.183 0.078
occasion_id 1.299 4.303 0.302
Correlation of Fixed Effects:
(Intr) ag_cnt
age_central 0.999
occasion_id -1.000 -0.999
You can see that there is a variance for the residual in the random effect section, which I have read from Applied Multilevel Analysis - A Practical Guide by Jos W.R. Twisk, that this represents the amount of "unexplained variance" from the model.
I wondered if I could arrive at the same residual variance from glm.02, so I tried the following:
> var(resid(glm.01))
[1] 64.98952
> sd(resid(glm.01))
[1] 8.061608
The results are slightly different from the mlm.03 output. Does this refer to the same "residual variance" stated in mlm.03?
Your glm.02 and glm.01 estimate a simple linear regression model using least squares. On the other hand, mlm.03 is a linear mixed model estimated through maximum likelihood.
I don't know your dataset, but it looks like you use the dummy variable to create a cluster structure at level-2 with zero variance.
So your question has basically two answers, but only the second answer is important in your case. The models glm.02 and mlm.03 do not contain the same residual variance estimate, because...
The models are usually different (mixed effects vs. classical regression). In your case, however, the dummy variable seems to supress the additional variance component in the mixed model. So for me the models seem to be equal.
The method used to estimate the residual variance is different. glm uses LS, lmer uses ML in your code. ML estimates for the residual variance are slightly biased (resulting in smaller variance estimates). This can be solved by using REML instead of ML to estimate variance components.
Using classic ML (instead of REML), however, is still necessary and correct for the likelihood-ratio test. Using REML the comparison of the two likelihoods would not be correct.
Cheers!

Resources