R: Calculate and interpret odds ratio in logistic regression - r

I am having trouble interpreting the results of a logistic regression. My outcome variable is Decision and is binary (0 or 1, not take or take a product, respectively).
My predictor variable is Thoughts and is continuous, can be positive or negative, and is rounded up to the 2nd decimal point.
I want to know how the probability of taking the product changes as Thoughts changes.
The logistic regression equation is:
glm(Decision ~ Thoughts, family = binomial, data = data)
According to this model, Thoughts has a significant impact on probability of Decision (b = .72, p = .02). To determine the odds ratio of Decision as a function of Thoughts:
exp(coef(results))
Odds ratio = 2.07.
Questions:
How do I interpret the odds ratio?
Does an odds ratio of 2.07 imply that a .01 increase (or decrease) in Thoughts affect the odds of taking (or not taking) the product by 0.07 OR
Does it imply that as Thoughts increases (decreases) by .01, the odds of taking (not taking) the product increase (decrease) by approximately 2 units?
How do I convert odds ratio of Thoughts to an estimated probability of Decision?
Or can I only estimate the probability of Decision at a certain Thoughts score (i.e. calculate the estimated probability of taking the product when Thoughts == 1)?

The coefficient returned by a logistic regression in r is a logit, or the log of the odds. To convert logits to odds ratio, you can exponentiate it, as you've done above. To convert logits to probabilities, you can use the function exp(logit)/(1+exp(logit)). However, there are some things to note about this procedure.
First, I'll use some reproducible data to illustrate
library('MASS')
data("menarche")
m<-glm(cbind(Menarche, Total-Menarche) ~ Age, family=binomial, data=menarche)
summary(m)
This returns:
Call:
glm(formula = cbind(Menarche, Total - Menarche) ~ Age, family = binomial,
data = menarche)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0363 -0.9953 -0.4900 0.7780 1.3675
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -21.22639 0.77068 -27.54 <2e-16 ***
Age 1.63197 0.05895 27.68 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 3693.884 on 24 degrees of freedom
Residual deviance: 26.703 on 23 degrees of freedom
AIC: 114.76
Number of Fisher Scoring iterations: 4
The coefficients displayed are for logits, just as in your example. If we plot these data and this model, we see the sigmoidal function that is characteristic of a logistic model fit to binomial data
#predict gives the predicted value in terms of logits
plot.dat <- data.frame(prob = menarche$Menarche/menarche$Total,
age = menarche$Age,
fit = predict(m, menarche))
#convert those logit values to probabilities
plot.dat$fit_prob <- exp(plot.dat$fit)/(1+exp(plot.dat$fit))
library(ggplot2)
ggplot(plot.dat, aes(x=age, y=prob)) +
geom_point() +
geom_line(aes(x=age, y=fit_prob))
Note that the change in probabilities is not constant - the curve rises slowly at first, then more quickly in the middle, then levels out at the end. The difference in probabilities between 10 and 12 is far less than the difference in probabilities between 12 and 14. This means that it's impossible to summarise the relationship of age and probabilities with one number without transforming probabilities.
To answer your specific questions:
How do you interpret odds ratios?
The odds ratio for the value of the intercept is the odds of a "success" (in your data, this is the odds of taking the product) when x = 0 (i.e. zero thoughts). The odds ratio for your coefficient is the increase in odds above this value of the intercept when you add one whole x value (i.e. x=1; one thought). Using the menarche data:
exp(coef(m))
(Intercept) Age
6.046358e-10 5.113931e+00
We could interpret this as the odds of menarche occurring at age = 0 is .00000000006. Or, basically impossible. Exponentiating the age coefficient tells us the expected increase in the odds of menarche for each unit of age. In this case, it's just over a quintupling. An odds ratio of 1 indicates no change, whereas an odds ratio of 2 indicates a doubling, etc.
Your odds ratio of 2.07 implies that a 1 unit increase in 'Thoughts' increases the odds of taking the product by a factor of 2.07.
How do you convert odds ratios of thoughts to an estimated probability of decision?
You need to do this for selected values of thoughts, because, as you can see in the plot above, the change is not constant across the range of x values. If you want the probability of some value for thoughts, get the answer as follows:
exp(intercept + coef*THOUGHT_Value)/(1+(exp(intercept+coef*THOUGHT_Value))

Odds and probability are two different measures, both addressing the same aim of measuring the likeliness of an event to occur. They should not be compared to each other, only among themselves!
While odds of two predictor values (while holding others constant) are compared using "odds ratio" (odds1 / odds2), the same procedure for probability is called "risk ratio" (probability1 / probability2).
In general, odds are preferred against probability when it comes to ratios since probability is limited between 0 and 1 while odds are defined from -inf to +inf.
To easily calculate odds ratios including their confident intervals, see the oddsratio package:
library(oddsratio)
fit_glm <- glm(admit ~ gre + gpa + rank, data = data_glm, family = "binomial")
# Calculate OR for specific increment step of continuous variable
or_glm(data = data_glm, model = fit_glm,
incr = list(gre = 380, gpa = 5))
predictor oddsratio CI.low (2.5 %) CI.high (97.5 %) increment
1 gre 2.364 1.054 5.396 380
2 gpa 55.712 2.229 1511.282 5
3 rank2 0.509 0.272 0.945 Indicator variable
4 rank3 0.262 0.132 0.512 Indicator variable
5 rank4 0.212 0.091 0.471 Indicator variable
Here you can simply specify the increment of your continuous variables and see the resulting odds ratios. In this example, the response admit is 55 times more likely to occur when predictor gpa is increased by 5.
If you want to predict probabilities with your model, simply use type = response when predicting your model. This will automatically convert log odds to probability. You can then calculate risk ratios from the calculated probabilities. See ?predict.glm for more details.

I found this epiDisplay package, works fine! It might be useful for others but note that your confidence intervals or exact results will vary according to the package used so it is good to read the package details and chose the one that works well for your data.
Here is a sample code:
library(epiDisplay)
data(Wells, package="carData")
glm1 <- glm(switch~arsenic+distance+education+association,
family=binomial, data=Wells)
logistic.display(glm1)
Source website

The above formula to logits to probabilities, exp(logit)/(1+exp(logit)), may not have any meaning. This formula is normally used to convert odds to probabilities. However, in logistic regression an odds ratio is more like a ratio between two odds values (which happen to already be ratios). How would probability be defined using the above formula? Instead, it may be more correct to minus 1 from the odds ratio to find a percent value and then interpret the percentage as the odds of the outcome increase/decrease by x percent given the predictor.

Related

How do you interpret thresholds in CLMM (intercepts and odds ratio)

How do you express and interpret CLMM thresholds? Preferably using Odds ratio. Any help is highly appreciated! :)
Some background info: I have done a survey of human perceptions towards an animal. I used a CLMM to analyze the data with perception (1, 2, 3) as my response variable. My predictors are education, gender, experience of crop-raiding and observation of the animal. Village (where the survey was performed) as a random effect. Using package: "ordinal", https://cran.r-project.org/web/packages/ordinal/ordinal.pdf
Now im in the process of interpreting the results of the CLMM. I would like to use sjplot and analyze results based on odds ratio, but I'm having issues understanding the thresholds. Eg: 1|2 comes out with an estimate of -1.0375 and an odds ratio of 0.35. The sjplot also signifies that the odds ratio has a significant p-value (figure of sjplot below). What does this threshold odds ratio mean?
Xclmm <- clmm(OrdinalPercept ~ Education + Gender + Crop_raid + Obs + (1|IDVillage), Data = HR, Hess = TRUE, threshold = c( "flexible"))
summary(Xclmm)
Coefficients
Am I right in this interpretation of the estimates?
Threshold coefficient estimate:
1|2 -1.0375
2|3 0.3724
1 < -1.0375
2 > -1.0375 and < 0.3724 (-1.0376 to 0.3723)
3 > 0.3724
Respondent of gender female (estimate = 0.783) is expected to be in the third group ((3), estimate > 0.3724). And female respondents have 2.19 (odds ratio) times the odds of men, being in that third group (perception group 3).
What does the odds ratio of 1|2 and 2|3 mean?
Sjplot
Thanks :)

Back-transforming gamma GLM to natural units to be able to predict values in unsampled locations

I'm working with ecological data, where I have used cameras to sample animal detections (converted to biomass) and run various models to identify the best fitting model, chosen through looking at diagnostic plots, AIC and parameter effect size. The model is a gamma GLM (due to biomass having a continuous response), with a log link. The chosen model has the predictor variables of distance to water ("dist_water") and distance to forest patch ("dist_F3"). This is the model summary:
glm(formula = RAI_biomass ~ Dist_water.std + Dist_F3.std, family = Gamma(link = "log"),
data = biomass_RAI)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3835 -1.0611 -0.3937 0.4355 1.5923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3577 0.2049 26.143 2.33e-16 ***
Dist_water.std -0.7531 0.2168 -3.474 0.00254 **
Dist_F3.std 0.5831 0.2168 2.689 0.01452 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.9239696)
Null deviance: 41.231 on 21 degrees of freedom
Residual deviance: 24.232 on 19 degrees of freedom
AIC: 287.98
Number of Fisher Scoring iterations: 7
The covariates were standardised prior to running the model. What I need to do now is to back-transform this model into natural units in order to predict biomass values at unsampled locations (in this case, farms). I made a table of each farm and their respective distance to water, and distance to forest patch. I thought the way to do this would be to use the exp(coef(biomass_glm)), but when I did this the dist_water.std coefficient changed direction and became positive.
exp(coef(biomass_glm8))
## Intercept Dist_water.std Dist_F3.std
## 212.2369519 0.4709015 1.7915026
To me this seems problematic, as in the original GLM, an increasing distance to water meant a decrease in biomass (this makes sense) - but now we are seeing the opposite? The calculated biomass response had a very narrow range, from 210.97-218.9331 (for comparison, in the original data, biomass ranged from 3.04-2227.99.
I then tried to take the exponent of the entire model, without taking the exponent of each coefficient individually:
farms$biomass_est2 <- exp(5.3577 + (-0.7531*farms$Farm_dist_water_std) + (0.5831*farms$Farm_dist_F3_std))
and this gave me a new biomass response that makes a bit more sense, i.e. more variation given the variation in the two covariates (2.93-1088.84).
I then tried converting the coefficient estimates by doing e^B - 1, which gave again different results, although most similar to the ones obtained through exp(coef(biomass_glm)):
(e^(-0.7531))-1 #dist_water = -0.5290955
(e^(0.5831))-1 #dist_F3 = 0.7915837
(e^(5.3577))-1 #intercept = 211.2362
My question is, why are these estimates different, and what is the best way to take this gamma GLM with a log link and convert it into a format that can be used to calculate predicted values? Any help would be greatly appreciated!

How to report with APA style a Bayesian Linear (Mixed) Models using rstanarm?

I'm currently struggling with how to report, following APA-6 recommendations, the output of rstanarm::stan_lmer().
First, I'll fit a mixed model within the frequentist approach, then will try to do the same using the bayesian framework.
Here's the reproducible code to get the data:
library(tidyverse)
library(neuropsychology)
library(rstanarm)
library(lmerTest)
df <- neuropsychology::personality %>%
select(Study_Level, Sex, Negative_Affect) %>%
mutate(Study_Level=as.factor(Study_Level),
Negative_Affect=scale(Negative_Affect)) # I understood that scaling variables is important
Now, let's fit a linear mixed model in the "traditional" way to test the impact of Sex (male/female) on Negative Affect (negative mood) with the study level (years of education) as random factor.
fit <- lmer(Negative_Affect ~ Sex + (1|Study_Level), df)
summary(fit)
The output is the following:
Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of
freedom [lmerMod]
Formula: Negative_Affect ~ Sex + (1 | Study_Level)
Data: df
REML criterion at convergence: 3709
Scaled residuals:
Min 1Q Median 3Q Max
-2.58199 -0.72973 0.02254 0.68668 2.92841
Random effects:
Groups Name Variance Std.Dev.
Study_Level (Intercept) 0.04096 0.2024
Residual 0.94555 0.9724
Number of obs: 1327, groups: Study_Level, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 0.01564 0.08908 4.70000 0.176 0.868
SexM -0.46667 0.06607 1321.20000 -7.064 2.62e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
SexM -0.149
To report it, I would say that "we fitted a linear mixed model with negative affect as outcome variable, sex as predictor and study level was entered as a random effect. Within this model, the male level led to a significant decrease of negative affect (beta = -0.47, t(1321)=-7.06, p < .001).
Is that correct?
Then, let's try to fit the model within a bayesian framework using rstanarm:
fitB <- stan_lmer(Negative_Affect ~ Sex + (1|Study_Level),
data=df,
prior=normal(location=0, scale=1),
prior_intercept=normal(location=0, scale=1),
prior_PD=F)
print(fitB, digits=2)
This returns:
stan_lmer
family: gaussian [identity]
formula: Negative_Affect ~ Sex + (1 | Study_Level)
------
Estimates:
Median MAD_SD
(Intercept) 0.02 0.10
SexM -0.47 0.07
sigma 0.97 0.02
Error terms:
Groups Name Std.Dev.
Study_Level (Intercept) 0.278
Residual 0.973
Num. levels: Study_Level 8
Sample avg. posterior predictive
distribution of y (X = xbar):
Median MAD_SD
mean_PPD 0.00 0.04
------
For info on the priors used see help('prior_summary.stanreg').
I think than median is the median of the posterior distribution of the coefficient and mad_sd the equivalent of standart deviation. These parameters are close to the beta and standart error of the frequentist model, which is reassuring. However, I do not know how to formalize and put the output in words.
Moreover, if I do the summary of the model (summary(fitB, probs=c(.025, .975), digits=2)), I get other features of the posterior distribution:
...
Estimates:
mean sd 2.5% 97.5%
(Intercept) 0.02 0.11 -0.19 0.23
SexM -0.47 0.07 -0.59 -0.34
...
Is something like the following good?
"we fitted a linear mixed model within the bayesian framework with negative affect as outcome variable, sex as predictor and study level was entered as a random effect. Priors for the coefficient and the intercept were set to normal (mean=0, sd=1). Within this model, the features of the posterior distribution of the coefficient associated with the male level suggest a decrease of negative affect (mean = -0.47, sd = 0.11, 95% CI[-0.59, -0.34]).
Thanks for your help.
The following is personal opinion that may or may not be acceptable to a psychology journal.
To report it, I would say that "we fitted a linear mixed model with negative affect as outcome variable, sex as predictor and study level was entered as a random effect. Within this model, the male level led to a significant decrease of negative affect (beta = -0.47, t(1321)=-7.06, p < .001).
Is that correct?
That is considered correct from a frequentist perspective.
The key concepts from a Bayesian perspective are that (conditional on the model, of course)
There is a 0.5 probability that the true effect is less than the posterior median and a 0.5 probability that the true effect is greater than the posterior median. Frequentists tend to see a posterior median as being like a numerical optimum.
The posterior_interval function yields credible intervals around the median with a default probability of 0.9 (although a lower number produces more accurate estimates of the bounds). So, you can legitimately say that there is a probability of 0.9 that the true effect is between those bounds. Frequentists tend to see confidence intervals as being like credible intervals.
The as.data.frame function will give you access to the raw draws, so mean(as.data.frame(fitB)$male > 0) yields the probability that the expected difference in the outcome between men and women in the same study is positive. Frequentists tend to see these probabilities as being like p-values.
For a Bayesian approach, I would say
We fit a linear model using Markov Chain Monte Carlo with negative affect as the outcome variable, sex as predictor and the intercept was allowed to vary by study level.
And then talk about the estimates using the three concepts above.

Understanding output of 'predict' in R

I'm trying to understand the output from predict(), as well as understand whether this approach is appropriate for the problem I'm trying to solve. The prediction intervals don't make sense to me, but when I plot this on a scatterplot it looks like a good model:
I created a simple linear regression model of deal size ($) with a company's sales volume as a predictor variable. The data is faked, with deal size being a multiple of sales volume plus or minus some noise:
Call:
lm(formula = deal_size ~ sales_volume, data = accounts)
Residuals:
Min 1Q Median 3Q Max
-19123502 -3794671 -3426616 4838578 17328948
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.709e+06 1.727e+05 21.48 <2e-16 ***
sales_volume 1.898e-01 2.210e-03 85.88 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6452000 on 1586 degrees of freedom
Multiple R-squared: 0.823, Adjusted R-squared: 0.8229
F-statistic: 7376 on 1 and 1586 DF, p-value: < 2.2e-16
The predictions were generated thusly:
d = data.frame(accounts, predict(fit, interval="prediction"))
When I plot sales_volume vs. deal_size on a scatterplot, and overlay the regression line with the prediction interval, it looks good, except for a few intervals that span negative values where sales is at or near zero.
I understand fit is the predicted value, but what are lwr and upr? Do they define the intervals in absolute terms (y coordinates)? The intervals seem to be extremely wide, wider than would make sense if my model was a good fit:
sales_volume deal_size fit lwr upr
0 0 3709276.494 -8950776.04 16369329.03
0 8586337.22 3709276.494 -8950776.04 16369329.03
110000 549458.6512 3730150.811 -8929897.298 16390198.92
When you use predict with an lm model, you can specify an interval. You have three choices: none will not return intervals, confidence and prediction. Both of those will return different values. The first column will be as you said the predicted values (column fit). You then have two other columns : lwr and upper which are the lower and upper levels of the confidence intervals.
What is the difference between confidence and prediction ?
confidence is a (by default 95%, use level if you wish to change that) confidence interval of the mean of the predicted value. It is the green interval on your plot. Whereas prediction is a (also 95%) confidence interval of all your values, meaning that should you repeat your experience/survey/... a huge number of times, you can expect that 95% of your values will fall in the yellow interval, thus making it a lot wider than the green one as the green one only evaluates the mean.
And as you an see on your plot, almost all values are in the yellow interval. R doesn't know that your values can only be positive so it explains why the yellow interval "begins" under 0.
Also, when you say "The intervals seem to be extremely wide, wider than would make sense if my model was a good fit", you can see in your plot that the interval is not that big, considering that you can expect 95% of the values to be in it, and you can clearly see a trend in your data. And your model is clearly a good fit as the adjusted R squared and the global p-value tells you.
Just a slight rephrasing of #etienne above, which is very good and accurate.
Confidence interval is the (1-alpha; eg 95%) interval for the mean prediction (or group response). IE if you have 10 new companies with sales volume of 2e+08 the predict(..., interval= "confidence") interval will give you the long-run average interval for your group mean.
With Var(\hat y|X= x*) = \sigma^2 (1/n + (x*-\bar x)^2 / SXX)
The prediction interval is the (1-alpha; eg 95%) interval for an individual response -- predict(..., interval= "predict"). IE for a single new company with sales volume of 2e+08...
With Var(\hat y|X= x*) = \sigma^2 (1 + 1/n + (x*-\bar x)^2 / SXX)
(Sorry that LaTeX isn't supported)

What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?

Could someone explain to the statistically naive what the difference between Multiple R-squared and Adjusted R-squared is? I am doing a single-variate regression analysis as follows:
v.lm <- lm(epm ~ n_days, data=v)
print(summary(v.lm))
Results:
Call:
lm(formula = epm ~ n_days, data = v)
Residuals:
Min 1Q Median 3Q Max
-693.59 -325.79 53.34 302.46 964.95
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2550.39 92.15 27.677 <2e-16 ***
n_days -13.12 5.39 -2.433 0.0216 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 410.1 on 28 degrees of freedom
Multiple R-squared: 0.1746, Adjusted R-squared: 0.1451
F-statistic: 5.921 on 1 and 28 DF, p-value: 0.0216
The "adjustment" in adjusted R-squared is related to the number of variables and the number of observations.
If you keep adding variables (predictors) to your model, R-squared will improve - that is, the predictors will appear to explain the variance - but some of that improvement may be due to chance alone. So adjusted R-squared tries to correct for this, by taking into account the ratio (N-1)/(N-k-1) where N = number of observations and k = number of variables (predictors).
It's probably not a concern in your case, since you have a single variate.
Some references:
How high, R-squared?
Goodness of fit statistics
Multiple regression
Re: What is "Adjusted R^2" in Multiple Regression
The Adjusted R-squared is close to, but different from, the value of R2. Instead of being based on the explained sum of squares SSR and the total sum of squares SSY, it is based on the overall variance (a quantity we do not typically calculate), s2T = SSY/(n - 1) and the error variance MSE (from the ANOVA table) and is worked out like this: adjusted R-squared = (s2T - MSE) / s2T.
This approach provides a better basis for judging the improvement in a fit due to adding an explanatory variable, but it does not have the simple summarizing interpretation that R2 has.
If I haven't made a mistake, you should verify the values of adjusted R-squared and R-squared as follows:
s2T <- sum(anova(v.lm)[[2]]) / sum(anova(v.lm)[[1]])
MSE <- anova(v.lm)[[3]][2]
adj.R2 <- (s2T - MSE) / s2T
On the other side, R2 is: SSR/SSY, where SSR = SSY - SSE
attach(v)
SSE <- deviance(v.lm) # or SSE <- sum((epm - predict(v.lm,list(n_days)))^2)
SSY <- deviance(lm(epm ~ 1)) # or SSY <- sum((epm-mean(epm))^2)
SSR <- (SSY - SSE) # or SSR <- sum((predict(v.lm,list(n_days)) - mean(epm))^2)
R2 <- SSR / SSY
The R-squared is not dependent on the number of variables in the model. The adjusted R-squared is.
The adjusted R-squared adds a penalty for adding variables to the model that are uncorrelated with the variable your trying to explain. You can use it to test if a variable is relevant to the thing your trying to explain.
Adjusted R-squared is R-squared with some divisions added to make it dependent on the number of variables in the model.
Note that, in addition to number of predictive variables, the Adjusted R-squared formula above also adjusts for sample size. A small sample will give a deceptively large R-squared.
Ping Yin & Xitao Fan, J. of Experimental Education 69(2): 203-224, "Estimating R-squared shrinkage in multiple regression", compares different methods for adjusting r-squared and concludes that the commonly-used ones quoted above are not good. They recommend the Olkin & Pratt formula.
However, I've seen some indication that population size has a much larger effect than any of these formulas indicate. I am not convinced that any of these formulas are good enough to allow you to compare regressions done with very different sample sizes (e.g., 2,000 vs. 200,000 samples; the standard formulas would make almost no sample-size-based adjustment). I would do some cross-validation to check the r-squared on each sample.

Resources