PERMANOVA multivariate spread among groups is not similar to variance homogeneity ANOVA - r

I try to understand a PERMANOVA assumtption that is multivariate spread among groups is similar to variance homogeneity in univariate ANOVA, for this I make a R code and I don't find this results, why?
My code:
library(vegan)
# Four similar populations:
spdf <- matrix(NA, 60, 4, dimnames =
list(1:60, c("sp1", "sp2", "sp3", "sp4")))
spdf <- as.data.frame(spdf)
eff <- sort(rep(1:6, 10))
spdf$sp1 = eff + rnorm(60, 0, 0.25)
spdf$sp2 = eff + rnorm(60, 0, 0.25)
spdf$sp3 = eff + rnorm(60, 0, 0.25)
spdf$sp4 = eff + rnorm(60, 0, 0.25)
#3 Treatment
treat <- gl(3, 20, labels = paste("t", 1:3, sep=""))
# distance matrix
envdist <- vegdist(spdf, method="euclidean")
# when computing beta-dispersion in anova we have no group differences
# but in adonis is different
anova(betadisper(envdist, treat))
adonis(spdf~treat)

You seem to be confusing a lot of things here. PERMANOVA is a multivariate ANOVA with permutation-based testing. PERMANOVA tests for differences between group centroids --- in other words it compares the multivariate means. It does assume homogeneity of variances. To check that any difference between groups in terms of their centroids is not being induced by differences in variances, we might use the multivariate dispersion method implemented in betadisper() in R. adonis() and betadisper() are doing very different things:
adonis() gives an analysis like PERMANOVA,
betadisper() gives an analysis of multivariate spread.
What we can conclude therefore is that the two methods correctly detect a difference in means (adonis() shows a significant treat effect)
> adonis(spdf~treat)
Call:
adonis(formula = spdf ~ treat)
Permutation: free
Number of permutations: 999
Terms added sequentially (first to last)
Df SumsOfSqs MeanSqs F.Model R2 Pr(>F)
treat 2 3.5326 1.76628 113.66 0.79952 0.001 ***
Residuals 57 0.8858 0.01554 0.20048
Total 59 4.4184 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
and that betadisper() correctly (all your groups had the same dispersion) fails to reject the null hypothesis of homogeneous multivariate dispersions
> anova(betadisper(envdist, treat))
Analysis of Variance Table
Response: Distances
Df Sum Sq Mean Sq F value Pr(>F)
Groups 2 0.1521 0.076041 1.1099 0.3366
Residuals 57 3.9050 0.068509
This is all in accord with the way you simulated the data.

Related

How to us lapply or sapply for GLM on multiple species separately?

I am trying to run a GLM on multiple different species in my data set. Currently I have been sub-setting my data for each species and copying this code and it's turned into quite the mess. I know there has to be a better way to do this, (maybe with the lapply function?) but I'm not sure how to begin with that.
I'm running the model on the CPUE (catch per unit effort) for a species and using Year, Salinity, Discharge, and Rainfall as my explanatory variables.
My data is here: https://drive.google.com/file/d/1_ylbMoqevvsuucwZn2VMA_KMNaykDItk/view?usp=sharing
This is the code that I have tried. It gets the job done, but I have just been copying this code and changing the species each time. I'm hoping to find a way to simplify this process and clean up my code a bit.
fish_df$pinfishCPUE <- ifelse(fish_df$Commonname == "Pinfish", fish_all$CPUE, 0)
#create binomial column
fish_df$binom <- ifelse(fish_df$pinfishCPUE > 0, 1,0)
glm.full.bin = glm(binom~Year+Salinity+Discharge +Rainfall,data=fish_df,family=binomial)
glm.base.bin = glm(binom~Year,data=fish_df,family=binomial)
#step to simplify model and get appropriate order
glm.step.bin = step(glm.base.bin,scope=list(upper=glm.full.bin,lower=~Year),direction='forward',
trace=1,k=log(nrow(fish_df)))
#final model - may choose to reduce based on deviance and cutoff in above step
glm.final.bin = glm.step.bin
print(summary(glm.final.bin))
#calculate the LSMeans for the proportion of positive trips
lsm.b.glm = emmeans(glm.final.bin,"Year",data=fish_df)
LSMeansProp = summary(lsm.b.glm)
Output:
Call:
glm(formula = log.CPUE ~ Month + Salinity + Temperature, family = gaussian,
data = fish_B_pos)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.8927 -0.7852 0.1038 0.8974 3.5887
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.38530 0.72009 3.313 0.00098 ***
Month 0.10333 0.03433 3.010 0.00272 **
Salinity -0.13530 0.01241 -10.900 < 2e-16 ***
Temperature 0.06901 0.01434 4.811 1.9e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 1.679401)
Null deviance: 1286.4 on 603 degrees of freedom
Residual deviance: 1007.6 on 600 degrees of freedom
AIC: 2033.2
Number of Fisher Scoring iterations: 2
I would suggest next approach creating a function for the models and then using lapply over a list which results from applying split() to the dataframe by variable Commonname:
library(emmeans)
#Load data
fish_df <- read.csv('fish_df.csv',stringsAsFactors = F)
#Code
List <- split(fish_df,fish_df$Commonname)
#Function for models
mymodelfun <- function(x)
{
#Create binomial column
x$binom <- ifelse(x$pinfishCPUE > 0, 1,0)
glm.full.bin = glm(binom~Year+Salinity+Discharge +Rainfall,data=x,family=binomial)
glm.base.bin = glm(binom~Year,data=x,family=binomial)
#step to simplify model and get appropriate order
glm.step.bin = step(glm.base.bin,scope=list(upper=glm.full.bin,lower=~Year),direction='forward',
trace=1,k=log(nrow(x)))
#final model - may choose to reduce based on deviance and cutoff in above step
glm.final.bin = glm.step.bin
print(summary(glm.final.bin))
#calculate the LSMeans for the proportion of positive trips
lsm.b.glm = emmeans(glm.final.bin,"Year",data=x)
LSMeansProp = summary(lsm.b.glm)
return(LSMeansProp)
}
#Apply function
Lmods <- lapply(List,mymodelfun)
In Lmods there will be the results of the models, here an example:
Lmods$`Atlantic Stingray`
Output:
Year emmean SE df asymp.LCL asymp.UCL
2009 -22.6 48196 Inf -94485 94440
Results are given on the logit (not the response) scale.
Confidence level used: 0.95

One-way ANOVA for stratified samples in R

I have a stratified sample with three groups ("a","b","c") that where drawn from a larger population N. All groups have 30 observations but their proportions in N are unequal, hence their sampling weights differ.
I use the survey package in R to calculate summary statistics and linear regression models and would like to know how to calculate a one-way ANOVA correcting for the survey design (if necessary).
My assumption is and please correct me if I'm wrong, that the standard error for the variance should be normally higher for a population where the weight is smaller, hence a simple ANOVA that does not account for the survey design should not be reliable.
Here is an example. Any help would be appreciated.
## Oneway- ANOVA tests in R for surveys with stratified sampling-design
library("survey")
# create test data
test.df<-data.frame(
id=1:90,
variable=c(rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=150,sd=10),
rnorm(n = 30,mean=140,sd=10)),
groups=c(rep("a",30),
rep("b",30),
rep("c",30)),
weights=c(rep(1,30), # undersampled
rep(1,30),
rep(100,30))) # oversampled
# correct for survey design
test.df.survey<-svydesign(id=~id,
strata=~groups,
weights=~weights,
data=test.df)
## descriptive statistics
# boxplot
svyboxplot(~variable~groups,test.df.survey)
# means
svyby(~variable,~groups,test.df.survey,svymean)
# variances
svyby(~variable,~groups,test.df.survey,svyvar)
### ANOVA ###
## One-way ANOVA without correcting for survey design
summary(aov(formula = variable~groups,data = test.df))
Hmm this is a interesting question, as far as I know it'd be difficult to consider weights in one-way anova. Thus I decided to show you the way that I'd solve this problem.
I'm going to use two-way anova and then soem port hoc test.
First of all let's build a linear model based on your data and check how does it look like.
library(car)
library(agricolae)
model.lm = lm(variable ~ groups * weights, data = test.df)
shapiro.test(resid(model.lm))
Shapiro-Wilk normality test
data: resid(model.lm)
W = 0.98238, p-value = 0.263
leveneTest(variable ~ groups * factor(weights), data = test.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 2.6422 0.07692 .
87
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Distribution is close to normal, variances differ between groups, so the variance isn't homogeneic - should be for parametrical test - anova. However let's perform the test anyway.
Several plots to check that our data fits to this test:
hist(resid(model.lm))
plot(model.lm)
Here is interpretation of plots, they don't look bad actually.
Let's run two-way anova:
anova(model.lm)
Analysis of Variance Table
Response: variable
Df Sum Sq Mean Sq F value Pr(>F)
groups 2 2267.8 1133.88 9.9566 0.0001277 ***
Residuals 87 9907.8 113.88
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
As you see, the results are very close to yours. Some post hoc test:
(result.hsd = HSD.test(model.lm, list('groups', 'weights')))
$statistics
MSerror Df Mean CV MSD
113.8831 87 147.8164 7.2195 6.570186
$parameters
test name.t ntr StudentizedRange alpha
Tukey groups:weights 3 3.372163 0.05
$means
variable std r Min Max Q25 Q50 Q75
a:1 150.8601 11.571185 30 113.3240 173.0429 145.2710 151.9689 157.8051
b:1 151.8486 8.330029 30 137.1907 176.9833 147.8404 150.3161 154.7321
c:100 140.7404 11.762979 30 118.0823 163.9753 131.6112 141.1810 147.8231
$comparison
NULL
$groups
variable groups
b:1 151.8486 a
a:1 150.8601 a
c:100 140.7404 b
attr(,"class")
[1] "group"
And maybe some different way:
aov_cont<- aov(test.df$variable ~ test.df$groups * test.df$weights)
summary(aov_cont)
Df Sum Sq Mean Sq F value Pr(>F)
test.df$groups 2 2268 1133.9 9.957 0.000128 ***
Residuals 87 9908 113.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(TukeyHSD(aov_cont))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = test.df$variable ~ test.df$groups * test.df$weights)
$`test.df$groups`
diff lwr upr p adj
b-a 0.9884608 -5.581725 7.558647 0.9315792
c-a -10.1197048 -16.689891 -3.549519 0.0011934
c-b -11.1081657 -17.678352 -4.537980 0.0003461
Summarizing, the results are very close to yours. Personaly I'll run two way anova with (*) symbol or (+) when you are sure that your variables are independent - additive model.
Group c with bigger weight differs from groups a and b substantially.
According to the main statistician of our institute there is no easy implementation of this kind of analysis in any common modeling environment. The reason for that is that ANOVA and ANCOVA are linear models that where not further developed after the emergence of General Linear Models (later Generalized linear models - GLMs) in the 70's.
A normal linear regression model yields practically the same results as an ANOVA, but is much more flexible regarding variable choice. Since weighting methods exist for GLMs (see survey package in R) there is no real need to develop methods to weight for stratified sampling design in ANOVA... simply use a GLM instead.
summary(svyglm(variable~groups,test.df.survey))

get p-values from post-hoc duncan test in r

I want to perform a post-hoc duncan test (use "agricolae" package in r) after running one-way anova comparing the means of 3 groups.
## run one-way anova
> t1 <- aov(q3a ~ pgy,data = pgy)
> summary(t1)
Df Sum Sq Mean Sq F value Pr(>F)
pgy 2 13 6.602 5.613 0.00367 **
Residuals 6305 7416 1.176
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1541 observations deleted due to missingness
## run post-hoc duncan test
> duncan.test(t1,"pgy",group = T, console = T)
Study: t1 ~ "pgy"
Duncan's new multiple range test
for q3a
Mean Square Error: 1.176209
pgy, means
q3a std r Min Max
PGY1 1.604292 1.068133 2656 1 5
PGY2 1.711453 1.126446 2017 1 5
PGY3 1.656269 1.057937 1635 1 5
Groups according to probability of means differences and alpha level( 0.05 )
Means with the same letter are not significantly different.
q3a groups
PGY2 1.711453 a
PGY3 1.656269 ab
PGY1 1.604292 b
However, the output only tells me the mean of PGY1 and PGY2 are different without p-values for each group comparison ( post-hoc pairwise t tests would generate p-values for each group comparison).
How can I get p value from a duncan test?
Thanks!!
One solution would be to use PostHocTest from the DescTools package.
Here is an example using the warpbreaks sample data.
require(DescTools);
res <- aov(breaks ~ tension, data = warpbreaks);
PostHocTest(res, method = "duncan");
#
# Posthoc multiple comparisons of means : Duncan's new multiple range test
# 95% family-wise confidence level
#
#$tension
# diff lwr.ci upr.ci pval
#M-L -10.000000 -17.95042 -2.049581 0.01472 *
#H-L -14.722222 -23.08443 -6.360012 0.00072 ***
#H-M -4.722222 -12.67264 3.228197 0.23861
The pairwise differences between the means for every group are given in the first column (e.g. M-L, and so on), along with confidence intervals and p-values.
For example, the difference in the mean breaks between H and M is not statistically significant.
If performing Duncan's test is not a critical requirement, you can also run pairwise.t.test with various other multiple comparison corrections. For example, using Bonferroni's method
with(warpbreaks, pairwise.t.test(breaks, tension, p.adj = "bonferroni"));
#
# Pairwise comparisons using t tests with pooled SD
#
#data: breaks and tension
#
# L M
#M 0.0442 -
#H 0.0015 0.7158
#
#P value adjustment method: bonferroni
Results are consistent with those from the post-hoc Duncan's test.

anova.rq() in quantreg package in R

I'm interested in comparing estimates from different quantiles (same outcome, same covariates) using anova.rqlist function called by anova in the environment of the quantreg package in R. However the math in the function is beyond my rudimentary expertise. Lets say i fit 3 models at different quantiles;
library(quantreg)
data(Mammals) # data in quantreg to be used as a useful example
fit1 <- rq(weight ~ speed + hoppers + specials, tau = .25, data = Mammals)
fit2 <- rq(weight ~ speed + hoppers + specials, tau = .5, data = Mammals)
fit3 <- rq(weight ~ speed + hoppers + specials, tau = .75, data = Mammals)
Then i compare them using;
anova(fit1, fit2, fit3, test="Wald", joint=FALSE)
My question is which is of these models is being used as the basis of the comparison?
My understanding of the Wald test (wiki entry)
where θ^ is the estimate of the parameter(s) of interest θ that is compared with the proposed value θ0.
So my question is what is the anova function in quantreg choosing as the θ0?
Based on the pvalue returned from the anova my best guess is that it is choosing the lowest quantile specified (ie tau=0.25). Is there a way to specify the median (tau = 0.5) or better yet the mean estimate from obtained using lm(y ~ x1 + x2 + x3, data)?
anova(fit1, fit2, fit3, joint=FALSE)
actually produces
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.25 0.5 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
while
anova(fit3, fit1, fit2, joint=FALSE)
produces the exact same result
Quantile Regression Analysis of Deviance Table
Model: weight ~ speed + hoppers + specials
Tests of Equality of Distinct Slopes: tau in { 0.5 0.25 0.75 }
Df Resid Df F value Pr(>F)
speed 2 319 1.0379 0.35539
hoppersTRUE 2 319 4.4161 0.01283 *
specialsTRUE 2 319 1.7290 0.17911
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The order of the models is clearly being changed in the anova, but how is it that the F value and Pr(>F) are identical in both tests?
All the quantiles you input are used and there is not one model used as a reference.
I suggest you read this post and the related answer to understand what your "theta.0" is.
I believe what you are trying to do is to test whether the regression lines are parallel. In other words whether the effects of the predictor variables (only income here) are uniform across quantiles.
You can use the anova() from the quantreg package to answer this question. You should indeed use several fits for each quantile.
When you use joint=FALSE as you did, you get coefficient-wise comparisons. But you only have one coefficient so there is only one line! And your results tells you that the effect of income is not uniform accross quantiles in your example. Use several predictor variables and you will get several p-values.
You can do an overall test of equality of the entire sets of coefficients if you do not use joint=FALSE and that would give you a "Joint Test of Equality of Slopes" and therefore only one p-value.
EDIT:
I think theta.0 is the average slope for all 'tau' values or the actual estimate from 'lm()', rather than a specific slope of any of the models. My reasoning is that 'anova.rq()' does not require any specific low value of 'tau' or even the median 'tau'.
There are several ways to test this. Either do the calculations by hand with theta.0 being equal to the average value, or compare many combinations because then you could a situation where certain of your models are close to the model with a low 'tau' values but not to the 'lm()' value. So if theta.0 is the slope of the first model with lowest 'tau' then your Pr(>F) will be high whereas in the other case, it will be low.
This question should maybe have been asked on cross-validated.

Assessing/Improving prediction with linear discriminant analysis or logistic regression

I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis.
Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with multiple classes and I guess I should use fit$posterior to predict Cases against Controls, but I cannot take those data out of it.
For further information, I get this results as fit$posterior:
$posterior
0 1
1 0.7707927 0.22920726
2 0.7085165 0.29148352
3 0.6990989 0.30090106
4 0.5902161 0.40978387
5 0.8667109 0.13328912
6 0.6924406 0.30755939
7 0.7471086 0.25289141
8 0.7519326 0.24806736
And so on up to the last observation which is 242. Every time I try to take, for example, column 1 by fit$posterior[,1], I get:
1 2 3 4 5 6 7 8
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326
9 10 11 12 13 14 15 16
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121
I don't know which part of the code could be useful, since I made very basic computation:
library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)
I can't link the data, anyway ECGvarious is simply an N observation x P variables, being N= N1+ N2 with N1 the number of Controls and N2 the number of Cases, and the Cases are defined as subjects who developed pathology after a follow up. The very last column of data is just 0 or 1 for Controls and Cases, respectively.
When I performed the logistic regression, I did:
mod1<-glm(Case ~ temp + temp1, data=data, family="binomial");
auctemp=auc(Case~predict(mod1), data=data);
Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)
Sample data
We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.
set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE),
sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE),
sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
pred2 = runif(n = 200, min = 0, max = 100))
We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.
boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")
Fitting the logistic model
First using both predictors:
model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.058258 0.479094 -4.296 1.74e-05 ***
# pred1 0.428491 0.075373 5.685 1.31e-08 ***
# pred2 0.003399 0.005500 0.618 0.537
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.51 on 197 degrees of freedom
# AIC: 244.51
As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.
Note that to get Odds Ratios from those coefficients, we need to exponentiate them:
exp(model.1$coefficients[2:3])
# pred1 pred2
# 1.534939 1.003405 # Odds Ratios (making the relationships appear more clearly).
# Use `exp(confint(model.1))` to get confidence intervals.
We'll compare this model to a simpler model, removing the second predictor:
model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.87794 0.37452 -5.014 5.32e-07 ***
# pred1 0.42651 0.07514 5.676 1.38e-08 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.89 on 198 degrees of freedom
# AIC: 242.89
exp(model.2$coefficients)[2]
# pred1
# 1.531907 # Odds Ratio
We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.
Stored predictions
In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).
New predictions
It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.
With model.1 (2 predictors):
predict(model.1, newdata = list(pred1=1, pred2=42), type="response")
# 1
# 0.1843701
With model.2 (1 predictor):
predict(model.2, newdata = list(pred1=12), type="response")
# 1
# 0.96232
Going from probability to binary response
Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:
plot(df$pred1, model.2$fitted.values,
xlab="pred1", ylab="probability that Case=1")
We note that since we have only one predictor, the probability is a direct function of it. If we had kept the other predictor in the equation, we'd see points grouped around the same line, but in a cloud of points.
But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.
df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}
which.max(df.ana$accuracy)
# [1] 46
optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46
The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.
df.ana[45:48,]
# thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45 0.45 0.7142857 0.6947368 0.7211538 0.6875000 0.705
# 46 0.46 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 47 0.47 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 48 0.48 0.7047619 0.7157895 0.7326733 0.6868687 0.710
Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:
library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)
# Area under the curve: 0.715
Some plots
# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")
# thresholds against Sensitivity
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")
# thresholds against specificity
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")
# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)
Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.
About ROC
The following graph is the same as the one which would be produced with pROC's roc:
plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity")
# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))
Tabulations and other stats
The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.
diagnos.test <- function(model, threshold) {
output <- list()
output$stats <- c(
sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
return(output)
}
diagnos.test(model.2, 0.47)
# $stats
# sensitivity specificity pos.pr.value neg.pr.value accuracy
# 0.7142857 0.7157895 0.7352941 0.6938776 0.7150000
#
# $tab
# Cases
# Predictions 1 0 Sum
# 1 75 27 102
# 0 30 68 98
# Sum 105 95 200
Final note
I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.

Resources