I have an issue when calculating logistic regression in R that, to me, makes no sense.
I have one parameter in the model, positive numbers (molecular weight).
I have a binary response variable, let's say either A or B.
My data table is called df1.
str(df1)
data.frame': 1015 obs. of 2 variables:
$ Protein_Class: chr "A" "A" "A" "B" ...
$ MW : num 47114 29586 26665 34284 104297 ...
I make the model:
summary(glm(as.factor(df1[,1]) ~ df1[,2],family="binomial"))
The results are:
Call:
glm(formula = as.factor(df1[, 1]) ~ df1[, 2], family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5556 -1.5516 0.8430 0.8439 0.8507
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.562e-01 1.251e-01 6.842 7.8e-12 ***
df1[, 2] -1.903e-07 3.044e-06 -0.063 0.95
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1239.2 on 1013 degrees of freedom
AIC: 1243.2
Number of Fisher Scoring iterations: 4
That's all fine and good until this point.
But, when I take the logarithm of my variable:
summary(glm(as.factor(df1[,1]) ~ log10(df1[,2]),family="binomial"))
Call:
glm(formula = as.factor(df1[, 1]) ~ log10(df1[, 2]), family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8948 -1.4261 0.8007 0.8528 1.0469
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7235 1.1169 -2.438 0.01475 *
log10(df1[, 2]) 0.8038 0.2514 3.197 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1228.9 on 1013 degrees of freedom
AIC: 1232.9
Number of Fisher Scoring iterations: 4
The p-value has changed!
How can this be? And more importantly, which one to use?
My understanding was that logistic regression is based on ranks, and all I do is a monotone transformation. Note, that the AUROC curve of the model remains the same.
There are no zero or negative values that are lost during the transformation.
Did I miss something here?
Any advice?
Thanks in advance,
Adam
There are a couple of things to think about. First, you can probably constrain your search to one side or the other of 1. That is decreasing the power on x - square root, log, inverse, etc... - all have a similar type of effect, but to differing degrees. They all pull in big values and spread out small values. The transformations greater than 1 do the opposite, they tend to increase the spread among big values and decrease the spread among small values - all generally assuming you've got no non-positive values in your variable. This is really, then, a question about what kind of transformation you want and then after that - how severe does it have to be.
First, what kind of transformation do you need. I made some fake data to illustrate the point:
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1234)
x <- runif(1000, 1, 10000)
y.star <- -6 + log(x)
y <- rbinom(1000, 1, plogis(y.star) )
df <- tibble(
y=y,
x=x,
ystar=y.star)
Next, since this is just a bivariate relationship, we could plot it out with a loess curve. In particular, though, we want to know what the log-odds of y look like with respect to x. We can do this by transforming the predictions from the loess curve with the logistic quantile function, qlogis() - this takes the probabilities and puts them in log-odds form. Then, we could make the plot.
lo <- loess(y ~ x, span=.75)
df <- df %>% mutate(fit = predict(lo),
fit = case_when(
fit < .01 ~ .01,
fit > .99 ~ .99,
TRUE ~ fit))
ggplot(df) +
geom_line(aes(x=x, y=qlogis(fit)))
This looks like a class log relationship. We could then implement a few different transformations and plot those - square root, log and negative inverse.
lo1 <- loess(y ~ sqrt(x), span=.5)
lo2 <- loess(y ~ log(x), span=.5)
lo3 <- loess(y ~ I(-(1/x)), span=.5)
df <- df %>% mutate(fit1 = predict(lo1),
fit1 = case_when(
fit1 < .01 ~ .01,
fit1 > .99 ~ .99,
TRUE ~ fit1))
df <- df %>% mutate(fit2 = predict(lo2),
fit2 = case_when(
fit2 < .01 ~ .01,
fit2 > .99 ~ .99,
TRUE ~ fit2))
df <- df %>% mutate(fit3 = predict(lo3),
fit3 = case_when(
fit3 < .01 ~ .01,
fit3 > .99 ~ .99,
TRUE ~ fit3))
Next, we need to transform the data so the plotting will look right:
plot.df <- df %>%
tidyr::pivot_longer(cols=starts_with("fit"),
names_to="var",
values_to="vals") %>%
mutate(x2 = case_when(
var == "fit" ~ x,
var == "fit1" ~ sqrt(x),
var == "fit2" ~ log(x),
var == "fit3" ~ -(1/x),
TRUE ~ x),
var = factor(var, labels=c("Original", "Square Root", "Log", "Inverse")))
Then, we can make the plot:
ggplot(plot.df, aes(x=x2, y=vals)) +
geom_line() +
facet_wrap(~var, scales="free_x")
Here, it looks like the log is the most linear of the bunch - not surprising since we made the variable y.star with log(x). If we wanted to test between these different possibilities, Kevin Clarke, a Political Scientist at Rochester proposed a paired sign test for evaluating the difference between non-nested models. There is a paper about it here. I wrote a package called clarkeTest that implements this in R. So, we could use this to test the various different alternatives:
m0 <- glm(y ~ x, data=df, family=binomial)
m1 <- glm(y ~ sqrt(x), data=df, family=binomial)
m2 <- glm(y ~ log(x), data=df, family=binomial)
m3 <- glm(y ~ I(-(1/x)), data=df, family=binomial)
Testing the original against the square root:
library(clarkeTest)
> clarke_test(m0, m1)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -296
# Observations: 1000
# Test statistic: 400 (40%)
#
# Model 2 is preferred (p = 2.7e-10)
This shows that the square root is better than the original un-transformed variable.
clarke_test(m0, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 462 (46%)
#
# Model 2 is preferred (p = 0.018)
The above shows that the log is better than the un-transformed variable.
> clarke_test(m0, m3)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -292
# Observations: 1000
# Test statistic: 550 (55%)
#
# Model 1 is preferred (p = 0.0017)
The above shows that the un-transformed variable is preferred to the negative inverse. Then, we can test the difference of the two models preferred to the original.
> clarke_test(m1, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 536 (54%)
#
# Model 1 is preferred (p = 0.025)
This shows that the the square root is better than the log transformation in terms of individual log-likelihoods.
Another option would be a grid search over possible transformations and look at the AIC each time. We first have to make a function to deal with the situation where the transformation power = 0, where we should substitute the log. Then we can run a model for each different transformation and get the AICs.
grid <- seq(-1,1, by=.1)
trans <- function(x, power){
if(power == 0){
tx <- log(x)
}else{
tx <- x^power
}
tx
}
mods <- lapply(grid, function(p)glm(y ~ trans(x, p),
data=df,
family=binomial))
aic.df <- tibble(
power = grid,
aic = sapply(mods, AIC))
Next, we can plot the AICs as a function of the power.
ggplot(aic.df, aes(x=power, y=aic)) +
geom_line()
This tells us that about -.25 is the appropriate transformation parameter. Note that there is a discrepancy between the Clarke test results and the AIC because AIC is based on the overall log-likelihood and the Clarke test is based on differences in the individual log-likelihoods.
We would find that this new proposed transformation is also worse than the square root:
m4 <- glm(y ~ I(x^-.25), data=df, family=binomial)
clarke_test(m1, m4)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -283
# Observations: 1000
# Test statistic: 559 (56%)
#
# Model 1 is preferred (p = 0.00021)
So, if you have a couple of different candidates in mind and you like the idea behind the Clarke test, you could use that to find the appropriate transformation. If you don't have a candidate in mind, a grid search is always a possibility.
Related
Somehow as a follow up on the question Creating confidence intervals for regression curve in GLMM using Bootstrapping, I am interested in getting the correct values of a regression curve and the associated confidence interval curves.
Consider a case where in a GLMM, there is one response variable, two continuous fixed effects and one random effect. Here is some fake data:
library (dplyr)
set.seed (1129)
x1 <- runif(100,0,1)
x2 <- rnorm(100,0.5,0.4)
f1 <- gl(n = 5,k = 20)
rnd1<-rnorm(5,0.5,0.1)
my_data <- data.frame(x1=x1, x2=x2, f1=f1)
modmat <- model.matrix(~x1+x2, my_data)
fixed <- c(-0.12,0.35,0.09)
y <- (modmat%*%fixed+rnd1)
my_data$y <- ((y - min (y))/max(y- min (y))) %>% round (digits = 1)
rm (y)
The GLMM that I fit looks like this:
m1<-glmer (y ~x1+x2+(1|f1), my_data, family="binomial")
summary (m1)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: y ~ x1 + x2 + (1 | f1)
Data: my_data
AIC BIC logLik deviance df.resid
65.7 76.1 -28.8 57.7 96
Scaled residuals:
Min 1Q Median 3Q Max
-8.4750 -0.7042 -0.0102 1.5904 14.5919
Random effects:
Groups Name Variance Std.Dev.
f1 (Intercept) 1.996e-10 1.413e-05
Number of obs: 100, groups: f1, 5
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.668 2.051 -4.713 2.44e-06 ***
x1 12.855 2.659 4.835 1.33e-06 ***
x2 4.875 1.278 3.816 0.000136 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) x1
x1 -0.970
x2 -0.836 0.734
convergence code: 0
boundary (singular) fit: see ?isSingular
Plotting y vs x1:
plot (y~x1, my_data)
It should be possible to get a regression curve from the summary of m1. I have learned that I need to reverse the link-function (in this case, "logit"):
y = 1/(1+exp(-(Intercept+b*x1+c*x2)))
In order to plot a regression curve of x1 in a two-dimensional space, I set x2 = mean(x2) in the formula (which also seems important - the red line in the following plots ignores x2, apparently leading to considerable bias). The regression line:
xx <- seq (from = 0, to = 1, length.out = 100)
yy <- 1/(1+exp(-(-9.668+12.855*xx+4.875*mean(x2))))
yyy <- 1/(1+exp(-(-9.668+12.855*xx)))
lines (yy ~ xx, col = "blue")
lines (yyy~ xx, col = "red")
I think, the blue line looks not so good (and the red line worse, of course). So as a side-question: is y = 1/(1+exp(-(Intercept+b*x1+c*x2))) always the right choice as a back-transformation of the logit-link? I am asking because I found this https://sebastiansauer.github.io/convert_logit2prob/, which made me suspicious. Or is there another reason for the model not to fit so well? Maybe my data creation process is somewhat 'bad'.
What I need now is to add the 95%-confidence interval to the curve. I think that Bootstrapping using the bootMer function should be a good approach. However, all examples that I found were on models with one single fixed effect. #Jamie Murphy asked a similar question, but he was interested in models containing a continuous and a categorical variable as fixed effects here: Creating confidence intervals for regression curve in GLMM using Bootstrapping
But when it comes to models with more than one continuous variables as fixed effects, I get lost. Perhaps someone can help solve this issue - possibly with a modification of the second part of this tutorial:
https://www.r-bloggers.com/2015/06/confidence-intervals-for-prediction-in-glmms/
I would like to simulate data for a logistic regression where I can specify its explained variance beforehand. Have a look at the code below. I simulate four independent variables and specify that each logit coefficient should be of size log(2)=0.69. This works nicely, the explained variance (I report Cox & Snell's r2) is 0.34.
However, I need to specify the regression coefficients in such a way that a pre-specified r2 will result from the regression. So if I would like to produce an r2 of let's say exactly 0.1. How do the coefficients need to be specified? I am kind of struggling with this..
# Create independent variables
sigma.1 <- matrix(c(1,0.25,0.25,0.25,
0.25,1,0.25,0.25,
0.25,0.25,1,0.25,
0.25,0.25,0.25,1),nrow=4,ncol=4)
mu.1 <- rep(0,4)
n.obs <- 500000
library(MASS)
sample1 <- as.data.frame(mvrnorm(n = n.obs, mu.1, sigma.1, empirical=FALSE))
# Create latent continuous response variable
sample1$ystar <- 0 + log(2)*sample1$V1 + log(2)*sample1$V2 + log(2)*sample1$V3 + log(2)*sample1$V4
# Construct binary response variable
sample1$prob <- exp(sample1$ystar) / (1 + exp(sample1$ystar))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
# Logistic regression
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg)
The output is:
Call:
glm(formula = y ~ V1 + V2 + V3 + V4, family = binomial, data = sample1)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.7536 -0.7795 -0.0755 0.7813 3.3382
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.002098 0.003544 -0.592 0.554
V1 0.691034 0.004089 169.014 <2e-16 ***
V2 0.694052 0.004088 169.776 <2e-16 ***
V3 0.693222 0.004079 169.940 <2e-16 ***
V4 0.699091 0.004081 171.310 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 693146 on 499999 degrees of freedom
Residual deviance: 482506 on 499995 degrees of freedom
AIC: 482516
Number of Fisher Scoring iterations: 5
And Cox and Snell's r2 gives:
library(pscl)
pR2(logreg)["r2ML"]
> pR2(logreg)["r2ML"]
r2ML
0.3436523
If you add a random error term to the ystar variable making ystat.r and then work with that, you can tweek the standard deviation until it meets you specifications.
sample1$ystar.r <- sample1$ystar+rnorm(n.obs, 0, 3.8) # tried a few values
sample1$prob <- exp(sample1$ystar.r) / (1 + exp(sample1$ystar.r))
sample1$y <- rbinom(n.obs,size=1,prob=sample1$prob)
logreg <- glm(y ~ V1 + V2 + V3 + V4, data=sample1, family=binomial)
summary(logreg) # the estimates "shrink"
pR2(logreg)["r2ML"]
#-------
r2ML
0.1014792
R-squared (and its variations) is a random variable, as it depends on your simulated data. If you simulate data with the exact same parameters multiple times, you'll most likely get different values for R-squared each time. Therefore, you cannot produce a simulation where the R-squared will be exactly 0.1 just by controlling the parameters.
On the other hand, since it's a random variable, you could potentially simulate your data from a conditional distribution (conditioning on a fixed value of R-squared), but you would need to find out what these distributions look like (math might get really ugly here, cross validated is more appropriate for this part).
My R-script produces glm() coeffs below.
What is Poisson's lambda, then? It should be ~3.0 since that's what I used to create the distribution.
Call:
glm(formula = h_counts ~ ., family = poisson(link = log), data = pois_ideal_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-22.726 -12.726 -8.624 6.405 18.515
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.222532 0.015100 544.53 <2e-16 ***
h_mids -0.363560 0.004393 -82.75 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 11451.0 on 10 degrees of freedom
Residual deviance: 1975.5 on 9 degrees of freedom
AIC: 2059
Number of Fisher Scoring iterations: 5
random_pois = rpois(10000,3)
h=hist(random_pois, breaks = 10)
mean(random_pois) #verifying that the mean is close to 3.
h_mids = h$mids
h_counts = h$counts
pois_ideal_data <- data.frame(h_mids, h_counts)
pois_ideal_model <- glm(h_counts ~ ., pois_ideal_data, family=poisson(link=log))
summary_ideal=summary(pois_ideal_model)
summary_ideal
What are you doing here???!!! You used a glm to fit a distribution???
Well, it is not impossible to do so, but it is done via this:
set.seed(0)
x <- rpois(10000,3)
fit <- glm(x ~ 1, family = poisson())
i.e., we fit data with an intercept-only regression model.
fit$fitted[1]
# 3.005
This is the same as:
mean(x)
# 3.005
It looks like you're trying to do a Poisson fit to aggregated or binned data; that's not what glm does. I took a quick look for canned ways of fitting distributions to canned data but couldn't find one; it looks like earlier versions of the bda package might have offered this, but not now.
At root, what you need to do is set up a negative log-likelihood function that computes (# counts)*prob(count|lambda) and minimize it using optim(); the solution given below using the bbmle package is a little more complex up-front but gives you added benefits like easily computing confidence intervals etc..
Set up data:
set.seed(101)
random_pois <- rpois(10000,3)
tt <- table(random_pois)
dd <- data.frame(counts=unname(c(tt)),
val=as.numeric(names(tt)))
Here I'm using table rather than hist because histograms on discrete data are fussy (having integer cutpoints often makes things confusing because you have to be careful about right- vs left-closure)
Set up density function for binned Poisson data (to work with bbmle's formula interface, the first argument must be called x, and it must have a log argument).
dpoisbin <- function(x,val,lambda,log=FALSE) {
probs <- dpois(val,lambda,log=TRUE)
r <- sum(x*probs)
if (log) r else exp(r)
}
Fit lambda (log link helps prevent numerical problems/warnings from negative lambda values):
library(bbmle)
m1 <- mle2(counts~dpoisbin(val,exp(loglambda)),
data=dd,
start=list(loglambda=0))
all.equal(unname(exp(coef(m1))),mean(random_pois),tol=1e-6) ## TRUE
exp(confint(m1))
## 2.5 % 97.5 %
## 2.972047 3.040009
I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis.
Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with multiple classes and I guess I should use fit$posterior to predict Cases against Controls, but I cannot take those data out of it.
For further information, I get this results as fit$posterior:
$posterior
0 1
1 0.7707927 0.22920726
2 0.7085165 0.29148352
3 0.6990989 0.30090106
4 0.5902161 0.40978387
5 0.8667109 0.13328912
6 0.6924406 0.30755939
7 0.7471086 0.25289141
8 0.7519326 0.24806736
And so on up to the last observation which is 242. Every time I try to take, for example, column 1 by fit$posterior[,1], I get:
1 2 3 4 5 6 7 8
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326
9 10 11 12 13 14 15 16
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121
I don't know which part of the code could be useful, since I made very basic computation:
library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)
I can't link the data, anyway ECGvarious is simply an N observation x P variables, being N= N1+ N2 with N1 the number of Controls and N2 the number of Cases, and the Cases are defined as subjects who developed pathology after a follow up. The very last column of data is just 0 or 1 for Controls and Cases, respectively.
When I performed the logistic regression, I did:
mod1<-glm(Case ~ temp + temp1, data=data, family="binomial");
auctemp=auc(Case~predict(mod1), data=data);
Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)
Sample data
We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.
set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE),
sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE),
sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
pred2 = runif(n = 200, min = 0, max = 100))
We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.
boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")
Fitting the logistic model
First using both predictors:
model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.058258 0.479094 -4.296 1.74e-05 ***
# pred1 0.428491 0.075373 5.685 1.31e-08 ***
# pred2 0.003399 0.005500 0.618 0.537
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.51 on 197 degrees of freedom
# AIC: 244.51
As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.
Note that to get Odds Ratios from those coefficients, we need to exponentiate them:
exp(model.1$coefficients[2:3])
# pred1 pred2
# 1.534939 1.003405 # Odds Ratios (making the relationships appear more clearly).
# Use `exp(confint(model.1))` to get confidence intervals.
We'll compare this model to a simpler model, removing the second predictor:
model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.87794 0.37452 -5.014 5.32e-07 ***
# pred1 0.42651 0.07514 5.676 1.38e-08 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.89 on 198 degrees of freedom
# AIC: 242.89
exp(model.2$coefficients)[2]
# pred1
# 1.531907 # Odds Ratio
We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.
Stored predictions
In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).
New predictions
It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.
With model.1 (2 predictors):
predict(model.1, newdata = list(pred1=1, pred2=42), type="response")
# 1
# 0.1843701
With model.2 (1 predictor):
predict(model.2, newdata = list(pred1=12), type="response")
# 1
# 0.96232
Going from probability to binary response
Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:
plot(df$pred1, model.2$fitted.values,
xlab="pred1", ylab="probability that Case=1")
We note that since we have only one predictor, the probability is a direct function of it. If we had kept the other predictor in the equation, we'd see points grouped around the same line, but in a cloud of points.
But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.
df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}
which.max(df.ana$accuracy)
# [1] 46
optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46
The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.
df.ana[45:48,]
# thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45 0.45 0.7142857 0.6947368 0.7211538 0.6875000 0.705
# 46 0.46 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 47 0.47 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 48 0.48 0.7047619 0.7157895 0.7326733 0.6868687 0.710
Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:
library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)
# Area under the curve: 0.715
Some plots
# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")
# thresholds against Sensitivity
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")
# thresholds against specificity
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")
# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)
Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.
About ROC
The following graph is the same as the one which would be produced with pROC's roc:
plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity")
# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))
Tabulations and other stats
The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.
diagnos.test <- function(model, threshold) {
output <- list()
output$stats <- c(
sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
return(output)
}
diagnos.test(model.2, 0.47)
# $stats
# sensitivity specificity pos.pr.value neg.pr.value accuracy
# 0.7142857 0.7157895 0.7352941 0.6938776 0.7150000
#
# $tab
# Cases
# Predictions 1 0 Sum
# 1 75 27 102
# 0 30 68 98
# Sum 105 95 200
Final note
I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.
I have a set of data in R and I want to run a regression to test for correlation using custom coefficients.
Example:
x = lm(a ~ b + c + d, data=data, weights=weights)
That gives me coefficients for b, c, and d, but I just want to give b, c, and d my own coefficients and find, for example, the r^2. How would I do so?
Let's assume your predetermined coefficients are a three-element, numeric vector named: vec and that none of a,b,c are factors or character vectors:
#edit ... add a sum() function
(x = lm(a ~ 1, data=data, offset=apply(data, 1, function(x) {sum( c(1,x) * vec))} )
This should produce a model that has the specified estimates. You will probably need to do this:
summary(x)
As always... if you want tested code, then provide a dataset for testing. With the mtcars dataframe:
m1 = lm(mpg ~ carb + wt, data=mtcars)
vec <- coef(m1)
(x = lm(mpg ~ 1, data=mtcars,
offset=apply( mtcars[c("carb","wt")], 1,
function(x){ sum( c(1,x) *vec)} )))
Call:
lm(formula = mpg ~ 1, data = mtcars, offset = apply(mtcars[c("carb",
"wt")], 1, function(x) {
sum( c(1, x) * vec)
}))
Coefficients:
(Intercept)
-7.85e-17
So the offset model (with the coefficients used in the offset) is essentially an exact fit to the m1 model.
#BondedDust's method will be more efficient in the long run, but just for illustration, here's a simple example of how to create your own function to calculate R-squared for any regression coefficients you choose. We'll use the mtcars data set, which is built into R.
Assume a regression model that predicts "mpg" using the independent variables "carb" and "wt". a, b, and c are the three regression parameters that we need to provide to the function.
# Function to calculate R-squared
R2 = function(a,b,c) {
# Calculate the residual sum of squares from the regression model
SSresid = sum(((a + b*mtcars$carb + c*mtcars$wt) - mtcars$mpg)^2)
# Calculate the total sum of squares
SStot = sum((mtcars$mpg - mean(mtcars$mpg))^2)
# Calculate and return the R-squared for the regression model
return(1 - SSresid/SStot)
}
Now let's run the function. First let's see if our function matches the R-squared calculated by lm. We'll do this by creating a regression model in R, then we'll use the coefficients from that model and calculate the R-squared using our function and see if it matches the output from lm:
# Create regression model
m1 = lm(mpg ~ carb + wt, data=mtcars)
summary(m1)
Call:
lm(formula = mpg ~ carb + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5206 -2.1223 -0.0467 1.4551 5.9736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.7300 1.7602 21.435 < 2e-16 ***
carb -0.8215 0.3492 -2.353 0.0256 *
wt -4.7646 0.5765 -8.265 4.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.839 on 29 degrees of freedom
Multiple R-squared: 0.7924, Adjusted R-squared: 0.7781
F-statistic: 55.36 on 2 and 29 DF, p-value: 1.255e-10
From the summary, we can see that the R-squared is 0.7924. Let's see what we get from the function we just created. All we need to do is feed our function the three regression coefficients listed in the summary above. We can hard-code those numbers, or we can extract the coefficients from the model object m1 (which is what I've done below):
R2(coef(m1)[1], coef(m1)[2], coef(m1)[3])
[1] 0.7924425
Now let's calculate the R-squared for other choices of the regression coefficients:
a = 37; b = -1; c = -3.5
R2(a, b, c)
[1] 0.5277607
a = 37; b = -2; c = -5
R2(a, b, c)
[1] 0.0256494
To check lots of values of a parameter at once, you can, for example, use sapply. The code below will return the R-squared for values of c ranging from -7 to -3 in increments of 0.1 (with the other two parameters set to the the values returned by lm:
sapply(seq(-7,-3,0.1), function(x) R2(coef(m1)[1],coef(m1)[2],x))