I recently needed to combine two or more variables on some data set to evaluate if their combination could enhance predictivity, thus I made some logistic regression in R. Now, on the statistic Q&A, someone suggested that I may use the linear discriminant analysis.
Since I don't have any fitcdiscr.m in MATLAB, I'd rather go with lda in R but I cannot use the fit results to predict AUC or whatever I could use. Indeed, I see that fit output vector of lda in R is some sort of vector with multiple classes and I guess I should use fit$posterior to predict Cases against Controls, but I cannot take those data out of it.
For further information, I get this results as fit$posterior:
$posterior
0 1
1 0.7707927 0.22920726
2 0.7085165 0.29148352
3 0.6990989 0.30090106
4 0.5902161 0.40978387
5 0.8667109 0.13328912
6 0.6924406 0.30755939
7 0.7471086 0.25289141
8 0.7519326 0.24806736
And so on up to the last observation which is 242. Every time I try to take, for example, column 1 by fit$posterior[,1], I get:
1 2 3 4 5 6 7 8
0.7707927 0.7085165 0.6990989 0.5902161 0.8667109 0.6924406 0.7471086 0.7519326
9 10 11 12 13 14 15 16
0.7519326 0.6902850 0.7519326 0.8080445 0.8075360 0.8484318 0.4860899 0.8694121
I don't know which part of the code could be useful, since I made very basic computation:
library(gdata)
data=read.xls("ECGvarious.xls", perl="C:/Strawberry/perl/bin/perl.exe");
i=6;
p=19;
temp=data[,i];
temp1=data[, p];
library(MASS)
fit <- lda(Case ~ temp + temp , data=data, na.action="na.omit", CV=TRUE)
I can't link the data, anyway ECGvarious is simply an N observation x P variables, being N= N1+ N2 with N1 the number of Controls and N2 the number of Cases, and the Cases are defined as subjects who developed pathology after a follow up. The very last column of data is just 0 or 1 for Controls and Cases, respectively.
When I performed the logistic regression, I did:
mod1<-glm(Case ~ temp + temp1, data=data, family="binomial");
auctemp=auc(Case~predict(mod1), data=data);
Here's my input concerning logistic regression and prediction (I don't know much about linear discrimination but understand it's closely related to logistic regression, which I know much better). I'm not sure I'm following all of your reasoning, nor if this will be a satisfactory answer, but hopefully it won't hurt. This has been a review of some epidemiology classes for me. I hope it's not too formal and addresses at least in part some of your questions. If not, and if other users think this would better belong on Cross Validated, I won't take offense. :)
Sample data
We'll first generate 200 observations, having increasing levels of probability for Case=1. The first predictor (pred1) will follow a distribution that is nonlinear, close to the one being modeled when doing logistic regression. It will be rather closely related to the proportion of Cases. The second predictor will just be random, uniformly distributed noise.
set.seed(2351)
df <- data.frame(Case = c(sample(c(0,1), size = 67, prob = c(0.8, 0.2), replace = TRUE),
sample(c(0,1), size = 66, prob = c(0.5, 0.5), replace = TRUE),
sample(c(0,1), size = 67, prob = c(0.2, 0.8), replace = TRUE)),
pred1 = 6/(1+4*exp(-seq(from = -3, to = 5, length.out = 200))) + rnorm(n = 200, mean = 2, sd=.5),
pred2 = runif(n = 200, min = 0, max = 100))
We see in the boxplot below that the observations where case==1 generally have higher pred1, which is intended (from the way we generated the data). At the same time, there is an overlap, otherwise it would make it too easy to decide on a cutoff point/threshold.
boxplot(pred1 ~ Case, data=df, xlab="Case", ylab="pred1")
Fitting the logistic model
First using both predictors:
model.1 <- glm(Case ~ pred1 + pred2, data=df, family=binomial(logit))
summary(model.1)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -2.058258 0.479094 -4.296 1.74e-05 ***
# pred1 0.428491 0.075373 5.685 1.31e-08 ***
# pred2 0.003399 0.005500 0.618 0.537
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.51 on 197 degrees of freedom
# AIC: 244.51
As we'd expect, the first predictor is rather strongly related, and the second, poorly related to the outcome.
Note that to get Odds Ratios from those coefficients, we need to exponentiate them:
exp(model.1$coefficients[2:3])
# pred1 pred2
# 1.534939 1.003405 # Odds Ratios (making the relationships appear more clearly).
# Use `exp(confint(model.1))` to get confidence intervals.
We'll compare this model to a simpler model, removing the second predictor:
model.2 <- glm(Case ~ pred1, data=df, family=binomial(logit))
summary(model.2)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -1.87794 0.37452 -5.014 5.32e-07 ***
# pred1 0.42651 0.07514 5.676 1.38e-08 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 276.76 on 199 degrees of freedom
# Residual deviance: 238.89 on 198 degrees of freedom
# AIC: 242.89
exp(model.2$coefficients)[2]
# pred1
# 1.531907 # Odds Ratio
We could also run an anova(model.1, model.2), but let's skip this part and move on to prediction, keeping this simpler model as the second variable doesn't add much predictive value, if any. In practive, having more predictors is rarely a problem unless it's truly random noise, but here I focus more on the operation of predicting and choosing a proper threshold.
Stored predictions
In the model.2 object (a list), there is an item named fitted.values. Those values are the exact same that we'd get from predict(model.2, type="response") and can be interpreted as probabilities; one for each row, based on the predictor(s) and their coefficient(s).
New predictions
It is also possible to predict the outcome for hypothetical rows not in our initial dataframe.
With model.1 (2 predictors):
predict(model.1, newdata = list(pred1=1, pred2=42), type="response")
# 1
# 0.1843701
With model.2 (1 predictor):
predict(model.2, newdata = list(pred1=12), type="response")
# 1
# 0.96232
Going from probability to binary response
Looking back at the link between our predictor pred1 and the calculated probability of having Case=1:
plot(df$pred1, model.2$fitted.values,
xlab="pred1", ylab="probability that Case=1")
We note that since we have only one predictor, the probability is a direct function of it. If we had kept the other predictor in the equation, we'd see points grouped around the same line, but in a cloud of points.
But this doesn't change the fact that if we are to evaluate how well our model can predict binary outcomes, we need to settle on a threshold above which we'll consider that the observation is a Case. Several packages have tools to help picking that threshold. But even without any additional package, we can calculate various properties over a range of thresholds using a function such as the following, which will calculate the sensitivity (ability to detect True Cases), specificity (ability to identify True Non Cases), and other properties well described here.
df.ana <- data.frame(thresh=seq(from = 0, to = 100, by = 0.5) / 100)
for(i in seq_along(df.ana$thresh)) {
df.ana$sensitivity[i] <- sum(df$Case==1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(df$Case==1)
df.ana$specificity[i] <- sum(df$Case==0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(df$Case==0)
df.ana$pos.pred.value[i] <- sum(df$Case == 1 & (predict(model.2, type="resp") >= df.ana$thresh[i])) / sum(predict(model.2, type="resp") >= df.ana$thresh[i])
df.ana$neg.pred.value[i] <- sum(df$Case == 0 & (predict(model.2, type="resp") < df.ana$thresh[i])) / sum(predict(model.2, type="resp") < df.ana$thresh[i])
df.ana$accuracy[i] <- sum((predict(model.2, type="resp") >= df.ana$thresh[i]) == df$Case) / nrow(df)
}
which.max(df.ana$accuracy)
# [1] 46
optimal.thresh <- df.ana$thresh[which.max(df.ana$accuracy)] # 0.46
The accuracy is the proportion of correct predictions over all predictions. The 46th threshold (0.46) is the "best" for that matter. Let's check a few other neighboring rows in the generated dataframe; it tells us that 0.47 would work as well on all fronts. Fine-tuning would involve adding some new data to our initial dataframe.
df.ana[45:48,]
# thresh sensitivity specificity pos.pred.value neg.pred.value accuracy
# 45 0.45 0.7142857 0.6947368 0.7211538 0.6875000 0.705
# 46 0.46 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 47 0.47 0.7142857 0.7157895 0.7352941 0.6938776 0.715
# 48 0.48 0.7047619 0.7157895 0.7326733 0.6868687 0.710
Note that the auc function (area under the curve) will give the same number as the accuracy for that threshold:
library(pROC)
auc(Case ~ as.numeric(predict(model.2, type="response") >= optimal.thresh), data=df)
# Area under the curve: 0.715
Some plots
# thresholds against accuracy
plot(x=df.ana$thresh, y=df.ana$accuracy, type="l",
xlab="Threshold", ylab="", xlim=c(0,1), ylim=c(0,1))
text(x = 0.1, y = 0.5, labels = "Accuracy", col="black")
# thresholds against Sensitivity
lines(x=df.ana$thresh, y=df.ana$sensitivity, type="l",col="blue") # Sensitivity We want to maximize this, but not too much
text(x = 0.1, y = 0.95, labels = "Sensitivity", col="blue")
# thresholds against specificity
lines(x=df.ana$thresh, y=df.ana$specificity, type="l", col="red") # Specificity we want to maximize also, but not too much
text(x = 0.1, y = 0.05, labels = "Specificity", col="red")
# optimal threshold vertical line
abline(v=optimal.thresh)
text(x=optimal.thresh + .01, y=0.05, labels= optimal.thresh)
Incidentally, all lines converge more or less to the same point, which suggests this is a good compromise between all the qualities we look for in a predictive tool. But depending on your objectives, it might be better picking a lower or a higher threshold. Statistical tools are useful, but in the end, some other considerations are often more important in making a final decision.
About ROC
The following graph is the same as the one which would be produced with pROC's roc:
plot(x=df.ana$specificity, y = df.ana$sensitivity, type="l", col="blue",
xlim = c(1,0), xlab = "Specificity", ylab = "Sensitivity")
# Equivalent to
# plot(roc(predictor=model.2$fitted.values, response = model.2$y))
Tabulations and other stats
The following function allows one to calculate, for a logistic model fit, the same stats seen above, and gives a 2x2 table for any chosen threshold.
diagnos.test <- function(model, threshold) {
output <- list()
output$stats <- c(
sensitivity = sum(model.1$y==1 & (predict(model, type="resp") >= threshold)) / sum(model.1$y==1),
specificity = sum(model.1$y==0 & (predict(model, type="resp") < threshold)) / sum(model.1$y==0),
pos.pr.value = sum(model.1$y==1 & (predict(model.2, type="resp") >= threshold)) / sum(predict(model.2, type="resp") >= threshold),
neg.pr.value = sum(df$Case == 0 & (predict(model.2, type="resp") < threshold)) / sum(predict(model.2, type="resp") < threshold),
accuracy = sum((predict(model.2, type="resp") >= threshold) == df$Case) / nrow(df))
output$tab <- addmargins(t(table(model$y, as.numeric(predict(model, type="response") > threshold),dnn = list("Cases", "Predictions")))[2:1,2:1])
return(output)
}
diagnos.test(model.2, 0.47)
# $stats
# sensitivity specificity pos.pr.value neg.pr.value accuracy
# 0.7142857 0.7157895 0.7352941 0.6938776 0.7150000
#
# $tab
# Cases
# Predictions 1 0 Sum
# 1 75 27 102
# 0 30 68 98
# Sum 105 95 200
Final note
I don't pretend I have covered everything on prediction, sensitivity and specificity; my goal was more to go as far as possible using common language and calculations, not relying on any specific packages.
Related
I have an issue when calculating logistic regression in R that, to me, makes no sense.
I have one parameter in the model, positive numbers (molecular weight).
I have a binary response variable, let's say either A or B.
My data table is called df1.
str(df1)
data.frame': 1015 obs. of 2 variables:
$ Protein_Class: chr "A" "A" "A" "B" ...
$ MW : num 47114 29586 26665 34284 104297 ...
I make the model:
summary(glm(as.factor(df1[,1]) ~ df1[,2],family="binomial"))
The results are:
Call:
glm(formula = as.factor(df1[, 1]) ~ df1[, 2], family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5556 -1.5516 0.8430 0.8439 0.8507
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.562e-01 1.251e-01 6.842 7.8e-12 ***
df1[, 2] -1.903e-07 3.044e-06 -0.063 0.95
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1239.2 on 1013 degrees of freedom
AIC: 1243.2
Number of Fisher Scoring iterations: 4
That's all fine and good until this point.
But, when I take the logarithm of my variable:
summary(glm(as.factor(df1[,1]) ~ log10(df1[,2]),family="binomial"))
Call:
glm(formula = as.factor(df1[, 1]) ~ log10(df1[, 2]), family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8948 -1.4261 0.8007 0.8528 1.0469
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.7235 1.1169 -2.438 0.01475 *
log10(df1[, 2]) 0.8038 0.2514 3.197 0.00139 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1239.2 on 1014 degrees of freedom
Residual deviance: 1228.9 on 1013 degrees of freedom
AIC: 1232.9
Number of Fisher Scoring iterations: 4
The p-value has changed!
How can this be? And more importantly, which one to use?
My understanding was that logistic regression is based on ranks, and all I do is a monotone transformation. Note, that the AUROC curve of the model remains the same.
There are no zero or negative values that are lost during the transformation.
Did I miss something here?
Any advice?
Thanks in advance,
Adam
There are a couple of things to think about. First, you can probably constrain your search to one side or the other of 1. That is decreasing the power on x - square root, log, inverse, etc... - all have a similar type of effect, but to differing degrees. They all pull in big values and spread out small values. The transformations greater than 1 do the opposite, they tend to increase the spread among big values and decrease the spread among small values - all generally assuming you've got no non-positive values in your variable. This is really, then, a question about what kind of transformation you want and then after that - how severe does it have to be.
First, what kind of transformation do you need. I made some fake data to illustrate the point:
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(1234)
x <- runif(1000, 1, 10000)
y.star <- -6 + log(x)
y <- rbinom(1000, 1, plogis(y.star) )
df <- tibble(
y=y,
x=x,
ystar=y.star)
Next, since this is just a bivariate relationship, we could plot it out with a loess curve. In particular, though, we want to know what the log-odds of y look like with respect to x. We can do this by transforming the predictions from the loess curve with the logistic quantile function, qlogis() - this takes the probabilities and puts them in log-odds form. Then, we could make the plot.
lo <- loess(y ~ x, span=.75)
df <- df %>% mutate(fit = predict(lo),
fit = case_when(
fit < .01 ~ .01,
fit > .99 ~ .99,
TRUE ~ fit))
ggplot(df) +
geom_line(aes(x=x, y=qlogis(fit)))
This looks like a class log relationship. We could then implement a few different transformations and plot those - square root, log and negative inverse.
lo1 <- loess(y ~ sqrt(x), span=.5)
lo2 <- loess(y ~ log(x), span=.5)
lo3 <- loess(y ~ I(-(1/x)), span=.5)
df <- df %>% mutate(fit1 = predict(lo1),
fit1 = case_when(
fit1 < .01 ~ .01,
fit1 > .99 ~ .99,
TRUE ~ fit1))
df <- df %>% mutate(fit2 = predict(lo2),
fit2 = case_when(
fit2 < .01 ~ .01,
fit2 > .99 ~ .99,
TRUE ~ fit2))
df <- df %>% mutate(fit3 = predict(lo3),
fit3 = case_when(
fit3 < .01 ~ .01,
fit3 > .99 ~ .99,
TRUE ~ fit3))
Next, we need to transform the data so the plotting will look right:
plot.df <- df %>%
tidyr::pivot_longer(cols=starts_with("fit"),
names_to="var",
values_to="vals") %>%
mutate(x2 = case_when(
var == "fit" ~ x,
var == "fit1" ~ sqrt(x),
var == "fit2" ~ log(x),
var == "fit3" ~ -(1/x),
TRUE ~ x),
var = factor(var, labels=c("Original", "Square Root", "Log", "Inverse")))
Then, we can make the plot:
ggplot(plot.df, aes(x=x2, y=vals)) +
geom_line() +
facet_wrap(~var, scales="free_x")
Here, it looks like the log is the most linear of the bunch - not surprising since we made the variable y.star with log(x). If we wanted to test between these different possibilities, Kevin Clarke, a Political Scientist at Rochester proposed a paired sign test for evaluating the difference between non-nested models. There is a paper about it here. I wrote a package called clarkeTest that implements this in R. So, we could use this to test the various different alternatives:
m0 <- glm(y ~ x, data=df, family=binomial)
m1 <- glm(y ~ sqrt(x), data=df, family=binomial)
m2 <- glm(y ~ log(x), data=df, family=binomial)
m3 <- glm(y ~ I(-(1/x)), data=df, family=binomial)
Testing the original against the square root:
library(clarkeTest)
> clarke_test(m0, m1)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -296
# Observations: 1000
# Test statistic: 400 (40%)
#
# Model 2 is preferred (p = 2.7e-10)
This shows that the square root is better than the original un-transformed variable.
clarke_test(m0, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 462 (46%)
#
# Model 2 is preferred (p = 0.018)
The above shows that the log is better than the un-transformed variable.
> clarke_test(m0, m3)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -309
# Model 2 log-likelihood: -292
# Observations: 1000
# Test statistic: 550 (55%)
#
# Model 1 is preferred (p = 0.0017)
The above shows that the un-transformed variable is preferred to the negative inverse. Then, we can test the difference of the two models preferred to the original.
> clarke_test(m1, m2)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -284
# Observations: 1000
# Test statistic: 536 (54%)
#
# Model 1 is preferred (p = 0.025)
This shows that the the square root is better than the log transformation in terms of individual log-likelihoods.
Another option would be a grid search over possible transformations and look at the AIC each time. We first have to make a function to deal with the situation where the transformation power = 0, where we should substitute the log. Then we can run a model for each different transformation and get the AICs.
grid <- seq(-1,1, by=.1)
trans <- function(x, power){
if(power == 0){
tx <- log(x)
}else{
tx <- x^power
}
tx
}
mods <- lapply(grid, function(p)glm(y ~ trans(x, p),
data=df,
family=binomial))
aic.df <- tibble(
power = grid,
aic = sapply(mods, AIC))
Next, we can plot the AICs as a function of the power.
ggplot(aic.df, aes(x=power, y=aic)) +
geom_line()
This tells us that about -.25 is the appropriate transformation parameter. Note that there is a discrepancy between the Clarke test results and the AIC because AIC is based on the overall log-likelihood and the Clarke test is based on differences in the individual log-likelihoods.
We would find that this new proposed transformation is also worse than the square root:
m4 <- glm(y ~ I(x^-.25), data=df, family=binomial)
clarke_test(m1, m4)
#
# Clarke test for non-nested models
#
# Model 1 log-likelihood: -296
# Model 2 log-likelihood: -283
# Observations: 1000
# Test statistic: 559 (56%)
#
# Model 1 is preferred (p = 0.00021)
So, if you have a couple of different candidates in mind and you like the idea behind the Clarke test, you could use that to find the appropriate transformation. If you don't have a candidate in mind, a grid search is always a possibility.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I've got a logistic prediction model which produced, for each person, a probability of being a case. Model AUC is 0.95.
Is there a way to determine the probability threshold that would give me 0.9 specificity? (Or any other arbitrarily specified level of specificity or sensitivity.) Thank you.
Empirical values of sensitivity and specificity are of course data-set dependent. You can try extracting the class probability predicted by the logistic model using predict and setting different thresholds to calibrate it against specificity, but keep in mind that for your specificity figures to remain accurate on test data, the proportions of the classes have to be similarly distributed in training and test populations. In the example below, I created a function to map training data specificity to logistic model probability response thresholds for a simulated dataset.
set.seed(100)
x = rnorm(1000)
y = sapply(x, function(zeta) rbinom(1, 1, plogis(zeta)))
data <- data.frame(x = x, y = y)
logistic_model <- glm(data = data, formula = y ~ 0 + x, family = "binomial")
summary(logistic_model)
# Call:
# glm(formula = y ~ 0 + x, family = "binomial", data = data)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -2.4626 -0.9187 0.5383 1.0284 2.3236
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# x 1.09347 0.08576 12.75 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 1386.3 on 1000 degrees of freedom
# Residual deviance: 1163.2 on 999 degrees of freedom
# AIC: 1165.2
#
# Number of Fisher Scoring iterations: 4
data$response <- predict(logistic_model, type = "response")
p_vals = seq(0,1,0.001)
specificity <- sapply(p_vals, function(p) sum(data$y == 0 & data$response < p)/sum(data$y == 0))
plot(p_vals, specificity, type = "l")
threshold_by_specificity <- function(spc)
return(p_vals[sum(specificity <= spc)])
threshold_by_specificity(0.1)
##0.13
threshold_by_specificity(0.3)
##0.251
P.S. I am quite sure there is a function to do this in the caret package, but I couldn't find it.
P.P.S. As an aside, the logistic model specifies a probability distribution for the class given the feature vector, and obtaining theoretical values for sensitivity and/or specificity would involve the opposite, that is, a model that specifies a distribution for the feature vector given the class. To obtain this from the logistic model you'd need to assume a prior distribution for the data (or fit one to it). Without more details, it's not apparent how you should go about doing that, or if it is even needed.
I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.
I'm extremely stuck at the moment as I am trying to figure out how to calculate the probability from my glm output in R. I know the data is very insignificant but I would really love to be shown how to get the probability from an output like this. I was thinking of trying inv.logit() but didn't know what variables to put within the brackets.
The data is from occupancy study. I'm assessing the success of a hair trap method versus a camera trap in detecting 3 species (red squirrel, pine marten and invasive grey squirrel). I wanted to see what affected detection (or non detection) of the various species. One hypotheses was the detection of another focal species at the site would affect the detectability of red squirrel. Given that pine marten is a predator of the red squirrel and that the grey squirrel is a competitor, the presence of those two species at a site might affect the detectability of the red squirrel.
Would this show the probability? inv.logit(-1.14 - 0.1322 * nonRS events)
glm(formula = RS_sticky ~ NonRSevents_before1stRS, family = binomial(link = "logit"), data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.7432 -0.7432 -0.7222 -0.3739 2.0361
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.1455 0.4677 -2.449 0.0143 *
NonRSevents_before1stRS -0.1322 0.1658 -0.797 0.4255
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 34.575 on 33 degrees of freedom
Residual deviance: 33.736 on 32 degrees of freedom
(1 observation deleted due to missingness)
AIC: 37.736
Number of Fisher Scoring iterations: 5*
If you want to predict the probability of response for a specified set of values of the predictor variable:
pframe <- data.frame(NonRSevents_before1stRS=4)
predict(fitted_model, newdata=pframe, type="response")
where fitted_model is the result of your glm() fit, which you stored in a variable. You may not be familiar with the R approach to statistical analysis, which is to store the fitted model as an object/in a variable, then apply different methods to it (summary(), plot(), predict(), residuals(), ...)
This is obviously only a made-up example: I don't know if 4 is a reasonable value for the NonRSevents_before1stRS variable)
you can specify more different values to do predictions for at the same time (data.frame(NonRSevents_before1stRS=c(4,5,6,7,8)))
if you have multiple predictors, you have to specify some value for every predictor for every prediction, e.g. data.frame(x=4:8,y=mean(orig_data$y), ...)
If you want the predicted probabilities for the observations in your original data set, just predict(fitted_model, type="response")
You're correct that inv.logit() (from a bunch of different packages, don't know which you're using) or plogis() (from base R, essentially the same) will translate from the logit or log-odds scale to the probability scale, so
plogis(predict(fitted_model))
would also work (predict provides predictions on the link-function [in this case logit/log-odds] scale by default).
The dependent variable in a logistic regression is a log odds ratio. We'll illustrate how to interpret the coefficients with the space shuttle autolander data from the MASS package.
After loading the data, we'll create a binary dependent variable where:
1 = autolander used,
0 = autolander not used.
We will also create a binary independent variable for shuttle stability:
1 = stable positioning
0 = unstable positioning.
Then, we'll run glm() with family=binomial(link="logit"). Since the coefficients are log odds ratios, we'll exponentiate them to turn them back into odds ratios.
library(MASS)
str(shuttle)
shuttle$stable <- 0
shuttle[shuttle$stability =="stab","stable"] <- 1
shuttle$auto <- 0
shuttle[shuttle$use =="auto","auto"] <- 1
fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
summary(fit)
exp(fit$coefficients)
...and the output:
> fit <- glm(use ~ factor(stable),family=binomial(link = "logit"),data=shuttle) # specifies base as unstable
>
> summary(fit)
Call:
glm(formula = use ~ factor(stable), family = binomial(link = "logit"),
data = shuttle)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.1774 -1.0118 -0.9566 1.1774 1.4155
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.747e-15 1.768e-01 0.000 1.0000
factor(stable)1 -5.443e-01 2.547e-01 -2.137 0.0326 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 350.36 on 255 degrees of freedom
Residual deviance: 345.75 on 254 degrees of freedom
AIC: 349.75
Number of Fisher Scoring iterations: 4
> exp(fit$coefficients)
(Intercept) factor(stable)1
1.0000000 0.5802469
>
The intercept of 0 is the log odds for unstable, and the coefficient of -.5443 is the log odds for stable. After exponentiating the coefficients, we observe that the odds of autolander use under the condition of an unstable shuttle 1.0, and are multiplied by .58 if the shuttle is stable. This means that the autolander is less likely to be used if the shuttle has stable positioning.
Calculating probability of autolander use
We can do this in two ways. First, the manual approach: exponentiate the coefficients and convert the odds to probabilities using the following equation.
p = odds / (1 + odds)
With the shuttle autolander data it works as follows.
# convert intercept to probability
odds_i <- exp(fit$coefficients[1])
odds_i / (1 + odds_i)
# convert stable="stable" to probability
odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
odds_p / (1 + odds_p)
...and the output:
> # convert intercept to probability
> odds_i <- exp(fit$coefficients[1])
> odds_i / (1 + odds_i)
(Intercept)
0.5
> # convert stable="stable" to probability
> odds_p <- exp(fit$coefficients[1]) * exp(fit$coefficients[2])
> odds_p / (1 + odds_p)
(Intercept)
0.3671875
>
The probability of autolander use when a shuttle is unstable is 0.5, and decreases to 0.37 when the shuttle is stable.
The second approach to generate probabilities is to use the predict() function.
# convert to probabilities with the predict() function
predict(fit,data.frame(stable="0"),type="response")
predict(fit,data.frame(stable="1"),type="response")
Note that the output matches the manually calculated probabilities.
> # convert to probabilities with the predict() function
> predict(fit,data.frame(stable="0"),type="response")
1
0.5
> predict(fit,data.frame(stable="1"),type="response")
1
0.3671875
>
Applying this to the OP data
We can apply these steps to the glm() output from the OP as follows.
coefficients <- c(-1.1455,-0.1322)
exp(coefficients)
odds_i <- exp(coefficients[1])
odds_i / (1 + odds_i)
# convert nonRSEvents = 1 to probability
odds_p <- exp(coefficients[1]) * exp(coefficients[2])
odds_p / (1 + odds_p)
# simulate up to 10 nonRSEvents prior to RS
coef_df <- data.frame(nonRSEvents=0:10,
intercept=rep(-1.1455,11),
nonRSEventSlope=rep(-0.1322,11))
coef_df$nonRSEventValue <- coef_df$nonRSEventSlope *
coef_df$nonRSEvents
coef_df$intercept_exp <- exp(coef_df$intercept)
coef_df$slope_exp <- exp(coef_df$nonRSEventValue)
coef_df$odds <- coef_df$intercept_exp * coef_df$slope_exp
coef_df$probability <- coef_df$odds / (1 + coef_df$odds)
# print the odds & probabilities by number of nonRSEvents
coef_df[,c(1,7:8)]
...and the final output.
> coef_df[,c(1,7:8)]
nonRSEvents odds probability
1 0 0.31806 0.24131
2 1 0.27868 0.21794
3 2 0.24417 0.19625
4 3 0.21393 0.17623
5 4 0.18744 0.15785
6 5 0.16423 0.14106
7 6 0.14389 0.12579
8 7 0.12607 0.11196
9 8 0.11046 0.09947
10 9 0.09678 0.08824
11 10 0.08480 0.07817
>
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 3 years ago.
Improve this question
I was looking for a way to do a linear regression under positive constraints, therefore came across the nnls approach. However I was wondering how I could get the same statistics from the nnls as the one provided by an lm object. More specifically the R-squared, the akaike information criterion, the p-values and confidence intervals.
library(arm)
library(nnls)
data = runif(100*4, min = -1, max = 1)
data = matrix(data, ncol = 4)
colnames(data) = c("y", "x1", "x2", "x3")
data = as.data.frame(data)
data$x1 = -data$y
A = as.matrix(data[,c("x1", "x2", "x3")])
b = data$y
test = nnls(A,b)
print(test)
Is there a way to reestimate in an lm framework, using offset and fixing the coefficient did not work... Is there a way to obtain these statistics ? Or another way to create an lm object with positivity constraints on the coefficients?
Thanks
Romain.
What you are proposing to do is a massively bad idea, so much so that I'm reluctant to show you how to do it. The reason is that for OLS, assuming the residuals are normally distributed with constant variance, then the parameter estimates follow a multivariate t-distribution and we can calculate confidence limits and p-values in the usual way.
However, if we perform NNLS on the same data, the residuals will not be normally ditributed, and the standard techniques for calculating p-values, etc. will produce garbage. There are methods for estimating confidence limits on the parameters of an NNLS fit (see this reference for instance), but they are approximate and usually rely on fairly restrictive assumptions about the dataset.
On the other hand, it would be nice if some of the more basic functions for an lm object, such as predict(...), coeff(...), residuals(...), etc. also worked for the result of an NNLS fit. So one way to acheive that is use nls(...): just because a model is linear in the parameters does not mean you cannot use non-linear least squares to find the parameters. nls(...) offers the option to set lower (and upper) limits on the parameters if you use the port algorithm.
set.seed(1) # for reproducible example
data <- as.data.frame(matrix(runif(1e4, min = -1, max = 1),nc=4))
colnames(data) <-c("y", "x1", "x2", "x3")
data$y <- with(data,-10*x1+x2 + rnorm(2500))
A <- as.matrix(data[,c("x1", "x2", "x3")])
b <- data$y
test <- nnls(A,b)
test
# Nonnegative least squares model
# x estimates: 0 1.142601 0
# residual sum-of-squares: 88391
# reason terminated: The solution has been computed sucessfully.
fit <- nls(y~b.1*x1+b.2*x2+b.3*x3,data,algorithm="port",lower=c(0,0,0))
fit
# Nonlinear regression model
# model: y ~ b.1 * x1 + b.2 * x2 + b.3 * x3
# data: data
# b.1 b.2 b.3
# 0.000 1.143 0.000
# residual sum-of-squares: 88391
As you can see, the result of using nnls(...) and the result of using nls(...) with lower-c(0,0,0) are identical. But nls(...) produces an nls object, which supports (most of) the same methods as an lm object. So you can write precict(fit), coef(fit), residuals(fit), AIC(fit) etc. You can also write summary(fit) and confint(fit) but beware: the values you get are not meaningful!!!
To illustrate the point about the residuals, we compare the residuals for an OLS fit to this data, with the residuals for the NNLS fit.
par(mfrow=c(1,2),mar=c(3,4,1,1))
qqnorm(residuals(lm(y~.,data)),main="OLS"); qqline(residuals(lm(y~.,data)))
qqnorm(residuals(fit),main="NNLS"); qqline(residuals(fit))
In this dataset, the stochastic part of the variability in y is N(0,1) by design, so the residuals from the OLS fit (Q-Q plot on the left) are normal. But the residuals from the same dataset fitted using NNLS are not remotely normal. This is because the true dependance of y on x1 is -10, but the NNLS fit is forcing it to 0. Consequently, the proportion of very large residuals (both positive and negative) is much higher than would be expected from the normal distribution.
I think you could use bbmle's mle2 function to optimize the least squares likelihood function and calculate 95% confidence intervals on the nonnegative nnls coefficients. Furthermore, you can take into account that your coefficients cannot go negative by optimizing the log of your coefficients, so that on a backtransformed scale they could never become negative.
Here is a numerical example illustrating this approach, here in the context of deconvoluting a superposition of gaussian-shaped chromatographic peaks with Gaussian noise on them : (any comments welcome)
First let's simulate some data :
require(Matrix)
n = 200
x = 1:n
npeaks = 20
set.seed(123)
u = sample(x, npeaks, replace=FALSE) # peak locations which later need to be estimated
peakhrange = c(10,1E3) # peak height range
h = 10^runif(npeaks, min=log10(min(peakhrange)), max=log10(max(peakhrange))) # simulated peak heights, to be estimated
a = rep(0, n) # locations of spikes of simulated spike train, need to be estimated
a[u] = h
gauspeak = function(x, u, w, h=1) h*exp(((x-u)^2)/(-2*(w^2))) # shape of single peak, assumed to be known
bM = do.call(cbind, lapply(1:n, function (u) gauspeak(x, u=u, w=5, h=1) )) # banded matrix with theoretical peak shape function used
y_nonoise = as.vector(bM %*% a) # noiseless simulated signal = linear convolution of spike train with peak shape function
y = y_nonoise + rnorm(n, mean=0, sd=100) # simulated signal with gaussian noise on it
y = pmax(y,0)
par(mfrow=c(1,1))
plot(y, type="l", ylab="Signal", xlab="x", main="Simulated spike train (red) to be estimated given known blur kernel & with Gaussian noise")
lines(a, type="h", col="red")
Let's now deconvolute the measured noisy signal y with a banded matrix containing shifted copied of the known gaussian shaped blur kernel bM (this is our covariate/design matrix).
First, let's deconvolute the signal with nonnegative least squares :
library(nnls)
library(microbenchmark)
microbenchmark(a_nnls <- nnls(A=bM,b=y)$x) # 5.5 ms
plot(x, y, type="l", main="Ground truth (red), nnls estimate (blue)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_nnls, type="h", col="blue", lwd=2)
yhat = as.vector(bM %*% a_nnls) # predicted values
residuals = (y-yhat)
nonzero = (a_nnls!=0) # nonzero coefficients
n = nrow(X)
p = sum(nonzero)+1 # nr of estimated parameters = nr of nonzero coefficients+estimated variance
variance = sum(residuals^2)/(n-p) # estimated variance = 8114.505
Now let's optimize the negative log-likelihood of our gaussian loss objective, and optimize the log of your coefficients so that on a backtransformed scale they can never be negative :
library(bbmle)
XM=as.matrix(bM)[,nonzero,drop=FALSE] # design matrix, keeping only covariates with nonnegative nnls coefs
colnames(XM)=paste0("v",as.character(1:n))[nonzero]
yv=as.vector(y) # response
# negative log likelihood function for gaussian loss
NEGLL_gaus_logbetas <- function(logbetas, X=XM, y=yv, sd=sqrt(variance)) {
-sum(stats::dnorm(x = y, mean = X %*% exp(logbetas), sd = sd, log = TRUE))
}
parnames(NEGLL_gaus_logbetas) <- colnames(XM)
system.time(fit <- mle2(
minuslogl = NEGLL_gaus_logbetas,
start = setNames(log(a_nnls[nonzero]+1E-10), colnames(XM)), # we initialise with nnls estimates
vecpar = TRUE,
optimizer = "nlminb"
)) # takes 0.86s
AIC(fit) # 2394.857
summary(fit) # now gives log(coefficients) (note that p values are 2 sided)
# Coefficients:
# Estimate Std. Error z value Pr(z)
# v10 4.57339 2.28665 2.0000 0.0454962 *
# v11 5.30521 1.10127 4.8173 1.455e-06 ***
# v27 3.36162 1.37185 2.4504 0.0142689 *
# v38 3.08328 23.98324 0.1286 0.8977059
# v39 3.88101 12.01675 0.3230 0.7467206
# v48 5.63771 3.33932 1.6883 0.0913571 .
# v49 4.07475 16.21209 0.2513 0.8015511
# v58 3.77749 19.78448 0.1909 0.8485789
# v59 6.28745 1.53541 4.0950 4.222e-05 ***
# v70 1.23613 222.34992 0.0056 0.9955643
# v71 2.67320 54.28789 0.0492 0.9607271
# v80 5.54908 1.12656 4.9257 8.407e-07 ***
# v86 5.96813 9.31872 0.6404 0.5218830
# v87 4.27829 84.86010 0.0504 0.9597911
# v88 4.83853 21.42043 0.2259 0.8212918
# v107 6.11318 0.64794 9.4348 < 2.2e-16 ***
# v108 4.13673 4.85345 0.8523 0.3940316
# v117 3.27223 1.86578 1.7538 0.0794627 .
# v129 4.48811 2.82435 1.5891 0.1120434
# v130 4.79551 2.04481 2.3452 0.0190165 *
# v145 3.97314 0.60547 6.5620 5.308e-11 ***
# v157 5.49003 0.13670 40.1608 < 2.2e-16 ***
# v172 5.88622 1.65908 3.5479 0.0003884 ***
# v173 6.49017 1.08156 6.0008 1.964e-09 ***
# v181 6.79913 1.81802 3.7399 0.0001841 ***
# v182 5.43450 7.66955 0.7086 0.4785848
# v188 1.51878 233.81977 0.0065 0.9948174
# v189 5.06634 5.20058 0.9742 0.3299632
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# -2 log L: 2338.857
exp(confint(fit, method="quad")) # backtransformed confidence intervals calculated via quadratic approximation (=Wald confidence intervals)
# 2.5 % 97.5 %
# v10 1.095964e+00 8.562480e+03
# v11 2.326040e+01 1.743531e+03
# v27 1.959787e+00 4.242829e+02
# v38 8.403942e-20 5.670507e+21
# v39 2.863032e-09 8.206810e+11
# v48 4.036402e-01 1.953696e+05
# v49 9.330044e-13 3.710221e+15
# v58 6.309090e-16 3.027742e+18
# v59 2.652533e+01 1.090313e+04
# v70 1.871739e-189 6.330566e+189
# v71 8.933534e-46 2.349031e+47
# v80 2.824905e+01 2.338118e+03
# v86 4.568985e-06 3.342200e+10
# v87 4.216892e-71 1.233336e+74
# v88 7.383119e-17 2.159994e+20
# v107 1.268806e+02 1.608602e+03
# v108 4.626990e-03 8.468795e+05
# v117 6.806996e-01 1.021572e+03
# v129 3.508065e-01 2.255556e+04
# v130 2.198449e+00 6.655952e+03
# v145 1.622306e+01 1.741383e+02
# v157 1.853224e+02 3.167003e+02
# v172 1.393601e+01 9.301732e+03
# v173 7.907170e+01 5.486191e+03
# v181 2.542890e+01 3.164652e+04
# v182 6.789470e-05 7.735850e+08
# v188 4.284006e-199 4.867958e+199
# v189 5.936664e-03 4.236704e+06
library(broom)
signlevels = tidy(fit)$p.value/2 # 1-sided p values for peak to be sign higher than 1
adjsignlevels = p.adjust(signlevels, method="fdr") # FDR corrected p values
a_nnlsbbmle = exp(coef(fit)) # exp to backtransform
max(a_nnls[nonzero]-a_nnlsbbmle) # -9.981704e-11, coefficients as expected almost the same
plot(x, y, type="l", main="Ground truth (red), nnls bbmle logcoeff estimate (blue & green, green=FDR p value<0.05)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(x[nonzero], -a_nnlsbbmle, type="h", col="blue", lwd=2)
lines(x[nonzero][(adjsignlevels<0.05)&(a_nnlsbbmle>1)], -a_nnlsbbmle[(adjsignlevels<0.05)&(a_nnlsbbmle>1)],
type="h", col="green", lwd=2)
sum((signlevels<0.05)&(a_nnlsbbmle>1)) # 14 peaks significantly higher than 1 before FDR correction
sum((adjsignlevels<0.05)&(a_nnlsbbmle>1)) # 11 peaks significant after FDR correction
I haven't tried to compare the performance of this method relative to either nonparametric or parametric bootstrapping, but it's surely much faster.
I was also inclined to think that I should be able to calculate Wald confidence intervals for the nonnegative nnls coefficients based on the information matrix, calculated at a log transformed scale to enforce the nonnegativity constraints and evaluated at the nnls estimates.
I think this goes like this :
XM=as.matrix(bM)[,nonzero,drop=FALSE] # design matrix
posbetas = a_nnls[nonzero] # nonzero nnls coefficients
dispersion=sum(residuals^2)/(n-p) # estimated dispersion (variance in case of gaussian noise) (1 if noise were poisson or binomial)
information_matrix = t(XM) %*% XM # observed Fisher information matrix for nonzero coefs, ie negative of the 2nd derivative (Hessian) of the log likelihood at param estimates
scaled_information_matrix = (t(XM) %*% XM)*(1/dispersion) # information matrix scaled by 1/dispersion
# let's now calculate this scaled information matrix on a log transformed Y scale (cf. stat.psu.edu/~sesa/stat504/Lecture/lec2part2.pdf, slide 20 eqn 8 & Table 1) to take into account the nonnegativity constraints on the parameters
scaled_information_matrix_logscale = scaled_information_matrix/((1/posbetas)^2) # scaled information_matrix on transformed log scale=scaled information matrix/(PHI'(betas)^2) if PHI(beta)=log(beta)
vcov_logscale = solve(scaled_information_matrix_logscale) # scaled variance-covariance matrix of coefs on log scale ie of log(posbetas) # PS maybe figure out how to do this in better way using chol2inv & QR decomposition - in R unscaled covariance matrix is calculated as chol2inv(qr(XW_glm)$qr)
SEs_logscale = sqrt(diag(vcov_logscale)) # SEs of coefs on log scale ie of log(posbetas)
posbetas_LOWER95CL = exp(log(posbetas) - 1.96*SEs_logscale)
posbetas_UPPER95CL = exp(log(posbetas) + 1.96*SEs_logscale)
data.frame("2.5 %"=posbetas_LOWER95CL,"97.5 %"=posbetas_UPPER95CL,check.names=F)
# 2.5 % 97.5 %
# 1 1.095874e+00 8.563185e+03
# 2 2.325947e+01 1.743600e+03
# 3 1.959691e+00 4.243037e+02
# 4 8.397159e-20 5.675087e+21
# 5 2.861885e-09 8.210098e+11
# 6 4.036017e-01 1.953882e+05
# 7 9.325838e-13 3.711894e+15
# 8 6.306894e-16 3.028796e+18
# 9 2.652467e+01 1.090340e+04
# 10 1.870702e-189 6.334074e+189
# 11 8.932335e-46 2.349347e+47
# 12 2.824872e+01 2.338145e+03
# 13 4.568282e-06 3.342714e+10
# 14 4.210592e-71 1.235182e+74
# 15 7.380152e-17 2.160863e+20
# 16 1.268778e+02 1.608639e+03
# 17 4.626207e-03 8.470228e+05
# 18 6.806543e-01 1.021640e+03
# 19 3.507709e-01 2.255786e+04
# 20 2.198287e+00 6.656441e+03
# 21 1.622270e+01 1.741421e+02
# 22 1.853214e+02 3.167018e+02
# 23 1.393520e+01 9.302273e+03
# 24 7.906871e+01 5.486398e+03
# 25 2.542730e+01 3.164851e+04
# 26 6.787667e-05 7.737904e+08
# 27 4.249153e-199 4.907886e+199
# 28 5.935583e-03 4.237476e+06
z_logscale = log(posbetas)/SEs_logscale # z values for log(coefs) being greater than 0, ie coefs being > 1 (since log(1) = 0)
pvals = pnorm(z_logscale, lower.tail=FALSE) # one-sided p values for log(coefs) being greater than 0, ie coefs being > 1 (since log(1) = 0)
pvals.adj = p.adjust(pvals, method="fdr") # FDR corrected p values
plot(x, y, type="l", main="Ground truth (red), nnls estimates (blue & green, green=FDR Wald p value<0.05)", ylab="Signal (black) & peaks (red & blue)", xlab="Time", ylim=c(-max(y),max(y)))
lines(x,-y)
lines(a, type="h", col="red", lwd=2)
lines(-a_nnls, type="h", col="blue", lwd=2)
lines(x[nonzero][pvals.adj<0.05], -a_nnls[nonzero][pvals.adj<0.05],
type="h", col="green", lwd=2)
sum((pvals<0.05)&(posbetas>1)) # 14 peaks significantly higher than 1 before FDR correction
sum((pvals.adj<0.05)&(posbetas>1)) # 11 peaks significantly higher than 1 after FDR correction
The results of these calculations and the ones returned by mle2 are nearly identical (but much faster), so I think this is right, and would correspond that what we were implicitly doing with mle2...
Just refitting the covariates with positive coefficients in an nnls fit using a regular linear model fit btw does not work, since such a linear model fit would not take into account the nonnegativity constraints and so would result in nonsensical confidence intervals that could go negative.
This paper "Exact post model selection inference for marginal screening" by Jason Lee & Jonathan Taylor also presents a method to do post-model selection inference on nonnegative nnls (or LASSO) coefficients and uses truncated Gaussian distributions for that. I haven't seen any openly available implementation of this method for nnls fits though - for LASSO fits there is the selectiveInference package that does something like that. If anyone would happen to have an implementation, please let me know!