What is the scale of parameter estimates produced by nnet::multinom? - r

I'm using the multinom function from the nnet package to do multinomial logistic regression in R. When I fit the model, I expected to get parameter estimates on the logit scale. However, transforming variables with the inverse logit doesn't give probability estimates that match predicted examples, see example below.
The help file states that "A log-linear model is fitted, with coefficients zero for the first class", but how do I transform parameter estimates to get predicted effects on the probability scale?
library("nnet")
set.seed(123)
# Simulate some simple fake data
groups <- t(rmultinom(500, 1, prob = c(0.05, 0.3, 0.65))) %*% c(1:3)
moddat <- data.frame(group = factor(groups))
# Fit the multinomial model
mod <- multinom(group ~ 1, moddat)
predict(mod, type = "probs")[1,] # predicted probabilities recover generating probs
# But transformed coefficients don't become probabilities
plogis(coef(mod)) # inverse logit
1/(1 + exp(-coef(mod))) # inverse logit
Using predict I can recover the generating probabilities:
1 2 3
0.06 0.30 0.64
But taking the inverse logit of the coefficients does not give probabilities:
(Intercept)
2 0.8333333
3 0.9142857

The inverse logit is the correct back transformation for a binomial model. In the case of a multinomial model, the appropriate back transformation is the softmax function, as described in this question.
The statement from the documentation that a "log-linear model is fitted with coefficient zero for the first class" essentially means that the reference probability is set to 0 on the link scale.
To recover the probabilities manually from the example above:
library("nnet")
set.seed(123)
groups <- t(rmultinom(500, 1, prob = c(0.05, 0.3, 0.65))) %*% c(1:3)
moddat <- data.frame(group = factor(groups))
mod <- multinom(group ~ 1, moddat)
# weights: 6 (2 variable)
# initial value 549.306144
# final value 407.810115
# converged
predict(mod, type = "probs")[1,] # predicted probabilities recover generating probs
# 1 2 3
# 0.06 0.30 0.64
# Inverse logit is incorrect
1/(1 + exp(-coef(mod))) # inverse logit
# (Intercept)
# 2 0.8333333
# 3 0.9142857
# Use softmax transformation instead
softmax <- function(x){
expx <- exp(x)
return(expx/sum(expx))
}
# Add the reference category probability (0 on link scale) and use softmax tranformation
all_coefs <- rbind("1" = 0, coef(mod))
softmax(all_coefs)
# (Intercept)
# 1 0.06
# 2 0.30
# 3 0.64

Related

Confidence Interval in mixed effect models

library(lme4)
fm1 <- lmer(Reaction ~ Days + (Days|Subject), data = sleepstudy)
To generate a 95% CI, I can use the predictInterval() function from the package merTools.
library(merTools)
head(predictInterval(fm1, level = 0.95, seed = 123, n.sims = 100))
# fit upr lwr
# 1 255.4179 313.8781 184.1400
# 2 273.2944 333.2005 231.3584
# 3 291.8451 342.8701 240.8226
# 4 311.3562 359.2908 250.4980
# 5 330.3671 384.2520 270.7094
# 6 353.4378 409.9307 289.4760
In the documentation, it says about the predictInterval() function
This function provides a way to capture model uncertainty in predictions from multi-level models
fit with lme4. By drawing a sampling distribution for the random and the fixed effects and then
estimating the fitted value across that distribution, it is possible to generate a prediction interval for
fitted values that includes all variation in the model except for variation in the covariance parameters,
theta. This is a much faster alternative than bootstrapping for models fit to medium to large datasets.
My goal is to get all the fitted values instead of the the upper and lower CI i.e. for each row, I need the
original n simulations from which these 95% CI are calculated. I checked the argument in the documentation and
followed this:
head(predictInterval(fm1, n.sims = 100, returnSims = TRUE, seed = 123, level = 0.95))
# fit upr lwr
# 1 255.4179 313.8781 184.1400
# 2 273.2944 333.2005 231.3584
# 3 291.8451 342.8701 240.8226
# 4 311.3562 359.2908 250.4980
# 5 330.3671 384.2520 270.7094
# 6 353.4378 409.9307 289.4760
Instead of getting the 100 simulations, it still gives me the same output. What is it I am doing wrong here?
A second question though I believe this is more of a StatsExchange one.
"By drawing a sampling distribution for the random and the fixed
effects and then."`
How does it draws the sampling distribution if some could explain me?
You can get simulated values if you specify newdata in the predictInterval() function.
predInt <- predictInterval(fm1, newdata = sleepstudy, n.sims = 100,
returnSims = TRUE, seed = 123, level = 0.95)
simValues <- attr(predInt, "sim.results")
Details on how to create sampling distributions of parameters are given in the Detail section of the help page.You can get the estimates of fit, lower and upper boundaries as:
fit <- apply(simValues, 1, function(x){quantile(x, probs=0.500) } )
lwr <- apply(simValues, 1, function(x){quantile(x, probs=0.025) } )
upr <- apply(simValues, 1, function(x){quantile(x, probs=0.975) } )

incorrect logistic regression output

I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.

weights option in GAM

My dataset has many redundant observations (but each observation should be counted). So I consider using 'weights' option in GAM because it significantly reduces computation time.
gam function (in mgcv package) explains that they are 'equivalent' (from ?gam on arguments weights):
"Note that a weight of 2, for example, is equivalent to having made exactly the same observation twice."
But it does not seem right.
yy = c(5,2,8,9)
xx = 1:4
wgts = c(3,2,4,1)
yy2 = rep(yy, wgts)
xx2 = rep(xx, wgts)
mod1 = gam(yy2 ~ xx2)
mod2 = gam(yy ~ xx, weights = wgts)
mod3 = gam(yy ~ xx, weights = wgts / mean(wgts))
predict(mod1,data.frame(xx2=1:4))
predict(mod2,data.frame(xx=1:4))
predict(mod3,data.frame(xx=1:4))
The estimates are identical in all three models.
Standard error are same in model 2 and 3 but different in model 1.
GCV is different in all three models.
I understand GCVs can be different. But how can we say that the models are identical if standard errors are different? Is this an error, or is there any good explanation for this?
The issues you saw is not about GAM. You have used gam to fit a parametric model, in which case gam behaves almost as same as lm. To answer your questions, it is sufficient to focus on the linear regression case. What happens to a linear model will happens to GLMs and GAMs, too. Here is how we can reproduce the issue with lm:
yy <- c(5,2,8,9)
xx <- 1:4
wgts <- c(3,2,4,1)
yy2 <- rep(yy,wgts)
xx2 <- rep(xx,wgts)
fit1 <- lm(yy2 ~ xx2)
fit2 <- lm(yy ~ xx, weights = wgts)
fit3 <- lm(yy ~ xx, weights = wgts/mean(wgts))
summary1 <- summary(fit1)
summary2 <- summary(fit2)
summary3 <- summary(fit3)
pred1 <- predict(fit1, list(xx2 = xx), interval = "confidence", se.fit = TRUE)
pred2 <- predict(fit2, list(xx = xx), interval = "confidence", se.fit = TRUE)
pred3 <- predict(fit3, list(xx = xx), interval = "confidence", se.fit = TRUE)
All models have the same regression coefficients, but other results may differ. You asked:
For weighted regression fit2 and fit3, why is almost everything the same except residual standard error?
Why is weighted regression (fit2 or fit3) not equivalent to ordinary regression with ties?
Your first question is about the scaling invariance of weight least squares to weights. Here is a brief summary I made:
If we rescale W by an arbitrary positive value, only residual standard error and unscaled covariance will change. Such change does not imply a different, non-equivalent model. In fact, everything related to prediction is not affected. In weighted regression, don't just look at sigma2; it is just a marginal variance. What is really of interest is the gross variance after multiplying weights. If you divide your weights by 2, you will find sigma2 doubles, but you still get the same result when multiplying them together.
summary2$coef
summary3$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.128713 3.128697 0.6803832 0.5664609
#xx 1.683168 1.246503 1.3503125 0.3094222
pred2
pred3
#$fit
# fit lwr upr
#1 3.811881 -5.0008685 12.62463
#2 5.495050 -0.1299942 11.12009
#3 7.178218 0.6095820 13.74685
#4 8.861386 -1.7302209 19.45299
#
#$se.fit
# 1 2 3 4
#2.048213 1.307343 1.526648 2.461646
#
#$df
#[1] 2
#
#$residual.scale ## for `pred2`
#[1] 3.961448
#
#$residual.scale ## for `pred3`
#[1] 2.50544
Your second question is about the meaning of weights. Weights is used to model heteroscedastic response to overcome leverage effect in ordinary least square regression. Weights are proportional to reciprocal variance: You give bigger weights to data with smaller expected errors. Weights can be non-integer, so it does not have a naturual explanation in terms of repeated data. Thus, what is written in mgcv package is not rigorously correct.
The real difference between fit1 and fit2? is the degree of freedom. Check the above table for (n - p). n is the number of data you have, while p is the number of non-NA coefficients, so n - p is the residual degree of freedom. For both models we have p = 2 (intercept and slope), but for fit1 we have n = 10 while for fit2 we have n = 4. This has dramatic effect on inference, as now standard errors for coefficients and predictions (hence confidence intervals) will differ. These two models are far from being equivalent.
summary1$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.128713 1.5643486 1.360766 0.21068210
#xx2 1.683168 0.6232514 2.700625 0.02704784
summary2$coef
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.128713 3.128697 0.6803832 0.5664609
#xx 1.683168 1.246503 1.3503125 0.3094222
pred1
#$fit
# fit lwr upr
#1 3.811881 1.450287 6.173475
#2 5.495050 3.987680 7.002419
#3 7.178218 5.417990 8.938446
#4 8.861386 6.023103 11.699669
#
#$se.fit
# 1 2 3 4
#1.0241066 0.6536716 0.7633240 1.2308229
#
#$df # note, this is `10 - 2 = 8`
#[1] 8
#
#$residual.scale
#[1] 1.980724
pred2
#$fit
# fit lwr upr
#1 3.811881 -5.0008685 12.62463
#2 5.495050 -0.1299942 11.12009
#3 7.178218 0.6095820 13.74685
#4 8.861386 -1.7302209 19.45299
#
#$se.fit
# 1 2 3 4
#2.048213 1.307343 1.526648 2.461646
#
#$df # note, this is `4 - 2 = 2`
#[1] 2
#
#$residual.scale ## for `pred2`
#[1] 3.961448

Adding arbitrary curve with AUC 0.8 to ROC plot

I have a simple ROC plot that I am creating using pROC package:
plot.roc(response, predictor)
It is working fine, as expected, but I would like to add an "ideally" shaped reference curve with AUC 0.8 for comparison (the AUC of my ROC plot is 0.66).
Any thoughts?
Just to clarify, I am not trying to smoothen my ROC plot, but trying to add a reference curve that would represent AUC 0.8 (similar to the reference diagonal line representing AUC 0.5).
The reference diagonal line has a meaning (a model that guesses randomly), so you would similarly have to define the model associated with your reference curve of AUC 0.8. Different models would be associated with different reference curves.
For instance, one might define a model for which predicted probabilities are distributed evenly between 0 and 1 and for a point with predicted probability p, the probability of the true outcome is p^k for some constant k. It turns that for this model, k=2 yields a plot with AUC 0.8.
library(pROC)
set.seed(144)
probs <- seq(0, 1, length.out=10000)
truth <- runif(10000)^2 < probs
plot.roc(truth, probs)
# Call:
# plot.roc.default(x = truth, predictor = probs)
#
# Data: probs in 3326 controls (truth FALSE) < 6674 cases (truth TRUE).
# Area under the curve: 0.7977
Some algebra shows that this particular family of models has AUC (2+3k)/(2+4k), meaning it can generate curves with AUC between 0.75 and 1 depending on the value of k.
Another approach you could use is linked to logistic regression. If you had logistic regression linear predictor function value p, aka you would have predicted probability 1/(1+exp(-p)), then you could label the true outcome as true if p plus some normally distributed noise exceeds 0 and otherwise label the true outcome as false. If the normally distributed noise has variance 0 your model will have AUC 1, and if the normally distributed noise has variance approaching infinity your model will have AUC 0.5.
If I assume the original predictions are drawn from the standard normal distribution, it looks like normally distributed noise with standard deviation 1.2 give AUC 0.8 (I couldn't figure out a nice closed form for AUC, though):
set.seed(144)
pred.fxn <- rnorm(10000)
truth <- (pred.fxn + rnorm(10000, 0, 1.2)) >= 0
plot.roc(truth, pred.fxn)
# Call:
# plot.roc.default(x = truth, predictor = pred.fxn)
#
# Data: pred.fxn in 5025 controls (truth FALSE) < 4975 cases (truth TRUE).
# Area under the curve: 0.7987
A quick/rough way is to add a circle of radius 1 onto your plot which will have AUC pi/4 = 0.7853982
library(pROC)
library(car)
n <- 100L
x1 <- rnorm(n, 2.0, 0.5)
x2 <- rnorm(n, -1.0, 2)
y <- rbinom(n, 1L, plogis(-0.4 + 0.5 * x1 + 0.1 * x2))
mod <- glm(y ~ x1 + x2, "binomial")
probs <- predict(mod, type = "response")
plot(roc(y, probs))
ellipse(c(0, 0), matrix(c(1,0,0,1), 2, 2), radius = 1, center.pch = FALSE, col = "blue")

(R, quantreg): Hypothesis testing a large range of quantiles

I have a quantile regression model with 1 regressor and 1 regressand. I want to hypothesis test that the regressor is equal over every quantile. One approach I've thought of is to test over all tau over {0.01,0.02,....,0.99}. However, I would then have to write:
anova(model1,model2,model3,.......,model99), where each model corresponds to a different tau. Question: How do I get anova() to accept large amount of models of type rq without manually typing them out?
My attempt at a solution has been to do this:
y = rnorm(100)
x = rnorm(100)
rqs_object <- rq(y~x,tau=1:99/100)
anova(rqs_object)
However, anova clearly doesn't take object type rqs, only type rq, unfortunately.
Cross posted here until I decided that it had a large programming/specialist element to the problem .
I concentrate on question 1 and only on the programming part.
some data:
set.seed(65465)
y = rnorm(100)
x = rnorm(100)
Now I define a function, that takes tau as input and does the fit:
rqfits <- function(tau) {
require(quantreg)
rq(y~x,tau=tau)
}
I can then apply this function on a vector of taus:
taus <- 1:5/10
fits <- lapply(taus,rqfits)
The result is a list of models.
We can now use do.call to pass our models to anova:
do.call(anova,fits)
Quantile Regression Analysis of Deviance Table
Model: y ~ x
Joint Test of Equality of Slopes: tau in { 0.1 0.2 0.3 0.4 0.5 }
Df Resid Df F value Pr(>F)
1 4 496 1.0388 0.3866
Warning:
In summary.rq(x, se = se, covariance = TRUE) : 2 non-positive fis

Resources