I use glm.nb() function in R MASS package to estimate the parameters of a negative binomial regression model. How could I calculate the predicted probability (probability mass function) given new data, which R function can I use?
My dataset is as follows. y follows negative binomial distribution and x is covariate. And I use glm.nb(y ~ x, data=data) to estimate model parameters. Given new x and y, how can I calculate the predicted probability.
Is there a way to calculate it using Java?
y x
91 1.000000
79 1.000000
86 1.000000
32 1.000000
41 1.000000
29 0.890609
44 1.000000
42 1.000000
31 0.734058
35 1.000000
Let's say you set up your data like this:
set.seed(1)
x = seq(-2, 8, .01)
y = rnbinom(length(x), mu=exp(x), size=10)
fit = glm.nb(y ~ x)
and you have a new point: you want to find the probability of y=100 given x=5.
You can get the predicted value of y from x using predict (with type="response" to tell it you want it after the inverse of the link function has been applied):
predicted.y = predict(fit, newdata=data.frame(x=5), type="response")
Then you could find out the probability with:
dnbinom(100, mu=predicted.y, size=fit$theta)
(This is using fit$theta, the maximum likelihood estimate of the "size" parameter of the negative binomial).
So in one function:
prob = function(newx, newy, fit) {
dnbinom(newy, mu=predict(fit, newdata=data.frame(x=newx), type="response"), size=fit$theta)
}
Related
I am working on a project where we use R to compute a multiple linear regression to come up with some estimates. I found out how to compute prediction intervals, like in this examplt
x <- rnorm(100, 10)
y <- x + rnorm(100, 5)
d <- data.frame(x = x, y = y)
mod <- lm(y ~ x, data = d)
d2 <- data.frame(x = c(0.3, 0.6, 0.2))
predict(mod, newdata = d2, interval = 'prediction')
Now I receive a prediction together with the lower and upper bounds of the prediction interval:
fit lwr upr
1 6.149834 3.630532 8.669137
2 6.425235 3.937989 8.912481
3 6.058034 3.527913 8.588155
However, I am wondering if there is a way to compute the likelihood of the new observation being in a given prediction interval (i.e. prediction +/- 1). In other words, I want to turn the computation “around”. Instead of asking “What are the upper and lower bounds of the 95% prediction interval”, I am asking “What is the likelihood of the new estimate being between a given upper and lower bound around the estimate?”.
To continue the example from above:
fit lwr upr likelihood
1 6.149834 5.149834 7.149834 ???
2 6.425235 5.425235 7.425235 ???
3 6.058034 5.058034 7.058034 ???
Does anyone have an idea how to compute this? Is there a predefined formula in R?
Thank you very much for your help!
Sometimes I try to simulate data by using the rnorm function, which I have done below:
mom.iq <- rnorm(n=1000,
mean=120,
sd=15)
kid.score <- rnorm(n=1000,
mean=45,
sd=20)
df <- data.frame(mom.iq,
kid.score)
But when I plot something like this, it usually ends up with data thats highly uncorrelated:
library(ggpubr)
ggscatter(df,
x="mom.iq",
y="kid.score")+
geom_smooth(method = "lm")
However, I would like to simulate something with a stronger correlation if possible. Is there an easy way to do this within R? I'm aware that I could just as easily just produce my own values manually, but thats not super practical for recreating large samples.
What you are doing is to generate two independent variables; so, it is normal not to be correlated. What you can do is this:
# In order to make the values reproducible
set.seed(12345)
# Generate independent variable
x <- rnorm(n=1000, mean=120, sd=15)
# Generate the dependen variable
y <- 3*x + 6 + rnorm(n=1000, mean = 0, sd = 5)
I used 3 and 6, but you can define them as you want (a and b) in order to get a linear dependence defined as y = a*x + b.
The sum of rnorm(n=1000, mean = 0, sd = 5) is done to add some variability and avoid a perfect correlation between x and y. If you want to get a more correlated data, reduce the standard deviation (sd) and to get a lower correlation, increase its value.
You can create your second variable by taking the first variable into account, and adding some error with rnorm in order to avoid making the relationship completely deterministic/
library(ggplot2)
dat <- data.frame(father_age = rnorm(1000, 35, 5)) |>
dplyr::mutate(child_score = -father_age * 0.5 + rnorm(1000, 0, 4))
dat |>
ggplot(aes(father_age, child_score)) +
geom_point() +
geom_smooth(method = "lm")
#> `geom_smooth()` using formula 'y ~ x'
Created on 2022-07-07 by the reprex package (v2.0.1)
It seems to me that you don't just want to simulate arbitrary x and y with a linear relationship (which the other two answers show). You give your variables meaningful names mod.iq and kid.score, so it appears to me that you want them to have certain mean and variance. In this case, you can use MASS::mvrnorm to simulate samples from multivariate normal, where you can specify correlation. This allows you to preserve the marginal mean and marginal variance you specified.
## your current specification of marginal mean and marginal standard deviation
mean_mod.iq <- 120
mean_kid.score <- 45
sd_mod.iq <- 15
sd_kid.score <- 20
## introduce correlation coefficient between two variables
## coefficient must be between -1 and 1
corcoef <- 0.8
## the result covariance between two variables
covariance <- corcoef * sd_mod.iq * sd_kid.score
## the variance-covariance matrix
Sigma <- matrix(c(sd_mod.iq^2, covariance, covariance, sd_kid.score^2), nrow = 2)
# [,1] [,2]
#[1,] 225 240
#[2,] 240 400
Now you can use MASS::mvrnorm.
xy <- MASS::mvrnorm(n = 500, mu = c(mean_mod.iq, mean_kid.score), Sigma = Sigma)
colnames(xy) <- c("mod.iq", "kid.score")
xydf <- data.frame(xy)
head(xydf)
# mod.iq kid.score
#1 111.6211 33.26241
#2 114.4765 42.49280
#3 115.8160 47.57242
#4 121.8656 53.16578
#5 152.1459 89.60617
#6 107.4360 39.00345
plot(xydf)
You can verify marginal mean and marginal variance of the simulated samples.
sapply(xydf, mean) ## mean, you specified 120 and 45
# mod.iq kid.score
# 119.9499 44.4193
sapply(xydf, sd) ## standard error, you specified 15 and 20
# mod.iq kid.score
# 15.35214 20.16483
I'm using the multinom function from the nnet package to do multinomial logistic regression in R. When I fit the model, I expected to get parameter estimates on the logit scale. However, transforming variables with the inverse logit doesn't give probability estimates that match predicted examples, see example below.
The help file states that "A log-linear model is fitted, with coefficients zero for the first class", but how do I transform parameter estimates to get predicted effects on the probability scale?
library("nnet")
set.seed(123)
# Simulate some simple fake data
groups <- t(rmultinom(500, 1, prob = c(0.05, 0.3, 0.65))) %*% c(1:3)
moddat <- data.frame(group = factor(groups))
# Fit the multinomial model
mod <- multinom(group ~ 1, moddat)
predict(mod, type = "probs")[1,] # predicted probabilities recover generating probs
# But transformed coefficients don't become probabilities
plogis(coef(mod)) # inverse logit
1/(1 + exp(-coef(mod))) # inverse logit
Using predict I can recover the generating probabilities:
1 2 3
0.06 0.30 0.64
But taking the inverse logit of the coefficients does not give probabilities:
(Intercept)
2 0.8333333
3 0.9142857
The inverse logit is the correct back transformation for a binomial model. In the case of a multinomial model, the appropriate back transformation is the softmax function, as described in this question.
The statement from the documentation that a "log-linear model is fitted with coefficient zero for the first class" essentially means that the reference probability is set to 0 on the link scale.
To recover the probabilities manually from the example above:
library("nnet")
set.seed(123)
groups <- t(rmultinom(500, 1, prob = c(0.05, 0.3, 0.65))) %*% c(1:3)
moddat <- data.frame(group = factor(groups))
mod <- multinom(group ~ 1, moddat)
# weights: 6 (2 variable)
# initial value 549.306144
# final value 407.810115
# converged
predict(mod, type = "probs")[1,] # predicted probabilities recover generating probs
# 1 2 3
# 0.06 0.30 0.64
# Inverse logit is incorrect
1/(1 + exp(-coef(mod))) # inverse logit
# (Intercept)
# 2 0.8333333
# 3 0.9142857
# Use softmax transformation instead
softmax <- function(x){
expx <- exp(x)
return(expx/sum(expx))
}
# Add the reference category probability (0 on link scale) and use softmax tranformation
all_coefs <- rbind("1" = 0, coef(mod))
softmax(all_coefs)
# (Intercept)
# 1 0.06
# 2 0.30
# 3 0.64
I'm doing logistic regression on Boston data with a column high.medv (yes/no) which indicates if the median house pricing given by column medv is either more than 25 or not.
Below is my code for logistic regression.
high.medv <- ifelse(Boston$medv>25, "Y", "N") # Applying the desired
`condition to medv and storing the results into a new variable called "medv.high"
ourBoston <- data.frame (Boston, high.medv)
ourBoston$high.medv <- as.factor(ourBoston$high.medv)
attach(Boston)
# 70% of data <- Train
train2<- subset(ourBoston,sample==TRUE)
# 30% will be Test
test2<- subset(ourBoston, sample==FALSE)
glm.fit <- glm (high.medv ~ lstat,data = train2, family = binomial)
summary(glm.fit)
The output is as follows:
Deviance Residuals:
[1] 0
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -22.57 48196.14 0 1
lstat NA NA NA NA
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 0.0000e+00 on 0 degrees of freedom
Residual deviance: 3.1675e-10 on 0 degrees of freedom
AIC: 2
Number of Fisher Scoring iterations: 21
Also i need:
Now I'm required to use the misclassification rate as the measure of error for the two cases:
using lstat as the predictor, and
using all predictors except high.medv and medv.
but i am stuck at the regression itself
With every classification algorithm, the art relies on choosing the threshold upon which you will determine whether the the result is positive or negative.
When you predict your outcomes in the test data set you estimate probabilities of the response variable being either 1 or 0. Therefore, you need to the tell where you are gonna cut, the threshold, at which the prediction becomes 1 or 0.
A high threshold is more conservative about labeling a case as positive, which makes it less likely to produce false positives and more likely to produce false negatives. The opposite happens for low thresholds.
The usual procedure is to plot the rates that interests you, e.g., true positives and false positives against each other, and then choose what is the best rate for you.
set.seed(666)
# simulation of logistic data
x1 = rnorm(1000) # some continuous variables
z = 1 + 2*x1 # linear combination with a bias
pr = 1/(1 + exp(-z)) # pass through an inv-logit function
y = rbinom(1000, 1, pr)
df = data.frame(y = y, x1 = x1)
df$train = 0
df$train[sample(1:(2*nrow(df)/3))] = 1
df$new_y = NA
# modelling the response variable
mod = glm(y ~ x1, data = df[df$train == 1,], family = "binomial")
df$new_y[df$train == 0] = predict(mod, newdata = df[df$train == 0,], type = 'response') # predicted probabilities
dat = df[df$train==0,] # test data
To use missclassification error to evaluate your model, first you need to set up a threshold. For that, you can use the roc function from pROC package, which calculates the rates and provides the corresponding thresholds:
library(pROC)
rates =roc(dat$y, dat$new_y)
plot(rates) # visualize the trade-off
rates$specificity # shows the ratio of true negative over overall negatives
rates$thresholds # shows you the corresponding thresholds
dat$jj = as.numeric(dat$new_y>0.7) # using 0.7 as a threshold to indicate that we predict y = 1
table(dat$y, dat$jj) # provides the miss classifications given 0.7 threshold
0 1
0 86 20
1 64 164
The accuracy of your model can be computed as the ratio of the number of observations you got right against the size of your sample.
I am working with some log-normal data, and naturally I want to demonstrate log-normal distribution results in a better overlap than other possible distributions. Essentially, I want to replicate the following graph with my data:
where the fitted density curves are juxtaposed over log(time).
The text where the linked image is from describes the process as fitting each model and obtaining the following parameters:
For that purpose, I fitted four naive survival models with the above-mentioned distributions:
survreg(Surv(time,event)~1,dist="family")
and extracted the shape parameter (α) and the coefficient (β).
I have several questions regarding the process:
1) Is this the right way of going about it? I have looked into several R packages but couldn't locate one that plots density curves as a built-in function, so I feel like I must be overlooking something obvious.
2) Do the values corresponding log-normal distribution (μ and σ$^2$) just the mean and the variance of the intercept?
3) How can I create a similar table in R? (Maybe this is more of a stack overflow question) I know I can just cbind them manually, but I am more interested in calling them from the fitted models. survreg objects store the coefficient estimates, but calling survreg.obj$coefficients results a named number vector (instead of just a number).
4) Most importantly, how can I plot a similar graph? I thought it would be fairly simple if I just extract the parameters and plot them over the histrogram, but so far no luck. The author of the text says he estimated the density curves from the parameters, but I just get a point estimate - what am I missing? Should I calculate the density curves manually based on distribution before plotting?
I am not sure how to provide a mwe in this case, but honestly I just need a general solution for adding multiple density curves to survival data. On the other hand, if you think it will help, feel free to recommend a mwe solution and I will try to produce one.
Thanks for your input!
Edit: Based on eclark's post, I have made some progress. My parameters are:
Dist = data.frame(
Exponential = rweibull(n = 10000, shape = 1, scale = 6.636684),
Weibull = rweibull(n = 10000, shape = 6.068786, scale = 2.002165),
Gamma = rgamma(n = 10000, shape = 768.1476, scale = 1433.986),
LogNormal = rlnorm(n = 10000, meanlog = 4.986, sdlog = .877)
)
However, given the massive difference in scales, this is what I get:
Going back to question number 3, is this how I should get the parameters?
Currently this is how I do it (sorry for the mess):
summary(fit.exp)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "exponential")
Value Std. Error z p
(Intercept) 6.64 0.052 128 0
Scale fixed at 1
Exponential distribution
Loglik(model)= -2825.6 Loglik(intercept only)= -2825.6
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.wei)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "weibull")
Value Std. Error z p
(Intercept) 6.069 0.1075 56.5 0.00e+00
Log(scale) 0.694 0.0411 16.9 6.99e-64
Scale= 2
Weibull distribution
Loglik(model)= -2622.2 Loglik(intercept only)= -2622.2
Number of Newton-Raphson Iterations: 6
n= 397
summary(fit.gau)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "gaussian")
Value Std. Error z p
(Intercept) 768.15 72.6174 10.6 3.77e-26
Log(scale) 7.27 0.0372 195.4 0.00e+00
Scale= 1434
Gaussian distribution
Loglik(model)= -3243.7 Loglik(intercept only)= -3243.7
Number of Newton-Raphson Iterations: 4
n= 397
summary(fit.log)
Call:
survreg(formula = Surv(duration, confterm) ~ 1, data = data.na,
dist = "lognormal")
Value Std. Error z p
(Intercept) 4.986 0.1216 41.0 0.00e+00
Log(scale) 0.877 0.0373 23.5 1.71e-122
Scale= 2.4
Log Normal distribution
Loglik(model)= -2624 Loglik(intercept only)= -2624
Number of Newton-Raphson Iterations: 5
n= 397
I feel like I am particularly messing up the lognormal, given that it is not the standard shape-and-coefficient tandem but the mean and variance.
Try this; the idea is generating random variables using the random distribtion functions and then plotting the density functions with the output data, here is an example like you need:
require(ggplot2)
require(dplyr)
require(tidyr)
SampleData <- data.frame(Duration=rlnorm(n = 184,meanlog = 2.859,sdlog = .246)) #Asume this is data we have sampled from a lognormal distribution
#Then we estimate the parameters for different types of distributions for that sample data and come up for this parameters
#We then generate a dataframe with those distributions and parameters
Dist = data.frame(
Weibull = rweibull(10000,shape = 1.995,scale = 22.386),
Gamma = rgamma(n = 10000,shape = 4.203,scale = 4.699),
LogNormal = rlnorm(n = 10000,meanlog = 2.859,sdlog = .246)
)
#We use gather to prepare the distribution data in a manner better suited for group plotting in ggplot2
Dist <- Dist %>% gather(Distribution,Duration)
#Create the plot that sample data as a histogram
G1 <- ggplot(SampleData,aes(x=Duration)) + geom_histogram(aes(,y=..density..),binwidth=5, colour="black", fill="white")
#Add the density distributions of the different distributions with the estimated parameters
G2 <- G1 + geom_density(aes(x=Duration,color=Distribution),data=Dist)
plot(G2)