Julia: quantiles and confidence intervals

Julia: quantiles and confidence intervals - julia

Does Julia have a function to calculate the density points where p% of the distribution is included?
Something like the scipy.stats norm.ppf function mentioned in this answer
Example: 2-sided 95% confidence interval:
> norm.ppf(1-(1-0.95)/2)
1.96
> norm.ppf(1-(1+0.95)/2)
-1.96

The quantile function from Distributions package is probably (95% CI) what you are looking for. For the Normal distributions you have:
julia> using Distributions
julia> quantile(Normal(0.0, 1.0),1-(1+0.95)/2)
-1.9599639845400576
julia> quantile(Normal(0.0, 1.0),1-(1-0.95)/2)
1.9599639845400576
The same function quantile can be used for other distributions.

Just to add another related enhancement to the answer, especially for users of Bayesian posteriors, we can define medianinterval as follows:
medianinterval(d,p = 0.95) = quantile(d,1-(1+p)/2),quantile(d,(1+p)/2)
and have:
julia> medianinterval(Normal())
(-1.9599639845400576, 1.9599639845400576)
But sometimes a more efficient (i.e. shorter) interval will be around the mode of the distribution. To address this we can define:
function modeinterval(d,p=0.95)
mcdf = cdf(d,mode(d))
endpoints = mcdf < p/2 ? (0,p) : mcdf > 1-p/2 ? (1-p,1) : (mcdf-p/2,mcdf+p/2)
return map(x->quantile(d,x), endpoints)
end
For the Normal distribution it doesn't matter since the mode is also the median, but for other distributions such as the Beta, we can have:
julia> modeinterval(Beta(2,8),0.2)
(0.09639068616673087, 0.15355172436770012)
julia> medianinterval(Beta(2,8),0.2)
(0.1498495815725847, 0.21227857915644155)
julia> 0.15355172436770012 - 0.09639068616673087
0.05716103820096925
julia> 0.21227857915644155 - 0.1498495815725847
0.06242899758385684
The mode interval covers the same fraction of the distribution with a shorter length. See Credible interval for related discussion.

Related

How does the pnorm aspect of work with z scores & x-values?

My professor assigned us some homework questions regarding normal distributions. We are using R studio to calculate our values instead of the z-tables.
One question asks about something about meteors where the mean (μ) = 4.35, standard deviation (σ) = 0.59 and we are looking for the probability of x>5.
I already figured out the answer with 1-pnorm((5-4.35)/0.59) ~ 0.135.
However, I am currently having some difficulty trying to understand what pnorm calculates.
Originally, I just assumed that z scores were the only arguments needed. So I proceeded to use pnorm(z-score) for most of the normal curvature problems.
The help page for pnorm accessed through ?pnorm() indicates that the usage is:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE).
My professor also says that I am ignoring the mean and sd by just using pnorm(z-score). I feel like it is just easier to type in one value instead of the whole set of arguments. So I experimented and found that
1-pnorm((5-4.35)/0.59) = 1-pnorm(5,4.35,0.59)
So it looks like pnorm(z-score) = pnorm (x,μ,σ).
Is there a reason that using the z-score allows to skip the mean and
standard deviation in the pnorm function?
I have also noticed that trying to add μ,σ arguments with the z-score gives the wrong answer (ex: pnorm(z-score,μ,σ).
> 1-pnorm((5-4.35)/0.59)
[1] 0.1352972
> pnorm(5,4.35,0.59)
[1] 0.8647028
> 1-pnorm(5,4.35,0.59)
[1] 0.1352972
> 1-pnorm((5-4.35)/0.59,4.35,0.59)
[1] 1

That is because a z-score is standard normally distributed, meaning it has μ = 0 and σ = 1, which, as you found out, are the default parameters for pnorm().
The z-score is just the transformation of any normally distributed value to a standard normally distributed one.
So when you output the probability of the z-score for x = 5 you indeed get the same value than asking for the probability of x > 5 in a normal distribution with μ = 4.35 and σ = 0.59.
But when you add μ = 4.35 and σ = 0.59 to your z-score inside pnorm() you get it all wrong, because you're looking for a standard normally distributed value in a different distribution.
pnorm() (to answer your first question) calculates the cumulative density function, which shows you P(X < x) (the probability that a random variable takes a value equal or less than x). That's why you do 1 - pnorm(..) to find out P(X > x).

confidence interval around predicted value from complex inverse function

I'm trying to get a 95% confidence interval around some predicted values, but am not capable of achieving this.
Basically, I estimated a growth curve like this:
set.seed(123)
dat=data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
model <- nls(size~sommers(age,Linf,K,t0,ts,C),data=dat,
start=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1))
I have independent size measurements, for which I would like to predict the age. Therefore, the inverse of the function, which is not very straightforward, I calculated like this:
model.out=coef(model)
S.out <- function(t)
((model.out[[4]]*model.out[[2]])/(2*pi))*sin(2*pi*(t-model.out[[5]]))
sommers.out <- function(t)
model.out[[1]]*(1-exp(-model.out[[2]]*(t-model.out[[3]])-S.out(t)+S.out(model.out[[3]])))
inverse = function (f, lower = -100, upper = 100) {
function (y) uniroot((function (x) f(x) - y), lower = lower, upper = upper)[1]
}
sommers.inverse = inverse(sommers.out, 0, 25)
x= sommers.inverse(10) #this works with my complete dataset, but not with this fake one
Although this works fine, I need to know the confidence interval (95%) around this estimate (x). For linear models there is for example "predict(... confidence=)". I could also bootstrap the function somehow to get the quantiles associated with the parameters (didn't find how), to then use the extremes of those to calculate the maximum and minimum values predictable. But that doesn't really look like the good way of doing this....
Any help would be greatly appreciated.
EDIT after answer:
So this worked (explained in the book of Ben Bolker, see answer):
vmat = mvrnorm(1000, mu = coef(mfit), Sigma = vcov(mfit))
dist = numeric(1000)
for (i in 1:1000) {dist[i] = sommers_inverse(9.938,vmat[i,])}
quantile(dist, c(0.025, 0.975))
On the rather bad fake data I gave, this works of course rather horrible. But on the real data (which I have a problem recreating), this is ok!

Unless I'm mistaken, you're going to have to use either regular (parametric) bootstrapping or a method called either "population predictive intervals" (e.g., see section 5 of chapter 7 of Bolker 2008), which assumes that the sampling distributions of your parameters are multivariate Normal. However, I think you may have bigger problems, unless I've somehow messed up your model in adapting it ...
Generate data (note that random data may actually bad for testing your model - see below ...)
set.seed(123)
dat <- data.frame(size=rnorm(50,10,3),age=rnorm(50,5,2))
S <- function(t,ts,C,K) ((C*K)/(2*pi))*sin(2*pi*(t-ts))
sommers <- function(t,Linf,K,t0,ts,C)
Linf*(1-exp(-K*(t-t0)-S(t,ts,C,K)+S(t0,ts,C,K)))
Plot the data and the initial curve estimate:
plot(size~age,data=dat,ylim=c(0,16))
agevec <- seq(0,10,length=1001)
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
I had trouble with nls so I used minpack.lm::nls.lm, which is slightly more robust. (There are other options here, e.g. calculating the derivatives and providing the gradient function, or using AD Model Builder or Template Model Builder, or using the nls2 package.)
For nls.lm we need a function that returns the residuals:
sommers_fn <- function(par,dat) {
with(c(as.list(par),dat),size-sommers(age,Linf,K,t0,ts,C))
}
library(minpack.lm)
mfit <- nls.lm(fn=sommers_fn,
par=list(Linf=10,K=4.7,t0=2.2,C=0.9,ts=0.1),
dat=dat)
coef(mfit)
## Linf K t0 C ts
## 10.6540185 0.3466328 2.1675244 136.7164179 0.3627371
Here's our problem:
plot(size~age,data=dat,ylim=c(0,16))
lines(agevec,sommers(agevec,Linf=10,K=4.7,t0=2.2,ts=0.1,C=0.9))
with(as.list(coef(mfit)), {
lines(agevec,sommers(agevec,Linf,K,t0,ts,C),col=2)
abline(v=t0,lty=2)
abline(h=c(0,Linf),lty=2)
})
With this kind of fit, the results of the inverse function are going to be extremely unstable, as the inverse function is many-to-one, with the number of inverse values depending sensitively on the parameter values ...
sommers_pred <- function(x,pars) {
with(as.list(pars),sommers(x,Linf,K,t0,ts,C))
}
sommers_pred(6,coef(mfit)) ## s(6)=9.93
sommers_inverse <- function (y, pars, lower = -100, upper = 100) {
uniroot(function(x) sommers_pred(x,pars) -y, c(lower, upper))$root
}
sommers_inverse(9.938, coef(mfit)) ## 0.28
If I pick my interval very carefully I can get back the correct answer ...
sommers_inverse(9.938, coef(mfit), 5.5, 6.2)
Maybe your model will be better behaved with more realistic data. I hope so ...

Optimizing a simple linear curve (constant and coefficient estimated from a regression)

I am trying to calculate the turning point of a a few functions where I have estimated the coefficient and constant from a regression. I'm using the optimize function for this as my curves are all linear.
My function looks like:
F<- function(x){
beta* x + alpha
}
mind: beta and alpha are both vectors here. When running the optimisation with optimize, I'm getting the following error:
Error in optimize(F, interval = c(10, 20), lower = (10), :
invalid function value in 'optimize'
Is this because optimize is running the optimisation mathematically, so the beta and alphas need to be single parameters? If anyone knows a better way of doing this please do contribute!
Thank you in advance :)

If the functions are linear, then they will be at a minimum at the lower end of the range where beta>=0, and at the upper end of the range if beta<=0 - no need to use optimize().
It's not entirely clear what you're expecting the code to do - if you want it to return an x for each set of parameters, look at optim() instead and have F return the sum, or run optimize on each set of parameters in turn using an apply() function or loop.
One other thing is that your syntax is a bit wonky - I imagine that you mean:
> F<- function(x){
+ beta* x + alpha
+ }
> alpha <- 1
> beta <- 2
> optimize(F,c(10,20))
$minimum
[1] 10.00006
$objective
[1] 21.00011

Function to find Negative binomial distribution in Julia

I am using below code to find Negative binomial distribution in R
dnbinom(n11, size=p[1], prob=p[2]/(p[2]+E))
where dnbinom is the function used for finding Negative binomial distribution
n11 & E are vector of integer.
Now i want to run the same code in Julia, which function should i have to use inplace of dnbinom
The function must have arguments as (x,size,prob)
where x = vector of probabilities.
size = target for number of successful trials, or dispersion parameter (the shape parameter of the gamma mixing distribution). Must be strictly positive, need not be integer.
prob = probability of success in each trial. 0 < prob <= 1.
Below is My full Code(Updated as per answers given, but still not working)
using Distributions
data = query("Select count_a,EXP_COUNT from SM_STAT_ALGO_LOCALTRADE_SOC;")
f([0.2,0.06,1.4,1.8,0.1],data[:,1],data[:,2])
function f(x::Vector,n11,E)
return sum(-log(x[5] * pdf(NegativeBinomial(x[1], x[2]/(x[2]+E), n11)) + (1-x[5]) * pdf(NegativeBinomial(x[3], x[4]/(x[4]+E),n11))))
end

Assuming that you want the probabilities of a vector of outcomes, you can do
using Distributions
function dnbinom(x, size, prob)
dist = NegativeBinomial(size,prob)
map(y->pdf(dist,y), x)
end
#show dnbinom([3,5], 10, 0.1)

To get the equivilaent of dbinom in R
dnbinom(1, 1, 0.5)
# [1] 0.25
you can use
using Distributions
pdf(NegativeBinomial(), 1)
# 0.25000000000000006
in julia.

Confidence interval for Weibull distribution

I have wind data that I'm using to perform extreme value analysis (calculate return levels). I'm using R with packages 'evd', 'extRemes' and 'ismev'.
I'm fitting GEV, Gumbel and Weibull distributions, in order to estimate the return levels (RL) for some period T.
For the GEV and Gumbel cases, I can get RL's and Confidence Intervals using the extRemes::return.level() function.
Some code:
require(ismev)
require(MASS)
data(wind)
x = wind[, 2]
rperiod = 10
fit <- fitdistr(x, 'weibull')
s <- fit$estimate['shape']
b <- fit$estimate['scale']
rlevel <- qweibull(1 - 1/rperiod, shape = s, scale = b)
## CI around rlevel
## ci.rlevel = ??
But for the Weibull case, I need some help to generate the CI's.

I suspect the excruciatingly correct answer will be that the joint confidence region is an ellipse or some bent-sausage shape but you can extract variance estimates for the parameters from the fit object with the vcov function and then build standard errors for which +/- 1.96 SE's should be informative:
> sqrt(vcov(fit)["shape", "shape"])
[1] 0.691422
> sqrt(vcov(fit)["scale", "scale"])
[1] 1.371256
> s +c(-1,1)*sqrt(vcov(fit)["shape", "shape"])
[1] 6.162104 7.544948
> b +c(-1,1)*sqrt(vcov(fit)["scale", "scale"])
[1] 54.46597 57.20848
The usual way to calculate a CI for a single parameter is to assume Normal distribution and use theta+/- 1.96*SE(theta). In this case, you have two parameters so doing that with both of them would give you a "box", the 2D analog of an interval. The truly correct answer would be something more complex in the 'scale'-by-'shape' parameter space and might be most easily achieved with simulation methods, unless you have a better grasp of theory than I have.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Julia: quantiles and confidence intervals - julia

Does Julia have a function to calculate the density points where p% of the distribution is included? Something like the scipy.stats norm.ppf function mentioned in this answer Example: 2-sided 95% confidence interval: > norm.ppf(1-(1-0.95)/2) 1.96 > norm.ppf(1-(1+0.95)/2) -1.96

Related

How does the pnorm aspect of work with z scores & x-values?

confidence interval around predicted value from complex inverse function

Optimizing a simple linear curve (constant and coefficient estimated from a regression)

Function to find Negative binomial distribution in Julia

Confidence interval for Weibull distribution

Categories

Resources