Calculating the mean of truncated log normal distribution - math

I am trying to calculate the mean of a truncated log normal distribution.
I have a random variable x which has a Log-Normal distribution with std a.
I would like to calculate the mean of x when x < y
Note - If x was normally distributed, it can be calculated using this library:
from scipy.stats import truncnorm
my_mean = 100
my_std = 20
myclip_a = 0
myclip_b = 95
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
new_mean = truncnorm.mean(a, b, my_mean, my_std)
I would like to convert this code with the assumption that the distribution is Log-Normal and not Normal.

There may well be more elegant ways of doing this, but I ended up reverting to integrating the lognormal pdf multiplied by x over the range between the truncated outcomes to solve this problem before.
Below is a Python example - ignore the clumsy way I've specified the untruncated lognormal distribution mean and standard deviation, thats just a peculiarity of my work.
It should work between any truncations (x1 = lower limit, x2 = upper limit) including zero to infinity (using np.inf)
import math
from scipy.special import erf
import numpy as np
P10 = 50 # Untruncated P10 (ie 10% outcomes higher than this)
P90 = 10 # Untruncated P90 (ie 90% outcomes higher than this)
u = (np.log(P90)+np.log(P10))/2 # Untruncated Mean of the log transformed distribution
s = np.log(P10/P90)/2.562 # Standard Deviation
# Returns integral of the lognormal pdf multiplied by the lognormal outcomes (x)
# Between lower (x1) and upper (x2) truncations
# pdf and cdf equations from https://en.wikipedia.org/wiki/Log-normal_distribution
# Integral evaluated with;
# https://www.wolframalpha.com/input/?i2d=true&i=Integrate%5Bexp%5C%2840%29-Divide%5BPower%5B%5C%2840%29ln%5C%2840%29x%5C%2841%29-u%5C%2841%29%2C2%5D%2C%5C%2840%292*Power%5Bs%2C2%5D%5C%2841%29%5D%5C%2841%29%2Cx%5D
def ln_trunc_mean(u, s, x1, x2):
if x2 != np.inf:
upper = erf((s**2+u-np.log(x2))/(np.sqrt(2)*s))
upper_cum_prob = 0.5*(1+erf((np.log(x2)-u)/(s*np.sqrt(2)))) # CDF
else:
upper = -1
upper_cum_prob = 1
if x1 != 0:
lower = erf((s**2+u-np.log(x1))/(np.sqrt(2)*s))
lower_cum_prob = 0.5*(1+erf((np.log(x1)-u)/(s*np.sqrt(2))))
else:
lower = 1
lower_cum_prob = 0
integrand = -0.5*np.exp(s**2/2+u)*(upper-lower) # Integral of PDF.x.dx
return integrand / (upper_cum_prob - lower_cum_prob)
You could then evaluate - for example, the untruncated mean as well as a mean with upper & lower 1 percentile clipping as follows
# Untruncated Mean
print(ln_trunc_mean(u, s, 0, np.inf))
27.238164532490508
# Truncated mean between 5.2 and 96.4
print(ln_trunc_mean(u, s, 5.2, 96.4))
26.5089880192863

Related

Changing the distribution of a series of numbers with a power function

I'm trying to use a power function to change the distribution of a series of values between 0 and 1 such that the mean is 0.5.
ie. for each of the values in the series:
new_value = old_value ^ x
Where x is some number.
Is there a simple way to calculate the value of x?
You could run an optimizer from Python's scipy.
Here is an example:
import numpy as np
from scipy import optimize
values = np.random.uniform(0, 1, 5)
sol = optimize.root_scalar(lambda pwr: np.mean(values ** pwr) - 0.5,
bracket=[np.log(0.5) / np.log(values.max()), np.log(0.5) / np.log(values.min())])
print('given values:', values)
print('given mean:', values.mean())
print('power:', sol.root)
print('transformed values:', values ** sol.root)
print('mean of transformed values:', (values ** sol.root).mean())
Example output:
given values: [0.82082056 0.01531309 0.56587417 0.53283897 0.73051697]
given mean: 0.5330727532243068
power: 1.1562709936704882
transformed values: [0.79588022 0.00796968 0.5176988 0.48291519 0.69553611]
mean of transformed values: 0.5
A much simplified algorithm would be:
choose two limits: a = log(0.5)/log(max(values)) and b = log(0.5)/log(max(values))
calculating with a as power gives a mean lower (or equal) to 0.5
calculating with b as power gives a mean higher (or equal) to 0.5
choose a value m somewhere in the middle and calculate the mean with m as power; if that mean is lower than 0.5, m should replace a, otherwise m should replace b
repeat the previous step until either the mean is close enough to 0.5, or a and b get too close to each other

Drawing from truncated normal distribution delivers wrong standard deviation in R

I draw random numbers from a truncated normal distribution. The truncated normal distribution is supposed to have mean 100 and standard deviation 60 after truncation at 0 from the left.
I computed an algorithm to compute the mean and sd of the normal distribution prior to the truncation (mean_old and sd_old).
The function vtruncnorm gives me the (wanted) variance of 60^2. However, when I draw random variables from the distribution, the standard deviation is around 96.
I don't understand why the sd of the random variables varies from the computation of 60.
I tried increasing the amount of draws - still results in sd around 96.
require(truncnorm)
mean_old = -5425.078
sd_old = 745.7254
val = rtruncnorm(10000, a=0, mean = mean_old, sd = sd_old)
sd(val)
sqrt(vtruncnorm( a=0, mean = mean_old, sd = sd_old))
Ok, I did quick test
require(truncnorm)
val = rtruncnorm(1000000, a=7.2, mean = 0.0, sd = 1.0)
sd(val)
sqrt(vtruncnorm( a=7.2, mean = 0.0, sd = 1.0))
Canonical truncated gaussian. At a=6 they are very close, 0.1554233 vs 0.1548865 f.e., depending on seed etc. At a = 7 they are systematically different, 0.1358143 vs 0.1428084 (sampled value is smaller that function call). I've checked with Python implementation
import numpy as np
from scipy.stats import truncnorm
a, b = 7.0, 100.0
mean, var, skew, kurt = truncnorm.stats(a, b, moments='mvsk')
print(np.sqrt(var))
r = truncnorm.rvs(a, b, size=100000)
print(np.sqrt(np.var(r)))
and got back 0.1428083662823426 which is consistent with R vtruncnorm result. At your a=7.2 or so results are even worse.
Moral of the story - at high a values sampling from rtruncnorm has a bug. Python has the same problem as well.

How to generate normally distributed random numbers in specific interval?

I want to generate 100 normally distributed random number in interval [-50,50]. However in the below code the range of random number generated is [-50,50].
n <- rnorm(100, -50,50)
plot(n)
Your question is atrangely asked, because it seems you don't fully understand the rnorm function.
rnorm(100, -50,50)
generates a sample of 100 points given by a normal distribution centered on -50, with a standard deviation of 50. So you need to specifiy what you want by :
100 normally distributed random number in interval [-50,50]. In a normal distribution you don't give an upper and lower limit : the probability of drawing is never 0, but is just very low when being several standard deviation away from the mean. So:
Or you want a normal distribution centered on 0 with 50 standard deviation, and the answer is rnorm(100, 0,50), but you will have values above 50 and below -50.
Or you actually want a normal distribution with no value outside the [-50,50] range, and in this case you still need to give a standard deviation, and you will need to cut the values draw outside the range. You could do something like:
sd <- 50
n <- data.frame(draw = rnorm(1000, 0,sd))
final <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
Here is an example of what it does for 2 different sd:
sd <- 10
n1 <- data.frame(draw = rnorm(1000, 0,sd))
final1 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
sd <- 50
n2 <- data.frame(draw = rnorm(1000, 0,sd))
final2 <- sample(n$draw[!with(n, draw > 50 | draw < -50)],100)
par(mfrow = c(1,2))
hist(final1,main = "sd = 10")
hist(final2,main = "sd = 50")
or you just want to sample values in this range with a flat distribution. In this case, just sample(-50:50,100,replace = T)
You have to make a sacrifice. Either your random variable is not normally distributed because the tails are cut off, or you compromise on the boundaries. You can define your random variable to "practically" lie in a range, this is you accept that a very small percentage lies outside. Maybe 1 % would be an acceptable choice for your purpose.
my_range <- setNames(c(-50, 50), c("lower", "upper"))
prob <- 0.01 # probability to lie outside of my_range
# you have to define this, 1 % in this case
my <- mean(my_range)
z_value <- qnorm(prob/2)
sigma <- (my - my_range["lower"]) / (-1 * z_value)
# proof
N <- 100000 # large number
sim_vec <- rnorm(N, my, sigma)
chk <- 1 - length(sim_vec[sim_vec >= my_range["lower"] &
sim_vec <= my_range["upper"]]) / length(sim_vec)
cat("simulated proportion outside range:", chk, "\n")

Sample from a custom likelihood function

I have the following likelihood function which I used in a rather complex model (in practice on a log scale):
library(plyr)
dcustom=function(x,sd,L,R){
R. = (log(R) - log(x))/sd
L. = (log(L) - log(x))/sd
ll = pnorm(R.) - pnorm(L.)
return(ll)
}
df=data.frame(Range=seq(100,500),sd=rep(0.1,401),L=200,U=400)
df=mutate(df, Likelihood = dcustom(Range, sd,L,U))
with(df,plot(Range,Likelihood,type='l'))
abline(v=200)
abline(v=400)
In this function, the sd is predetermined and L and R are "observations" (very much like the endpoints of a uniform distribution), so all 3 of them are given. The above function provides a large likelihood (1) if the model estimate x (derived parameter) is in between the L-R range, a smooth likelihood decrease (between 0 and 1) near the bounds (of which the sharpness is dependent on the sd), and 0 if it is too much outside.
This function works very well to obtain estimates of x, but now I would like to do the inverse: draw a random x from the above function. If I would do this many times, I would generate a histogram that follows the shape of the curve plotted above.
The ultimate goal is to do this in C++, but I think it would be easier for me if I could first figure out how to do this in R.
There's some useful information online that helps me start (http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution, https://stats.stackexchange.com/questions/88697/sample-from-a-custom-continuous-distribution-in-r) but I'm still not entirely sure how to do it and how to code it.
I presume (not sure at all!) the steps are:
transform likelihood function into probability distribution
calculate the cumulative distribution function
inverse transform sampling
Is this correct and if so, how do I code this? Thank you.
One idea might be to use the Metropolis Hasting Algorithm to obtain a sample from the distribution given all the other parameters and your likelihood.
# metropolis hasting algorithm
set.seed(2018)
n_sample <- 100000
posterior_sample <- rep(NA, n_sample)
x <- 300 # starting value: I chose 300 based on your likelihood plot
for (i in 1:n_sample){
lik <- dcustom(x = x, sd = 0.1, L = 200, R =400)
# propose a value for x (you can adjust the stepsize with the sd)
x.proposed <- x + rnorm(1, 0, sd = 20)
lik.proposed <- dcustom(x = x.proposed, sd = 0.1, L = 200, R = 400)
r <- lik.proposed/lik # this is the acceptance ratio
# accept new value with probablity of ratio
if (runif(1) < r) {
x <- x.proposed
posterior_sample[i] <- x
}
}
# plotting the density
approximate_distr <- na.omit(posterior_sample)
d <- density(approximate_distr)
plot(d, main = "Sample from distribution")
abline(v=200)
abline(v=400)
# If you now want to sample just a few values (for example, 5) you could use
sample(approximate_distr,5)
#[1] 281.7310 371.2317 378.0504 342.5199 412.3302

Maximum Likelihood Estimation for three-parameter Weibull distribution in r

I want to estimate the scale, shape and threshold parameters of a 3p Weibull distribution.
What I've done so far is the following:
Refering to this post, Fitting a 3 parameter Weibull distribution in R
I've used the functions
EPS = sqrt(.Machine$double.eps) # "epsilon" for very small numbers
llik.weibull <- function(shape, scale, thres, x)
{
sum(dweibull(x - thres, shape, scale, log=T))
}
thetahat.weibull <- function(x)
{
if(any(x <= 0)) stop("x values must be positive")
toptim <- function(theta) -llik.weibull(theta[1], theta[2], theta[3], x)
mu = mean(log(x))
sigma2 = var(log(x))
shape.guess = 1.2 / sqrt(sigma2)
scale.guess = exp(mu + (0.572 / shape.guess))
thres.guess = 1
res = nlminb(c(shape.guess, scale.guess, thres.guess), toptim, lower=EPS)
c(shape=res$par[1], scale=res$par[2], thres=res$par[3])
}
to "pre-estimate" my Weibull parameters, such that I can use them as initial values for the argument "start" in the "fitdistr" function of the MASS-Package.
You might ask why I want to estimate the parameters twice... reason is that I need the variance-covariance-matrix of the estimates which is also estimated by the fitdistr function.
EXAMPLE:
set.seed(1)
thres <- 450
dat <- rweibull(1000, 2.78, 750) + thres
pre_mle <- thetahat.weibull(dat)
my_wb <- function(x, shape, scale, thres) {
dweibull(x - thres, shape, scale)
}
ml <- fitdistr(dat, densfun = my_wb, start = list(shape = round(pre_mle[1], digits = 0), scale = round(pre_mle[2], digits = 0),
thres = round(pre_mle[3], digits = 0)))
ml
> ml
shape scale thres
2.942548 779.997177 419.996196 ( 0.152129) ( 32.194294) ( 28.729323)
> ml$vcov
shape scale thres
shape 0.02314322 4.335239 -3.836873
scale 4.33523868 1036.472551 -889.497580
thres -3.83687258 -889.497580 825.374029
This works quite well for cases where the shape parameter is above 1. Unfortunately my approach should deal with the cases where the shape parameter could be smaller than 1.
The reason why this is not possible for shape parameters that are smaller than 1 is described here: http://www.weibull.com/hotwire/issue148/hottopics148.htm
in Case 1, All three parameters are unknown the following is said:
"Define the smallest failure time of ti to be tmin. Then when γ → tmin, ln(tmin - γ) → -∞. If β is less than 1, then (β - 1)ln(tmin - γ) goes to +∞ . For a given solution of β, η and γ, we can always find another set of solutions (for example, by making γ closer to tmin) that will give a larger likelihood value. Therefore, there is no MLE solution for β, η and γ."
This makes a lot of sense. For this very reason I want to do it the way they described it on this page.
"In Weibull++, a gradient-based algorithm is used to find the MLE solution for β, η and γ. The upper bound of the range for γ is arbitrarily set to be 0.99 of tmin. Depending on the data set, either a local optimal or 0.99tmin is returned as the MLE solution for γ."
I want to set a feasible interval for gamma (in my code called 'thres') such that the solution is between (0, .99 * tmin).
Does anyone have an idea how to solve this problem?
In the function fitdistr there seems to be no opportunity doing a constrained MLE, constraining one parameter.
Another way to go could be the estimation of the asymptotic variance via the outer product of the score vectors. The score vector could be taken from the above used function thetahat.weibul(x). But calculating the outer product manually (without function) seems to be very time consuming and does not solve the problem of the constrained ML estimation.
Best regards,
Tim
It's not too hard to set up a constrained MLE. I'm going to do this in bbmle::mle2; you could also do it in stats4::mle, but bbmle has some additional features.
The larger issue is that it's theoretically difficult to define the sampling variance of an estimate when it's on the boundary of the allowed space; the theory behind Wald variance estimates breaks down. You can still calculate confidence intervals by likelihood profiling ... or you could bootstrap. I ran into a variety of optimization issues when doing this ... I haven't really thought about wether there are specific reasons
Reformat three-parameter Weibull function for mle2 use (takes x as first argument, takes log as an argument):
dweib3 <- function(x, shape, scale, thres, log=TRUE) {
dweibull(x - thres, shape, scale, log=log)
}
Starting function (slightly reformatted):
weib3_start <- function(x) {
mu <- mean(log(x))
sigma2 <- var(log(x))
logshape <- log(1.2 / sqrt(sigma2))
logscale <- mu + (0.572 / logshape)
logthres <- log(0.5*min(x))
list(logshape = logshape, logsc = logscale, logthres = logthres)
}
Generate data:
set.seed(1)
dat <- data.frame(x=rweibull(1000, 2.78, 750) + 450)
Fit model: I'm fitting the parameters on the log scale for convenience and stability, but you could use boundaries at zero as well.
tmin <- log(0.99*min(dat$x))
library(bbmle)
m1 <- mle2(x~dweib3(exp(logshape),exp(logsc),exp(logthres)),
data=dat,
upper=c(logshape=Inf,logsc=Inf,
logthres=tmin),
start=weib3_start(dat$x),
method="L-BFGS-B")
vcov(m1), which should normally provide a variance-covariance estimate (unless the estimate is on the boundary, which is not the case here) gives NaN values ... not sure why without more digging.
library(emdbook)
tmpf <- function(x,y) m1#minuslogl(logshape=x,
logsc=coef(m1)["logsc"],
logthres=y)
tmpf(1.1,6)
s1 <- curve3d(tmpf,
xlim=c(1,1.2),ylim=c(5.9,tmin),sys3d="image")
with(s1,contour(x,y,z,add=TRUE))
h <- lme4:::hessian(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1))
vv <- solve(h)
diag(vv) ## [1] 0.002672240 0.001703674 0.004674833
(se <- sqrt(diag(vv))) ## standard errors
## [1] 0.05169371 0.04127558 0.06837275
cov2cor(vv)
## [,1] [,2] [,3]
## [1,] 1.0000000 0.8852090 -0.8778424
## [2,] 0.8852090 1.0000000 -0.9616941
## [3,] -0.8778424 -0.9616941 1.0000000
This is the variance-covariance matrix of the log-scaled variables. If you want to convert to the variance-covariance matrix on the original scale, you need to scale by (x_i)*(x_j) (i.e. by the derivatives of the transformation exp(x)).
outer(exp(coef(m1)),exp(coef(m1))) * vv
## logshape logsc logthres
## logshape 0.02312803 4.332993 -3.834145
## logsc 4.33299307 1035.966372 -888.980794
## logthres -3.83414498 -888.980794 824.831463
I don't know why this doesn't work with numDeriv - would be very careful with variance estimates above. (Maybe too close to boundary for Richardson extrapolation to work?)
library(numDeriv)
hessian()
grad(function(x) do.call(m1#minuslogl,as.list(x)),coef(m1)) ## looks OK
vcov(m1)
The profiles look OK ... (we have to supply std.err because the Hessian isn't invertible)
pp <- profile(m1,std.err=c(0.01,0.01,0.01))
par(las=1,bty="l",mfcol=c(1,3))
plot(pp,show.points=TRUE)
confint(pp)
## 2.5 % 97.5 %
## logshape 0.9899645 1.193571
## logsc 6.5933070 6.755399
## logthres 5.8508827 6.134346
Alternately, we can do this on the original scale ... one possibility would be to use the log-scaling to fit, then refit starting from those parameters on the original scale.
wstart <- as.list(exp(unlist(weib3_start(dat$x))))
names(wstart) <- gsub("log","",names(wstart))
m2 <- mle2(x~dweib3(shape,sc,thres),
data=dat,
lower=c(shape=0.001,sc=0.001,thres=0.001),
upper=c(shape=Inf,sc=Inf,
thres=exp(tmin)),
start=wstart,
method="L-BFGS-B")
vcov(m2)
## shape sc thres
## shape 0.02312399 4.332057 -3.833264
## sc 4.33205658 1035.743511 -888.770787
## thres -3.83326390 -888.770787 824.633714
all.equal(unname(coef(m2)),unname(exp(coef(m1))),tol=1e-4)
About the same as the values above.
We can fit with a small shape, if we are a little more careful to bound the paraameters, but now we end up on the boundary for the threshold, which will cause lots of problems for the variance calculations.
set.seed(1)
dat <- data.frame(x = rweibull(1000, .53, 365) + 100)
tmin <- log(0.99 * min(dat$x))
m1 <- mle2(x ~ dweib3(exp(logshape), exp(logsc), exp(logthres)),
lower=c(logshape=-10,logscale=0,logthres=0),
upper = c(logshape = 20, logsc = 20, logthres = tmin),
data = dat,
start = weib3_start(dat$x), method = "L-BFGS-B")
For censored data, you need to replace dweibull with pweibull; see Errors running Maximum Likelihood Estimation on a three parameter Weibull cdf for some hints.
Another possible solution is to do Bayesian inference. Using scale priors on the shape and scale parameters and a uniform prior on the location parameter, you can easily run Metropolis-Hastings as follows. It might be adviceable to reparameterize in terms of log(shape), log(scale) and log(y_min - location) because the posterior for some of the parameters becomes strongly skewed, in particular for the location parameter. Note that the output below shows the posterior for the backtransformed parameters.
library(MCMCpack)
logposterior <- function(par,y) {
gamma <- min(y) - exp(par[3])
sum(dweibull(y-gamma,exp(par[1]),exp(par[2]),log=TRUE)) + par[3]
}
y <- rweibull(100,shape=.8,scale=10) + 1
chain0 <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=.01*diag(3))
chain <- MCMCmetrop1R(logposterior, rep(0,3), y=y, V=var(chain0))
plot(exp(chain))
summary(exp(chain))
This produces the following output
#########################################################
The Metropolis acceptance rate was 0.43717
#########################################################
Iterations = 501:20500
Thinning interval = 1
Number of chains = 1
Sample size per chain = 20000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
[1,] 0.81530 0.06767 0.0004785 0.001668
[2,] 10.59015 1.39636 0.0098738 0.034495
[3,] 0.04236 0.05642 0.0003990 0.001174
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
var1 0.6886083 0.768054 0.81236 0.8608 0.9498
var2 8.0756210 9.637392 10.50210 11.4631 13.5353
var3 0.0003397 0.007525 0.02221 0.0548 0.1939

Resources