Drawing from truncated normal distribution delivers wrong standard deviation in R - r

I draw random numbers from a truncated normal distribution. The truncated normal distribution is supposed to have mean 100 and standard deviation 60 after truncation at 0 from the left.
I computed an algorithm to compute the mean and sd of the normal distribution prior to the truncation (mean_old and sd_old).
The function vtruncnorm gives me the (wanted) variance of 60^2. However, when I draw random variables from the distribution, the standard deviation is around 96.
I don't understand why the sd of the random variables varies from the computation of 60.
I tried increasing the amount of draws - still results in sd around 96.
require(truncnorm)
mean_old = -5425.078
sd_old = 745.7254
val = rtruncnorm(10000, a=0, mean = mean_old, sd = sd_old)
sd(val)
sqrt(vtruncnorm( a=0, mean = mean_old, sd = sd_old))

Ok, I did quick test
require(truncnorm)
val = rtruncnorm(1000000, a=7.2, mean = 0.0, sd = 1.0)
sd(val)
sqrt(vtruncnorm( a=7.2, mean = 0.0, sd = 1.0))
Canonical truncated gaussian. At a=6 they are very close, 0.1554233 vs 0.1548865 f.e., depending on seed etc. At a = 7 they are systematically different, 0.1358143 vs 0.1428084 (sampled value is smaller that function call). I've checked with Python implementation
import numpy as np
from scipy.stats import truncnorm
a, b = 7.0, 100.0
mean, var, skew, kurt = truncnorm.stats(a, b, moments='mvsk')
print(np.sqrt(var))
r = truncnorm.rvs(a, b, size=100000)
print(np.sqrt(np.var(r)))
and got back 0.1428083662823426 which is consistent with R vtruncnorm result. At your a=7.2 or so results are even worse.
Moral of the story - at high a values sampling from rtruncnorm has a bug. Python has the same problem as well.

Related

Estimating PDF with monotonically declining density at tails

tldr: I am numerically estimating a PDF from simulated data and I need the density to monotonically decrease outside of the 'main' density region (as x-> infinity). What I have yields a close to zero density, but which does not monotonically decrease.
Detailed Problem
I am estimating a simulated maximum likelihood model, which requires me to numerically evaluate the probability distribution function of some random variable (the probability of which cannot be analytically derived) at some (observed) value x. The goal is to maximize the log-likelihood of these densities, which requires them to not have spurious local maxima.
Since I do not have an analytic likelihood function I numerically simulate the random variable by drawing the random component from some known distribution function, and apply some non-linear transformation to it. I save the results of this simulation in a dataset named simulated_stats.
I then use density() to approximate the PDF and approxfun() to evaluate the PDF at x:
#some example simulation
Simulated_stats_ <- runif(n=500, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
#approximation for x
approxfun(density(simulated_stats))(x)
This works well within the range of simulated simulated_stats, see image:
Example PDF. The problem is I need to be able to evaluate the PDF far from the range of simulated data.
So in the image above, I would need to evaluate the PDF at, say, x=50:
approxfun(density(simulated_stats))(50)
> [1] NA
So instead I use the from and to arguments in the density function, which correctly approximate near 0 tails, such
approxfun(
density(Simulated_stats, from = 0, to = max(Simulated_stats)*10)
)(50)
[1] 1.924343e-18
Which is great, under one condition - I need the density to go to zero the further out from the range x is. That is, if I evaluated at x=51 the result must be strictly smaller. (Otherwise, my estimator may find local maxima far from the 'true' region, since the likelihood function is not monotonic very far from the 'main' density mass, i.e. the extrapolated region).
To test this I evaluated the approximated PDF at fixed intervals, took logs, and plotted. The result is discouraging: far from the main density mass the probability 'jumps' up and down. Always very close to zero, but NOT monotonically decreasing.
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), FUN = function(x){approxfun(
density(Simulated_stats_,from = 0, to = max(Simulated_stats_)*10)
)(x)})
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))
Result:
Non-monotonic log density far from density mass
My question
Does this happen because of the kernel estimation in density() or is it inaccuracies in approxfun()? (or something else?)
What alternative methods can I use that will deliver a monotonically declining PDF far from the simulated density mass?
Or - how can I manually change the approximated PDF to monotonically decline the further I am from the density mass? I would happily stick some linear trend that goes to zero...
Thanks!
One possibility is to estimate the CDF using a beta regression model; numerical estimate of the derivative of this model could then be used to estimate the pdf at any point. Here's an example of what I was thinking. I'm not sure if it helps you at all.
Import libraries
library(mgcv)
library(data.table)
library(ggplot2)
Generate your data
set.seed(123)
Simulated_stats_ <- runif(n=5000, 10,15)+ rnorm(n=500,mean = 15,sd = 3)
Function to estimate CDF using gam beta regression model
get_mod <- function(ss,p = seq(0.02, 0.98, 0.02)) {
qp = quantile(ss, probs=p)
betamod = mgcv::gam(p~s(qp, bs="cs"), family=mgcv::betar())
return(betamod)
}
betamod <- get_mod(Simulated_stats_)
Very basic estimate of PDF at val given model that estimates CDF
est_pdf <- function(val, betamod, tol=0.001) {
xvals = c(val,val+tol)
yvals = predict(betamod,newdata=data.frame(qp = xvals), type="response")
as.numeric((yvals[1] - yvals[2])/(xvals[1] - xvals[2]))
}
Lets check if monotonically increasing below min of Simulated_stats
test_x = seq(0,min(Simulated_stats_), length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummax(pdf))
[1] TRUE
Lets check if monotonically decreasing above max of Simulated_stats
test_x = seq(max(Simulated_stats_), 60, length.out=1000)
pdf = sapply(test_x, est_pdf, betamod=betamod)
all(pdf == cummin(pdf))
[1] TRUE
Additional thoughts 3/5/22
As discussed in comments, using the betamod to predict might slow down the estimator. While this could be resolved to a great extent by writing your own predict function directly, there is another possible shortcut.
Generate estimates from the betamod over the range of X, including the extremes
k <- sapply(seq(0,max(Simulated_stats_)*10, length.out=5000), est_pdf, betamod=betamod)
Use the approach above that you were initially using, i.e. a linear interpolation across the density, but rather than doing this over the density outcome, instead do over k (i.e. over the above estimates from the beta model)
lin_int = approxfun(x=seq(0,max(Simulated_stats_)*10, length.out=5000),y=k)
You can use the lin_int() function for prediction in the estimator, and it will be lighting fast. Note that it produces virtually the same value for a given x
c(est_pdf(38,betamod), lin_int(38))
[1] 0.001245894 0.001245968
and it is very fast
microbenchmark::microbenchmark(
list = alist("betamod" = est_pdf(38, betamod),"lin_int" = lint(38)),times=100
)
Unit: microseconds
expr min lq mean median uq max neval
betamod 1157.0 1170.20 1223.304 1188.25 1211.05 2799.8 100
lin_int 1.7 2.25 3.503 4.35 4.50 10.5 100
Finally, lets check the same plot you did before, but using lin_int() instead of approxfun(density(....))
a <- sapply(X = seq(from = 0, to = 100, by = 0.5), lin_int)
aa <- cbind( seq(from = 0, to = 100, by = 0.5), a)
plot(aa[,1],log(aa[,2]))

Hydenet r package: setnode() simulates wrong standard deviation for normal distribution

In the r package hydenet I'm unable to simulate the standard deviations I inputted earlier using setnode(). hydenet exports to jags for simulations.
I load in libraries here and make a simple graph showing that node_a affects node_b
library(rjags)
library(HydeNet)
dag = HydeNetwork(~node_a
+ node_b | node_a)
I give both nodes a standard deviation (node a is set to 100, node b is set to 0.25)
dag = setNode(dag, node_a, nodeType = "dnorm", mean = 10, sd = 100) #.1
dag = setNode(dag, node_b, nodeType = "dnorm", mean = "15 + 0.2*node_a", sd = 0.25)
I compile to Jags and simulate a data frame of 100000 below.
comp_dag = compileJagsModel(dag)
sim = HydeSim(comp_dag, variable.names = c("node_a", "node_b"), n.iter = 100000)
yet for some reason when I check the standard deviations for the nodes on the simulated data, the standard deviations are wrong. For some reason, simulated_sd=sqrt(1/input_sd). I also don't know why it follows this equation.
sd(sim$node_a)
#0.1000575
sd(sim$node_b)
#1.992311
when hydenet is fed the simulated data it estimates close to the original input of standard deviation, so weirdly that seems to work. This is unfortunate because I need to use both of these equations. Thus, it prevents me from just reversing the equation I derived before.
sim_dag = HydeNetwork(~node_a
+ node_b | node_a, data = sim)
writeNetworkModel(sim_dag, pretty = T)
# model{
# node_a ~ dnorm(9.99955, 100.00706)
# node_b ~ dnorm(15.53815 + 0.1468 * node_a, 0.25169)
# }
I figured it out, though I thought I'd leave my post up for anyone else who's confused. JAGs takes mean and precision not mean and standard deviations. Precision is 1/variance which also explains the equation I got between the input and output standard deviations.
Back calculating based on the equation I wrote (simulated_sd = 1/sqrt(input_sd)) will work if you're simulating things from JAGS in R.

Calculating the mean of truncated log normal distribution

I am trying to calculate the mean of a truncated log normal distribution.
I have a random variable x which has a Log-Normal distribution with std a.
I would like to calculate the mean of x when x < y
Note - If x was normally distributed, it can be calculated using this library:
from scipy.stats import truncnorm
my_mean = 100
my_std = 20
myclip_a = 0
myclip_b = 95
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
new_mean = truncnorm.mean(a, b, my_mean, my_std)
I would like to convert this code with the assumption that the distribution is Log-Normal and not Normal.
There may well be more elegant ways of doing this, but I ended up reverting to integrating the lognormal pdf multiplied by x over the range between the truncated outcomes to solve this problem before.
Below is a Python example - ignore the clumsy way I've specified the untruncated lognormal distribution mean and standard deviation, thats just a peculiarity of my work.
It should work between any truncations (x1 = lower limit, x2 = upper limit) including zero to infinity (using np.inf)
import math
from scipy.special import erf
import numpy as np
P10 = 50 # Untruncated P10 (ie 10% outcomes higher than this)
P90 = 10 # Untruncated P90 (ie 90% outcomes higher than this)
u = (np.log(P90)+np.log(P10))/2 # Untruncated Mean of the log transformed distribution
s = np.log(P10/P90)/2.562 # Standard Deviation
# Returns integral of the lognormal pdf multiplied by the lognormal outcomes (x)
# Between lower (x1) and upper (x2) truncations
# pdf and cdf equations from https://en.wikipedia.org/wiki/Log-normal_distribution
# Integral evaluated with;
# https://www.wolframalpha.com/input/?i2d=true&i=Integrate%5Bexp%5C%2840%29-Divide%5BPower%5B%5C%2840%29ln%5C%2840%29x%5C%2841%29-u%5C%2841%29%2C2%5D%2C%5C%2840%292*Power%5Bs%2C2%5D%5C%2841%29%5D%5C%2841%29%2Cx%5D
def ln_trunc_mean(u, s, x1, x2):
if x2 != np.inf:
upper = erf((s**2+u-np.log(x2))/(np.sqrt(2)*s))
upper_cum_prob = 0.5*(1+erf((np.log(x2)-u)/(s*np.sqrt(2)))) # CDF
else:
upper = -1
upper_cum_prob = 1
if x1 != 0:
lower = erf((s**2+u-np.log(x1))/(np.sqrt(2)*s))
lower_cum_prob = 0.5*(1+erf((np.log(x1)-u)/(s*np.sqrt(2))))
else:
lower = 1
lower_cum_prob = 0
integrand = -0.5*np.exp(s**2/2+u)*(upper-lower) # Integral of PDF.x.dx
return integrand / (upper_cum_prob - lower_cum_prob)
You could then evaluate - for example, the untruncated mean as well as a mean with upper & lower 1 percentile clipping as follows
# Untruncated Mean
print(ln_trunc_mean(u, s, 0, np.inf))
27.238164532490508
# Truncated mean between 5.2 and 96.4
print(ln_trunc_mean(u, s, 5.2, 96.4))
26.5089880192863

Multiple random values between specific ranges in R?

I want to pick up 50 samples from (TRUNCATED) Normal Distribution (Gaussian) in a range 15-85 with mean=35, and sd=30. For reproducibility:
num = 50 # number of samples
rng = c(15, 85) # the range to pick the samples from
mu = 35 # mean
std = 30 # standard deviation
The following code gives 50 samples:
rnorm(n = num, mean = mu, sd = std)
However, I want these numbers to be strictly between the range 15-85. How can I achieve this?
UPDATE: Some people made great points in the comment section that this problem can not be solved as this will no longer be Gaussian Distribution. I added the word TRUNCATED to the original post so it makes more sense (Truncated Normal Distribution).
As Limey said in the comments, by imposing a bounded region the distribution is no longer normal. There are several ways to achieve this.
library("MCMCglmm")
rtnorm(n = 50, mean = mu, sd = std, lower = 15, upper = 85)
is one method. If you want a more manual approach you could simulate using uniform distribution within the range and apply the normal distribution function
bounds <- c(pnorm(15, mu, std), pnorm(50, mu, std))
samples <- qnorm(runif(50, bounds[1], bounds[2]), mu, std)
The idea is very basic: Simulate the quantiles of the outcome, and then estimate the value of the specific quantive given the distribution. The value of this approach rather than the approach linked by GKi is that it ensures a "normal-ish" distribution, where simulating and bounding the resulting vector will cause the bounds to have additional mass compared to the normal distribution.
Note the outcome is not normal, as it is bounded.

Is pnorm(q = (x - $\mu$)/$\sigma$) ever different from pnorm(q = x, mean = $\mu$, sd = $\sigma$)?

It is the case that the probability density for a standardized and unstandardized random variable will differ. E.g., in R
dnorm(x = 0, mean = 1, sd = 2)
dnorm(x = (0 - 1)/2)
However,
pnorm(q = 0, mean = 1, sd = 2)
pnorm(q = (0 - 1)/2)
yields the same value.
Are there any situations in which the Normal cumulative density function will yield a different probability for the same random variable when it is standardized versus unstandardized? If yes, is there a particular example in which this difference arises? If not, is there a general proof of this property?
Thanks so much for any help and/or insight!
This isn't really a coding question, but I'll answer it anyway.
Short answer: yes, they may differ.
Long answer:
A normal distribution is usually thought of as y=f(x), that is, a curve over the domain of x. When you standardize, you are converting from units of x to units of z. For example, if x~N(15,5^2), then a value of 10 is 5 x-units less than the mean. Notice that this is also 1 standard deviation less than the mean. When you standardize, you convert x to z~N(0,1^2). Now, that example value of 10, when standarized into z-units, becomes a value of -1 (i.e., it's still one standard deviation less than the mean).
As a result, the area under the curve to the left of x=10 is the same as the area under the curve to the left of z=-1. In words, the cumulative probability up to those cut-offs is the same.
However, the height of curves is different. Let the normal distribution curves be f(x) and g(z). Then f(10) != g(-1). In code:
dnorm(10, 15, 5) != dnorm(-1, 0, 1)
The reason is that the act of standardizing either "spreads" or "squishes" the f(x) curve to make it "fit" over the new z domain as g(z).
Here are two links that let you visualize the spreading/squishing:
https://academo.org/demos/gaussian-distribution/
https://www.intmath.com/counting-probability/normal-distribution-graph-interactive.php
Hope this helps!

Resources