Changing the distribution of a series of numbers with a power function - math

I'm trying to use a power function to change the distribution of a series of values between 0 and 1 such that the mean is 0.5.
ie. for each of the values in the series:
new_value = old_value ^ x
Where x is some number.
Is there a simple way to calculate the value of x?

You could run an optimizer from Python's scipy.
Here is an example:
import numpy as np
from scipy import optimize
values = np.random.uniform(0, 1, 5)
sol = optimize.root_scalar(lambda pwr: np.mean(values ** pwr) - 0.5,
bracket=[np.log(0.5) / np.log(values.max()), np.log(0.5) / np.log(values.min())])
print('given values:', values)
print('given mean:', values.mean())
print('power:', sol.root)
print('transformed values:', values ** sol.root)
print('mean of transformed values:', (values ** sol.root).mean())
Example output:
given values: [0.82082056 0.01531309 0.56587417 0.53283897 0.73051697]
given mean: 0.5330727532243068
power: 1.1562709936704882
transformed values: [0.79588022 0.00796968 0.5176988 0.48291519 0.69553611]
mean of transformed values: 0.5
A much simplified algorithm would be:
choose two limits: a = log(0.5)/log(max(values)) and b = log(0.5)/log(max(values))
calculating with a as power gives a mean lower (or equal) to 0.5
calculating with b as power gives a mean higher (or equal) to 0.5
choose a value m somewhere in the middle and calculate the mean with m as power; if that mean is lower than 0.5, m should replace a, otherwise m should replace b
repeat the previous step until either the mean is close enough to 0.5, or a and b get too close to each other

Related

Calculating the mean of truncated log normal distribution

I am trying to calculate the mean of a truncated log normal distribution.
I have a random variable x which has a Log-Normal distribution with std a.
I would like to calculate the mean of x when x < y
Note - If x was normally distributed, it can be calculated using this library:
from scipy.stats import truncnorm
my_mean = 100
my_std = 20
myclip_a = 0
myclip_b = 95
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
new_mean = truncnorm.mean(a, b, my_mean, my_std)
I would like to convert this code with the assumption that the distribution is Log-Normal and not Normal.
There may well be more elegant ways of doing this, but I ended up reverting to integrating the lognormal pdf multiplied by x over the range between the truncated outcomes to solve this problem before.
Below is a Python example - ignore the clumsy way I've specified the untruncated lognormal distribution mean and standard deviation, thats just a peculiarity of my work.
It should work between any truncations (x1 = lower limit, x2 = upper limit) including zero to infinity (using np.inf)
import math
from scipy.special import erf
import numpy as np
P10 = 50 # Untruncated P10 (ie 10% outcomes higher than this)
P90 = 10 # Untruncated P90 (ie 90% outcomes higher than this)
u = (np.log(P90)+np.log(P10))/2 # Untruncated Mean of the log transformed distribution
s = np.log(P10/P90)/2.562 # Standard Deviation
# Returns integral of the lognormal pdf multiplied by the lognormal outcomes (x)
# Between lower (x1) and upper (x2) truncations
# pdf and cdf equations from https://en.wikipedia.org/wiki/Log-normal_distribution
# Integral evaluated with;
# https://www.wolframalpha.com/input/?i2d=true&i=Integrate%5Bexp%5C%2840%29-Divide%5BPower%5B%5C%2840%29ln%5C%2840%29x%5C%2841%29-u%5C%2841%29%2C2%5D%2C%5C%2840%292*Power%5Bs%2C2%5D%5C%2841%29%5D%5C%2841%29%2Cx%5D
def ln_trunc_mean(u, s, x1, x2):
if x2 != np.inf:
upper = erf((s**2+u-np.log(x2))/(np.sqrt(2)*s))
upper_cum_prob = 0.5*(1+erf((np.log(x2)-u)/(s*np.sqrt(2)))) # CDF
else:
upper = -1
upper_cum_prob = 1
if x1 != 0:
lower = erf((s**2+u-np.log(x1))/(np.sqrt(2)*s))
lower_cum_prob = 0.5*(1+erf((np.log(x1)-u)/(s*np.sqrt(2))))
else:
lower = 1
lower_cum_prob = 0
integrand = -0.5*np.exp(s**2/2+u)*(upper-lower) # Integral of PDF.x.dx
return integrand / (upper_cum_prob - lower_cum_prob)
You could then evaluate - for example, the untruncated mean as well as a mean with upper & lower 1 percentile clipping as follows
# Untruncated Mean
print(ln_trunc_mean(u, s, 0, np.inf))
27.238164532490508
# Truncated mean between 5.2 and 96.4
print(ln_trunc_mean(u, s, 5.2, 96.4))
26.5089880192863

R: compute an integral with an unknown parameter equal to a certain value (for example: int x = 0.6)

I try to simulate values out of an unknown integral (to create a climatological forecaster)
my function is: $\int_{x = 0}^{x = 0.25} 4*y^(-1/x) dx$
Normally one inputs the variable y and gets a value as output.
However, I want to input the value this integral is equal to and get the value of y as an output.
I have 3 runif vectors of length 1 000, 10 000 and 100 000 (with values between 0 and 1), which I use as my input values.
Say the first value is 0.3 and the second value is 0.78
I want to calculate for which y, the integral above is equal to 0.3 (or equal to 0.78 for the second value).
how am I able to do this in R?
I've tried some stuff with the integrate function, but then I need a value for y to make that work
You are trying to solve a non-linear equation with an integral inside.
Intuitively, what you need to do is to start with an interval in which the desired y sits on. Then try different values of y and calculate the integral, narrow the interval by the result.
You can implement that in R using integrate and optimize as below:
f <- function(x, y) {
4*y^(-1/x)
}
intf <- function(y) {
integrate(f, 0, 0.25, y=y)
}
objective <- function(y, value) {
abs(intf(y)$value - value)
}
optimize(objective, c(1, 10), value=0.3)
#$minimum
#[1] 1.14745
#
#$objective
#[1] 1.540169e-05
optimize(objective, c(1, 10), value=0.78)
#$minimum
#[1] 1.017891
#
#$objective
#[1] 0.0001655954
Here, f is the function to be integrated, intf calculates the integral for a given y, and objective measures the distance between the value of the integral against the desired value.
Since optimize function finds the minimum value of a function, it finds y such that the objective is closest to the target value.
Note that non-linear equations with an integral inside are in general tough to solve. This case seems manageable since the function is monotonic and continuous in y. The solution y should be unique and can be easily found by narrowing down the interval.

R How to sample from an interrupted upside down bell curve

I've asked a related question before which successfully received an answer. Now I want to sample values from an upside down bell curve but exclude a range of values that fall in the middle of it like shown on the picture below:
I have this code currently working:
min <- 1
max <- 20
q <- min + (max-min)*rbeta(10000, 0.5, 0.5)
How may I adapt it to achieve the desired output?
Say you want a sample of 10,000 from your distribution but don't want any numbers between 5 and 15 in your sample. Why not just do:
q <- min + (max-min)*rbeta(50000, 0.5, 0.5);
q <- q[!(q > 5 & q < 15)][1:10000]
Which gives you this:
hist(q)
But still has the correct size:
length(q)
#> [1] 10000
An "upside-down bell curve" compared to the normal distribution, with the exclusion of a certain interval, can be sampled using the following algorithm. I write it in pseudocode because I'm not familiar with R. I adapted it from another answer I just posted.
Notice that this sampler samples in a truncated interval (here, the interval [x0, x1], with the exclusion of [x2, x3]) because it's not possible for an upside-down bell curve extended to infinity to integrate to 1 (which is one of the requirements for a probability density).
In the pseudocode, RNDU01() is a uniform(0, 1) random number.
x0pdf = 1-exp(-(x0*x0))
x1pdf = 1-exp(-(x1*x1))
ymax = max(x0pdf, x1pdf)
while true
# Choose a random x-coordinate
x=RNDU01()*(x1-x0)+x0
# Choose a random y-coordinate
y=RNDU01()*ymax
# Return x if y falls within PDF
if (x<x2 or x>x3) and y < 1-exp(-(x*x)): return x
end

Drawing from truncated normal distribution delivers wrong standard deviation in R

I draw random numbers from a truncated normal distribution. The truncated normal distribution is supposed to have mean 100 and standard deviation 60 after truncation at 0 from the left.
I computed an algorithm to compute the mean and sd of the normal distribution prior to the truncation (mean_old and sd_old).
The function vtruncnorm gives me the (wanted) variance of 60^2. However, when I draw random variables from the distribution, the standard deviation is around 96.
I don't understand why the sd of the random variables varies from the computation of 60.
I tried increasing the amount of draws - still results in sd around 96.
require(truncnorm)
mean_old = -5425.078
sd_old = 745.7254
val = rtruncnorm(10000, a=0, mean = mean_old, sd = sd_old)
sd(val)
sqrt(vtruncnorm( a=0, mean = mean_old, sd = sd_old))
Ok, I did quick test
require(truncnorm)
val = rtruncnorm(1000000, a=7.2, mean = 0.0, sd = 1.0)
sd(val)
sqrt(vtruncnorm( a=7.2, mean = 0.0, sd = 1.0))
Canonical truncated gaussian. At a=6 they are very close, 0.1554233 vs 0.1548865 f.e., depending on seed etc. At a = 7 they are systematically different, 0.1358143 vs 0.1428084 (sampled value is smaller that function call). I've checked with Python implementation
import numpy as np
from scipy.stats import truncnorm
a, b = 7.0, 100.0
mean, var, skew, kurt = truncnorm.stats(a, b, moments='mvsk')
print(np.sqrt(var))
r = truncnorm.rvs(a, b, size=100000)
print(np.sqrt(np.var(r)))
and got back 0.1428083662823426 which is consistent with R vtruncnorm result. At your a=7.2 or so results are even worse.
Moral of the story - at high a values sampling from rtruncnorm has a bug. Python has the same problem as well.

range of values taken by f(x) based on a range of values for x

I would like to know the range of values that a function f(x) can take based on a range of values of x.
For instance, say I have a quadratic equation f(x)=x^2 - x + 0.2 and I want to know the range of f(x) for x in the range [0.2, 1].
is there a function or package in R that can do this?
If I correct understand your question you are looking for:
f <- function(x) x^2 - x + 0.2
x <- seq(0.2, 1, by=0.1)
range(f(x))
# [1] -0.05 0.20 # approximate numerical answer
If you want to know the range in an analytical way you have to do some mathematics (or further programming) to determine the maximum and minimum of the function f in that range of x.
An analytic answer can be calculated using calculus, if the function is differentiable. For the example quadratic, the calculation is:
f'(x) = 2x -1 = 0 => x* =1/2 is argmin/max, and lies within the domain for x: [0.2,1]
Evaluate f at the domain endpoints, and the argmin/max:
f(0.2) = 0.04, f(0.5) = -0.05, f(1) = 0.2.
So min = -0.05, max = 0.2.
A numerical approximation will work if the function is well-behaved (e.g. continuous, differentiable). Otherwise, a spike or discontinuity (e.g. f(x) = 1/x) could be missed depending on the step-size.

Resources