Update Array to reduce Variance - math

I have an array X with float values and a fixed size. I need to add p (a float value) to any element of X such that the variance of X reduces.
How do I select the proper element?
Eg:
X = [0,1.2,1.7,2.1,1.7,0,1.3]
and
p = 0.5
What is the new X?

Variance is a meassure of how far values are from the mean.
Therefore, you can evaluate which changed value produces the greater reduction of that meassure.
import numpy as np
def Reduce_variance(X, p):
#Calculate mean
mean = np.mean(X)
#Calculate the variance terms
x=(np.array(X)-mean)**2
#Calculate the variance terms added p
y=(np.array(X)+p-mean)**2
#Calculate the difference
z=x-y
#Check the index of the best improvement
index=np.where(z==max(z))[0][0]
#Replace
X[index]=X[index]+p
return X

The variance is determined by how far elements are from the mean. If adding p to an element brings that element closer to the the mean then you will lower the variance. For example:
from statistics import variance, mean
X = [0,1.2,1.7,2.1,1.7,0,1.3]
p = .5
def change(X, p):
mu = mean(X)
for n in X:
if abs(mu - n) > abs(mu - (n + p)):
yield n + p
else:
yield n
variance(X), variance(change(X, p))
# (0.6961904761904762, 0.3747619047619048)

Related

How can I calculate the second-order derivative of a vector using finite differences if the interval is non-constant?

Say I have vectors x and y and want to calculate the second derivative of y with respect to x using finite differences.
I'd do
x <- rnorm(2000)
y <- x^2
y = y[order(x)]
x = sort(x)
dydx = diff(y) / diff(x)
d2ydx2 = c(NA, NA, diff(dydx) / diff(x[-1]))
plot(x, d2ydx2)
As you can see, there are a few points which are wildly inaccurate. I believe the problem arises because values in dydx do not exactly correspond to those of x[-1] leading a second differentiation to have inaccurate results. Since the step in x is non-constant, the second-order differentiation is not straight forward. How can I do this?
Each time you are taking the numerical approximation derivative, you are losing one value in the vector and shifting the output value over one spot. You are correct, the error is due to the uneven spacing in the x values (incorrect divisor in dydx & d2ydx2 calculations).
In order to correct, calculate a new set of x values corresponding to the mid point between the adjacent x values at each derivative. This is the value where the slope is calculated.
Thus y'1 = f'((x1+x2)/2).
This method is not perfect but the resulting error is much smaller.
#create the input
x <- sort(rnorm(2000))
y <- x**2
#calculate the first deriative and the new mean x value
xprime <- x[-1] - diff(x)/2
dydx <- diff(y)/diff(x)
#calculate the 2nd deriative and the new mean x value
xpprime <- xprime[-1] - diff(xprime)/2
d2ydx2 <- diff(dydx)/diff(xprime)
plot(xpprime, d2ydx2)
Another way is using splinefun, which returns a function from which you can calculate cubic spline derivatives.
Of course, given your example function y= x^2 the second derivatives will be always 2
x <- rnorm(2000)
y <- x^2
y = y[order(x)]
x = sort(x)
fun = splinefun(x,y)
plot(x,fun(x,deriv=2))

how to build an algorithm that is not in terms of M

X and Y agree to meet each other at a certain time. X shows up at time t (where t can range from 1 to n) with a probability x and Y shows up at time t with a probability y. Time t is in seconds.
I need an algorithm that runs in O(nlogn) time to calculate the probability that Y shows up after X.
The probabilities of X and Y are independent of each other
I tried to calculate an expression for the probability that X shows up M seconds before Y and plugging in all the t values (from 1 to n) and calculating the sum of all probabilities. But I will only get the probability in terms of M.
The pobability of X showing up after Y is the sum of P(y>i)p(x=i). The basic approach in O(n²) is for each i, for each j > i sum x(i)y(j), but there is a more efficient way to calculate this :
P(y > i ) = 1 - P(y <= i) so at each step you store the last p(y <= i) and you can even achieve O(n)
it would look like:
ysmaller = 0
total = 0
for i in range(n):
ysmaller += y[i]
total += x[i] * (1 - ysmaller)

Calculating the mean of truncated log normal distribution

I am trying to calculate the mean of a truncated log normal distribution.
I have a random variable x which has a Log-Normal distribution with std a.
I would like to calculate the mean of x when x < y
Note - If x was normally distributed, it can be calculated using this library:
from scipy.stats import truncnorm
my_mean = 100
my_std = 20
myclip_a = 0
myclip_b = 95
a, b = (myclip_a - my_mean) / my_std, (myclip_b - my_mean) / my_std
new_mean = truncnorm.mean(a, b, my_mean, my_std)
I would like to convert this code with the assumption that the distribution is Log-Normal and not Normal.
There may well be more elegant ways of doing this, but I ended up reverting to integrating the lognormal pdf multiplied by x over the range between the truncated outcomes to solve this problem before.
Below is a Python example - ignore the clumsy way I've specified the untruncated lognormal distribution mean and standard deviation, thats just a peculiarity of my work.
It should work between any truncations (x1 = lower limit, x2 = upper limit) including zero to infinity (using np.inf)
import math
from scipy.special import erf
import numpy as np
P10 = 50 # Untruncated P10 (ie 10% outcomes higher than this)
P90 = 10 # Untruncated P90 (ie 90% outcomes higher than this)
u = (np.log(P90)+np.log(P10))/2 # Untruncated Mean of the log transformed distribution
s = np.log(P10/P90)/2.562 # Standard Deviation
# Returns integral of the lognormal pdf multiplied by the lognormal outcomes (x)
# Between lower (x1) and upper (x2) truncations
# pdf and cdf equations from https://en.wikipedia.org/wiki/Log-normal_distribution
# Integral evaluated with;
# https://www.wolframalpha.com/input/?i2d=true&i=Integrate%5Bexp%5C%2840%29-Divide%5BPower%5B%5C%2840%29ln%5C%2840%29x%5C%2841%29-u%5C%2841%29%2C2%5D%2C%5C%2840%292*Power%5Bs%2C2%5D%5C%2841%29%5D%5C%2841%29%2Cx%5D
def ln_trunc_mean(u, s, x1, x2):
if x2 != np.inf:
upper = erf((s**2+u-np.log(x2))/(np.sqrt(2)*s))
upper_cum_prob = 0.5*(1+erf((np.log(x2)-u)/(s*np.sqrt(2)))) # CDF
else:
upper = -1
upper_cum_prob = 1
if x1 != 0:
lower = erf((s**2+u-np.log(x1))/(np.sqrt(2)*s))
lower_cum_prob = 0.5*(1+erf((np.log(x1)-u)/(s*np.sqrt(2))))
else:
lower = 1
lower_cum_prob = 0
integrand = -0.5*np.exp(s**2/2+u)*(upper-lower) # Integral of PDF.x.dx
return integrand / (upper_cum_prob - lower_cum_prob)
You could then evaluate - for example, the untruncated mean as well as a mean with upper & lower 1 percentile clipping as follows
# Untruncated Mean
print(ln_trunc_mean(u, s, 0, np.inf))
27.238164532490508
# Truncated mean between 5.2 and 96.4
print(ln_trunc_mean(u, s, 5.2, 96.4))
26.5089880192863

Changing the distribution of a series of numbers with a power function

I'm trying to use a power function to change the distribution of a series of values between 0 and 1 such that the mean is 0.5.
ie. for each of the values in the series:
new_value = old_value ^ x
Where x is some number.
Is there a simple way to calculate the value of x?
You could run an optimizer from Python's scipy.
Here is an example:
import numpy as np
from scipy import optimize
values = np.random.uniform(0, 1, 5)
sol = optimize.root_scalar(lambda pwr: np.mean(values ** pwr) - 0.5,
bracket=[np.log(0.5) / np.log(values.max()), np.log(0.5) / np.log(values.min())])
print('given values:', values)
print('given mean:', values.mean())
print('power:', sol.root)
print('transformed values:', values ** sol.root)
print('mean of transformed values:', (values ** sol.root).mean())
Example output:
given values: [0.82082056 0.01531309 0.56587417 0.53283897 0.73051697]
given mean: 0.5330727532243068
power: 1.1562709936704882
transformed values: [0.79588022 0.00796968 0.5176988 0.48291519 0.69553611]
mean of transformed values: 0.5
A much simplified algorithm would be:
choose two limits: a = log(0.5)/log(max(values)) and b = log(0.5)/log(max(values))
calculating with a as power gives a mean lower (or equal) to 0.5
calculating with b as power gives a mean higher (or equal) to 0.5
choose a value m somewhere in the middle and calculate the mean with m as power; if that mean is lower than 0.5, m should replace a, otherwise m should replace b
repeat the previous step until either the mean is close enough to 0.5, or a and b get too close to each other

How can I code this equation with double summation in R?

So I'm having hard time coding the above equation, mainly the part which contains that double sum over i's and over j.
I'n my case, my n = 200 and p = 15. My yi:s are in a vector Y = (y1,y2,...yn) that is vector of length 200 and Xij:s are in a matrix which has 15 columns and 200 rows. Bj:s are in a vector of length 15.
My own solution, which I'm fairly certain is wrong, is this:
b0 <- 1/200 * sum(Y - sum(matr*b))
And here is code which you can use to reproduce my vectors and matrix:
matr <- t(mvrnorm(15,mu= rep(0,200),diag(1,nrow = 200)))
Y <- rnorm(n = 200)
b <- rnorm(n = 15)
Use matrix multiplication:
mean(y - x %*% b)
Note that if y and x are known and b is the least squares regression estimate of the coefficients then we can write it as:
fm <- lm(y ~ x + 0)
mean(resid(fm))
and that necessarily equals 0 if there is an intercept, i.e. a constant column in x, since the residual vector must be orthogonal to the range of x and taking the mean is the same as taking the inner product of the residuals and a vector whose elements are all the same (and equal to 1/n).

Resources