Cannot figure out simple use of Cumulants.jl - julia

I cannot for the life of me figure out how to use Cumulants.jl to get moments or cumulants from some data. I find the docs (https://juliahub.com/docs/Cumulants/Vrq25/1.0.4/) completely over my head.
Suppose I have a vector of some data e.g.:
using Distributions
d = rand(Exponential(1), 1000)
The documentation suggests, so far as I can understand it, that cumulants(d, 3) should return the first three cumulants. The function is defined like so:
cumulants(data::Matrix{T}, m::Int = 4, b::Int = 2) where T<: AbstractFloat
a Matrix in Julia is, so far as I understand, a 2D array. So I convert my data to a 2D array:
dm = reshape(d, length(d), 1)
But I get:
julia> cumulants(dm,3)
ERROR: DimensionMismatch("bad block size 2 > 1")
My question concisely: how do I use Cumulants.jl to get the first m cumulants and the first m moments from some simulated data?
Thanks!
EDIT: In the above example, c = cumulants(dm,3,1) as suggested in a comment will give, for c:
3-element Array{SymmetricTensors.SymmetricTensor{Float64,N} where N,1}:
SymmetricTensors.SymmetricTensor{Float64,1}(Union{Nothing, Array{Float64,1}}[[1.0122452678071678]], 1, 1, 1, true)
SymmetricTensors.SymmetricTensor{Float64,2}(Union{Nothing, Array{Float64,2}}[[1.0336298356976195]], 1, 1, 1, true)
SymmetricTensors.SymmetricTensor{Float64,3}(Union{Nothing, Array{Float64,3}}[[2.5438037582591146]], 1, 1, 1, true)
I find that I can access the first, second, and third cumulants by:
c[1][1]
c[2][1,1]
c[3][1,1,1]
Which I arrived at essentially by guessing. I have no idea why this nutty output format exists. I still cannot figure out how to get the first m cumulants as a vector easily.

As I wrote in the comments, if you have a univariate problem you should use cumulants(dm,3,1) as the cumulants are calulated using tensors and the tensors are saved in a block structure, where the blocks are of size bxb, i.e. the third argument in the function call. However, If you have only one column, the size of the tensors will be 1, so that it doesn't make sense to save it in a 2x2 block.
To access the cumulants in Array form you have to convert them first. This is done by Array(cumulant(data, nc, b)[c]), where nc is the number of cumulants you want to calculate, b is the block size (for efficient storage of the tensors), and c is the cumulant you need.
Summing up:
using Cumulants
# univariate data
unidata = rand(1000,1)
uc = cumulants(unidata, 3, 1)
Array(uc[1])
#1-element Array{Float64,1}:
# 0.48772026299259374
Array(uc[2])
#1×1 Array{Float64,2}:
# 0.0811428357438324
Array(uc[3])
#[:, :, 1] =
# 0.0008653019738796724
# multivariate data
multidata = rand(1000,3)
mc = cumulants(multidata, 3, 2)
Array(mc[1])
#3-element Array{Float64,1}:
# 0.5024511157116442
# 0.4904838734508787
# 0.48286680648519215
Array(mc[2])
#3×3 Array{Float64,2}:
# 0.0834021 -0.00368562 -0.00151614
# -0.00368562 0.0835084 0.00233202
# -0.00151614 0.00233202 0.0808521
Array(mc[3])
# [:, :, 1] =
# -0.000506926 -0.000763061 -0.00183751
# -0.000763061 -0.00104804 -0.00117227
# -0.00183751 -0.00117227 0.00112968
#
# [:, :, 2] =
# -0.000763061 -0.00104804 -0.00117227
# -0.00104804 0.000889305 -0.00116559
# -0.00117227 -0.00116559 -0.000106866
#
# [:, :, 3] =
# -0.00183751 -0.00117227 0.00112968
# -0.00117227 -0.00116559 -0.000106866
# 0.00112968 -0.000106866 0.00131965
The optimal size of the blocks can be found in their software paper (https://arxiv.org/pdf/1701.05420.pdf), where they write (for proper latex formatting have a look at the paper):
5.2.1. The optimal size of blocks.
The number of coefficients required to store a super-symmetric tensor of order d and n dimensions is equal to (d+n−1 over n). The storage of tensor disregarding the super-symmetry requires n^d coefficients. The block structure introduced in [49] uses more than minimal amount of memory but allows for easier further processing of super-symmetric tensors.If we store the super-symmetric tensor in the block structure, the block size parameter b appears. In our implementation in order to store a super-symmetric tensor in the block structure we need, assuming n|b, an array of (n over b)^d pointers to blocks and an array of the same size of flags that contain the information if a pointer points to a valid block. Recall that diagonal blocks contain redundant information.Therefore on the one hand, the smaller the value of b, the less redundant elements on diagonals of the block structure. On the other hand, the larger the value of b,the smaller the number of blocks, the smaller the blocks’ operation overhead, and the fewer the number of pointers pointing to empty blocks. For detailed discussion of memory usage see [49]. The analysis of the influence of the parameter b on the computational time of cumulants for some parameters are presented in Fig. 2. We obtain the shortest computation time for b = 2 in almost all test cases, and this value will be set as default and used in all efficiency tests. Note that for b = 1we loose all the memory savings.

Using Oskar's helpful answer, I thought I'd provide my wrapper function which accomplishes the goal of returning a vector of the first m cumulants, given an input of a 1D array of data.
using Cumulants
function mycumulants(d, m) # given a 1D array of data d, return a vector of the first m cumulants
res = zeros(m)
dm = reshape(d, length(d), 1) # Convert 1D array to 2D
c = cumulants(dm, m, 1) # Need the 1 (block size) or else it errors
for i in 1:m
res[i] = Array(c[i])[1]
end
return(res)
end
But it turns out this is really really slow compared to just directly calculating raw moments and coverting them to cumulants by e.g. k[5] = u[5] - 5*u[4]*u[1] - 10*u[3]*u[2] + 20*u[3]*u[1]^2 + 30*u[2]^2*u[1] - 60*u[2]*u[1]^3 + 24*u[1]^5 so I think I won't be using Cumulants.jl after all for my purposes, which only involve univariate data at this time.
Example of time difference for calculating the first six cumulants from some simulated data:
----Data set 2----
Direct calculation:
1.997 ms (14 allocations: 469.47 KiB)
Cumulants.jl:
152.798 ms (318435 allocations: 17.59 MiB)

Related

Construct a bijective function to map arbitrary integer from [1, n] to [1, n] randomly

I want to construct a bijective function f(k, n, seed) from [1,n] to [1,n] where 1<=k<=n and 1<=f(k, n, seed)<=n for each given seed and n. The function actually should return a value from a random permutation of 1,2,...,n. The randomness is decided by the seed. Different seed may corresponds to different permutation. I want the function f(k, n, seed)'s time complexity to be O(1) for each 1<=k<=n and any given seed.
Anyone knows how can I construct such a function? The randomness is allowed to be pseudo-randomness. n can be very large (e.g. >= 1e8).
No matter how you do it, you will always have to store a list of numbers still available or numbers already used ... A simple possibility would be the following
const avail = [1,2,3, ..., n];
let random = new Random(seed)
function f(k,n) {
let index = random.next(n - k);
let result = avail[index]
avail[index] = avail[n-k];
}
The assumptions for this are the following
the array avail is 0-indexed
random.next(x) creates an random integer i with 0 <= i < x
the first k to call the function f with is 0
f is called for contiguous k 0, 1, 2, 3, ..., n
The principle works as follows:
avail holds all numbers still available for the permution. When you take a random index, the element at that index is the next element of the permutation. Then instead of slicing out that element from the array, which is quite expensive, you just replace the currently selected element with the last element in the avail array. In the next iteration you (virtually) decrease the size of the avail array by 1 by decreasing the upper limit for the random by one.
I'm not sure, how secure this random permutation is in terms of distribution of the values, ie for instance it may happen that a certain range of numbers is more likely to be in the beginning of the permuation or in the end of the permutation.
A simple, but not very 'random', approach would be to use the fact that, if a is relatively prime to n (ie they have no common factors), then
x-> (a*x + b)%n
is a permutation of {0,..n-1} to {0,..n-1}. To find the inverse of this, you can use the extended euclidean algorithm to find k and l so that
1 = gcd(a,n) = k*a+l*n
for then the inverse of the map above is
y -> (k*x + c) mod n
where c = -k*b mod n
So you could choose a to be a 'random' number in {0,..n-1} that is relatively prime to n, and b to be any number in {0,..n-1}
Note that you'll need to do this in 64 bit arithmetic to avoid overflow in computing a*x.

Julia: Linking LAPACK 2.0 on Linux

I am using eigs() function in Julia for computing eigenvalues and eigenvectors. Results are non deterministic and often full of 0.0. Temporary solution is to link LAPACK 2.0.
Any idea how to do it on Linux Ubuntu? So far I am not able to link it and I do not how complex Linux administration skills so It will be good if someone could post some guide for how to link it correctly.
Thanks a lot.
Edit:
I wanted to add results but I noticed one flaw in code. I was using matrix = sparse(map(collect,zip([triple(e,"weight") for e in edges(g)]...))..., num_vertices(g), num_vertices(g)). It answer from you to one of my questions. It works ok when vertices are indexed from 1. But my vertices have random indexes due to reading them from file. So I changed num_vertices to be equal to largest index. But I do not noticed that it was doing for example computations considering 1000 vertices when vertex with max index was 1000 although whole graph could consists of 3 verts 1, 10, 1000 for example. Any idea how to fix it ?
Edit 2:
#Content of matrix = matrix+matrix'
[2, 1] = 10.0
[3, 1] = 14.0
[1, 2] = 10.0
[3, 2] = 10.0
[5, 2] = 2.0
[1, 3] = 14.0
[2, 3] = 10.0
[4, 3] = 20.0
[5, 3] = 20.0
[3, 4] = 20.0
[2, 5] = 2.0
[3, 5] = 20.0
[6, 5] = 10.0
[5, 6] = 10.0
matrix = matrix+matrix'
(d, v) = eigs(matrix, nev=1, which=:LR, maxiter=1)
5 executions of code above:
[-0.3483956604402672
-0.3084333257587648
-0.6697046040724708
-0.37450798643794125
-0.4249810113292739
-0.11882760090004019]
[0.3483956604402674
0.308433325758765
0.6697046040724703
0.3745079864379416
0.424981011329274
0.11882760090004027]
[-0.3483956604402673
-0.308433325758765
-0.669704604072471
-0.37450798643794114
-0.4249810113292739
-0.1188276009000403]
[0.34839566044026726
0.30843332575876503
0.6697046040724703
0.37450798643794114
0.4249810113292739
0.11882760090004038]
[0.34839566044026715
0.30843332575876503
0.6697046040724708
0.3745079864379412
0.4249810113292738
0.11882760090004038]
The algorithm is indeed non-deterministic (as is obvious in the example in the question). But, there are two kinds of non-determinism in the answers:
the complete sign reversals of the eigenvector.
small accuracy errors.
If a vector is an eigenvector, so is every scalar multiple of it (mathematically, the eigenvector is part of a subspace of eigenvectors belonging to an eigenvalue). Thus, if v is an eigenvector, so is λv. When λ = -1 this is the sign reversal. But 2v is also an eigenvector. The eigs function normalizes the vectors to norm 1, so the only freedom left is this sign reversal. To solve this non-determinism, you can choose a sign for the first non-zero coordinate of the vector (say, positive) and multiple the eigenvector to make it so. In code:
v = v*sign(v[findfirst(v)])
Regarding the second non-determinism source (inaccuracies), it is important to note that the true eigenvalues and eigenvectors are often real numbers which cannot be accurately represented by Float64, thus the return values are always off. If the level of accuracy needed is low enough, rounding the values deterministically should make the resulting approximation the same. If this is not clear, consider an algorithm for calculating sqrt(2). It may be non-deterministic and return 1.4142135623730951 and sometimes 1.4142135623730949, but rounding to 5 decimal places would always yield 1.41421.
The above should provide a guide to making the results more deterministic. But consider:
If there are multiple eigenvalues with the same value, the subspace of eigenvectors is more than 1 dimensional and there is more freedom to choose an eigenvector. This could make finding a deterministic vector (or vectors) to span this space more intricate.
Does the application really require this determinism?
(Thanks for the code bits - they do help. Even better when they can be quickly cut-and-pasted).

Julia : eigs() function returning different values after every evaluation

I noticed that after running eigs() function multiple times, every time it gives different but approximate result.
Is there way to return it every time the same result ? Output is sometimes with "+" sign or "-" sign.
Content of M :
[2, 1] = 1.0
[3, 1] = 0.5
[1, 2] = 1.0
[3, 2] = 2.5
[1, 3] = 0.5
[2, 3] = 2.5
M = M+M'
(d, v) = eigs(M, nev=1, which=:LR)
I tried running same function on same sparse matrix in Python , although the matrix looks bit different I think it is same. Just left values are numbered from 0. In julia they are numbered from 1. I do not know if that is a big difference. Values are approximately same in Julia and Python but in Python they are always the same after every evaluation. Also return values in python are complex numbers, in Julia real.
Python code:
Content of M.T :
from scipy.sparse import linalg
(1, 0) 1.0
(2, 0) 0.5
(0, 1) 1.0
(2, 1) 2.5
(0, 2) 0.5
(1, 2) 2.5
eigenvalue, eigenvector = linalg.eigs(M.T, k=1, which='LR')
Any idea why this behavior is occurring ?
Edit :
These are results of four evaluations of eigs
==========eigvalues==============
[2.8921298144977587]
===========eigvector=============
[-0.34667468634025667
-0.679134250677923
-0.6469878912367839]
=================================
==========eigvalues==============
[2.8921298144977596]
===========eigvector=============
[0.34667468634025655
0.6791342506779232
0.646987891236784]
=================================
==========eigvalues==============
[2.8921298144977596]
===========eigvector=============
[0.34667468634025655
0.6791342506779233
0.6469878912367841]
=================================
==========eigvalues==============
[2.8921298144977583]
===========eigvector=============
[0.3466746863402567
0.679134250677923
0.646987891236784]
=================================
The result of eigs depends on the initial vector for the Lanczos iterations. When not specified, it is random so even though all the vectors returned are correct the phase is not guaranteed to be the same over different iterations.
If you want the result to be the same every time, you can set v0 in eigs, e.g.
eigs(M, nev=1, which=:LR, v0 = ones(3))
As long as v0 doesn't change you should get deterministic results.
Note that if you want a deterministic result for testing purposes, you might want to consider a testing scheme that allows phase shifts since the phase can shift with the smallest perturbations. E.g. if you link a different BLAS or change the number of threads the result might change again.

Vectorizing code to calculate (squared) Mahalanobis Distiance

EDIT 2: this post seems to have been moved from CrossValidated to StackOverflow due to it being mostly about programming, but that means by fancy MathJax doesn't work anymore. Hopefully this is still readable.
Say I want to to calculate the squared Mahalanobis distance between two vectors x and y with covariance matrix S. This is a fairly simple function defined by
M2(x, y; S) = (x - y)^T * S^-1 * (x - y)
With python's numpy package I can do this as
# x, y = numpy.ndarray of shape (n,)
# s_inv = numpy.ndarray of shape (n, n)
diff = x - y
d2 = diff.T.dot(s_inv).dot(diff)
or in R as
diff <- x - y
d2 <- t(diff) %*% s_inv %*% diff
In my case, though, I am given
m by n matrix X
n-dimensional vector mu
n by n covariance matrix S
and want to find the m-dimensional vector d such that
d_i = M2(x_i, mu; S) ( i = 1 .. m )
where x_i is the ith row of X.
This is not difficult to accomplish using a simple loop in python:
d = numpy.zeros((m,))
for i in range(m):
diff = x[i,:] - mu
d[i] = diff.T.dot(s_inv).dot(diff)
Of course, given that the outer loop is happening in python instead of in native code in the numpy library means it's not as fast as it could be. $n$ and $m$ are about 3-4 and several hundred thousand respectively and I'm doing this somewhat often in an interactive program so a speedup would be very useful.
Mathematically, the only way I've been able to formulate this using basic matrix operations is
d = diag( X' * S^-1 * X'^T )
where
x'_i = x_i - mu
which is simple to write a vectorized version of, but this is unfortunately outweighed by the inefficiency of calculating a 10-billion-plus element matrix and only taking the diagonal... I believe this operation should be easily expressible using Einstein notation, and thus could hopefully be evaluated quickly with numpy's einsum function, but I haven't even begun to figure out how that black magic works.
So, I would like to know: is there either a nicer way to formulate this operation mathematically (in terms of simple matrix operations), or could someone suggest some nice vectorized (python or R) code that does this efficiently?
BONUS QUESTION, for the brave
I don't actually want to do this once, I want to do it k ~ 100 times. Given:
m by n matrix X
k by n matrix U
Set of n by n covariance matrices each denoted S_j (j = 1..k)
Find the m by k matrix D such that
D_i,j = M(x_i, u_j; S_j)
Where i = 1..m, j = 1..k, x_i is the ith row of X and u_j is the jth row of U.
I.e., vectorize the following code:
# s_inv is (k x n x n) array containing "stacked" inverses
# of covariance matrices
d = numpy.zeros( (m, k) )
for j in range(k):
for i in range(m):
diff = x[i, :] - u[j, :]
d[i, j] = diff.T.dot(s_inv[j, :, :]).dot(diff)
First off, it seems like maybe you're getting S and then inverting it. You shouldn't do that; it's slow and numerically inaccurate. Instead, you should get the Cholesky factor L of S so that S = L L^T; then
M^2(x, y; L L^T)
= (x - y)^T (L L^T)^-1 (x - y)
= (x - y)^T L^-T L^-1 (x - y)
= || L^-1 (x - y) ||^2,
and since L is triangular L^-1 (x - y) can be computed efficiently.
As it turns out, scipy.linalg.solve_triangular will happily do a bunch of these at once if you reshape it properly:
L = np.linalg.cholesky(S)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis]).T, lower=True)
d = np.einsum('ij,ij->j', y, y)
Breaking that down a bit, y[i, j] is the ith component of L^-1 (X_j - \mu). The einsum call then does
d_j = \sum_i y_{ij} y_{ij}
= \sum_i y_{ij}^2
= || y_j ||^2,
like we need.
Unfortunately, solve_triangular won't vectorize across its first argument, so you should probably just loop there. If k is only about 100, that's not going to be a significant issue.
If you are actually given S^-1 rather than S, then you can indeed do this with einsum more directly. Since S is quite small in your case, it's also possible that actually inverting the matrix and then doing this would be faster. As soon as n is a nontrivial size, though, you're throwing away a lot of numerical accuracy by doing this.
To figure out what to do with einsum, write everything in terms of components. I'll go straight to the bonus case, writing S_j^-1 = T_j for notational convenience:
D_{ij} = M^2(x_i, u_j; S_j)
= (x_i - u_j)^T T_j (x_i - u_j)
= \sum_k (x_i - u_j)_k ( T_j (x_i - u_j) )_k
= \sum_k (x_i - u_j)_k \sum_l (T_j)_{k l} (x_i - u_j)_l
= \sum_{k l} (X_{i k} - U_{j k}) (T_j)_{k l} (X_{i l} - U_{j l})
So, if we make arrays X of shape (m, n), U of shape (k, n), and T of shape (k, n, n), then we can write this as
diff = X[np.newaxis, :, :] - U[:, np.newaxis, :]
D = np.einsum('jik,jkl,jil->ij', diff, T, diff)
where diff[j, i, k] = X_[i, k] - U[j, k].
Dougal nailed this one with an excellent and detailed answer, but thought I'd share a small modification that I found increases efficiency in case anyone else is trying to implement this. Straight to the point:
Dougal's method was as follows:
def mahalanobis2(X, mu, sigma):
L = np.linalg.cholesky(sigma)
y = scipy.linalg.solve_triangular(L, (X - mu[np.newaxis,:]).T, lower=True)
return np.einsum('ij,ij->j', y, y)
A mathematically equivalent variant I tried is
def mahalanobis2_2(X, mu, sigma):
# Cholesky decomposition of inverse of covariance matrix
# (Doing this in either order should be equivalent)
linv = np.linalg.cholesky(np.linalg.inv(sigma))
# Just do regular matrix multiplication with this matrix
y = (X - mu[np.newaxis,:]).dot(linv)
# Same as above, but note different index at end because the matrix
# y is transposed here compared to above
return np.einsum('ij,ij->i', y, y)
Ran both versions head-to-head 20x using identical random inputs and recorded the times (in milliseconds). For X as a 1,000,000 x 3 matrix (mu and sigma 3 and 3x3) I get:
Method 1 (min/max/avg): 30/62/49
Method 2 (min/max/avg): 30/47/37
That's about a 30% speedup for the 2nd version. I'm mostly going to be running this in 3 or 4 dimensions but to see how it scaled I tried X as 1,000,000 x 100 and got:
Method 1 (min/max/avg): 970/1134/1043
Method 2 (min/max/avg): 776/907/837
which is about the same improvement.
I mentioned this in a comment on Dougal's answer but adding here for additional visibility:
The first pair of methods above take a single center point mu and covariance matrix sigma and calculate the squared Mahalanobis distance to each row of X. My bonus question was to do this multiple times with many sets of mu and sigma and output a two-dimensional matrix. The set of methods above can be used to accomplish this with a simple for loop, but Dougal also posted a more clever example using einsum.
I decided to compare these methods with each other by using them to solve the following problem: Given k d-dimensional normal distributions (with centers stored in rows of k by d matrix U and covariance matrices in the last two dimensions of the k by d by d array S), find the density at the n points stored in rows of the n by d matrix X.
The density of a multivariate normal distribution is a function of the squared Mahalanobis distance of the point to the mean. Scipy has an implementation of this as scipy.stats.multivariate_normal.pdf to use as a reference. I ran all three methods against each other 10x using identical random parameters each time, with d=3, k=96, n=5e5. Here are the results, in points/sec:
[Method]: (min/max/avg)
Scipy: 1.18e5/1.29e5/1.22e5
Fancy 1: 1.41e5/1.53e5/1.48e5
Fancy 2: 8.69e4/9.73e4/9.03e4
Fancy 2 (cheating version): 8.61e4/9.88e4/9.04e4
where Fancy 1 is the better of the two methods above and Fancy2 is Dougal's 2nd solution. Since the Fancy 2 needs to calculate the inverses of all the covariance matrices I also tried a "cheating version" where it was passed these as a parameter, but it looks like that didn't make a difference. I had planned on including the non-vectorized implementation but that was so slow it would have taken all day.
What we can take away from this is that using Dougal's first method is about 20% faster than however Scipy does it. Unfortunately despite its cleverness the 2nd method is only about 60% as fast as the first. There are probably some other optimizations that can be done but this is already fast enough for me.
I also tested how this scaled with higher dimensionality. With d=100, k=96, n=1e4:
Scipy: 7.81e3/7.91e3/7.86e3
Fancy 1: 1.03e4/1.15e4/1.08e4
Fancy 2: 3.75e3/4.10e3/3.95e3
Fancy 2 (cheating version): 3.58e3/4.09e3/3.85e3
Fancy 1 seems to have an even bigger advantage this time. Also worth noting that Scipy threw a LinAlgError 8/10 times, probably because some of my randomly-generated 100x100 covariance matrices were close to singular (which may mean that the other two methods are not as numerically stable, I did not actually check the results).

Generate 3 random number that sum to 1 in R

I am hoping to create 3 (non-negative) quasi-random numbers that sum to one, and repeat over and over.
Basically I am trying to partition something into three random parts over many trials.
While I am aware of
a = runif(3,0,1)
I was thinking that I could use 1-a as the max in the next runif, but it seems messy.
But these of course don't sum to one. Any thoughts, oh wise stackoverflow-ers?
This question involves subtler issues than might be at first apparent. After looking at the following, you may want to think carefully about the process that you are using these numbers to represent:
## My initial idea (and commenter Anders Gustafsson's):
## Sample 3 random numbers from [0,1], sum them, and normalize
jobFun <- function(n) {
m <- matrix(runif(3*n,0,1), ncol=3)
m<- sweep(m, 1, rowSums(m), FUN="/")
m
}
## Andrie's solution. Sample 1 number from [0,1], then break upper
## interval in two. (aka "Broken stick" distribution).
andFun <- function(n){
x1 <- runif(n)
x2 <- runif(n)*(1-x1)
matrix(c(x1, x2, 1-(x1+x2)), ncol=3)
}
## ddzialak's solution (vectorized by me)
ddzFun <- function(n) {
a <- runif(n, 0, 1)
b <- runif(n, 0, 1)
rand1 = pmin(a, b)
rand2 = abs(a - b)
rand3 = 1 - pmax(a, b)
cbind(rand1, rand2, rand3)
}
## Simulate 10k triplets using each of the functions above
JOB <- jobFun(10000)
AND <- andFun(10000)
DDZ <- ddzFun(10000)
## Plot the distributions of values
par(mfcol=c(2,2))
hist(JOB, main="JOB")
hist(AND, main="AND")
hist(DDZ, main="DDZ")
just random 2 digits from (0, 1) and if assume its a and b then you got:
rand1 = min(a, b)
rand2 = abs(a - b)
rand3 = 1 - max(a, b)
When you want to randomly generate numbers that add to 1 (or some other value) then you should look at the Dirichlet Distribution.
There is an rdirichlet function in the gtools package and running RSiteSearch('Dirichlet') brings up quite a few hits that could easily lead you to tools for doing this (and it is not hard to code by hand either for simple Dirichlet distributions).
I guess it depends on what distribution you want on the numbers, but here is one way:
diff(c(0, sort(runif(2)), 1))
Use replicate to get as many sets as you want:
> x <- replicate(5, diff(c(0, sort(runif(2)), 1)))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 0.66855903 0.01338052 0.3722026 0.4299087 0.67537181
[2,] 0.32130979 0.69666871 0.2670380 0.3359640 0.25860581
[3,] 0.01013117 0.28995078 0.3607594 0.2341273 0.06602238
> colSums(x)
[1] 1 1 1 1 1
I would simply randomly select 3 numbers from uniform distribution and then divide by their sum:
n <- 3
x <- runif(n, 0, 1)
y <- x / sum(x)
sum(y) == 1
n could be any number you like.
This problem and the different solutions proposed intrigued me. I did a little test of the three basic algorithms suggested and what average values they would yield for the numbers generated.
choose_one_and_divide_rest
means: [ 0.49999212 0.24982403 0.25018384]
standard deviations: [ 0.28849948 0.22032758 0.22049302]
time needed to fill array of size 1000000 was 26.874945879 seconds
choose_two_points_and_use_intervals
means: [ 0.33301421 0.33392816 0.33305763]
standard deviations: [ 0.23565652 0.23579615 0.23554689]
time needed to fill array of size 1000000 was 28.8600130081 seconds
choose_three_and_normalize
means: [ 0.33334531 0.33336692 0.33328777]
standard deviations: [ 0.17964206 0.17974085 0.17968462]
time needed to fill array of size 1000000 was 27.4301018715 seconds
The time measurements are to be taken with a grain of salt as they might be more influenced by the Python memory management than by the algorithm itself. I'm too lazy to do it properly with timeit. I did this on 1GHz Atom so that explains why it took so long.
Anyway, choose_one_and_divide_rest is the algorithm suggested by Andrie and the poster of the question him/herself (AND): you choose one value a in [0,1], then one in [a,1] and then you look what you have left. It adds up to one but that's about it, the first division is twice as large as the other two. One might have guessed as much ...
choose_two_points_and_use_intervals is the accepted answer by ddzialak (DDZ). It takes two points in the interval [0,1] and uses the size of the three sub-intervals created by these points as the three numbers. Works like a charm and the means are all 1/3.
choose_three_and_normalize is the solution by Anders Gustafsson and Josh O'Brien (JOB). It just generates three numbers in [0,1] and normalizes them back to a sum of 1. Works just as well and surprisingly a little bit faster in my Python implementation. The variance is a bit lower than for the second solution.
There you have it. No idea to what beta distribution these solutions correspond or which set of parameters in the corresponding paper I referred to in a comment but maybe someone else can figure that out.
The simplest solution is the Wakefield package probs() function
probs(3) will yield a vector of three values with a sum of 1
given that you can rep(probs(3),x) where x is "over and over"
no drama

Resources