Generate 3 random number that sum to 1 in R - r

I am hoping to create 3 (non-negative) quasi-random numbers that sum to one, and repeat over and over.
Basically I am trying to partition something into three random parts over many trials.
While I am aware of
a = runif(3,0,1)
I was thinking that I could use 1-a as the max in the next runif, but it seems messy.
But these of course don't sum to one. Any thoughts, oh wise stackoverflow-ers?

This question involves subtler issues than might be at first apparent. After looking at the following, you may want to think carefully about the process that you are using these numbers to represent:
## My initial idea (and commenter Anders Gustafsson's):
## Sample 3 random numbers from [0,1], sum them, and normalize
jobFun <- function(n) {
m <- matrix(runif(3*n,0,1), ncol=3)
m<- sweep(m, 1, rowSums(m), FUN="/")
m
}
## Andrie's solution. Sample 1 number from [0,1], then break upper
## interval in two. (aka "Broken stick" distribution).
andFun <- function(n){
x1 <- runif(n)
x2 <- runif(n)*(1-x1)
matrix(c(x1, x2, 1-(x1+x2)), ncol=3)
}
## ddzialak's solution (vectorized by me)
ddzFun <- function(n) {
a <- runif(n, 0, 1)
b <- runif(n, 0, 1)
rand1 = pmin(a, b)
rand2 = abs(a - b)
rand3 = 1 - pmax(a, b)
cbind(rand1, rand2, rand3)
}
## Simulate 10k triplets using each of the functions above
JOB <- jobFun(10000)
AND <- andFun(10000)
DDZ <- ddzFun(10000)
## Plot the distributions of values
par(mfcol=c(2,2))
hist(JOB, main="JOB")
hist(AND, main="AND")
hist(DDZ, main="DDZ")

just random 2 digits from (0, 1) and if assume its a and b then you got:
rand1 = min(a, b)
rand2 = abs(a - b)
rand3 = 1 - max(a, b)

When you want to randomly generate numbers that add to 1 (or some other value) then you should look at the Dirichlet Distribution.
There is an rdirichlet function in the gtools package and running RSiteSearch('Dirichlet') brings up quite a few hits that could easily lead you to tools for doing this (and it is not hard to code by hand either for simple Dirichlet distributions).

I guess it depends on what distribution you want on the numbers, but here is one way:
diff(c(0, sort(runif(2)), 1))
Use replicate to get as many sets as you want:
> x <- replicate(5, diff(c(0, sort(runif(2)), 1)))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 0.66855903 0.01338052 0.3722026 0.4299087 0.67537181
[2,] 0.32130979 0.69666871 0.2670380 0.3359640 0.25860581
[3,] 0.01013117 0.28995078 0.3607594 0.2341273 0.06602238
> colSums(x)
[1] 1 1 1 1 1

I would simply randomly select 3 numbers from uniform distribution and then divide by their sum:
n <- 3
x <- runif(n, 0, 1)
y <- x / sum(x)
sum(y) == 1
n could be any number you like.

This problem and the different solutions proposed intrigued me. I did a little test of the three basic algorithms suggested and what average values they would yield for the numbers generated.
choose_one_and_divide_rest
means: [ 0.49999212 0.24982403 0.25018384]
standard deviations: [ 0.28849948 0.22032758 0.22049302]
time needed to fill array of size 1000000 was 26.874945879 seconds
choose_two_points_and_use_intervals
means: [ 0.33301421 0.33392816 0.33305763]
standard deviations: [ 0.23565652 0.23579615 0.23554689]
time needed to fill array of size 1000000 was 28.8600130081 seconds
choose_three_and_normalize
means: [ 0.33334531 0.33336692 0.33328777]
standard deviations: [ 0.17964206 0.17974085 0.17968462]
time needed to fill array of size 1000000 was 27.4301018715 seconds
The time measurements are to be taken with a grain of salt as they might be more influenced by the Python memory management than by the algorithm itself. I'm too lazy to do it properly with timeit. I did this on 1GHz Atom so that explains why it took so long.
Anyway, choose_one_and_divide_rest is the algorithm suggested by Andrie and the poster of the question him/herself (AND): you choose one value a in [0,1], then one in [a,1] and then you look what you have left. It adds up to one but that's about it, the first division is twice as large as the other two. One might have guessed as much ...
choose_two_points_and_use_intervals is the accepted answer by ddzialak (DDZ). It takes two points in the interval [0,1] and uses the size of the three sub-intervals created by these points as the three numbers. Works like a charm and the means are all 1/3.
choose_three_and_normalize is the solution by Anders Gustafsson and Josh O'Brien (JOB). It just generates three numbers in [0,1] and normalizes them back to a sum of 1. Works just as well and surprisingly a little bit faster in my Python implementation. The variance is a bit lower than for the second solution.
There you have it. No idea to what beta distribution these solutions correspond or which set of parameters in the corresponding paper I referred to in a comment but maybe someone else can figure that out.

The simplest solution is the Wakefield package probs() function
probs(3) will yield a vector of three values with a sum of 1
given that you can rep(probs(3),x) where x is "over and over"
no drama

Related

R function to find difference in mean greater than or equal to a specific number

I have just started my basic statistic course using R and we're studying using R for paired t-tests. I have come across questions where we're given two sets of data and we're asked to find whether the difference in mean is equal to 0 or greater than 0 so on so forth. The function we use for two samples x and y with an unknown variance is similar to the one below;
t.test(x, y, var.equal=TRUE, alternative="greater")
My question is, how would we to do this if we wanted to test the difference in mean is more than or equal to a specified number against the alternative that its less than a specific number and not 0.
For example, say we're given two datas for before and after weights of 10 people. How do we test that the mean difference in weight is more than or equal to say 3kg against the alternative where the mean difference in weight is less than 3kg. Is there a way to do this? Would really appreciate any guidance on this matter.
It might be worthwhile posting on https://stats.stackexchange.com/ as well if you're in need of more theoretical proof. Is it ok to add/subtract the 3kg from either x or y and then use the t-test to check for similarity? I think this would tell you at least which outcome is more likely, if that's the end goal. It would be good to get feedback on this
# number of obs, and rnorm dist for simulating
N <- 10
mu <- 70
sd <- 10
set.seed(1)
x <- round(rnorm(N, mu, sd), 1)
# three outcomes
# (1) no change
y_same <- x + round(rnorm(N, 0, 5), 1)
# (2) average increase of 3
y_imp <- x + rnorm(N, 3, 5)
# (3) average decrease of 3
y_dec <- x + rnorm(N, -3, 5)
# say y_imp is true
y_act <- y_imp
# can we test whether we're closer to the output by altering
# the original data? or conversely, altering y_imp
t_inc <- t.test(x+3, y_act, var.equal=TRUE, alternative="two.sided")
t_dec <- t.test(x-3, y_act, var.equal=TRUE, alternative="two.sided")
t_inc$p.value
[1] 0.8279801
t_dec$p.value
[1] 0.0956033
# one with the highest p.value has the closest distribution, so
# +3 kg more likely than -3kg
You can set mu=3 to change the null hypothesis from 0 to 3 assuming your x variables are in the units you describe above.
t.test(x, y, mu=3, alternative="greater", paired=TRUE)
More (general) information on Stack Exchange [here].(https://stats.stackexchange.com/questions/206316/can-a-paired-or-two-group-t-test-test-if-the-difference-between-two-means-is-l/206317#206317)

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

To find the distance between two roots in R

Suppose I have a function f(x) that is well defined on an interval I. I want to find the greatest and smallest roots of f(x), then taking the difference of them. What is a good way to program it?
To be precise, f can at worst be a rational function like (1+x)/(1-x). It should be a (high degree) polynomial most of the times. I only need to know the result numerically to some precision.
I am thinking about the following:
Convert f(x) into a form recognizable by R. (I can do)
Use R to list all roots of f(x) on I (I found the uniroot function only give me one root)
Use R to to find the maximum and minimum elements in the list (should be possible once I converted it to a vector)
Taking the difference of the two roots. (should be trivial)
I am stuck on step (2) and I do not know what to do. My professor give a brutal force solution, suggesting me to do:
Divide interval I into one million pieces.
Evaluate f on each end points, find the end points where f>=0.
Choose the maximum and minimum elements from the set formed in step 2.
Take the difference between them.
I feel this way is not very efficient and might not work for all f in general, but I am having trouble to implement it even for quadratics. I do not know how to do step (2) as well. So I want to ask for a hint or some toy examples.
At this point I am trying to implement the following code:
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(tail(ypos, -1) != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
at here everything is okay, but when I try to extract the roots to Y[i,1], Y[i,2] by
Y[i,1]=(ri<-root intervals(function(x)(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505, c(0,40),n=1e6)[1]
I found I cannot evaluate it anymore. R keep telling me
Error: unexpected symbol in:
"}
Y[i,1]=(ri<-root intervals"
and I got stuck. I really appreciate everyone's help as I am feeling lost.
I checked the function's expression many times using the plot function and it has no grammar mistakes. Also I believe it is well defined for all X in the interval.
This should give you a good start on the brute force solution. You're right, it's not elegant, but for relatively simple univariate functions, evaluating 1 million points is trivial.
root_intervals <- function(f, interval, n = 1e6) {
xvals <- seq(interval[1], interval[2], length = n)
yvals <- f(xvals)
ypos <- yvals > 0
x1 <- which(ypos[-1] != head(ypos, -1))
x2 <- x1 + 1
## so all the zeroes we can see are between x1 and x2
return(cbind(xvals[x1], xvals[x2]))
}
This function returns a two column matrix of x values, where the function changes sign between column 1 and column 2:
f1 <- function (x) 0.05 * x^5 - 2 * x^4 + x^3 - x^2 + 1
> (ri <- root_intervals(f1, c(-10, 10), n = 1e6))
[,1] [,2]
[1,] -0.6372706 -0.6372506
[2,] 0.8182708 0.8182908
> f1(ri)
[,1] [,2]
[1,] -3.045326e-05 6.163467e-05
[2,] 2.218895e-05 -5.579081e-05
Wolfram Alpha confirms results on the specified interval.
The top and bottom rows will be the min and max intervals found. These intervals (over which the function changes sign) are precisely what uniroot wants for it's interval, so you could use it to solve for the (more) exact roots. Of course, if the function changes sign twice within one interval (or any even number of times), it won't be picked up, so choose a big n!
Response to edited question:
Looks like your trying to define a bunch of functions, but your edits have syntax errors. Here's what I think you're trying to do: (this first part might take some more work to work right)
my_funs <- list()
Y=rep(0,200)
dim(Y)=c(100,2)
for(i in 1:100){
X=rnorm(9,0,1)
Z=rnorm(16,0,1)
a=0.64
b=a*sum(Z^2)/sum(X^2)
my_funs[[i]] <- function(x){(x/(a*x+b))^{9/2}*(1/((1-a)+a*(1-a)/b*x))^4-0.235505}
}
Here's using the root_intervals on the first of your generated functions.
> root_intervals(my_funs[[1]], interval = c(0, 40))
[,1] [,2]
[1,] 0.8581609 0.8582009
[2,] 11.4401314 11.4401714
Notice the output, a matrix, with the roots of the function being between the first and second columns. Being a matrix, you can't assign it to a vector. If you want a single root, use uniroot using each row to set the upper and lower bounds. This is left as an exercise to the reader.

Fastest way to sample real values using a proportional probability

Given a numeric vector with N real numbers, what's the fastest way to sample k values, such that higher values have greater probability of being selected?
mathematically
prob(X) > prob(Y) when X > Y (Linearly)
This is easy with sample() when all entries are positive, just use the prob arg:
N = 1000
k = 600
x = runif(N, 0, 10)
results = sample(x, k, replace = TRUE, prob = x)
But it does'n work in my case, because some values might be negative. I cannot drop or ignore negative values, that's the problem.
So, what's the fastest (code speed) way of doing this? Obviously i know how to solve this, the issue is code speed - one method should be slower than other i guess:
1 - Normalize the x vector (a call to `range()` would be necessary + division)
2 - Sum max(x) to x (a call to `max()` then sum)
Thanks.
A few comments. First, it's still not exactly clear what you want. Obviously, you want larger numbers to be chosen with higher probability, but there are a lot of ways of doing this. For example, either rank(x) or x-min(x) will produce a vector of non-negative weights which are monotonic in x.
Another point, you don't need to normalize the weights, because sample will do that for you, provided that the weights are non-negative:
> set.seed(1)
> sample(1:10,prob=1:10)
[1] 9 8 6 2 10 3 1 5 7 4
> set.seed(1)
> sample(1:10,prob=(1:10)/sum(1:10))
[1] 9 8 6 2 10 3 1 5 7 4
On edit: The OP is now asking for a weighting function which is "linear" in the input vector. Technically this is impossible, because linear functions are of the form f(X)=cX, so if a vector x contains both positive and negative values, then any linear function of x will also contain both positive and negative values, unless c=0, in which case it still does not give a valid vector of probability weights.
I think what you mean by "linear" is simply x-min(x). This is not a linear function, but an affine function. Moreover, even if you had specified that you wanted P(X) to vary as an affine function of X, that still would not have uniquely determined the probability weights, because there are an infinite number of possible affine functions that would yield valid weights (e.g. x-min(x)+1, etc.)
In any case, assuming x-min(x) is what you want, the question now becomes, what is the fastest way to compute x-min(x) in R. And I'm pretty sure that the answer is just x-min(x).
Finally, for constants anywhere near what you have in your example, there is not much point in trying to optimize the calculation of weights, because the random sampling is going to take much longer anyway. For example:
> x<-rnorm(1000)
> k<-600
> p<-x-min(x)
> microbenchmark(x-min(x),sample(x,k,T,p))
Unit: microseconds
expr min lq median uq max neval
x - min(x) 6.56 6.9105 7.0895 7.2515 13.629 100
sample(x, k, T, p) 50.30 51.4360 51.7695 52.1970 66.196 100

Plot a table of binomial distributions in R

For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.
Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))
This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.

Resources