Dealing with floating point errors in sums of probabilities [duplicate] - r

We know that prob argument in sample is used to assign a probability of weights.
For example,
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6
# 1 2 3 4
#0.2 0.4 0.3 0.1
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.4, 0.3, 0.1)))/1e6
# 1 2 3 4
#0.200 0.400 0.299 0.100
In this example, the sum of probability is exactly 1 (0.2 + 0.4 + 0.3 + 0.1), hence it gives the expected ratio but what if the probability does not sum to 1? What output would it give? I thought it would result in an error but it gives some value.
When the probability sums up to more than 1.
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.1544 0.3839 0.3848 0.0768
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.2, 0.5, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.1544 0.3842 0.3848 0.0767
When the probability sums up to less than 1
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.124 0.125 0.625 0.125
table(sample(1:4, 1e6, replace = TRUE, prob = c(0.1, 0.1, 0.5, 0.1)))/1e6
# 1 2 3 4
#0.125 0.125 0.625 0.125
As we can see, running multiple times gives the output which is not equal to prob but the results are not random as well. How are the numbers distributed in this case? Where is it documented?
I tried searching on the internet but didn't find any relevant information. I looked through the documentation at ?sample which has
The optional prob argument can be used to give a vector of weights for obtaining the elements of the vector being sampled. They need not sum to one, but they should be non-negative and not all zero. If replace is true, Walker's alias method (Ripley, 1987) is used when there are more than 200 reasonably probable values: this gives results incompatible with those from R < 2.2.0.
So it says that the prob argument need not sum to 1 but doesn't tell what is expected when it doesn't sum to 1? I am not sure if I am missing any part of the documentation. Does anybody have any idea?

Good question. The docs are unclear on this, but the question can be answered by reviewing the source code.
If you look at the R code, sample always calls another R function, sample.int If you pass in a single number x to sample, it will use sample.int to create a vector of integers less than or equal to that number, whereas if x is a vector, it uses sample.int to generate a sample of integers less than or equal to length(x), then uses that to subset x.
Now, if you examine the function sample.int, it looks like this:
function (n, size = n, replace = FALSE, prob = NULL, useHash = (!replace &&
is.null(prob) && size <= n/2 && n > 1e+07))
{
if (useHash)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}
The .Internal means any sampling is done by calling compiled code written in C: in this case, it's the function do_sample, defined here in src/main/random.c.
If you look at this C code, do_sample checks whether it has been passed a prob vector. If not, it samples on the assumption of equal weights. If prob exists, the function ensures that it is numeric and not NA. If prob passes these checks, a pointer to the underlying array of doubles is generated and passed to another function in random.c called FixUpProbs, defined here.
This function examines each member of prob and throws an error if any elements of prob are not positive finite doubles. It then normalises the numbers by dividing each by the sum of all. There is therefore no preference at all for prob summing to 1 inherent in the code. That is, even if prob sums to 1 in your input, the function will still calculate the sum and divide each number by it.
Therefore, the parameter is poorly named. It should be "weights", as others here have pointed out. To be fair, the docs only say that prob should be a vector of weights, not absolute probabilities.
So the behaviour of the prob parameter from my reading of the code should be:
prob can be absent altogether, in which case sampling defaults to equal weights.
If any of prob's numbers are less than zero, or are infinite, or NA, the function will throw.
An error should be thrown if any of the prob values are non-numeric, as they will be interpreted as NA in the SEXP passed to the C code.
prob must have the same length as x or the C code throws
You can pass a zero probability as one or more elements of prob if you have specified replace=T, as long as you have at least one non-zero probability.
If you specify replace=F, the number of samples you request must be less than or equal to the number of non-zero elements in prob. Essentially, FixUpProbs will throw if you ask it to sample with a zero probability.
A valid prob vector will be normalised to sum to 1 and used as sampling weights.
As an interesting side effect of this behaviour, this allows you to use odds instead of probabilities if you are choosing between 2 alternatives by setting probs = c(1, odds)

As already mentioned, the weights are normalized to sum to 1 as can be demonstrated:
> x/sum(x)
[1] 0.15384615 0.38461538 0.38461538 0.07692308
This matches your simulated tabulated data:
# 1 2 3 4
#0.1544 0.3839 0.3848 0.0768

Related

How to generate a series of number by function of sample in R, with given different probability in each try?

For example I have a vector about possibility is
myprob <- (0.58, 0.51, 0.48, 0.46, 0.62)
And I want to sampling a series of number between 1 and 0 each time by the probability of c(1-myprob, myprob),
which means in the first number in the series, the function sample 1 and 0 by (0.42, 0.58), the second by (0.49, 0.50) and so on,
how can I generate the 5 numbers by sample?
The syntax of
Y <- sample(c(1,0), 1, replace=F, prob=c(1-myprob, prob))
would have incorrect number of probabilities and only 1 number output if I specify the prob;
while the syntax of
Y <- sample(c(1,0), 5, replace=F, prob=c(1-myprob, prob))
would have the probabilities focus on only 0.62(or not I am not sure, but the results seems not correct at all)
Thanks for any reply in advance!
If myprob is the probability of drawing 1 for each iteration, then you can use rbinom, with n = 5 and size = 1 (5 iterations of a 1-0 draw).
set.seed(2)
rbinom(n = 5, size = 1, prob = myprob)
[1] 1 0 1 0 0
Maƫl already proposed a great solution sampling from a binomial distribution. There are probably many more alternatives and I just wanted to suggest two of them:
runif()
as.integer(runif(5) > myprob)
This will first generate a series of 5 uniformly distributed random numbers between 0 and 1, then compare that vector against myprob and convert the logical values TRUE/FALSE to 1/0.
vapply(sample())
vapply(myprob, function(p) sample(1:0, 1, prob = c(1-p, p)), integer(1))
This is what you may have been looking for in the first place. This executes the sample() command by iterating over the values of myprob as p and returns the 5 draws as a vector.

dmultinom function for Multinomial distribution R

The function dmultinom (x, size = NULL, prob, log = FALSE) estimate probabilities of a Multinomial distribution. However, it does not run with size =1.
Theoretically, when setting size=1 the Multinomial distribution should be equivalent to the Categorical distribution.
Does anybody know why the error message?
FYI, Categorical distribution can be modelled by dist.Categorical {LaplacesDemon}.
Examples:
dmultinom(c(1,2,1),size = 1,prob = c(0.3,0.5,0.4))
Error in dmultinom(c(1, 2, 1), size = 1, prob = c(0.3, 0.5, 0.4)) :
size != sum(x)
dcat(c(1,2,1),p = c(0.3,0.5,0.4))
[1] 0.3 0.5 0.3
Thanks
LaplacesDemon::dcat and stats::dmultinom do two different things. If you have multiple observations dcat takes a vector of category values, whereas dmultinom takes a single vector response, so you have to construct a matrix of responses and use apply (or something).
library(LaplacesDemon)
probs <- c(0.3,0.5,0.2)
dcat(c(1,2,1), p = probs) ## ans: 0.3 0.5 0.3
x=matrix(c(1,0,0,
0,1,0,
1,0,0),
nrow=3,byrow=TRUE)
apply(x,1,dmultinom,size=1, prob=probs)
(I modified your example because your original probabilities, c(0.3,0.5,0.4), don't add up to 1 - neither function gives you a warning, but dmultinom automatically rescales the probabilities to sum to 1)
If I try dmultinom(c(1,2,1),p=probs, size=1) I get
size != sum(x)
that is, dmultinom is interpreting c(1,2,1) as "one sample from group 1, two samples from group 2, 1 from group 3", which isn't consistent with a total sample size of 1 ...

Confusion Between 'sample' and 'rbinom' in R

Why are these not equivalent?
#First generate 10 numbers between 0 and .5
set.seed(1)
x <- runif(10, 0, .5)
These are the two statements I'm confused by:
#First
sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
#Second
rbinom(length(x), size = 1, prob=x)
I was originally trying to use 'sample'. What I thought I was doing was generating ten (0,1) pairs, then assigning the probability that each would return either a 0 or a 1.
The second one works and gives me the output I need (trying to run a sim). So I've been able to solve my problem. I'm just curious as to what's going on under the hood with 'sample' so that I can understand R better.
The first area of difference is the location of the length of the vector specification in the parameter list. The names size have different meanings in these two functions. (I hadn't thought about that source of confusion before, and I'm sure I have made this error myself many times.)
The random number generators (starting with r and having a distribution suffix) have that choice as the first parameter, whereas sample has it as the second parameter. So the length of the second one is 10 and the length of the first is 1. In sample the draw is from the values in the first argument, while 'size' is the length of the vector to create. In the rbinom function, n is the length of the vector to create, while size is the number of items to hypothetically draw from a theoretical urn having a distribution determined by 'prob'. The result returned is the number of "ones". Try:
rbinom(length(x), size = 10, prob=x)
Regarding the argument to prob: I don't think you need the c().
The difference between the two function is quite simple.
Think of a pack of shuffled cards, and choose a number of cards from it. That is exactly the situation that sample simulates.
This code,
> set.seed(123)
> sample(1:40, 5)
[1] 12 31 16 33 34
randomly extract five numbers from the 1:40 vector of numbers.
In your example, you set size = 1. It means you choose only one element from the pool of possible values. If you set size = 10 you will get ten values as you desire.
set.seed(1)
x <- runif(10, 0, .5)
> sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
[1] 0 0 0 0 0 0 0 1 0 1
Instead, the goal of the rbinom function is to simulate events where the results are "discrete", such as the flip of a coin. It considers, as parameters, the probability of success on a trial, such as the flip of the coin, according to a given probability of 0.5. Here we simulate 100 flips. If you think that the coin could be stacked in order to favor one specific outcome, we could simulate this behaviour by setting probability equals to 0.8, as in the example below.
> set.seed(123)
> table(rbinom(100, 1, prob = 0.5))
0 1
53 47
> table(rbinom(100, 1, prob = 0.8))
0 1
19 81

"sample" and "rbinom" functions in R

I guess it has been asked before, but I'm still a bit rusty about "sample" and "rbinom" functions in R, and would like to ask the following two simple questions:
a) Let's say we have:
rbinom(n = 5, size = 1, prob = c(0.9,0.2,0.3))
So "n" = 5 but "prob" is only indicated for three of them. What values R assigns for these two n's?
b) Let's say we have:
sample(x = 1:3, size = 1, prob = c(.5,0.2,0.9))
According to R-help (?sample):
The optional prob argument can be used to give a vector of weights
for obtaining the elements of the vector being sampled.
They need not sum to one, but they should be non-negative and not all zero.
The question would be: why "prob" does not need sum to one?
Any answers would be very appreciated: thank you!
From the documentation for rbinom:
The numerical arguments other than n are recycled to the length of the result.
This means that in your example the prob vector you pass in will be recycled until it reaches the required length (presumably 5). So the vector which will be used is:
c(0.9, 0.2, 0.3, 0.9, 0.2)
As for the sample function, as #thelatemail pointed out the probabilities do not have to sum to 1. It appears that the prob vector gets normalized to 1 internally.

Function that return TRUE with a given probability

I'm looking for a function that returns TRUE with a given probability. Something like:
> proba = 2/3
> function(proba)
It returns TRUE (or 1) with a probability of 2/3 and it returns FALSE (or 0) with a probability of 1/3
The only way to compute that I can think of is:
> sample(c(rep(1,ceiling(proba*100)),rep(0,ceiling((1-proba)*100))),1)
but it gives only an approximation (and it is not really good looking !) as it can only deal with values that have a finite number of decimals.
proba <- 2/3
# number of values:
n <- 1
as.logical(rbinom(n,size=1,prob=proba))
prob <- runif(1)>0.3333333 will do it for you. Or in the general case,
prob <-function(winval) runif(1)>(1-winval)
How about:
function(proba) sample(c(TRUE, FALSE), 1, prob = c(proba, 1 - proba))
And if you want to be able to draw any number of TRUE/FALSE, not just one:
function(proba, size) sample(c(TRUE, FALSE), size, prob = c(proba, 1 - proba),
replace = TRUE)
Just for reference, you can avoid the doubt about the fractional representation of your probabilities by creating the total population and then performing a selection, like so:
sample(c(rep(TRUE, 2), rep(FALSE, 1)), 1)
OR
sample(c(TRUE, TRUE, FALSE), 1)
Usually, we use probabilities to represent the selection likelihood of a population of unknown or feasibly uncountable size. Probability is used as a proxy. When you know the details of the population, using the exact population is actually preferred from a mathematical perspective. It also has the side-effect of being a more accurate representation of this specific problem.
To extend the solution, you would need to convert your probabilities into a population total for each population subset. In this case, we have two subsets: TRUE, and FALSE. Instead of representing the selection likelihood of a TRUE individual as 2/3, you would instead state the number of TRUE individuals contained in the total population TRUE_N, and the number of FALSE individuals contained in the total population FALSE_N.
TRUE_N <- 2
FALSE_N <- 1
sample(c(rep(TRUE, TRUE_N), rep(FALSE, FALSE_N)), 1)

Resources