R Precision for Double - Why code returns negative why positive outcome expected? - r

I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]

It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.

Related

For loop in R to find expectation of N

I just need help writing a for loop because I'm so new at this I literally can't get it to work. I understand the math and can get an answer for one iteration, but I need multiple (let's say 100 iterations).
What I'm trying to do:
Generate 10 random uniformly distributed numbers.
Take the cumulative product and define it to be N.
So, N <- cumprod(U) and say that cumprod(U) >= exp(-3).
It's either TRUE or FALSE for cumprod(U) being >= exp(-3) for each random number multiplied.
Perform sum(N) which returns how many TRUE values there were. This number tells me how many times we had to multiply the randomly generated numbers together before we got below the value exp(-3). If I do this many many times, I should find that the expected value for N is around 3.
When I run the code below I get one answer, which is fine and expected, but what I can't figure out how to do since I'm not good at coding is how to get this code below to repeat itself 100 times (or 200 or 300, or whatever I choose). Can someone please help?
U <- runif(10)
N <- cumprod(U) >= exp(-3)
sum(N)
You do not need an explicit loop:
val <- exp(-3)
results <- replicate(100, sum(cumprod(runif(10)) >= val))
quantile(results)
table(results)
mean(results)

Finding the Proportion of a specific difference between the average of two vectors

I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).

How can I compute ratios of large factorials without unnecessary overflows?

I'm writing a program (in R, in case that matters) in which I need to compute the number of unique permutations of a vector of elements, which can contain repeated values. The mathematical formula for this is straightforward: the factorial of the total number of elements divided by the product of the factorials of the counts of each unique element. However, calculating the result naively is very likely to lead to overflows even when the actual answer is not very large. For example:
# x has 200 elements, but 199 of them are identical
x <- c(rep(1, 199), 2)
num_unique_permutations <- factorial(length(x)) / prod(factorial(table(x)))
If this didn't overflow, then num_unique_permutations would be 200!/(199!*1!) = 200. However, both 200! and 199! overflow the max value of a double, so the actual result is NaN. Is there a good way to do this calculation that will always avoid overflows (or underflows) as long as the answer itself doesn't overflow? (Or perhaps, as long as it doesn't come within a factor of length(x) of overflowing?)
(Note that R uses doubles for most numerical calculations, but the problem is not specific to doubles. Any numeric type with a range has the same problem. Also, I don't care about losing a bit of precision to floating point math, since I'm just using this to get a rough upper bound on something.)
In base R use lfactorial, to compute the logarithms of the numerator and of the denominator. Then exponentiate the appropriate difference.
numer <- lfactorial(length(x))
denom <- sum(lfactorial(table(x)))
exp(numer - denom)
#[1] 200
This can be easily written as a function.
num_unique_permutations <- function(x){
numer <- lfactorial(length(x))
denom <- sum(lfactorial(table(x)))
exp(numer - denom)
}
num_unique_permutations(x)
#[1] 200
You can use the gmp library.
library(gmp)
factorial(as.bigz(length(x))) / prod(factorial(as.bigz(table(x))))
#[1] 200

Generate permutations in sequential order - R

I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

Resources