I'm writing a program (in R, in case that matters) in which I need to compute the number of unique permutations of a vector of elements, which can contain repeated values. The mathematical formula for this is straightforward: the factorial of the total number of elements divided by the product of the factorials of the counts of each unique element. However, calculating the result naively is very likely to lead to overflows even when the actual answer is not very large. For example:
# x has 200 elements, but 199 of them are identical
x <- c(rep(1, 199), 2)
num_unique_permutations <- factorial(length(x)) / prod(factorial(table(x)))
If this didn't overflow, then num_unique_permutations would be 200!/(199!*1!) = 200. However, both 200! and 199! overflow the max value of a double, so the actual result is NaN. Is there a good way to do this calculation that will always avoid overflows (or underflows) as long as the answer itself doesn't overflow? (Or perhaps, as long as it doesn't come within a factor of length(x) of overflowing?)
(Note that R uses doubles for most numerical calculations, but the problem is not specific to doubles. Any numeric type with a range has the same problem. Also, I don't care about losing a bit of precision to floating point math, since I'm just using this to get a rough upper bound on something.)
In base R use lfactorial, to compute the logarithms of the numerator and of the denominator. Then exponentiate the appropriate difference.
numer <- lfactorial(length(x))
denom <- sum(lfactorial(table(x)))
exp(numer - denom)
#[1] 200
This can be easily written as a function.
num_unique_permutations <- function(x){
numer <- lfactorial(length(x))
denom <- sum(lfactorial(table(x)))
exp(numer - denom)
}
num_unique_permutations(x)
#[1] 200
You can use the gmp library.
library(gmp)
factorial(as.bigz(length(x))) / prod(factorial(as.bigz(table(x))))
#[1] 200
Related
I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]
It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.
The deviations of the mean should always sum up to 0.
However, when the mean has a lot of digits, maybe infinitely like this one which is 20/7, R fails to calculate it.
x <- c(1,2,2,3,3,4,5)
sum(x - mean(x))
[1] -4.440892e-16
I am quite a newbie and have not found any information about this so far, maybe I was not searching for the right terms.
Is it possible to calculate with infinitely long numbers in R?
I am asking this out of theoretical interest.
The problem you have described is a general problem with all programming languages. Internally all floats are based on the IEEE754 convention. You can read more about it here.
As far as I know there is no easy way around these small errors, except for using number representations with higher precision.
EDIT: R already used the double precision representation of floating point numbers. To read more about it you can have a look at the R FAQ and this SO question.
If you deal with rational numbers only, such as your example, you can use the gmp package.
You can use the Rmpfr package to deal with numbers with an arbitrary precision (that you have to set).
Another possibility is the lazyNumbers package, freshly released on CRAN:
library(lazyNumbers)
# create a vector of lazy numbers
x <- lazyvec(c(1, 2, 2, 3, 3, 4, 5))
# compute its mean
m <- sum(x) / length(x)
# sum expected to be 0
y <- sum(x - m)
# convert it to double
as.double(y)
## 0
I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).
I have looked a bit online and in the site but I did not find any solution. My problem is relatively simple so if you could point me to a possible solution, much appreciated.
test_vec <- c(2,8,709,600)
mean(exp(test_vec))
test_vec_bis <- c(2,8,710,600)
mean(exp(test_vec_bis))
exp(709)
exp(710)
# The numerical limit of R is at exp(709)
How can I calculate the mean of my vector and deal with the Inf values knowing that R could probably handle the mean value but not all values in the numerator of the mean calculation ?
There is an edge case where you can solve your problem by simply restating your problem mathematically, but that would require that the length of your vector is extremely large and/or that your large exp. numbers are close to the numeric limit:
Since the mean sum(x)/n can be written as sum(x/n) and since exp(x)/exp(y) = exp(x-y), you can calculate sum(exp(x-log(n))), which gives you a relief of log(n).
mean(exp(test_vec))
[1] 2.054602e+307
sum(exp(test_vec - log(length(test_vec))))
[1] 2.054602e+307
sum(exp(test_vec_bis - log(length(test_vec_bis))))
[1] 5.584987e+307
While this works for your example, most likely this won't work for your real vector.
In this case, you will have to consult packages like Rmpfr as suggested by #fra.
Here's one way where you qualify to only select those in your test_vec that give an answer < Inf:
mean(exp(test_vec)[which(exp(test_vec) < Inf)])
[1] 1.257673e+260
t2 <- c(2,8,600)
mean(exp(t2))
[1] 1.257673e+260
This assumes you were looking to exclude values that result in Inf, of course.
I have a set of percentages in R which I want to make equal to 100%. I need the individual percentages to be to the nearest integer.
I have rounded percentages [20.5, 50.6, 25.8 , 3.1].Then I am rounding to the nearest integer [ 20,50,25,3].Then working out the individual decimal places [0.5,0.6,0.8.0.1].Then sorting the decimal places in descending order but output are [ 3,2,1,4].And I need to add 1 to the integer of the highest decimal place until I reach 100.
It is not really as trivial a problem as it may seem. Some good discussions about the problem can be seen in this thread and that is also where I got the solution (which is identical to the idea in the OP).
Here's a function that might do it for you
round_percent <- function(x) {
x <- x/sum(x)*100 # Standardize result
res <- floor(x) # Find integer bits
rsum <- sum(res) # Find out how much we are missing
if(rsum<100) {
# Distribute points based on remainders and a random tie breaker
o <- order(x%%1, sample(length(x)), decreasing=TRUE)
res[o[1:(100-rsum)]] <- res[o[1:(100-rsum)]]+1
}
res
}
Hope this helps. Note that there is no error checking at all in the function above.
Update: I implemented a version of this in the MESS package. Look at MESS::round_percent.