Rounding percentages to 100% in R - r

I have a set of percentages in R which I want to make equal to 100%. I need the individual percentages to be to the nearest integer.
I have rounded percentages [20.5, 50.6, 25.8 , 3.1].Then I am rounding to the nearest integer [ 20,50,25,3].Then working out the individual decimal places [0.5,0.6,0.8.0.1].Then sorting the decimal places in descending order but output are [ 3,2,1,4].And I need to add 1 to the integer of the highest decimal place until I reach 100.

It is not really as trivial a problem as it may seem. Some good discussions about the problem can be seen in this thread and that is also where I got the solution (which is identical to the idea in the OP).
Here's a function that might do it for you
round_percent <- function(x) {
x <- x/sum(x)*100 # Standardize result
res <- floor(x) # Find integer bits
rsum <- sum(res) # Find out how much we are missing
if(rsum<100) {
# Distribute points based on remainders and a random tie breaker
o <- order(x%%1, sample(length(x)), decreasing=TRUE)
res[o[1:(100-rsum)]] <- res[o[1:(100-rsum)]]+1
}
res
}
Hope this helps. Note that there is no error checking at all in the function above.
Update: I implemented a version of this in the MESS package. Look at MESS::round_percent.

Related

R Precision for Double - Why code returns negative why positive outcome expected?

I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]
It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.

Finding the Proportion of a specific difference between the average of two vectors

I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).

How can I compute ratios of large factorials without unnecessary overflows?

I'm writing a program (in R, in case that matters) in which I need to compute the number of unique permutations of a vector of elements, which can contain repeated values. The mathematical formula for this is straightforward: the factorial of the total number of elements divided by the product of the factorials of the counts of each unique element. However, calculating the result naively is very likely to lead to overflows even when the actual answer is not very large. For example:
# x has 200 elements, but 199 of them are identical
x <- c(rep(1, 199), 2)
num_unique_permutations <- factorial(length(x)) / prod(factorial(table(x)))
If this didn't overflow, then num_unique_permutations would be 200!/(199!*1!) = 200. However, both 200! and 199! overflow the max value of a double, so the actual result is NaN. Is there a good way to do this calculation that will always avoid overflows (or underflows) as long as the answer itself doesn't overflow? (Or perhaps, as long as it doesn't come within a factor of length(x) of overflowing?)
(Note that R uses doubles for most numerical calculations, but the problem is not specific to doubles. Any numeric type with a range has the same problem. Also, I don't care about losing a bit of precision to floating point math, since I'm just using this to get a rough upper bound on something.)
In base R use lfactorial, to compute the logarithms of the numerator and of the denominator. Then exponentiate the appropriate difference.
numer <- lfactorial(length(x))
denom <- sum(lfactorial(table(x)))
exp(numer - denom)
#[1] 200
This can be easily written as a function.
num_unique_permutations <- function(x){
numer <- lfactorial(length(x))
denom <- sum(lfactorial(table(x)))
exp(numer - denom)
}
num_unique_permutations(x)
#[1] 200
You can use the gmp library.
library(gmp)
factorial(as.bigz(length(x))) / prod(factorial(as.bigz(table(x))))
#[1] 200

Expected value of the difference between a sum of variables and a threshold

I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.

Round numbers in output to show small near-zero quantities as zero

I would like the output of my R console to look readable. To this end, I would like R to round all my numbers to the nearest N decimal places. I have some success but it doesn't work completely:
> options(scipen=100, digits=4)
> .000000001
[1] 0.000000001
> .1
[1] 0.1
> 1.23123123123
[1] 1.231
I would like the 0.000000001 to be displayed as simply 0. How does one do this? Let me be more specific: I would like a global fix for the entire R session. I realize I can start modifying things by rounding them but it's less helpful than simply setting things for the entire session.
Look at ?options, specifically the digits and scipen options.
try
sprintf("%.4f", 0.00000001)
[1] "0.0000"
Combine what Greg Snow and Ricardo Saporta already gave you to get the right answer: options('scipen'=+20) and options('digits'=2) , combined with round(x,4) .
round(x,4) will round small near-zero quantities.
Either you round off the results of your regression once and store it:
x <- round(x, 4)
... or else yes, you have to do that every time you display the small quantity, if you don't want to store its rounded value. In your case, since you said small near-zero quantities effectively represent zero, why don't you just round it?
If for some reason you need to keep both the precise and the rounded versions, then do.
x.rounded <- round(x, 4)

Resources