How to calculate sum over term including rising factorial? - r

I am new to programming and R and would like to compute the following sum
I used the pochMpfr from the Rmpfr package for the rising factorial and a for loop in order compute the sum.
B=rep(1,k+1)
for (i in 0:k) {
B[(i+1)]= (-1)^i *choose(k,i)*pochMpfr((-i)*sigma, n)
}
sum(B)
Doing so, I get the results as list (including always: mpfr) and thus cannot compute the sum.
Is there a possibility to get the results immediately as a Matrix or to convert the list to vector including only the relevant Elements?
The solution is probably quite easy but I haven't found it while looking through the forums.

There is no need to use a for-loop, this should work:
library(Rmpfr)
# You do not define these in your question,
# so I just take some arbitrary values
k <- 10
n <- 3
sigma <- 0.3
i <- 0:k
B <- (-1)^i *choose(k,i)*pochMpfr((-i)*sigma, n)
sum(B)
## 1 'mpfr' number of precision 159 bits
## [1] 6.2977401071861993597462780570563107354142915151e-14

Related

R Precision for Double - Why code returns negative why positive outcome expected?

I am testing 2 ways of calculating Prod(b-a), where a and b are vectors of length n. Prod(b-a)=(b1-a1)(b2-a2)(b3-a3)*... (bn-an), where b_i>a_i>0 for all i=1,2,3, n. For some special cases, another way (Method 2) of calculation this prod(b-a) is more efficient. It uses the following formula, which is to expand the terms and sum them:
Here is my question is: When it happens that a_i very close to b_i, the true outcome could be very, very close 0, something like 10^(-16). Method 1 (substract and Multiply) always returns positive output. Method 2 of using the formula some times return negative output ( about 7~8% of time returning negative for my experiment). Mathematically, these 2 methods should return exactly the same output. But in computer language, it apparently produces different outputs.
Here are my codes to run the test. When I run the testing code for 10000 times, about 7~8% of my runs for method 2 returns negative output. According to the official document, the R double has the precision of "2.225074e-308" as indicated by R parameter: ".Machine$double.xmin". Why it's getting into the negative values when the differences are between 10^(-16) ~ 10^(-18)? Any help that sheds light on this will be apprecaited. I would also love some suggestions concerning how to practically increase the precision to higher level as indicated by R document.
########## Testing code 1.
ftest1case<-function(a,b) {
n<-length(a)
if (length(b)!=n) stop("--------- length a and b are not right.")
if ( any(b<a) ) stop("---------- b has to be greater than a all the time.")
out1<-prod(b-a)
out2<-0
N<-2^n
for ( i in 1:N ) {
tidx<-rev(as.integer(intToBits(x=i-1))[1:n])
tsign<-ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
out2<-out2+tsign*prod(b[tidx==0])*prod(a[tidx==1])
}
c(out1,out2)
}
########## Testing code 2.
ftestManyCases<-function(N,printFreq=1000,smallNum=10^(-20))
{
tt<-matrix(0,nrow=N,ncol=2)
n<-12
for ( i in 1:N) {
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
tt[i,]<-ftest1case(a=a,b=b)
if ( (i%%printFreq)==0 ) cat("----- i = ",i,"\n")
if ( tt[i,2]< smallNum ) cat("------ i = ",i, " ---- Negative summation found.\n")
}
tout<-apply(tt,2,FUN=function(x) { round(sum(x<smallNum)/N,6) } )
names(tout)<-c("PerLess0_Method1","PerLee0_Method2")
list(summary=tout, data=tt)
}
######## Step 1. Test for 1 case.
n<-12
a<-runif(n,0,1)
b<-a+runif(n,0,1)*0.1
ftest1case(a=a,b=b)
######## Step 2 Test Code 2 for multiple cases.
N<-300
tt<-ftestManyCases(N=N,printFreq = 100)
tt[[1]]
It's hard for me to imagine when an algorithm that consists of generating 2^n permutations and adding them up is going to be more efficient than a straightforward product of differences, but I'll take your word for it that there are some special cases where it is.
As suggested in comments, the root of your problem is the accumulation of floating-point errors when adding values of different magnitudes; see here for an R-specific question about floating point and here for the generic explanation.
First, a simplified example:
n <- 12
set.seed(1001)
a <- runif(a,0,1)
b <- a + 0.01
prod(a-b) ## 1e-24
out2 <- 0
N <- 2^n
out2v <- numeric(N)
for ( i in 1:N ) {
tidx <- rev(as.integer(intToBits(x=i-1))[1:n])
tsign <- ifelse( (sum(tidx)%%2)==0,1.0,-1.0)
j <- as.logical(tidx)
out2v[i] <- tsign*prod(b[!j])*prod(a[j])
}
sum(out2v) ## -2.011703e-21
Using extended precision (with 1000 bits of precision) to check that the simple/brute force calculation is more reliable:
library(Rmpfr)
a_m <- mpfr(a, 1000)
b_m <- mpfr(b, 1000)
prod(a_m-b_m)
## 1.00000000000000857647286522936696473705868726043995807429578968484409120647055193862325070279593735821154440625984047036486664599510856317884962563644275433171621778761377125514191564456600405460403870124263023336542598111475858881830547350667868450934867675523340703947491662460873009229537576817962228e-24
This proves the point in this case, but in general doing extended-precision arithmetic will probably kill any performance gains you would get.
Redoing the permutation-based calculation with mpfr values (using out2 <- mpfr(0, 1000), and going back to the out2 <- out2 + ... running summation rather than accumulating the values in a vector and calling sum()) gives an accurate answer (at least to the first 20 or so digits, I didn't check farther), but takes 6.5 seconds on my machine (instead of 0.03 seconds when using regular floating-point).
Why is this calculation problematic? First, note the difference between .Machine$double.xmin (approx 2e-308), which is the smallest floating-point value that the system can store, and .Machine$double.eps (approx 2e-16), which is the smallest value such that 1+x > x, i.e. the smallest relative value that can be added without catastrophic cancellation (values a little bit bigger than this magnitude will experience severe, but not catastrophic, cancellation).
Now look at the distribution of values in out2v, the series of values in out2v:
hist(out2v)
There are clusters of negative and positive numbers of similar magnitude. If our summation happens to add a bunch of values that almost cancel (so that the result is very close to 0), then add that to another value that is not nearly zero, we'll get bad cancellation.
It's entirely possible that there's a way to rearrange this calculation so that bad cancellation doesn't happen, but I couldn't think of one easily.

Finding the Proportion of a specific difference between the average of two vectors

I have a question for an assignment I'm doing.
Q:
"Set the seed at 1, then using a for-loop take a random sample of 5 mice 1,000 times. Save these averages.
What proportion of these 1,000 averages are more than 1 gram away from the average of x ?"
I understand that basically, I need to write a code that says: What percentage of "Nulls" is +or- 1 gram from the average of "x." I'm not really certain how to write that given that this course hasn't given us the information on how to do that yet is asking us to do so. Any help on how to do so?
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/femaleControlsPopulation.csv"
filename <- basename(url)
download(url, destfile=filename)
x <- unlist( read.csv(filename) )
set.seed(1)
n <- 1000
nulls<-vector("numeric", n)
for(i in 1:n){
control <- sample(x, 5)
nulls[i] <-mean(control)
##I know my last line for this should be something like this
## mean(nulls "+ or - 1")> or < mean(x)
## not certain if they're asking for abs() to be involved.
## is the question asking only for those that are 1 gram MORE than the avg of x?
}
Thanks for any help.
Z
I do think that the absolute distance is what they're after here.
Vectors in R are nice in that you can just perform arithmetic operations between a vector and a scalar and it will apply it element-wise, so computing the absolute value of nulls - mean(x) is easy. The abs function also takes vectors as arguments.
Logical operators (such as < and >) can also be used in the same way, making it equally simple to compare the result with 1. This will yield a vector of booleans (TRUE/FALSE) where TRUE means the value at that index was indeed greater than 1, but booleans are really just numbers (1 or 0), so you can just sum that vector to find the number of TRUE elements.
I don't know what programming level you are on, but I hope this helps without giving the solution away completely (since you said it's for an assignment).

The sum of the first n odd integers

I am trying to create a function that takes the sum of the first n odd integers, i.e the summation from i=1 to n of (2i-1).
If n = 1 it should output 1
If n = 2 it should output 4
I'm having problems using a for loop which only outputs the nth term
n <-2
for (i in 1:n)
{
y<-((2*i)-1)
}
y
In R programming we try avoiding for loops
cumsum ( seq(1,2*n, by=2) )
Or just use 'sum' if you don't want the series of partial sums.
There's actually no need to use a loop or to construct the sequence of the first n odd numbers here -- this is an arithmetic series so we know the sum of the first n elements in closed form:
sum.first.n.odd <- function(n) n^2
sum.first.n.odd(1)
[1] 1
sum.first.n.odd(2)
[1] 4
sum.first.n.odd(100)
[1] 10000
This should be a good deal more efficient than any solution based on for or sum because it never computes the elements of the sequence.
[[Just seeing the title -- the OP apparently knows the analytic result and wanted something else...]]
Try this:
sum=0
n=2
for(i in seq(1,2*n,2)){
sum=sum+i
}
But, of course, R is rather slow when working with loops. That's why one should avoid them.

how many unique powers are for x^y for x in 1-1000 and y in 1-1000 using R

Using R, calculate for x and y be integers ∈ [1, 1000], How many unique powers, x^y exist.
This is what I have right now, just don't know how to eliminate the duplicate numbers,
x<-1:1000
y<-1:1000
for (i in x)
{
for (j in y){
print(i^j)
}
}
A combinatorial approach to this could split the numbers from 1-1000 into equivalence classes where each number in the class is the power of some other number. For instance, we would split the numbers 1-10 into (1), (2, 4, 8), (3, 9), (5), (6), (7), (10). None of the powers of values between equivalence classes will coincide, so we can just handle each equivalence class separately.
num.unique.comb <- function(limit) {
# Count number of powers in each equivalence class (labeled by lowest val)
num.powers <- rep(0, limit)
# Handle 1 as special case
num.powers[1] <- 1
# Beyond sqrt(limit), all unhandled numbers are in own equivalence class
handled <- c(T, rep(F, limit-1))
for (base in 2:ceiling(sqrt(limit))) {
if (!handled[base]) {
# Handle all the values in 1:limit that are powers of base
num.handle <- floor(log(limit, base))
handled[base^(1:num.handle)] <- T
# Compute the powers of base that we cover
num.powers[base] <- length(unique(as.vector(outer(1:num.handle, 1:limit))))
}
}
num.powers[!handled] <- limit
# Handle sums too big for standard numeric types
library(gmp)
print(sum(as.bigz(num.powers)))
}
num.unique.comb(10)
# [1] 76
num.unique.comb(1000)
# [1] 978318
One nice property of this combinatorial approach is that it's very fast compared to a brute-force approach. For instance, it takes less than 0.1 seconds to compute with limit set to 1000. This allows us to compute the result for much larger values:
# ~0.15 seconds
num.unique.comb(10000)
# [1] 99357483
# ~4 seconds
num.unique.comb(100000)
# [1] 9981335940
# ~220 seconds
num.unique.comb(1000000)
# [1] 999439867182
This is a pretty neat result -- in under 4 minutes we can compute the number of unique values within 1 trillion numbers, where each number can have up to 6 million digits!
Update: Based on this combinatorial code I've updated the OEIS entry for this sequence to include terms up to 10,000.
A brute-force approach would be to just compute all the powers and count the number of unique values:
num.unique.bf <- function(limit) {
length(unique(as.vector(sapply(1:limit, function(x) x^(1:limit)))))
}
num.unique.bf(10)
# [1] 76
A problem with this brute-force analysis is that you are dealing with large numbers that will create numerical issues. For instance:
1000^1000
# [1] Inf
As a result we get an inaccurate value:
# Wrong due to numerical issues!
num.unique.bf(1000)
# [1] 119117
However, a package like the gmp can enable us to compute even numbers as large as 1000^1000. My computer has trouble storing all 1 million numbers in memory at once, so I'll write them to a file (size for n=1000 is 1.2 GB on my computer) and then compute the number of unique values in that file:
library(gmp)
num.unique.bf2 <- function(limit) {
sink("foo.txt")
for (x in 1:limit) {
vals <- as.bigz(x)^(1:limit)
for (idx in 1:limit) {
cat(paste0(as.character(vals[idx]), "\n"))
}
}
sink()
as.numeric(system("sort foo.txt | uniq | wc -l", intern=T))
}
num.unique.bf2(10)
# [1] 76
num.unique.bf2(1000)
# [1] 978318
A quick visit to the OEIS (click the link for the first 1000 values) shows that this is correct. This approach is rather slow (roughly 40 minutes on my computer), and combinatorial approaches should be significantly faster.

subset slow in large matrix

I have a numeric vector of length 5,000,000
>head(coordvec)
[1] 47286545 47286546 47286547 47286548 47286549 472865
and a 3 x 1,400,000 numeric matrix
>head(subscores)
V1 V2 V3
1 47286730 47286725 0.830
2 47286740 47286791 0.065
3 47286750 47286806 -0.165
4 47288371 47288427 0.760
5 47288841 47288890 0.285
6 47288896 47288945 0.225
What I am trying to accomplish is that for each number in coordvec, find the average of V3 for rows in subscores in which V1 and V2 encompass the number in coordvec. To do that, I am taking the following approach:
results<-numeric(length(coordvec))
for(i in 1:length(coordvec)){
select_rows <- subscores[, 1] < coordvec[i] & subscores[, 2] > coordvec[i]
scores_subset <- subscores[select_rows, 3]
results[m]<-mean(scores_subset)
}
This is very slow, and would take a few days to finish. Is there a faster way?
Thanks,
Dan
I think there are two challenging parts to this question. The first is finding the overlaps. I'd use the IRanges package from Bioconductor (?findInterval in the base package might also be useful)
library(IRanges)
creating width 1 ranges representing the coordinate vector, and set of ranges representing the scores; I sort the coordinate vectors for convenience, assuming that duplicate coordinates can be treated the same
coord <- sort(sample(.Machine$integer.max, 5000000))
starts <- sample(.Machine$integer.max, 1200000)
scores <- runif(length(starts))
q <- IRanges(coord, width=1)
s <- IRanges(starts, starts + 100L)
Here we find which query overlaps which subject
system.time({
olaps <- findOverlaps(q, s)
})
This takes about 7s on my laptop. There are different types of overlaps (see ?findOverlaps) so maybe this step requires a bit of refinement.
The result is a pair of vectors indexing the query and overlapping subject.
> olaps
Hits of length 281909
queryLength: 5000000
subjectLength: 1200000
queryHits subjectHits
<integer> <integer>
1 19 685913
2 35 929424
3 46 1130191
4 52 37417
I think this is the end of the first complicated part, finding the 281909 overlaps. (I don't think the data.table answer offered elsewhere addresses this, though I could be mistaken...)
The next challenging part is calculating a large number of means. The built-in way would be something like
olaps0 <- head(olaps, 10000)
system.time({
res0 <- tapply(scores[subjectHits(olaps0)], queryHits(olaps0), mean)
})
which takes about 3.25s on my computer and appears to scale linearly, so maybe 90s for the 280k overlaps. But I think we can accomplish this tabulation efficiently with data.table. The original coordinates are start(v)[queryHits(olaps)], so as
require(data.table)
dt <- data.table(coord=start(q)[queryHits(olaps)],
score=scores[subjectHits(olaps)])
res1 <- dt[,mean(score), by=coord]$V1
which takes about 2.5s for all 280k overlaps.
Some more speed can be had by recognizing that the query hits are ordered. We want to calculate a mean for each run of query hits. We start by creating a variable to indicate the ends of each query hit run
idx <- c(queryHits(olaps)[-1] != queryHits(olaps)[-length(olaps)], TRUE)
and then calculate the cumulative scores at the ends of each run, the length of each run, and the difference between the cumulative score at the end and at the start of the run
scoreHits <- cumsum(scores[subjectHits(olaps)])[idx]
n <- diff(c(0L, seq_along(idx)[idx]))
xt <- diff(c(0L, scoreHits))
And finally, the mean is
res2 <- xt / n
This takes about 0.6s for all the data, and is identical to (though more cryptic than?) the data.table result
> identical(res1, res2)
[1] TRUE
The original coordinates corresponding to the means are
start(q)[ queryHits(olaps)[idx] ]
Something like this might be faster :
require(data.table)
subscores <- as.data.table(subscores)
subscores[, cond := V1 < coordvec & V2 > coordvec]
subscores[list(cond)[[1]], mean(V3)]
list(cond)[[1]] because: "When i is a single variable name, it is not considered an expression of column names and is instead evaluated in calling scope." source: ?data.table
Since your answer isn't easily reproducible and even if it were, none of your subscores meet your boolean condition, I'm not sure if this does exactly what you're looking for but you can use one of the apply family and a function.
myfun <- function(x) {
y <- subscores[, 1] < x & subscores[, 2] > x
mean(subscores[y, 3])
}
sapply(coordvec, myfun)
You can also take a look at mclapply. If you have enough memory this will probably speed things up significantly. However, you could also look at the foreach package with similar results. You've got your for loop "correct" by assigning into results rather than growing it, but really, you're doing a lot of comparisons. It will be hard to speed this up much.

Resources