I am trying to create a function that computes the sum of digits of large numbers, of the order of 100^100. The approach described in this question does not work, as shown below. I tried to come up with a function that does the job, but have not been able to get very far.
The inputs would be of the form a^b, where 1 < a, b < 100 and a and b are integers. So, in that sense, I am open to making digitSumLarge a function that accepts two arguments.
digitSumLarge <- function(x) {
pow <- floor(log10(x)) + 1L
rem <- x
i <- 1L
num <- integer(length = pow)
# Individually isolate each digit starting from the largest and store it in num
while(rem > 0) {
num[i] <- rem%/%(10^(pow - i))
rem <- rem%%(10^(pow - i))
i <- i + 1L
}
return(num)
}
# Function in the highest voted answer of the linked question.
digitsum <- function(x) sum(floor(x / 10^(0:(nchar(x) - 1))) %% 10)
Consider the following tests:
x <- c(1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9)
as.numeric(paste(x, collapse = ''))
# [1] 1.234568e+17
sum(x)
# 90
digitSumLarge(as.numeric(paste(x, collapse = '')))
# 85
digitsum(as.numeric(paste(x, collapse = '')))
# 81, with warning message about loss of accuracy
Is there any way I can write such a function in R?
You need arbitrary precision numbers. a^b with R's numerics (double precision floats) can be only represented with limited precision and not exactly for sufficiently large input.
library(gmp)
a <- as.bigz(13)
b <- as.bigz(67)
sum(as.numeric(strsplit(as.character(a^b), split = "")[[1]]))
#[1] 328
Related
This question already has an answer here:
correlation matrix using large data sets in R when ff matrix memory allocation is not enough
(1 answer)
Closed 1 year ago.
I have a matrix of size 55422 x 888, and so the rcorr function is producing this error:
M1<-matrix(rnorm(36),nrow=55422, ncol=888)
cor <- rcorr(t(M1), type = "pearson")
Error in double(p * p) : vector size cannot be NA
In addition: Warning message:
In p * p : NAs produced by integer overflow
Is there anything I can do to solve this?
This isn't a complete/working solution, but will give you an idea of some of the issues.
Your correlation matrix will contain n*(n-1)/2 = 1535771331 unique elements. If each correlation takes one millisecond to compute, computing the correlation matrix will take (n^2-n)/2/(1e6*3600) = 0.42 hours and require (n^2-n)/2*8/(2^30) = 11.4 Gb of storage. These requirements are not impossible if you have a lot of RAM and time ...
In fact it's a little bit worse than this, since rcorr returns its results as a symmetric matrix (i.e., not taking advantage of the symmetry), and returns the n and P matrices as well, so the storage requirement will be approximately 5 times as great as stated above (double for the full matrix, x 2.5 because we have two double-precision and one integer matrix).
Getting to your specific question, the section on long vectors in the R internals manual discusses the maximum sizes of objects in R. The 'standard' limitation is that the total number of elements of the matrix should be less than 2^31 ((n^2-n)/2/(2^31-1) = 0.72), but the redundancy in the matrix gets you in trouble (as would the storage of the correlation, p-values, and the sample sizes).
If you still want to go ahead, here is an implementation by A.N. Spiess, copied from here, that breaks the problem into blocks and stores the results in a disk-backed array (i.e., not in RAM). This won't get you the p-values (and it's still not clear what you're going to do with all those values ...), but it works at least up to 40,000 columns (takes about a minute).
However, it seems to crap out on your actual problem size (888 x 55242) with a too-large length. I'd have to look more closely and see if there is a limitation here we can get around ... It seems that we are actually still limited by the matrix dimensions ... (maximum matrix dimension is sqrt(2^31-1) approx. 46341 ... With more work, we could still do the block-diagonal thing and split this into several components ...
set.seed(101)
nc <- 55422
nr <- 888
d <- matrix(rnorm(nr*nc), ncol = nc)
t1 <- system.time(b1 <- bigcor(d))
bigcor <- function(
x,
y = NULL,
fun = c("cor", "cov"),
size = 2000,
verbose = TRUE,
...)
{
if (!require("ff")) stop("please install the ff package")
fun <- match.arg(fun)
if (fun == "cor") FUN <- cor else FUN <- cov
if (fun == "cor") STR <- "Correlation" else STR <- "Covariance"
if (!is.null(y) & NROW(x) != NROW(y)) stop("'x' and 'y' must have compatible dimensions!")
NCOL <- ncol(x)
if (!is.null(y)) YCOL <- NCOL(y)
## calculate remainder, largest 'size'-divisible integer and block size
REST <- NCOL %% size
LARGE <- NCOL - REST
NBLOCKS <- NCOL %/% size
## preallocate square matrix of dimension
## ncol(x) in 'ff' single format
if (is.null(y)) resMAT <- ff(vmode = "double", dim = c(NCOL, NCOL))
else resMAT <- ff(vmode = "double", dim = c(NCOL, YCOL))
## split column numbers into 'nblocks' groups + remaining block
GROUP <- rep(1:NBLOCKS, each = size)
if (REST > 0) GROUP <- c(GROUP, rep(NBLOCKS + 1, REST))
SPLIT <- split(1:NCOL, GROUP)
## create all unique combinations of blocks
COMBS <- expand.grid(1:length(SPLIT), 1:length(SPLIT))
COMBS <- t(apply(COMBS, 1, sort))
COMBS <- unique(COMBS)
if (!is.null(y)) COMBS <- cbind(1:length(SPLIT), rep(1, length(SPLIT)))
## initiate time counter
timeINIT <- proc.time()
## iterate through each block combination, calculate correlation matrix
## between blocks and store them in the preallocated matrix on both
## symmetric sides of the diagonal
for (i in 1:nrow(COMBS)) {
COMB <- COMBS[i, ]
G1 <- SPLIT[[COMB[1]]]
G2 <- SPLIT[[COMB[2]]]
## if y = NULL
if (is.null(y)) {
if (verbose) cat(sprintf("#%d: %s of Block %s and Block %s (%s x %s) ... ", i, STR, COMB[1],
COMB[2], length(G1), length(G2)))
RES <- FUN(x[, G1], x[, G2], ...)
resMAT[G1, G2] <- RES
resMAT[G2, G1] <- t(RES)
} else ## if y = smaller matrix or vector
{
if (verbose) cat(sprintf("#%d: %s of Block %s and 'y' (%s x %s) ... ", i, STR, COMB[1],
length(G1), YCOL))
RES <- FUN(x[, G1], y, ...)
resMAT[G1, ] <- RES
}
if (verbose) {
timeNOW <- proc.time() - timeINIT
cat(timeNOW[3], "s\n")
}
gc()
}
return(resMAT)
}
Suppose I have the following data frame
set.seed(36)
n <- 300
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
For each value of d$y<=300, I have to create a variable res in which the numerator is the sum of the indicator (d$x <= d$y[i]) and the denominator is the sum of the indicator (d$y >= d$y[i]). I have written the codes in for loop:
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
But my concern is when the number of observations of x and y is large, that is, the number of rows of the data frame increases, the for loop will work slowly. Additionally, if I simulate data 1000 times and each time run the for loop, the program will be inefficient.
What can be the more efficient solution of the code?
This depends on d already being sorted as it is:
# example data
set.seed(36)
n <- 1e5
dat <- data.frame(x = round(runif(n,0,200)), y = round(runif(n, 0, 500)))
d <- dat[order(dat$y),]
My suggestion (thanks to #alexis_laz for the denominator):
system.time(res3 <- {
xs <- sort(d$x) # sorted x
yt <- d$y[d$y <= 300] # truncated y
num = findInterval(yt, xs)
den = length(d$y) - match(yt, d$y) + 1L
num/den
})
# user system elapsed
# 0 0 0
OP's approach:
system.time(res <- {
res <- NULL
for( i in seq_len(sum(d$y<=300)) ){
numerator <- sum(d$x <= d$y[i])
denominator <- sum(d$y >= d$y[i])
res[i] <- numerator / denominator
}
res
})
# user system elapsed
# 50.77 1.13 52.10
# verify it matched
all.equal(res,res3) # TRUE
#d.b's approach:
system.time(res2 <- {
numerator = rowSums(outer(d$y, d$x, ">="))
denominator = rowSums(outer(d$y, d$y, "<="))
res2 = numerator/denominator
res2 = res2[d$y <= 300]
res2
})
# Error: cannot allocate vector of size 74.5 Gb
# ^ This error is common when using outer() on large-ish problems
Vectorization. Generally, tasks are faster in R if they can be vectorized. The key functions related to ordered vectors have confusing names (findInterval, sort, order and cut), but fortunately they all work on vectors.
Continuous vs discrete. The match above should be a fast way to compute the denominator whether the data is continuous or has mass points / repeating values. If the data is continuous (and so has no repeats), the denominator can just be seq(length(xs), length = length(yt), by=-1). If it is fully discrete and has a lot of repetition (like the example here), there might be some way to make that faster as well, maybe like one of these:
den2 <- inverse.rle(with(rle(yt), list(
values = length(xs) - length(yt) + rev(cumsum(rev(lengths))),
lengths = lengths)))
tab <- unname(table(yt))
den3 <- rep(rev(cumsum(rev(tab))) + length(xs) - length(yt), tab)
# verify
all.equal(den,den2) # TRUE
all.equal(den,den3) # TRUE
findInterval will still work for the numerator for continuous data. It's not ideal for the repeated-values case considered here I guess (since we're redundantly finding the interval for many repeated yt values). Similar ideas for speeding that up likely apply.
Other options. As #chinsoon suggested, the data.table package might be a good fit if findInterval is too slow, since it has a lot of features focused on sorted data, but it's not obvious to me how to apply it here.
Instead of running loop, generate all the numerator and denominator at once. This also allows you to keep track of which res is associated with which x and y. Later, you can keep only the ones you want.
You can use outer for element wise comparison between vectors.
numerator = rowSums(outer(d$y, d$x, ">=")) #Compare all y against all x
denominator = rowSums(outer(d$y, d$y, "<=")) #Compare all y against itself
res2 = numerator/denominator #Obtain 'res' for all rows
#I would first 'cbind' res2 to d and only then remove the ones for 'y <=300'
res2 = res2[d$y <= 300] #Keep only those 'res' that you want
Since this is using rowSums, this should be faster.
This question already has answers here:
How would you program Pascal's triangle in R?
(2 answers)
How to work with large numbers in R?
(1 answer)
Closed 6 years ago.
For a class assignment, I need to create a function that calculates n Choose k. I did just that, and it works fine with small numbers (e.g. 6 choose 2), but I'm supposed to get it work with 200 choose 50, where it naturally doesn't.
The answer is too large and R outputs NaN or Inf, saying:
> q5(200, 50)
[1] "NaN"
Warning message:
In factorial(n) : value out of range in 'gammafn'
I tried using logs and exponents, but it doesn't cut it.
q5 <- function (n, k) {
answer <- log(exp( factorial(n) / ( (factorial(k)) * (factorial(n - k)) )))
paste0(answer)
}
The answer to the actual question is that R cannot show numbers it cannot represent, and some of the terms in your equation are too big to represent. So it fails. However there are approximations to factorial that can be used - they work with logarithms which get big a lot slower.
The most famous one, Sterling's approximation, was not accurate enough, but the Ramanujan's approximation came to the rescue :)
ramanujan <- function(n){
n*log(n) - n + log(n*(1 + 4*n*(1+2*n)))/6 + log(pi)/2
}
nchoosek <- function(n,k){
factorial(n)/(factorial(k)*factorial(n-k))
}
bignchoosek <- function(n,k){
exp(ramanujan(n) - ramanujan(k) - ramanujan(n-k))
}
nchoosek(20,5)
# [1] 15504
bignchoosek(20,5)
# [1] 15504.06
bignchoosek(200,50)
# [1] 4.538584e+47
You can try this too:
q5 <- function (n, k) {
# nchoosek = (n-k+1)(n-k+2)...n / (1.2...k)
return(prod(sapply(1:k, function(i)(n-k+i)/(i))))
}
q5(200, 50)
#[1] 4.538584e+47
or in log domain
q5 <- function (n, k) {
# ln (nchoosek) = ln(n-k+1) + ln(n-k+2) + ...+ ln(n) - ln(1) -ln(2) - ...- ln(k)
return(exp(sum(sapply(1:k, function(i)(log(n-k+i) - log(i))))))
}
q5(200, 50)
#[1] 4.538584e+47
The packages for large numbers:
Brobdingnag package for "Very large numbers in R":
https://cran.r-project.org/web/packages/Brobdingnag/index.html
Paper: https://www.researchgate.net/publication/251996764_Very_large_numbers_in_R_Introducing_package_Brobdingnag
library(Brobdingnag)
googol <- as.brob(10)^100 # googol:=10^100
googol
# [1] +exp(230.26) # exponential notation is convenient for very large numbers
gmp package for multiple Precision Arithmetic (big integers and rationals, prime number tests, matrix computation):
https://cran.r-project.org/web/packages/gmp/index.html
This solution calculates the complete row of the Pascal triangle:
x <- 1
print(x)
for (i in 1:200) { x <- c(0, x) + c(x, 0); print(x) }
x[51] ### 200 choose 50
## > x[51]
## [1] 4.538584e+47
(as I proposed for How would you program Pascal's triangle in R? )
If you want to speed up the code then do not the print(x) (output is a relative slow operation).
To put the code in a function we can do
nchoosek <- function(n,k) {
x <- 1
for (i in 1:n) x <- c(0, x) + c(x, 0)
x[k+1] ### n choose k
}
nchoosek(200, 50) ### testing the function
## [1] 4.538584e+47
Here is a more refined version of my function:
nchoosek <- function(n, k) {
if (k==0) return(1)
if (k+k > n) k <- n-k
if (k==0) return(1)
x <- 1
for (i in 1:k) x <- c(0, x) + c(x, 0)
for (i in 1:(n-k)) x <- x + c(0, head(x, -1))
tail(x, 1)
}
nchoosek(200, 50) ### testing the function
## [1] 4.538584e+47
Having the following matrix and vector:
a<-matrix(c(1,4,7,
2,5,8,
3,6,9), nrow = 3)
b <- c(1,1,1)
How do I sum recursiverly over each line of the matrix inside a funciton till obtain a desired result using last result to calculate next operation as shown:
b<-b+a[1,]
b<-b+a[2,]
b<-b+a[3,]
b<-b+a[1,]
b<-b+a[2,]
sum(b)>100 # Sum recursiverly till obtain this result sum(b)>100
This operation looks similar to this answer Multiply recursiverly in r. However it uses results from previews operations to calculate next ones.
Here's a recursive function to do what you're after,
# Sample Data
a<-matrix(c(1,4,7,
2,5,8,
3,6,9), nrow = 3)
b <- c(1,1,1)
We create a function that references itself with a value that increments modulo the number of rows
recAdd <- function(b, a, start = 1, size = NROW(a)) {
if(sum(b) > 100) return(b)
return(recAdd(b + a[start,], a, start = start %% size + 1, size))
}
> recAdd(b,a)
[1] 30 38 46
EDIT: Alternatively, here's a way with no recursion at all, which is much faster on large ratios of target number to sum of the matrix (but is slower on data of this size). Basically we get to take advantage of Euclid
nonrecAdd <- function(b, a, target = 100) {
Remaining <- target - sum(b)
perloop <- sum(a)
nloops <- Remaining %/% perloop
Remaining <- Remaining %% perloop
if(Remaining > 0) {
cumulativeRowsums <- cumsum(rowSums(a))
finalindex <- which((Remaining %/% cumulativeRowsums) == 0)[1]
b + colSums(a) * nloops + colSums(a[1:finalindex,,drop = FALSE])
} else {
b + colSums(a) * nloops
}
}
I have a Ax =b type linear system - where A is an upper-triangular matrix. The structure of A is defined as follows:
comp.Amat <- function(i,j,prob) ifelse(i > j, 0, dbinom(x=i, size=j, prob=prob))
prob <- 1/4
A <- outer(1:50, 1:50 , FUN=function(r,c) comp.Amat(r,c,prob) )
The entries in A are binomial probabilities - and the issue is the diagonal entries fastly approach to 0 when the size of A grows.
If we define the vector b as follows as well:
b <- seq(1,50,1);
Then solve(a=A,b=b) - gives an error:
" system is computationally singular: reciprocal condition number = 1.07584e-64"
That makes sense, since the diagonal entries are almost 0, so the matrix becomes non-invertible.
As a work-around, I have written the following recursive function - which starts to compute the value of last diagonal entry, then replaces that value in the previous rows. Since each entry in matrix is dbinom(j,i, prob) for j=>i :I can get a solution via this way.
solve.for.x.custom <- function(A, b, prob)
{
n =length(A[1,])
m =length(A[,1])
x = seq(1,n, 1);
x[x> 0] = -1000;
calc.inv.Aii <- function(i,j, prob)
{
res = (1 / (prob*(1-prob)))^i;
return(res);
}
for (i in m:1 )
{
if(i ==m)
{
rhs =0;
}else
{
rhs=0;
for(j in m:(i+1))
{
rhs = dbinom(x=i,size=j,prob=prob)*x[j] + rhs;
}
}
x[i] = (b[i] - rhs)*calc.inv.Aii(i,i);
}
print(x)
return(x)
}
My problem is - when I multiply this solution x' by matrix A, the errors (Ax'- b) are huge. Since I have an analytical solution (each entry in x_i can be described as a in terms of binomial probabilities multiplies by previous values) - the error I should get is 0- in each row.
I see that (1 / (1/a)) may not be equal to a because of these issues. However, the current errors are really big( -1.13817489781529e+168).
x_prime=solve.for.x.custom(A, b, prob)
A%*%x_prime - b
#output
[,1]
[1,] -1.13817489781529e+168
[2,] 2.11872209742428e+167
[3,] -1.58403954589004e+166
[4,] 6.52328959209082e+164
[5,] -1.69562573261261e+163
[6,] 3.00614551450976e+161
***
[49,] -7.58010305220250e+08
[50,] 9.65162608741321e+03
I would really appreciate it you'd recommend any suggestions or efficient methods. I gave the size of A and b as 50 -but I intend to grow them as well thus in that case this the error will increase also.
If your matrix A is upper triangular you probably want to use backsolve(A, b) rather than solve(A, b).
You can do arbitrary precision in R with Rmpfr, which will require writing a compatible version of backsolve. With the code below the break we can get
> print(max(abs(b - .b)), digits=5)
1 'mpfr' number of precision 1024 bits
[1] 2.9686e-267
There is one important caveat though: the values in A may not be accurate enough since they come from dbinom rather than using mpfr objeccts. Depending on your end goal, you may need to write your own version of dbinom using Rmpfr.
library(Rmpfr)
logcomp.Amat <- function(i,j,prob) ifelse(i > j, -Inf, dbinom(x=i, size=j, prob=prob, log=TRUE))
nbits <- 1024
.backsolve <- function(A, b) {
n <- length(b)
x <- mpfr(numeric(n), nbits)
for(i in rev(seq_len(n))) {
known <- i + seq_len(n - i)
z <- if(length(known) > 0) sum(A[i,known] * x[known]) else 0
x[i] <- (b[i] - z) / A[i,i]
}
return(x)
}
logA <- outer(1:50, 1:50, logcomp.Amat, prob=1/4)
b <- 1:50
A <- exp(mpfr(logA, nbits))
b <- mpfr(b, nbits)
x <- .backsolve(A, b)
.b <- as.vector(A %*% x)