Bitwise operations with bigz in gmp - r

I'm translating some cryptography scripts from Python to R. Python seems to handle very large integers much better than R can natively:
10593080468914978578954316149578855170502344604886137564370015851276669104055 >> 1
# 5296540234457489289477158074789427585251172302443068782185007925638334552027
But I'm aware of the gmp library for R, which handles them well (mostly):
as.bigz("10593080468914978578954316149578855170502344604886137564370015851276669104055")
For context, to translate these scripts I need to use bitwise operations. The problem is that these bigz objects are encoded as raw values, and so I can't use the base bitwise functions for them as they are incompatible.
Finding workaround for shifting bits to the left and right is straight foward, but I need something that will:
Perform the equivalent of bitwAnd and bitwOr
On bigz values
WITHOUT losing precision.
Any ideas?
Bonus: if you can provide an interpretation of bitwAnd and bitwOr in terms of base 10 then that could work. Preferably with some example code in R, if not I can work around it.

I'm sure there must be a slicker and faster way, but one option would be something like this...
library(gmp)
z <- as.bigz("10593080468914978578954316149578855170502344604886137564370015851276669104055")
w <- as.bigz("1234874454654321549879876546351546654456432132321654987584654321321")
#express as numeric vectors of 0s and 1s
z1 <- as.numeric(charToRaw(as.character(z, b=2)))-48
w1 <- as.numeric(charToRaw(as.character(w, b=2)))-48
#normalise the lengths
mx <- max(length(z1), length(w1))
z1 <- c(rep(0, mx-length(z1)), z1)
w1 <- c(rep(0, mx-length(w1)), w1)
#then do & or | and convert back to bigz
zandw <- as.bigz(paste0("0b", rawToChar(as.raw(1*(z1 & w1) + 48))))
zorw <- as.bigz(paste0("0b", rawToChar(as.raw(1*(z1 | w1) + 48))))
zandw
Big Integer ('bigz') :
[1] 905773543034890641004226585015137324621885921615658881499355162273
zorw
Big Integer ('bigz') :
[1] 10593080469244079490573747058454505131838753934720683775076011957361968263103

Related

Integer overflow from many-leveled factor with class.ind()?

I'm trying to convert a "big" factor into a set of indicator (i.e. dummy, binary, flag) variables in R as such:
FLN <- data.frame(nnet::class.ind(FinelineNumber))
where FinelineNumber is a 5,000-level factor from Kaggle.com's current Walmart contest (the data is public if you'd like to reproduce this error).
I keep getting this concerning-looking warning:
In n * (unclass(cl) - 1L) : NAs produced by integer overflow
Memory available to the system is essentially unlimited. I'm not sure what the problem is.
The source code of nnet::class.ind is:
function (cl) {
n <- length(cl)
cl <- as.factor(cl)
x <- matrix(0, n, length(levels(cl)))
x[(1L:n) + n * (unclass(cl) - 1L)] <- 1
dimnames(x) <- list(names(cl), levels(cl))
x
}
.Machine$integer.max is 2147483647. If n*(nlevels - 1L) is greater than this value that should produce your error. Solving for n:
imax <- .Machine$integer.max
nlevels <- 5000
imax/(nlevels-1L)
## [1] 429582.6
You'll encounter this problem if you have 429583 or more rows (not particularly big for a data-mining context). As commented above, you'll do much better with Matrix::sparse.model.matrix (or Matrix::fac2sparse), if your modeling framework can handle sparse matrices. Alternatively, you'll have to rewrite class.ind to avoid this bottleneck (i.e. indexing by rows and columns rather than by absolute location) [#joran comments above that R indexes large vectors via double-precision values, so you might be able to get away with just hacking that line to
x[(1:n) + n * (unclass(cl) - 1)] <- 1
possibly throwing in an explicit as.numeric() here or there to force the coercion to double ...]
Even if you were able to complete this step, you'd end up with a 5000*650000 matrix - it looks like that will be 12Gb.
print(650*object.size(matrix(1L,5000,1000)),units="Gb")
I guess if you've got 100Gb free that could be OK ...

Solving system of nonlinear equations in R

I am trying to solve a system of non-linear equations in R but it keeps giving me this error "number of items to replace is not a multiple of replacement length".
My code looks like this:
my_data <- Danske
D <- my_data$D
V <- my_data$V
r <- my_data$r
s <- my_data$s
fnewton <- function(x)
{
y <- numeric(2)
d1 <- (log(x[1]/D)+(r+x[2]^2/2))/x[2]
d2 <- d1-x[2]
y[1] <- V - (x[1]*pnorm(d1) - exp(-r)*D*pnorm(d2))
y[2] <- s*V - pnorm(d1)*x[2]*x[1]
y
}
xstart <- c(239241500000, 0.012396)
nleqslv(xstart, fnewton, method="Newton")
D, V, r and s are numeric[1:2508] values and I think thats where the problem comes from. If I have single values 1x1, it solves it well, however, if I insert vectors with 2508 values, it only calculates the first x1 and x2 and then comes the warnings with the message I wrote above.
Thank you for any help.
Lina
You don't really have a "system" of equations the way you've written your fnewton . May I recommend (disclaimer: I'm the author) you take a look at ktsolve package? You may find that it'll get you the solutions you're looking for a bit more easily. You can use your fnewton almost as written, except that you will pass a collection of named scalar variables into the function.
If you want to solve (either with nleqslv or ktsolve) for a variety of input 'starting points', then you should wrap your approach inside a loop or *apply function.
Too long for a comment.
Without having a coy of your data, it's impossible to verify this, but...
You are passing fnewton(...) a vector of length 2, and expecting a vector of length 2 as the return value. But in your function, d1 and d2 are set to vectors of length 2508. Then you attempt to set y[1] and y[2] to vectors of length 2508. R can't do that, so it uses the first value in the RHS and provides the warnings.
I suggest you step through your function and see what each line is doing.
Can't propose a solution because I have no idea what you are trying to accomplish.

R: Big integer matrices

I have some big integer matrices (1000 x 1000000) that I have to multiply and do rowmax on.
They contain 0 and 1 (approx 99% 1 and 1% 0 and no other values).
My problem is memory consumption: Currently R eats 8 bytes per integer.
I have looked at SparseMatrix, but it seems I cannot set the default value to 1 instead of 0.
How can I represent these matrices in a memory efficient way, but so I can still multiply them as matrices and use rowmax?
Preferably it should work with R-2.15 and not require additional libraries.
Second idea: If you have a couple of these matrices, call them X_1 and X_2, let Y_1 = 1*1' - X_1 and Y_2 = 1*1' - X_2; the Y's can be sparse because they are 99% zero. So their product is
X_1 * X_2 = ( 1*1' - Y_1) * (1*1' - Y_2) = 1*1'*1*1' - Y_1*1*1' - 1*1'*Y_2 + Y_1 * Y_2
which you can simplify even further.
There are several sparse matrix packages slam, SparseM, Matrix, ...) but I doubt any will do bitwise presentation, or even single char, as you'd need here. You may have to code that up yourself.
Alternatively, packages like ff allow more compact storage but AFAIK will not do matrix ops for you. Maybe you could that on top of them?
Off the top of my head, I can't think of a packaged solution...
It seems like you could represent this type of data extremely efficiently with run length encoding by row. From there, you could implement a matrix-vector multiply method for rle objects, (which might be hard) and row-max (which should be trivial).
Since there are only 1% 0's it would not be difficult to compress. One trivial example:
pseudo.matrix <- function(x){
nrow <- nrow(x)
ncol <- ncol(x)
zeroes.cells <- which(x==0)
p <- list(nrow=nrow, ncol=ncol, zeroes.cells=zeroes.cells)
}
This alone would reduce their memory size significantly. And it would be easy to recover the original matrix:
recover.matrix <- function(x) {
m <- matrix(1, x$nrow, x$ncol)
for (i in x$zeroes.cells) m[i] <- 0
m
}
I guess it is possible to figure out a way to efficiently multiply these pseudo matrices, since the result for each cell would be something like the number of columns of the first matrix minus an adjustment regarding the number of zeros in the operation, but I am not sure how easy it would be to do this.

intToBin with large numbers

I'm using the intToBin() function from "R.utils" package and am having trouble using it to convert large decimal numbers to binary.
I get this error : NAs introduced by coercion.
Is there another function out there that can handle big numbers/ is there an algorithm/ code to implement such a function?
Thanks
If you read the help page for intToBin, it quite explicitly says it takes "integer" inputs. These are not mathematical "integers" but rather the computer-language-defined ints, which are limited to 16 bits (or something like that).
You'll need to find (or write :-() a function which converts floating-point numbers to binary floats, or if you're lucky, perhaps Rmpfr or gmp packages, which do arbitrary precision "big number" math, may have a float-to-binary tool.
By the time this gets posted, someone will have exposed my ignorance by posting an existing function, w/ my luck.
Edit -- like maybe the package pack
I needed a converter between doubles and hex numbers. So I wrote those, might be helpful to others
doubleToHex <- function(x) {
if(x < 16)
return(sprintf("%X", x))
remainders <- c()
while(x > 15) {
remainders <- append(remainders, x%%16)
x <- floor(x/16)
}
remainders <- paste(sprintf("%X", rev(remainders)), collapse="")
return(paste(x, remainders, sep=""))
}
hexToDouble <- function(x) {
x <- strsplit(x,"")[[1]]
output <- as.double(0)
for(i in rev(seq_along(x))) {
output <- output + (as.numeric(as.hexmode(x[i]) * (16**(length(x)-i))))
}
return(output)
}
doubleToHex(x = 8356723)
hexToDouble(x = "7F8373")
Hasn't been extensively tested yet, let me know if you detect a problem with it.

NAs produced by integer overflow + R on linux

I'm running an R script on UNIX based system , the script contain multiplication of large numbers , so the results where NAs by integer overflow , but when i run the same script on windows , this problem does not appears.
but i should keep the script working the whole night on the Desktop(which is Unix).
is there any solution for this problem?
thanks
for(ol in seq(1,nrow(yi),by=25))
{
for(oh in seq(1,nrow(yi),by=25))
{
A=(N*(ol^2)) + ((N*(N+1)*(2*N+1))/6) -(2*ol*((N*N+1)/2)) + (2*N*ol*(N-oh+1)) + ((N-oh+1)*N^2) + (2*N*(oh-N-1)*(oh+N))
}
}
with :
N=16569 = nrow(yi)
but first round is not being calculated on unix.
Can you cast your integers to floating-point numbers in order to use floating-point math for the computations?
For example:
> x=as.integer(1000000)
> x*x
[1] NA
Warning message:
In x * x : NAs produced by integer overflow
> x=as.numeric(1000000)
> x*x
[1] 1e+12
As an aside, it is not entirely clear why the warning would appear in one environment but not the other. I first thought that 32-bit and 64-bit builds of R might be using 32-bit and 64-bit integers respectively, but that doesn't appear to be the case. Are both your environments configured identically in terms of how warnings are displayed?
As the other answers have pointed out, there is something a bit non-reproducible/strange about your results so far. Nevertheless, if you really must do exact calculations on large integers, you probably need an interface between R and some other system.
Some of your choices are:
the gmp package (see this page and scroll down to R
an interface to the bc calculator on googlecode
there is a high precision arithmetic page on the R wiki which compares interfaces to Yacas, bc, and MPFR/GMP
there is a limited interface to the PARI/GP package in the elliptical package, but this is probably (much) less immediately useful than the preceding three choices
Most Unix or Cygwin systems should have bc installed already. GMP and Yacas are easy to install on modern Linux systems ...
Here's an extended example, with a function that can choose among numeric, integer, or bigz computation.
f1 <- function(ol=1L,oh=1L,N=16569L,type=c("num","int","bigz")) {
type <- match.arg(type)
## convert all values to appropriate type
if (type=="int") {
ol <- as.integer(ol)
oh <- as.integer(oh)
N <- as.integer(N)
one <- 1L
two <- 2L
six <- 6L
cc <- as.integer
} else if (type=="bigz") {
one <- as.bigz(1)
two <- as.bigz(2)
six <- as.bigz(6)
N <- as.bigz(N)
ol <- as.bigz(ol)
oh <- as.bigz(oh)
cc <- as.bigz
} else {
one <- 1
two <- 2
six <- 6
N <- as.numeric(N)
oh <- as.numeric(oh)
ol <- as.numeric(ol)
cc <- as.numeric
}
## if using bigz mode, the ratio needs to be converted back to bigz;
## defining cc() as above seemed to be the most transparent way to do it
N*ol^two + cc(N*(N+one)*(two*N+one)/six) -
ol*(N*N+one) + two*N*ol*(N-oh+one) +
(N-oh+one)*N^two + two*N*(oh-N-one)*(oh+N)
}
I removed a lot of unnecessary parentheses, which actually made it harder to see what was going on. It is indeed true that for the (1,1) case the final result is not bigger than .Machine$integer.max but some of the intermediate steps are ... (for the (1,1) case this actually reduces to $$-1/6*(N+2)*(4*N^2-5*N+3)$$ ...)
f1() ## -3.032615e+12
f1() > .Machine$integer.max ## FALSE
N <- 16569L
N*(N+1)*(2*N+1) > .Machine$integer.max ## TRUE
N*(N+1L)*(2L*N+1L) ## integer overflow (NA)
f1(type="int") ## integer overflow
f1(type="bigz") ## "-3032615078557"
print(f1(),digits=20) ## -3032615078557: no actual loss of precision in this case
PS: you have a (N*N+1) term in your equation. Should that really be N*(N+1), or did you really mean N^2+1?
Given your comments, I guess that you seriously misunderstand the "correctness" of numbers in R. You say the outcome you get on Windows is something like -30598395869593930593. Now, on both 32bit and 64bit that precision is even not possible using a double, let alone using an integer :
> x <- -30598395869593930593
> format(x,scientific=F)
[1] "-30598395869593931776"
> all.equal(x,as.numeric(format(x,scientific=F)))
[1] TRUE
> as.integer(x)
[1] NA
You have 16 digits you can trust, all the rest is bollocks. Then again, an accuracy of 16 digits is already pretty strong. Most measurement tools don't even come close to that.

Resources