R: Fast hashing of strings to integer modulo n? - r

I have a vector of strings and I would like to hash each element individually to integers modulo n.
In this SO post it suggests an approach using digest and strotoi. But when I try it I get NA as the returned value
library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)
So the above approach will not work as it can not even produce an integer let alone modulo of one.
What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.

R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi returns NA because the number is too big.
The mpfr-function from the Rmpfr package should work for you:
mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065

I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.
To use it
if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)

Related

Is it possible to have a smaller dataset than when using factors?

I am trying to decrease the memory footprint of some of my datasets where I have a small set of factors per columns (repeated a large number of times). Are there better ways to minimize it? For comparison, this is what I get from just using factors:
library(pryr)
N <- 10 * 8
M <- 10
Initial data:
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 1.95 kB
Using Factors:
test2 <- as.factor(test$A)
object_size(test2)
# 1.33 kB
Aside: I naively assumed that they replaced the strings with a number and was pleasantly surprised to see test2 smaller than test3. Can anyone point me to some material on how to optimize factor representation?
test3 <- data.frame(A = c(rep("1", N), rep("2", N)))
object_size(test3)
# 1.82 kB
I'm afraid the difference is minimal.
The principle would be easy enough: instead of (in your example) 160 strings, you would just be storing 2, along with 160 integers (which are only 4 bytes).
Except that R kind of stores character internally the same way.
Every modern language supports string of (virtually) unlimited length. Which gives the problem that you can't store a vector (or array) of strings as one contiguous block, as any element can be reset to arbitrary length. So if another value is assigned to one element, which happened to be somewhat longer, that would mean the rest of the array would have to be shifted. Or the OS/language should reserve large amounts of space for each string.
Therefore, strings are stored at whatever place in memory is convenient, and arrays (or vectors in R) are stored as blocks of pointers to the place where the value actually is.
In the early days of R, each pointer pointed to another place in memory, even if the actual value was the same. So in your example, 160 pointers to 160 memory locations. But that's changed, nowadays it's implemented as 160 pointers to 2 memory locations.
There may be some small differences, mainly because a factor can normally support only 2^31-1 levels, meaning 32-bits integers are enough to store it, while a character mostly uses 64-bits pointers. Then again, there's more overhead in factors.
Generally, there may be some advantage in using factor if you really have a large percentage duplicates, but if that's not the case it may even harm your memory usage.
And the example you provided doesn't work, as you're comparing a data.frame with a factor, instead of the bare character.
Even stronger: when I reproduce your example, I only get your results if I set stringsAsFactors to FALSE, so you're comparing a factor to a factor in a data.frame.
Comparing the results otherwise gives a lot smaller difference: 1568 for character, 1328 for a factor.
And that only works if you have a lot of the same values, if you look at this you see that the factor can be larger:
> object.size(factor(sample(letters)))
2224 bytes
> object.size(sample(letters))
1712 bytes
So generally, there is no real way to compress your data while still keeping it easy to work with, except for using common sense in what you actually want to store.
I don't have a direct answer for your question but here is a few information from the book "Advanced R" by Hadley Wickham:
Factors
One important use of attributes is to define factors. A factor
is a vector that can contain only predefined values, and is used to
store categorical data. Factors are built on top of integer vectors
using two attributes: the class, “factor”, which makes them behave
differently from regular integer vectors, and the levels, which
defines the set of allowed values.
Also:
"While factors look (and often behave) like character vectors, they
are actually integers. Be careful when treating them like strings.
Some string methods (like gsub() and grepl()) will coerce factors to
strings, while others (like nchar()) will throw an error, and still
others (like c()) will use the underlying integer values. For this
reason, it’s usually best to explicitly convert factors to character
vectors if you need string-like behaviour. In early versions of R,
there was a memory advantage to using factors instead of character
vectors, but this is no longer the case."
There is a package in R called fst (Lightning Fast Serialization of Data Frames for R)
, in which you can create compressed fst objects for your data frame. A detailed explanation can be found in the fst-package manual, but I'll briefly explain about how to use it and how much space an fst object takes. First, Let's make your test dataframe a bit larger, as follows:
library(pryr)
N <- 1000 * 8
M <- 100
test <- data.frame(A = c(rep(strrep("A", M), N), rep(strrep("B", N), N)))
object_size(test)
# 73.3 kB
Now, let's convert this dataframe into an fst object, as follows:
install.packages("fst") #install the package
library(fst) #load the package
path <- paste0(tempfile(), ".fst") #create a temporary '.fst' file
write_fst(test, path) #write the dataframe into the '.fst' file
test2 <- fst(path) #load the data as an fst object
object_size(test2)
# 2.14 kB
The disk space for the created .fst file is 434 bytes. You can deal with test2 as a normal dataframe (as far as I tried).
Hope this helps.

Bitwise operations with bigz in gmp

I'm translating some cryptography scripts from Python to R. Python seems to handle very large integers much better than R can natively:
10593080468914978578954316149578855170502344604886137564370015851276669104055 >> 1
# 5296540234457489289477158074789427585251172302443068782185007925638334552027
But I'm aware of the gmp library for R, which handles them well (mostly):
as.bigz("10593080468914978578954316149578855170502344604886137564370015851276669104055")
For context, to translate these scripts I need to use bitwise operations. The problem is that these bigz objects are encoded as raw values, and so I can't use the base bitwise functions for them as they are incompatible.
Finding workaround for shifting bits to the left and right is straight foward, but I need something that will:
Perform the equivalent of bitwAnd and bitwOr
On bigz values
WITHOUT losing precision.
Any ideas?
Bonus: if you can provide an interpretation of bitwAnd and bitwOr in terms of base 10 then that could work. Preferably with some example code in R, if not I can work around it.
I'm sure there must be a slicker and faster way, but one option would be something like this...
library(gmp)
z <- as.bigz("10593080468914978578954316149578855170502344604886137564370015851276669104055")
w <- as.bigz("1234874454654321549879876546351546654456432132321654987584654321321")
#express as numeric vectors of 0s and 1s
z1 <- as.numeric(charToRaw(as.character(z, b=2)))-48
w1 <- as.numeric(charToRaw(as.character(w, b=2)))-48
#normalise the lengths
mx <- max(length(z1), length(w1))
z1 <- c(rep(0, mx-length(z1)), z1)
w1 <- c(rep(0, mx-length(w1)), w1)
#then do & or | and convert back to bigz
zandw <- as.bigz(paste0("0b", rawToChar(as.raw(1*(z1 & w1) + 48))))
zorw <- as.bigz(paste0("0b", rawToChar(as.raw(1*(z1 | w1) + 48))))
zandw
Big Integer ('bigz') :
[1] 905773543034890641004226585015137324621885921615658881499355162273
zorw
Big Integer ('bigz') :
[1] 10593080469244079490573747058454505131838753934720683775076011957361968263103

R sweep-equivalent with integer division

Is there a way to use sweep(dataframe) with integer division, or something that is equivalent to such?
This is a minimal example of sweep not using integer division - which I want to replace with integer division:
sweep(x = mtcars, MARGIN = 2, STATS = unlist(mtcars[1,]), FUN = '/')
Some limitations I need to stick to:
I need to preserve the column names of the dataframe, as done in the example above.
I cannot just use round, floor, ceil, or similar - it needs to be an equivalent of integer division (floor would have different effects on negative numbers than integer division).
If possible, I'd prefer to not store any information in additional variables during this process.
I'm dealing with a relatively large dataframe, so it could turn out that very slow solutions might not be an option here.
Does anyone know a way of achieving this in R?
Pass '%/%' as your function, that is integer division. See arithmetic operator docs.
sweep(x = mtcars, MARGIN = 2, STATS = unlist(mtcars[1,]), FUN = '%/%')

Preallocate sparse matrix with max nonzeros in R

I'm looking to preallocate a sparse matrix in R (using simple_triplet_matrix) by providing the dimensions of the matrix, m x n, and also the number of non-zero elements I expect to have. Matlab has the function "spalloc" (see below), but I have not been able to find an equivalent in R. Any suggestions?
S = spalloc(m,n,nzmax) creates an all zero sparse matrix S of size m-by-n with room to hold nzmax nonzeros.
Whereas it may make sense to preallocate a traditional dense matrix in R (in the same way it is much more efficient to preallocate a regular (atomic) vector rather than increasing its size one by one,
I'm pretty sure it will not pay to preallocate sparse matrices in R, in most situations.
Why?
For dense matrices, you allocate and then assign "piece by piece", e.g.,
m[i,j] <- value
For sparse matrices, however that is very different: If you do something like
S[i,j] <- value
the internal code has to check if [i,j] is an existing entry (typically non-zero) or not. If it is, it can change the value, but otherwise, one way or the other, the triplet (i,j, value) needs to be stored and that means extending the current structure etc. If you do this piece by piece, it is inefficient... mostly irrespectively if you had done some preallocation or not.
If, on the other hand, you already know in advance all the [i,j] combinations which will contain non-zeroes, you could "pre-allocate", but in this case,
just store the vector i and j of length nnzero, say. And then use your underlying "algorithm" to also construct a vector x of the same length which contains all the corresponding values, i.e., entries.
Now, indeed, as #Pafnucy suggested, use spMatrix() or sparseMatrix(), two slightly different versions of the same functionality: Constructing a sparse matrix, given its contents.
I am happy to help further, as I am the maintainer of the Matrix package.

R: Anything faster than outer()?

Using R, I need to evaluate an expression of the form [using latex notation]
\frac{1}{n^2} \sum_{i=1}^n \sum_{j=1}^n f(x_i-x_j),
where x_i,x_j are real (scalar) numbers and f is a nonlinear function with scalar input and output.
My current best [now using R commands] is
mat <- outer(x,x,function(y,z) f(y-z))
res <- mean(mat)
where x is a vector of length n which holds all the x_i's.
For n = 10000, this operation takes about 26 seconds on my PC, but (as expected) computation time increases heavily in n. I'd like to speed this up, mainly because I want to pass the above result to an optimizer later on. Any suggestions? Thanks!

Resources