Reproducible Random Values in Vector of Arbitrary Length - r

I need to generate a random number corresponding to each value in an index, where it needs to be reproducible for each index value, regardless of how many indexes are given:
As an example, I might provide the indexes 1 to 10, and then in a different time, call the indexes for 5 to 10, and they both need to be the same for the values 5 to 10. Setting a global seed will not do this, it will only keep same for the nth item in the random call by position in the vector.
What I have so far is this, which works as desired:
f = function(ix,min=0,max=1,seed=1){
sapply(ix,function(x){
set.seed(seed + x)
runif(1,min,max)
})
}
identical(f(1:10)[5:10],f(5:10)) #TRUE
identical(f(1:5),f(5:1)) #FALSE
identical(f(1:5),rev(f(5:1))) #TRUE
I was wondering if there is a more efficient way of achieving the above, without setting the seed explicitly for each index, as an offset to global seed.

You can use the digest package for tasks like this:
library(digest)
f = function(ix, seed=1){
sapply(ix, digest, algo = "sha256", seed = seed)
}
identical(f(1:10)[5:10],f(5:10)) #TRUE
#> [1] TRUE
identical(f(1:5),f(5:1)) #FALSE
#> [1] FALSE
identical(f(1:5),rev(f(5:1))) #TRUE
#> [1] TRUE

Use an encryption. With a given key, unique inputs will always produce unique outputs. As long as the numbers you input are distinct then the outputs will always be different; the same numbers will always encrypt to the came output cyphertext. Use DES for 64 bit numbers or AES for 128 bit numbers. For other sizes either roll your own Feistel cipher (insecure, but random) or use Hasty Pudding cipher.

Related

R: Fast hashing of strings to integer modulo n?

I have a vector of strings and I would like to hash each element individually to integers modulo n.
In this SO post it suggests an approach using digest and strotoi. But when I try it I get NA as the returned value
library(digest)
strtoi(digest("cc", algo = "xxhash32"), 16L)
So the above approach will not work as it can not even produce an integer let alone modulo of one.
What's the best way to hash a large vector of strings to integers modulo n for some n? Efficient solutions are more than welcome as the vector is large.
R uses 32-bit integers for integer vectors, so the range of representable integers is restricted to about +/-2*10^9. strtoi returns NA because the number is too big.
The mpfr-function from the Rmpfr package should work for you:
mpfr(x = digest("cc`enter code here`", algo = "xxhash32"), base = 16)
[1] 4192999065
I made a Rcpp implementation using code from this SO post and the resultant code is quite fast even for large-ish string vectors.
To use it
if(!require(disk.frame)) devtools::install_github("xiaodaigh/disk.frame")
modn = 17
disk.frame::hashstr2i(c("string1","string2"), modn)

Indicator function in R

I'm looking for an indicator function in R, i.e. a function that returns a 1, if the value of an element in a vector is greater than 0 and returns zero, if the value of an element in a vector is less than 0.
I need to use this function on all elements in a vector returning a new vector with only zeros and ones.
Thanks.
There are a variety of ways, the minimal keystroke one:
Ivec <- 0+(vec>0)
Saves a couple of keystrokes over: as.numeric(vec>0). I would guess the ifelse(x>0,1,0)-approach would be somewhat slower if applied to a large vector or if used in simulations. Could also use:
Ivec <- 1*(vec>0)
If i am able to understand you correctly then you want to make changes into entire data frame,assuming of which i can suggest you to use apply like below, where df is your data frame.
apply(df,2,function(x)ifelse((x>0),1,0))
You can also use if its for only one vector something like below:
x <- c(-2,3,1,0)
y <- ifelse(x>0,1,0)
print(y)
[1] 0 1 1 0 #Output
Hope this helps
The I function in R, called the Inhibit Interpretation/Conversion of Objects function, can be used for this purpose. For instance, the line below returns the values for the function I(x < 4) where X = {0, 1, 2, 3, 4, 5}:
> I(0:5 < 4)
[1] TRUE TRUE TRUE TRUE FALSE FALSE
In R TRUE and FALSE can be treated as 1 and 0s, but if you insist on your output being precisely those numbers, just wrap your I function into as.numeric.
There is also an built-in indicator function in R
Indicator(x,min,max)
-Inf and Inf are still the valid values.

Deterministic pseudorandom number generation in R

I need to generate a vector of "random" numbers, except that they need to be fully deterministic. The distribution from which the numbers come is not so important. What is a simple way to do this in R?
The reason for not using something like runif is that it returns a different sequence every time it is called.
The reason for not generating one sequence (with runif) and reusing it is that the calls are made on different machines. I could hardcode the sequence into a script, but the length of the sequence needed is unknown at design-time, so a pseudorandom sequence based on some hardcoded seed is preferable.
Are you aware of the set.seed() command?
R> set.seed(42); runif(3)
[1] 0.914806 0.937075 0.286140
R> set.seed(42); runif(3) # same seed, same numbers
[1] 0.914806 0.937075 0.286140
R> set.seed(12345); runif(3) # different seed, different numbers
[1] 0.720904 0.875773 0.760982
R>
There is also the SoDa package (and manual [PDF]) which allows you to wrap other operations and recover the starting and ending seed. It's just a wrapper around set.seed() but you can check for yourself (e.g. in unit tests) more easily.

Argument of set.seed in R

I am trying to understand how set.seed works in R. I understand it, can reproduce random samples, but I don't know what is the difference between set.seed(1) and set.seed(123) ?
What do the argument in the bracket mean ?
The seed argument in set.seed is a single value, interpreted as an integer (as defined in help(set.seed()). The seed in set.seed produces random values which are unique to that seed (and will be same irrespective of the computer you run and hence ensures reproducibility). So the random values generated by set.seed(1) and set.seed(123) will not be the same but the random values generated by R in your computer using set.seed(1) and by R in my computer using the same seed are the same.
set.seed(1)
x<-rnorm(10,2,1)
> x
[1] 1.373546 2.183643 1.164371 3.595281 2.329508 1.179532 2.487429 2.738325 2.575781 1.694612
set.seed(123)
y<-rnorm(10,2,1)
> y
[1] 1.4395244 1.7698225 3.5587083 2.0705084 2.1292877 3.7150650 2.4609162 0.7349388 1.3131471 1.5543380
> identical(x,y)
[1] FALSE
The majority of computer programs uses deterministic algorithms to generate random numbers (which is the reason why the numbers they generate are not truly random, but pseudorandom, which is good enough for most purposes). R is no different, and you can think of the random numbers it generates as being part of a very long string of "random" numbers that, when summoned, just starts at some point and spits out pseudorandom numbers for you. By using set.seed() you are basically giving the program a starting point instead of letting it choose its own. That's why any user running the same seed number will get the same results.
You can run ?RNGkind for more information on the subject.

Questions about set.seed() in R

I understand what set.seed() does and when I might use it, but I still have many questions about the function. Here are a few:
Is it possible to "reset" set.seed() to something "more random" if you have called set.seed() earlier in your session? Is that even necessary?
Is it possible to view the seed that R is currently using?
Is there a way to make set.seed() allow alphanumeric seeds, the way one can enter them at random.org (be sure you are in the advanced mode, and see "Part 3" of the form to see what I mean)?
Just for fun:
set.seed.alpha <- function(x) {
require("digest")
hexval <- paste0("0x",digest(x,"crc32"))
intval <- type.convert(hexval) %% .Machine$integer.max
set.seed(intval)
}
So you can do:
set.seed.alpha("hello world")
(in fact x can be any R object, not just an alphanumeric string)
It's possible, if you set the seed to something like the final digits of your time epoch, but it's really not necessary. The intended use of PRNGs is that you set the seed once at the start of a session, and use successive generated variates from this. Do things differently, and you don't get to enjoy the various good theoretical and empirical properties the R RNGs have.
But I'm not sure you really understand the purpose of set.seed. It's not really there for you to get 'more random' numbers. If you are doing some kind of application for which the R PRNG is insufficient (for instance, if you require cryptographic randomness), you might as well generate all your random numbers by some alternate method and use them directly. The real purpose of set.seed is to produce reproducibility in results using RNGs. If you start the same analysis using the same sequence of random number generations, and set the seed to the same value, you will always get the same result. This is helpful in debugging, and for others reviewing your results.
To use the epoch time, do something like
t <- as.numeric(Sys.time())
seed <- 1e8 * (t - floor(t))
set.seed(seed); print(seed)
For your question 3 there is the char2seed function in the TeachingDemos package which will take a character string (alhpa numeric) and convert it to an integer and by default use that to set a new seed. The idea was that students could use their name (or some combination/subset of names) as a seed so each student gets a different dataset, but the teacher can reproduce each student's dataset.
For an answer to 2, first see the help page ?RNGkind.
To find the kind of RNG in use:
RNGkind()
# [1] "Mersenne-Twister" "Inversion"
The Mersenne Twister is the default.
From the help page:
‘"Mersenne-Twister":’ From Matsumoto and Nishimura (1998). A
twisted GFSR with period 2^19937 - 1 and equidistribution in
623 consecutive dimensions (over the whole period). The
‘seed’ is a 624-dimensional set of 32-bit integers plus a
current position in that set.
To find the current seed in use, you need to first call the random number generator.
runif(1, 0, 1)
# [1] 0.9834062
.Random.seed
# [Gives a 626 length vector]
Calling set.seed(some_integer) followed by .Random.seed,
will always give the same 626 length vector if you use the same some_integer. To put it differently, the 626-length vector is determined solely by some_integer, given one is using the Mersenne Twister, of course.
Also, of course, running set.seed to some fixed value will give you the same values for calls to random number routines following it. That's the main use for it in practice, to give reproducibility. E.g.
set.seed(1)
runif(5, 0, 1)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
rnorm(1, 0, 1)
# [1] 1.272429
set.seed(1)
runif(5, 0, 1)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
rnorm(1, 0, 1)
# [1] 1.272429
All the basic number generator code in R is in the file src/main/RNG.c in the source code.
It is in C, but fairly easy to follow.
I have the same issue as in question 1. I then figure I can simply reset seed in the loop by:
set.seed(123)
x<- rnorm(10,1,1)
set.seed(null)
This way at the end of each loop the seed just got deleted. It worked for me.

Resources