Using parSapply to generate random numbers - r

I am trying to run a function which there is a random number generator within the function. The results at not as what I expected so I have done the following test:
# Case 1
set.seed(100)
A1 = matrix(NA,20,10)
for (i in 1:10) {
A1[,i] = sample(1:100,20)
}
# Case 2
set.seed(100)
A2 = sapply(seq_len(10),function(x) sample(1:100,20))
# Case 3
require(parallel)
set.seed(100)
cl <- makeCluster(detectCores() - 1)
A3 = parSapply(cl,seq_len(10), function(x) sample(1:100,20))
stopCluster(cl)
# Check: Case 1 result equals Case 2 result
identical(A1,A2)
# [1] TRUE
# Check: Case 1 result does NOT equal to Case 3 result
identical(A1,A3)
# [1] FALSE
# Check2: Would like to check if it's a matter of ordering
range(rowSums(A1))
# [1] 319 704
range(rowSums(A3))
# [1] 288 612
In the above code, the parSapply generates a different set of random numbers than A1 and A2. My purpose of having Check2 is that, I was suspecting that parSapply might alter the order however it doesn't seem to be case as the max and min sums of these random numbers are different.
Appreciate if someone could shed some colour on why parSapply would give a different result from sapply. What am I missing here?
Thanks in advance!

Have a look at ?vignette(parallel) and in particular at "Section 6 Random-number generation". Among other things it states the following
Some care is needed with parallel computation using (pseudo-)random numbers: the processes/threads which run separate parts of the computation need to run independent (and preferably reproducible) random-number streams.
When an R process is started up it takes the random-number seed from the object .Random.seed in a saved workspace or constructs one from the clock time and process ID when random-number generation is first used (see the help on RNG). Thus worker processes might get the same seed
because a workspace containing .Random.seed was restored or the random number generator has been used before forking: otherwise these get a non-reproducible seed (but with very high probability a different seed for each worker).
You should also have a look at ?clusterSetRNGStream.

Related

Sampling using common random numbers in r (efficiently!)

Is there any way to perform sampling using common random numbers with R?
There are many cases where you do the following many times (for instance, if you wanted to plot Monte Carlo estimates at many different parameter values). First, you sample, say, ten thousand variates from a normal distribution, and second, you take the average of some function of these samples, returning a single floating point numbers. Now, if I wanted to change a few parameters, changing either of these two functions, I would have to re-do those steps over and over again.
The naive way would be to sample fresh draws over and over again using some function like rnorm(). A less naive way would be to use a different function that takes a large collection of common random numbers. However, if I used this approach, there might still be a lot of copying going on here, due to R mostly using pass-by-value semantics. What are some tools that would allow me to get around this and avoid all this copying in the second situation?
I think you're asking two types of questions here:
Programmatically, can we preserve a large pull of random data in such a way that side-steps R's default pass-by-value?
Mathematically, if we make a large pull of random data and pick from it piece-meal, can we arbitrarily change the parameters used in the pull?
The answer to 1 is "yes": pass-by-reference semantics are possible in R, but they take a little more work. All of the implementations I've seen and played with are done with environments or non-R-native objects (C/C++ pointers to structs or such). Here is one example that caches a large pull of random "normal" data and checks the pool of available data on each call:
my_rnorm_builder <- function(deflen = 10000) {
.cache <- numeric(0)
.index <- 0L
.deflen <- deflen
check <- function(n) {
if ((.index + n) > length(.cache)) {
message("reloading") # this should not be here "in-production"
l <- length(.cache)
.cache <<- c(.cache[ .index + seq_len(l - .index) ],
rnorm(.deflen + n + l))
.index <<- 0L
}
}
function(n, mean = 0, sd = 1) {
check(n)
if (n > 0) {
out <- mean + sd * .cache[ .index + seq_len(n) ]
.index <<- .index + as.integer(n)
return(out)
} else return(numeric(0))
}
}
It is by-far not resilient to hostile users or other likely mistakes. It does not guarantee the length of available remaining random numbers. (To put in checks like that would slow it down below a threshold of reasonable-ness, with the benchmark in mind.)
Demo of it in operation:
my_rnorm <- my_rnorm_builder(1e6)
# starts empty
get(".index", env=environment(my_rnorm))
# [1] 0
length(get(".cache", env=environment(my_rnorm)))
# [1] 0
set.seed(2)
my_rnorm(3) # should see "reloading"
# reloading
# [1] -0.8969145 0.1848492 1.5878453
my_rnorm(3) # should not see "reloading"
# [1] -1.13037567 -0.08025176 0.13242028
# prove that we've changed things internally
get(".index", env=environment(my_rnorm))
# [1] 6
length(get(".cache", env=environment(my_rnorm)))
# [1] 1000003
head(my_rnorm(1e6)) # should see "reloading"
# reloading
# [1] 0.7079547 -0.2396980 1.9844739 -0.1387870 0.4176508 0.9817528
Let's make sure that the random-number scaling of sigma*x+mu makes sense by starting over and re-setting our seed:
# reload the definition of my_rnorm
my_rnorm <- my_rnorm_builder(1e6)
length(get(".cache", env=environment(my_rnorm)))
# [1] 0
set.seed(2)
my_rnorm(3) # should see "reloading"
# reloading
# [1] -0.8969145 0.1848492 1.5878453
my_rnorm(3, mean = 100) # should not see "reloading"
# [1] 98.86962 99.91975 100.13242
So to answer question 2: "yes". Quick inspection reveals that those last three numbers are indeed "100 plus" the numbers in the second my_rnorm(3) in the previous block. So just shifting "normal" random numbers by mu/sigma holds. And we did this while still using the large pre-pulled cache of random data.
But is it worth it? This is a naïve test/comparison in and of itself, constructive suggestions are welcome.
t(sapply(c(1,5,10,100,1000,10000), function(n) {
s <- summary(microbenchmark::microbenchmark(
base = rnorm(n),
my = my_rnorm(n),
times = 10000, unit = "ns"
))
c(n = n, setNames(s$median, s$expr))
}))
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# reloading
# n base my
# [1,] 1 1100 1100
# [2,] 5 1400 1300
# [3,] 10 1600 1400
# [4,] 100 6400 2000
# [5,] 1000 53100 6600
# [6,] 10000 517000 49900
(All medians are in nanoseconds.) So while it would have seemed intuitive that "smaller pulls done more frequently" (with rnorm) would have benefited from this caching, I cannot explain why it is not very helpful until pulls 100 and greater.
Can this be extended to other distributions? Almost certainly. "Uniform" would be straight forward (similarly scale and shift), but some others might take a little more calculus to do correctly. (For instance, it is not obvious without more research how the "t" distribution could alter the degrees-of-freedom on pre-pulled data ... if that's even possible. Though I do count myself a statistician in some ways, I am not prepared to claim yes/no/maybe on that one yet :-)
Addition to r2evans' answer concerning is it worth it?: I don't think so, since instead of caching random draws one could also use a faster RNG. Here I am adding dqrnorm from my dqrng package to the comparison:
dqrnorm is the fastest method for n <= 100
for n > 100, caching and dqrnorm are comparable and much faster than rnorm

Generate permutations in sequential order - R

I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.

R sample function issue over 10 million values

I found this quirk in R and can't find much evidence for why it occurs. I was trying to recreate a sample as a check and discovered that the sample function behaves differently in certain cases. See this example:
# Look at the first ten rows of a randomly ordered vector of the first 10 million integers
set.seed(4)
head(sample(1:10000000), 10)
[1] 5858004 89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395 731445
# Select a specified sample of size 10 from this same list
set.seed(4)
sample(1:10000000), size = 10)
[1] 5858004 89458 2937396 2773749 8135739 2604277 7244055 9060916 9490395 731445
# Try the same for sample size 10,000,001
set.seed(4)
head(sample(1:10000001), 10)
[1] 5858004 89458 2937396 2773750 8135740 2604277 7244056 9060917 9490396 731445
set.seed(4)
sample(1:10000001), size = 10)
[1] 5858004 89458 2937397 2773750 8135743 2604278 7244060 9060923 9490404 731445
I tested many values up to this 10 million threshold and found that the values matched (though I admit to not testing more than 10 output rows).
Anyone know what's going on here? Is there something significant about this 10 million number?
Yes, there's something special about 1e7. If you look at the sample code, it ends up calling sample.int. And as you can see at ?sample, the default value for the useHash argument of sample.int is
useHash = (!replace && is.null(prob) && size <= n/2 && n > 1e7)
That && n > 1e7 means when you get above 1e7, the default preference switches to useHash = TRUE. If you want consistency, call sample.int directly and specify the the useHash value. (TRUE is a good choice for memory efficiency, see the argument description at ?sample for details.)

What is the true difference between set.seed(n) and set.seed(n+1)

I am trying figure out how the set.seed() function works in R.
I am curious whether if set.seed( 3 ) and set.seed( 4 ) these are more likely to generate duplicate samples than if set.seed( 3 ) and set.seed( 100 )?
If yes, how many unique samples a set.seed( 3 ) can generate, before a match in the samples generated by set.seed( 4 ) appears?
If not, can I conclude that a different n in set.seed( n ) does not mean anything as long as they are different?
I heard something related to independent random stream? Is this n related to that?
If yes, how can I define an independent random stream?
I have already read What does the integer while setting the seed mean?, but it looks does not answer my questions.
Let me also try to give a brief easy answer. I do believe the two comments are useful.
We sometimes need random numbers in our programs. Computer rely on an algorithm to generate random numbers. Because of this, we have the option to re-create the sequence of random numbers generated. This is quite useful in reproducing someones work. In R, if we use
sed.seed(42)
runif(5)
at any point, it will always give the same sequence of random numbers.
It is not expected that there be a relationship between set.seed(n) and set.seed(n+1), or set.seed(n1) and set.seed(n2). Or, it is expected that set.seed(3) is not going to generate stream of set.seed(4) after a number of iterations, and vice versa.
So, in general, one can treat sequences of random numbers generated by different seeds to be independent.
I think it's a bad idea to make any assumptions about the relationship between streams of random numbers that are produced by two different seeds unless the underlying random number generator documents the relationship. For example, I was surprised to learn that the default Mersenne-Twister RNG acts like this:
> set.seed(0)
> x <- runif(10)
> set.seed(1)
> y <- runif(10)
> x[2:10] == y[1:9]
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
I haven't noticed this kind of behavior for any other pair of seed values, but that was enough to scare me away from making assumptions.
If you care about these issues, you should read about the nextRNGStream and nextRNGSubStream functions in the parallel package. These are intended to generate .Random.seed values that result in independent streams of random numbers.

How do I stop set.seed() after a specific line of code?

I would like to end the scope of set.seed() after a specific line in order to have real randomization for the rest of the code. Here is an example in which I want set.seed() to work for "rnorm" (line 4), but not for "nrow" (line 9)
set.seed(2014)
f<-function(x){0.5*x+2}
datax<-1:100
datay<-f(datax)+rnorm(100,0,5)
daten<-data.frame(datax,datay)
model<-lm(datay~datax)
plot(datax,datay)
abline(model)
a<-daten[sample(nrow(daten),20),]
points(a,col="red",pch=16)
modela<-lm(a$datay~a$datax)
abline(modela, col="red")
Thanks for suggestions, indeed!
set.seed(NULL)
See help documents - ?set.seed:
"If called with seed = NULL it re-initializes (see ‘Note’) as if no seed had yet been set."
Simply use the current system time to "undo" the seed by introducing a new unique random seed:
set.seed(Sys.time())
If you need more precision, consider fetching the system timestamp by millisecond (use R's system(..., intern = TRUE) function).
set.seed() only works for the next execution. so what you want is already happening.
see this example
set.seed(12)
sample(1:15, 5)
[1] 2 12 13 4 15
sample(1:15, 5) # run the same code again you will see different results
[1] 1 3 9 15 12
set.seed(12)#set seed again to see first set of results
sample(1:15, 5)
[1] 2 12 13 4 15
set.seed() just works for the first line containing randomly sample, and will not influence the next following command. If you want it to work for the other lines, you must call the set.seed function with the same "seed"-the parameter.

Resources