I need to generate a vector of "random" numbers, except that they need to be fully deterministic. The distribution from which the numbers come is not so important. What is a simple way to do this in R?
The reason for not using something like runif is that it returns a different sequence every time it is called.
The reason for not generating one sequence (with runif) and reusing it is that the calls are made on different machines. I could hardcode the sequence into a script, but the length of the sequence needed is unknown at design-time, so a pseudorandom sequence based on some hardcoded seed is preferable.
Are you aware of the set.seed() command?
R> set.seed(42); runif(3)
[1] 0.914806 0.937075 0.286140
R> set.seed(42); runif(3) # same seed, same numbers
[1] 0.914806 0.937075 0.286140
R> set.seed(12345); runif(3) # different seed, different numbers
[1] 0.720904 0.875773 0.760982
R>
There is also the SoDa package (and manual [PDF]) which allows you to wrap other operations and recover the starting and ending seed. It's just a wrapper around set.seed() but you can check for yourself (e.g. in unit tests) more easily.
Related
When generating exams using the function exams2nops we randomly generate data for each of the produced exams (let's say 5 different versions). We would like to use the exact same version of each exam to produce the solutions version (using exams2pdf). Is it possible to create the solution version right on the go when generating exams with the exams2nops? By exact same version I mean, the same order of the multiple-choice answers and the same wrong values (using the marvelous num_to_schoice function). We save the .rds objects used on each exercise, allowing us to obtain import them when generating solutions, however, the wrong options and order are different since it is random. Should we also save a specific seed in the .rds object? Inside each exercise, we have several random generated values.
When you set the same random seed prior to calling exams2pdf() and exams2nops() you should get the same random versions of the exercises.
Illustration: n = 2 version of an exm with 3 exercises.
library("exams")
exm <- c("capitals.Rmd", "deriv2.Rmd", "tstat2.Rmd")
set.seed(1)
exm1 <- exams2pdf(exm, n = 2)
set.seed(1)
exm2 <- exams2nops(exm, n = 2)
Compare the question list of all three exercises in the second random version of the exams:
all.equal(exm1[[2]][[1]]$questionlist, exm2[[2]][[1]]$questionlist)
## [1] TRUE
all.equal(exm1[[2]][[2]]$questionlist, exm2[[2]][[2]]$questionlist)
## [1] TRUE
all.equal(exm1[[2]][[3]]$questionlist, exm2[[2]][[3]]$questionlist)
## [1] TRUE
Both have to be called separately, though, there is no option to produce both in one go, currently.
I am still quite new to r (used to program in Matlab) and I am trying use the parallel package to speed up some calculations. Below is an example which I am trying to calculate the rolling standard deviation of a matrix (by column) with the use of zoo package, with and without parallelising the codes. However, the shape of the outputs came out to be different.
# load library
library('zoo')
library('parallel')
library('snow')
# Data
z <- matrix(runif(1000000,0,1),100,1000)
#This is what I want to calculate with timing
system.time(zz <- rollapply(z,10,sd,by.column=T, fill=NA))
# Trying to achieve the same output with parallel computing
cl<-makeSOCKcluster(4)
clusterEvalQ(cl, library(zoo))
system.time(yy <-parCapply(cl,z,function(x) rollapplyr(x,10,sd,fill=NA)))
stopCluster(cl)
My first output zz has the same dimensions as input z, whereas output yy is a vector rather than a matrix. I understand that I can do something like matrix(yy,nrow(z),ncol(z)) however I would like to know if I have done something wrong or if there is a better way of coding to improve this. Thank you.
From the documentation:
parRapply and parCapply always return a vector. If FUN always returns
a scalar result this will be of length the number of rows or columns:
otherwise it will be the concatenation of the returned values.
And:
parRapply and parCapply are parallel row and column apply functions
for a matrix x; they may be slightly more efficient than parApply but
do less post-processing of the result.
So, I'd suggest you use parApply.
I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(
I am trying to understand how set.seed works in R. I understand it, can reproduce random samples, but I don't know what is the difference between set.seed(1) and set.seed(123) ?
What do the argument in the bracket mean ?
The seed argument in set.seed is a single value, interpreted as an integer (as defined in help(set.seed()). The seed in set.seed produces random values which are unique to that seed (and will be same irrespective of the computer you run and hence ensures reproducibility). So the random values generated by set.seed(1) and set.seed(123) will not be the same but the random values generated by R in your computer using set.seed(1) and by R in my computer using the same seed are the same.
set.seed(1)
x<-rnorm(10,2,1)
> x
[1] 1.373546 2.183643 1.164371 3.595281 2.329508 1.179532 2.487429 2.738325 2.575781 1.694612
set.seed(123)
y<-rnorm(10,2,1)
> y
[1] 1.4395244 1.7698225 3.5587083 2.0705084 2.1292877 3.7150650 2.4609162 0.7349388 1.3131471 1.5543380
> identical(x,y)
[1] FALSE
The majority of computer programs uses deterministic algorithms to generate random numbers (which is the reason why the numbers they generate are not truly random, but pseudorandom, which is good enough for most purposes). R is no different, and you can think of the random numbers it generates as being part of a very long string of "random" numbers that, when summoned, just starts at some point and spits out pseudorandom numbers for you. By using set.seed() you are basically giving the program a starting point instead of letting it choose its own. That's why any user running the same seed number will get the same results.
You can run ?RNGkind for more information on the subject.
I understand what set.seed() does and when I might use it, but I still have many questions about the function. Here are a few:
Is it possible to "reset" set.seed() to something "more random" if you have called set.seed() earlier in your session? Is that even necessary?
Is it possible to view the seed that R is currently using?
Is there a way to make set.seed() allow alphanumeric seeds, the way one can enter them at random.org (be sure you are in the advanced mode, and see "Part 3" of the form to see what I mean)?
Just for fun:
set.seed.alpha <- function(x) {
require("digest")
hexval <- paste0("0x",digest(x,"crc32"))
intval <- type.convert(hexval) %% .Machine$integer.max
set.seed(intval)
}
So you can do:
set.seed.alpha("hello world")
(in fact x can be any R object, not just an alphanumeric string)
It's possible, if you set the seed to something like the final digits of your time epoch, but it's really not necessary. The intended use of PRNGs is that you set the seed once at the start of a session, and use successive generated variates from this. Do things differently, and you don't get to enjoy the various good theoretical and empirical properties the R RNGs have.
But I'm not sure you really understand the purpose of set.seed. It's not really there for you to get 'more random' numbers. If you are doing some kind of application for which the R PRNG is insufficient (for instance, if you require cryptographic randomness), you might as well generate all your random numbers by some alternate method and use them directly. The real purpose of set.seed is to produce reproducibility in results using RNGs. If you start the same analysis using the same sequence of random number generations, and set the seed to the same value, you will always get the same result. This is helpful in debugging, and for others reviewing your results.
To use the epoch time, do something like
t <- as.numeric(Sys.time())
seed <- 1e8 * (t - floor(t))
set.seed(seed); print(seed)
For your question 3 there is the char2seed function in the TeachingDemos package which will take a character string (alhpa numeric) and convert it to an integer and by default use that to set a new seed. The idea was that students could use their name (or some combination/subset of names) as a seed so each student gets a different dataset, but the teacher can reproduce each student's dataset.
For an answer to 2, first see the help page ?RNGkind.
To find the kind of RNG in use:
RNGkind()
# [1] "Mersenne-Twister" "Inversion"
The Mersenne Twister is the default.
From the help page:
‘"Mersenne-Twister":’ From Matsumoto and Nishimura (1998). A
twisted GFSR with period 2^19937 - 1 and equidistribution in
623 consecutive dimensions (over the whole period). The
‘seed’ is a 624-dimensional set of 32-bit integers plus a
current position in that set.
To find the current seed in use, you need to first call the random number generator.
runif(1, 0, 1)
# [1] 0.9834062
.Random.seed
# [Gives a 626 length vector]
Calling set.seed(some_integer) followed by .Random.seed,
will always give the same 626 length vector if you use the same some_integer. To put it differently, the 626-length vector is determined solely by some_integer, given one is using the Mersenne Twister, of course.
Also, of course, running set.seed to some fixed value will give you the same values for calls to random number routines following it. That's the main use for it in practice, to give reproducibility. E.g.
set.seed(1)
runif(5, 0, 1)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
rnorm(1, 0, 1)
# [1] 1.272429
set.seed(1)
runif(5, 0, 1)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
rnorm(1, 0, 1)
# [1] 1.272429
All the basic number generator code in R is in the file src/main/RNG.c in the source code.
It is in C, but fairly easy to follow.
I have the same issue as in question 1. I then figure I can simply reset seed in the loop by:
set.seed(123)
x<- rnorm(10,1,1)
set.seed(null)
This way at the end of each loop the seed just got deleted. It worked for me.