I am trying to understand how set.seed works in R. I understand it, can reproduce random samples, but I don't know what is the difference between set.seed(1) and set.seed(123) ?
What do the argument in the bracket mean ?
The seed argument in set.seed is a single value, interpreted as an integer (as defined in help(set.seed()). The seed in set.seed produces random values which are unique to that seed (and will be same irrespective of the computer you run and hence ensures reproducibility). So the random values generated by set.seed(1) and set.seed(123) will not be the same but the random values generated by R in your computer using set.seed(1) and by R in my computer using the same seed are the same.
set.seed(1)
x<-rnorm(10,2,1)
> x
[1] 1.373546 2.183643 1.164371 3.595281 2.329508 1.179532 2.487429 2.738325 2.575781 1.694612
set.seed(123)
y<-rnorm(10,2,1)
> y
[1] 1.4395244 1.7698225 3.5587083 2.0705084 2.1292877 3.7150650 2.4609162 0.7349388 1.3131471 1.5543380
> identical(x,y)
[1] FALSE
The majority of computer programs uses deterministic algorithms to generate random numbers (which is the reason why the numbers they generate are not truly random, but pseudorandom, which is good enough for most purposes). R is no different, and you can think of the random numbers it generates as being part of a very long string of "random" numbers that, when summoned, just starts at some point and spits out pseudorandom numbers for you. By using set.seed() you are basically giving the program a starting point instead of letting it choose its own. That's why any user running the same seed number will get the same results.
You can run ?RNGkind for more information on the subject.
Related
I'm practicing the different methods of splitting datasets. However, the split observation numbers have different results. Isn't the result of the observation numbers with two functions should be same? Since the split ratio and original data set are the same?
Here's my code
##split data set with caTools
set.seed(144)
split.5<-sample.split(CTR$CTR,0.7)
ctr.test<-filter(CTR,split.5==FALSE)
ctr.train<-filter(CTR,split.5==TRUE)
##split data set with sample function
train.id=sample(nrow(CTR),0.7*nrow(CTR))
ctr_test=CTR[-train.id,]
ctr_train=CTR[train.id,]
according to the result calculated from the calculator, the observation number of sample is correct, equal to total observations*0.7.
Thanks a lot for help!
Here's a modified version of your code, let me know if this isn't what you meant.
set.seed(144)
split.5<-sample.split(chickwts$weight,0.7)
ctr.test<-dplyr::filter(chickwts,split.5==FALSE)
ctr.train<-dplyr::filter(chickwts,split.5==TRUE)
set.seed(144)
train.id=sample(nrow(chickwts),0.7*nrow(chickwts))
ctr_test=chickwts[-train.id,]
ctr_train=chickwts[train.id,]
And then you can see that
nrow(ctr_test) == nrow(ctr.test)
is TRUE.
I think you are asking why you don't end up with the same rows.
That is why the index values where train.id == TRUE are not the same as the values for split.5.
Using ?Random will get you more information on random number generation in R. I didn't look into the code of sample.split() but the documentation doesn't say much about the details of the RNG used.
So if you try this
set.seed(144)
split.52<-sample.split(chickwts$weight,0.7)
ctr.test2<-dplyr::filter(chickwts,split.5==FALSE)
ctr.train2<-dplyr::filter(chickwts,split.5==TRUE)
ctr.test2 == ctr.test
set.seed(144)
train.id2=sample(nrow(chickwts),0.7*nrow(chickwts))
ctr_test2=chickwts[-train.id,]
ctr_train2=chickwts[train.id,]
ctr_test == ctr_test2
You see that each method works as you expected but they are working somewhat differently.
Update:
A closer look reveals that sample.split() uses runif(). In contrast sample() relies on much lower level code. This is an article that can get you started on what that means:
https://datascienceconfidential.github.io/r/2017/12/28/how-to-read-source-code-internal-r-function.html
When generating exams using the function exams2nops we randomly generate data for each of the produced exams (let's say 5 different versions). We would like to use the exact same version of each exam to produce the solutions version (using exams2pdf). Is it possible to create the solution version right on the go when generating exams with the exams2nops? By exact same version I mean, the same order of the multiple-choice answers and the same wrong values (using the marvelous num_to_schoice function). We save the .rds objects used on each exercise, allowing us to obtain import them when generating solutions, however, the wrong options and order are different since it is random. Should we also save a specific seed in the .rds object? Inside each exercise, we have several random generated values.
When you set the same random seed prior to calling exams2pdf() and exams2nops() you should get the same random versions of the exercises.
Illustration: n = 2 version of an exm with 3 exercises.
library("exams")
exm <- c("capitals.Rmd", "deriv2.Rmd", "tstat2.Rmd")
set.seed(1)
exm1 <- exams2pdf(exm, n = 2)
set.seed(1)
exm2 <- exams2nops(exm, n = 2)
Compare the question list of all three exercises in the second random version of the exams:
all.equal(exm1[[2]][[1]]$questionlist, exm2[[2]][[1]]$questionlist)
## [1] TRUE
all.equal(exm1[[2]][[2]]$questionlist, exm2[[2]][[2]]$questionlist)
## [1] TRUE
all.equal(exm1[[2]][[3]]$questionlist, exm2[[2]][[3]]$questionlist)
## [1] TRUE
Both have to be called separately, though, there is no option to produce both in one go, currently.
I need to generate a vector of "random" numbers, except that they need to be fully deterministic. The distribution from which the numbers come is not so important. What is a simple way to do this in R?
The reason for not using something like runif is that it returns a different sequence every time it is called.
The reason for not generating one sequence (with runif) and reusing it is that the calls are made on different machines. I could hardcode the sequence into a script, but the length of the sequence needed is unknown at design-time, so a pseudorandom sequence based on some hardcoded seed is preferable.
Are you aware of the set.seed() command?
R> set.seed(42); runif(3)
[1] 0.914806 0.937075 0.286140
R> set.seed(42); runif(3) # same seed, same numbers
[1] 0.914806 0.937075 0.286140
R> set.seed(12345); runif(3) # different seed, different numbers
[1] 0.720904 0.875773 0.760982
R>
There is also the SoDa package (and manual [PDF]) which allows you to wrap other operations and recover the starting and ending seed. It's just a wrapper around set.seed() but you can check for yourself (e.g. in unit tests) more easily.
I would like to find all combinations of vector elements that matches a specific condition. The function expand.grid returns all possible combinations without checking for a specific condition. It is possible to test for a specific condition after using the expand.grid function, but in some situations the number of possible combinations is too large to generate them with expand.grid. Therefore is there a function that allows me to check for a condition while generating all possible combinations.
This is a simplified version of the problem:
A <- seq.int(12, from=0, by=1)*15
B <- seq.int(27, from=0, by=1)*23
C <- seq.int(18, from=0, by=1)*18
D <- seq.int(33, from=0, by=1)*10
out<-expand.grid(A,B,C,D) #out is a dataframe with 235144 x 4 as dimensions
idx<-which(rowSums(out)<=400 & rowSums(out)>=300) #Only a small fraction of 'out' is needed
results <- out(idx,)
In a word, no. After all, if you knew a priori which combinations were desirable/undesirable, you could exclude them from the expansion, e.g. expand.grid(A[A<20],B[B<15],...) . In the general case, which I'm assuming is your real question, you have no simple way to exclude portions of the input vectors.
You might just want to write a multilevel loop which tests each combination in turn and saves or rejects it. This will be slow (again, unless you come up with some clever algorithm to predict regions which are all TRUE or FALSE). So, in the long run, you may be better off using some of the R-packages which partition large calculations (and datasets) so as to avoid exceeding your memory limits.
Now that I've said all that, someone's going to post a link to a package which does exactly that :-(
I understand what set.seed() does and when I might use it, but I still have many questions about the function. Here are a few:
Is it possible to "reset" set.seed() to something "more random" if you have called set.seed() earlier in your session? Is that even necessary?
Is it possible to view the seed that R is currently using?
Is there a way to make set.seed() allow alphanumeric seeds, the way one can enter them at random.org (be sure you are in the advanced mode, and see "Part 3" of the form to see what I mean)?
Just for fun:
set.seed.alpha <- function(x) {
require("digest")
hexval <- paste0("0x",digest(x,"crc32"))
intval <- type.convert(hexval) %% .Machine$integer.max
set.seed(intval)
}
So you can do:
set.seed.alpha("hello world")
(in fact x can be any R object, not just an alphanumeric string)
It's possible, if you set the seed to something like the final digits of your time epoch, but it's really not necessary. The intended use of PRNGs is that you set the seed once at the start of a session, and use successive generated variates from this. Do things differently, and you don't get to enjoy the various good theoretical and empirical properties the R RNGs have.
But I'm not sure you really understand the purpose of set.seed. It's not really there for you to get 'more random' numbers. If you are doing some kind of application for which the R PRNG is insufficient (for instance, if you require cryptographic randomness), you might as well generate all your random numbers by some alternate method and use them directly. The real purpose of set.seed is to produce reproducibility in results using RNGs. If you start the same analysis using the same sequence of random number generations, and set the seed to the same value, you will always get the same result. This is helpful in debugging, and for others reviewing your results.
To use the epoch time, do something like
t <- as.numeric(Sys.time())
seed <- 1e8 * (t - floor(t))
set.seed(seed); print(seed)
For your question 3 there is the char2seed function in the TeachingDemos package which will take a character string (alhpa numeric) and convert it to an integer and by default use that to set a new seed. The idea was that students could use their name (or some combination/subset of names) as a seed so each student gets a different dataset, but the teacher can reproduce each student's dataset.
For an answer to 2, first see the help page ?RNGkind.
To find the kind of RNG in use:
RNGkind()
# [1] "Mersenne-Twister" "Inversion"
The Mersenne Twister is the default.
From the help page:
‘"Mersenne-Twister":’ From Matsumoto and Nishimura (1998). A
twisted GFSR with period 2^19937 - 1 and equidistribution in
623 consecutive dimensions (over the whole period). The
‘seed’ is a 624-dimensional set of 32-bit integers plus a
current position in that set.
To find the current seed in use, you need to first call the random number generator.
runif(1, 0, 1)
# [1] 0.9834062
.Random.seed
# [Gives a 626 length vector]
Calling set.seed(some_integer) followed by .Random.seed,
will always give the same 626 length vector if you use the same some_integer. To put it differently, the 626-length vector is determined solely by some_integer, given one is using the Mersenne Twister, of course.
Also, of course, running set.seed to some fixed value will give you the same values for calls to random number routines following it. That's the main use for it in practice, to give reproducibility. E.g.
set.seed(1)
runif(5, 0, 1)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
rnorm(1, 0, 1)
# [1] 1.272429
set.seed(1)
runif(5, 0, 1)
# [1] 0.2655087 0.3721239 0.5728534 0.9082078 0.2016819
rnorm(1, 0, 1)
# [1] 1.272429
All the basic number generator code in R is in the file src/main/RNG.c in the source code.
It is in C, but fairly easy to follow.
I have the same issue as in question 1. I then figure I can simply reset seed in the loop by:
set.seed(123)
x<- rnorm(10,1,1)
set.seed(null)
This way at the end of each loop the seed just got deleted. It worked for me.