I am wondering how to fix the random seed of sample function in R.
An easy example is here:
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 3) # (*)
#[1] 9 4 7
sample(tmp, size = 3) # (**)
#[1] 1 2 5
sample(tmp, size = 3) # (***)
#[1] 7 2 3
sample(tmp, size = 3) # (****)
#[1] 3 1 5
# retry set the random seed
set.seed(1)
sample(tmp, size = 3) # (*)
#[1] 9 4 7
sample(tmp, size = 3) # (**)
#[1] 1 2 5
sample(tmp, size = 3) # (***)
#[1] 7 2 3
sample(tmp, size = 3) # (****)
#[1] 3 1 5
Then, when I repeat the sample function, the returned values change, although pattern of the change is uniform.
I cannot understand why sample function is differently affected from the set.seed.
How can I fix the return from sample function?
Thank you for your time to spend to read this question.
something like this may be
{set.seed(1); sample( tmp, 3 )}
[1] 9 4 7
This will return same result whenever called
I think your post sort of answers the question. If you notice the first sample after setting the seed are the same, which is what it's expected to do:
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 3)
[1] 9 4 7
sample(tmp, size = 3)
[1] 1 2 5
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 3)
[1] 9 4 7
sample(tmp, size = 3)
[1] 1 2 5
The first set of three draws after setting the seed are the same. The second draw of three is the same.
Perhaps it could be easier to see how set.seed works by looking at the first draw of 6 random selected values:
set.seed(1)
tmp <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
sample(tmp, size = 6)
[1] 9 4 7 1 2 5
Your question seems to reflect a basic misunderstanding of Pseudo-Random Number Generators (PRNGs) and seeds. PRNGs maintain an internal state containing a fixed number of bits, and advance from the current state to the next state by applying a deterministic function. That function determines how "random" the sequence of state-based outputs appears to be, even though it's not actually random (hence the pseudo prefix). Given a fixed number of bits, the algorithm will eventually repeat its internal state, at which point the entire sequence repeats since it's deterministic. How long it takes to do this is called the cycle length of the PRNG.
Once you realize that all PRNGs will eventually repeat their sequence, it's easy to see that a seed does nothing more (nor less) than provide an entry point to that cycle. A lot of people mistakenly think the seed influences the quality of the PRNG, but that's incorrect. The quality is based on the deterministic transition function, not on the seed.
So why do PRNGs allow you to specify a seed? Basically it's for reproducibility. Firstly, it's really hard to debug a program if you can't reproduce the data which exposed your bug. Re-running with the same seed will allow you to trace through to whatever caused a problem. Secondly, this allows for fairer comparisons of alternative systems. Let's say you want to compare how a grocery store, or a bank, works when you add an additional clerk or teller. If you run both configurations with the same set of customers and customer transactions by controlling seeding, the differences you see between those configurations are directly attributable to the change in the system configuration. Without controlling the seeding, differences may be damped or amplified by the randomness of the sequence. Yes, in the long run that would come out by taking a larger sample, but the point is that you can reduce the variance of the answer by clever use of seeding rather than boosting the sample size.
Bottom line - your example is working exactly as expected. Resetting the seed produces the same sequence of randomness when sampled in the same way.
Related
I want to simulate a string of random non-negative integer values in R.
However, those values should not follow any particular probability distribution function and could be empirically distributed.
How do I go about doing it?
You will need a distribution; there is no alternative, philosophically. There's no such thing as a "random number," only numbers randomly distributed according to some distribution.
To sample from an empirical distribution stored as my_dist, you can use sample():
my_dist <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55) # first 10 Fibonacci numbers
sample(my_dist, 100, replace = T) # draw 100 numbers from my_dist w/ replacement
Or, for some uniformly-distributed numbers between (for instance) 1 and 10, you could do:
sample(1:10, 100, replace = T)
There are, of course, specific distributions implemented as functions in base R and various packages, but I'll avoid those since you said you weren't interested in them.
Editing per Rui's good suggestion: If you want non-uniform variables, you can specify the prob parameter:
sample(1:3, 100, replace = T, prob = c(6, 3, 1))
# draws a 1 with 60% prob., a 2 with 30% prob., and a 3 with 10% prob.
I am working with a large dataset that contains longitudinal data on gambling behavior of 184,113 participants. The data is based on complete tracking of electronic gambling behavior within a gambling operator. Gambling behavior data is aggregated on a monthly level, a total of 70 months. I have an ID variable separating participants, a time variable (months), as well as numerous gambling behavior variables such as active days played for given month, bets placed for given month, total losses for given month, etc. Participants vary in when they have been active gambling. One participant may have gambled at month 2, 3, 4, and 7, another participant at 3, 5, and 7, and a third at 23, 24, 48, 65 etc.
I am attempting to run a negative binomial 2 truncated model in glmmTMB and I am wondering how the package handles lack of 0. I have longitudinal data on gambling behavior, days played for each month (for a total of 70 months). The variable can take values between 1-31 (depending on month), there are no 0. Participants’ months with 0 are absent from dataset. Example of how data are structured with just two participants:
# Example variables and data frame in long form
# Includes id variable, time variable and example variable
id <- c(1, 1, 1, 1, 2, 2, 2)
time <- c(2, 3, 4, 7, 3, 5, 7)
daysPlayed <- c(2, 2, 3, 3, 2, 2, 2)
dfLong <- data.frame(id = id, time = time, daysPlayed = daysPlayed)
My question: How do I specify where the truncation happens in glmmTMB? Does it default to 0? I want to truncate 0 and have run the following code (I am going to compare models, the first one is a simple unconditional one):
DaysPlayedUnconditional <- glmmTMB(daysPlayed ~ 1 + (1 | id), dfLong, family = truncated_nbinom2)
Will it do the trick?
From Ben Bolker through r-sig-mixed-models#r-project.org:
"I'm not 100% clear on your question, but: glmmTMB only does zero-truncation, not k-truncation with k>0, i.e. you can only specify the model Prob(x==0) = 0 Prob(x>0) = Prob(NBinom(x))/Prob(NBinom(x>0)) (terrible notation, but hopefully you get the idea)"
What is actually going on in R (the statistics behind rbinom(3, 10, 0.75)?
I know the 3 is the number of observations, the 10 is the number of trials and 0.75 is the probability of success).
The first result from running the R code rbinom(3,10,0.75) was 9 ,8 , 9.
The second result was 7, 8 ,7 .
The third result was 9, 6, 6 .
How is R creating the sequence of numbers?
I know it is using the binomial distribution. nCx *p^x * (1-p)^(n-x) where p= 0.75, 1-p= 0.25, n=10, what is x?
What is the range of possible values? Thank you
set.seed(42)
rbinom(3, 10, 0.75)
#[1] 6 5 8
The number of experiments is 3. Each experiment consists of 10 trials. The probability of success in each trial is 0.75.
In the first experiment (trial # 1 - trial # 10), there are 6 successes.
In the second experiment (trial # 11 - trial # 20), there are 5 successes.
In the third experiment (trial # 21 - trial # 30), there are 8 successes.
I'm new to R but i am trying to use it in order to aggregate losses that are observed from a severity distribution by an observation from a frequency distribution - essentially what rcompound does. However, i need a more granular approach as i need to manipulate the severity distribution before 'aggregation'.
Lets take an example. Suppose you have:
rpois(10,lambda=3)
Thereby, giving you something like:
[1] 2 2 3 5 2 5 6 4 3 1
Additionally, suppose we have severity of losses determined by:
rgamma(20,shape=1,scale=10000)
So that we also have the following output:
[1] 233.0257 849.5771 7760.4402 731.5646 8982.7640 24172.2369 30824.8424 22622.8826 27646.5168 1638.2333 6770.9010 2459.3722 782.0580 16956.1417 1145.4368 5029.0473 3485.6412 4668.1921 5637.8359 18672.0568
My question is: what is an efficient way to get R to take each Poisson observation in turn and then aggregate losses from my severity distribution? For example, the first Poisson observation is 2. Therefore, adding two observations (the first two) from my Gamma distribution gives 1082.61.
I say this needs to be 'efficient' (run time) due to the fact:
- The Poisson parameter may be come significantly large, i.e. up to 1000 or so.
- The realisations are likely to be up to 1,000,000, i.e. up to a million Poisson and Gamma observations to sort through.
Any help would be greatly appreciated.
Thanks, Dave.
It looks like you want to split the gamma vector at positions indicated by the accumulation of the poisson vector.
The following function (from here) does the splitting:
splitAt <- function(x, pos) unname(split(x, cumsum(seq_along(x) %in% pos)))
pois <- c(2, 2, 3, 5, 2, 5, 6, 4, 3, 1)
gam <- c(233.0257, 849.5771, 7760.4402, 731.5646, 8982.7640, 24172.2369, 30824.8424, 22622.8826, 27646.5168, 1638.2333, 6770.9010, 2459.3722, 782.0580, 16956.1417, 1145.4368, 5029.0473, 3485.6412, 4668.1921, 5637.8359, 18672.0568)
posits <- cumsum(pois)
Then do the following:
sapply(splitAt(gam, posits + 1), sum)
[1] 1082.603 8492.005 63979.843 61137.906 17738.200 19966.153 18672.057
According to post I linked to above, the splitAt() function slows down for large arrays, so you could (if necessary) consider the alternatives proposed in that post. For my part, I generated 1e6 poissons and 1e6 gammas, and the above function ran in 0.78 sec on my machine.
I am setting my R code for doing a Monte Carlo, however I need a sample of 1 number with a random distribution, so in order to test the function of the sample in R, I set the code below, however I do not understand the reason of the different results.
x <- rnorm(1,8,0)
x
#8
y <-sample(x=rnorm(1,8,0), size=1)
y
#4
Quoting ?sample,
If x has length 1, is numeric (in the sense of is.numeric) and x >= 1,
sampling via sample takes place from 1:x.
you're actually drawing from c(1, 2, 3, 4, 5, 6, 7, 8) and not from c(8).
However, it works if we draw from "character" class.
as.numeric(sample(as.character(rnorm(1,8,0)), size=1))
# [1] 8