Plot a table of binomial distributions in R - r

For a game design issue, I need to better inspect binomial distributions. Using R, I need to build a two dimensional table that - given a fixed parameters 'pool' (the number of dice rolled), 'sides' (the number of sides of the die) has:
In rows --> minimum for a success (ranging from 0 to sides, it's a discrete distribution)
In columns --> number of successes (ranging from 0 to pool)
I know how to calculate it as a single task, but I'm not sure on how to iterate to fill the entire table
EDIT: I forgot to say that I want to calculate the probability p of gaining at least the number of successes.

Ok, i think this could be a simple solution. It has ratio of successes on rows and success thresholds on dice roll (p) on columns.
poolDistribution <- function(n, sides=10, digits=2, roll.Under=FALSE){
m <- 1:sides
names(m) <- paste(m,ifelse(roll.Under,"-", "+"),sep="")
s <- 1:n
names(s) <- paste(s,n,sep="/")
sapply(m, function(m.value) round((if(roll.Under) (1 - pbinom(s - 1, n, (m.value)/sides))*100 else (1 - pbinom(s - 1, n, (sides - m.value + 1)/sides))*100), digits=digits))

This gets you half of the way.
If you are new to R, you might miss out on the fact that a very powerful feature is that you can use a vector of values as an index to another vector. This makes part of the problem trivially easy:
pool <- 3
sides <- 20 # <cough>D&D<cough>
# you need to strore the values somewhere, use a vector
NumberOfRollsPerSide <- rep(0, sides)
names(NumberOfRollsPerSide) <- 1:sides # this will be useful in table
## Repeast so long as there are still zeros
## ie, so long as there is a side that has not come up yet
while (any(NumberOfRollsPerSide == 0)) {
# roll once
oneRoll <- sample(1:sides, pool, TRUE)
# add (+1) to each sides' total rolls
# note that you can use the roll outcome to index the vector. R is great.
NumberOfRollsPerSide[oneRoll] <- NumberOfRollsPerSide[oneRoll] + 1
}
# These are your results:
NumberOfRollsPerSide
All you have left to do now is count, for each side, in which roll number it first came up.

Related

How to find the probability of extinction = 1 using Galton-Watson process in R?

I am simulating a basic Galton-Watson process (GWP) using a geometric distribution. I'm using this to find the probability of extinction for each generation. My question is, how do I find the generation at which the probability of extinction is equal to 1?
For example, I can create a function for the GWP like so:
# Galton-Watson Process for geometric distribution
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- sum(rgeom(Sn[i - 1], p))
}
return(Sn)
}
where, n is the number of generations.
Then, if I set the geometric distribution parameter p = 0.25... then to calculate the probability of extinction for, say, generation 10, I just do this:
N <- 10 # Number of elements in the initial population.
GWn <- replicate(N, GWP(10, 0.25)[10])
probExtinction <- sum(GWn==0)/N
probExtinction
This will give me the probability of extinction for generation 10... to find the probability of extinction for each generation I have to change the index value (to the corresponding generation number) when creating GWn... But what I'm trying to do is find at which generation will the probability of extinction = 1.
Any suggestions as to how I might go about solving this problem?
I can tell you how you would do this problem in principle, but I'm going to suggest that you may run into some difficulties (if you already know everything I'm about to say, just take it as advice to the next reader ...)
theoretically, the Galton-Watson process extinction probability never goes exactly to 1 (unless prob==1, or in the infinite-time limit)
of course, for any given replicate and random-number seed you can compute the first time point (if any) at which all of your lineages have gone extinct. This will be highly variable across runs, depending on the random-number seed ...
the distribution of extinction times is extremely skewed; lineages that don't go extinct immediately will last a loooong time ...
I modified your GWP function in two ways to make it more efficient: (1) stop the simulation when the lineage goes extinct; (2) replace the sum of geometric deviates with a single negative binomial deviate (see here)
GWP <- function(n, p) {
Sn <- c(1, rep(0, n))
for (i in 2:(n + 1)) {
Sn[i] <- rnbinom(1, size=Sn[i - 1], prob=p)
if (Sn[i]==0) break ## extinct, bail out
}
return(Sn)
}
The basic strategy now is: (1) run the simulations for a while, keep the entire trajectory; (2) compute extinction probability in every generation; (3) find the first generation such that p==1.
set.seed(101)
N <- 10 # Number of elements in the initial population.
maxgen <- 100
GWn <- replicate(N, GWP(maxgen, 0.5), simplify="array")
probExtinction <- rowSums(GWn==0)/N
which(probExtinction==1)[1]
(Subtract 1 from the last result if you want to start indexing from generation 0.) In this case the answer is NA, because there's 1/10 lineages that manages to stay alive (and indeed gets very large, so it will probably persist almost forever)
plot(0:maxgen, probExtinction, type="s") ## plot extinction probability
matplot(1+GWn,type="l",lty=1,col=1,log="y") ## plot lineage sizes (log(1+x) scale)
## demonstration that (sum(rgeom(n,...)) is equiv to rnbinom(1,size=n,...)
nmax <- 70
plot(prop.table(table(replicate(10000, sum(rgeom(10, prob=0.3))))),
xlim=c(0,nmax))
points(0:nmax,dnbinom(0:nmax, size=10, prob=0.3), col=2,pch=16)

how to create a for loop in R to run a simple random sample and calculate the average of each set

data = read.csv(file= "~/Downloads/data.csv")
temp=(data$temp)
n=75
N=length(temp)
s=sample(1:N, n)
ybar=mean(temp[s])
I want to run the sample 100 times where n is 75. Then calculate average of each sample, and subtract each average from a set number (50).
Maybe a short loop is the way to go. Notice the bracket on the left-side of the equals sign in the last line of code - it's the key to using a loop for your calculation!
# set a seed - always a good idea when using randomness like 'sample()'
set.seed(123)
# pre-allocate an "empty" vector to fill in with results
ybar_vec = vector(length=n)
# do your calculation "n" times
for(i in 1:n) {
s = sample(N)
ybar_vec[i] = 50 - mean(temp[s]) # store i^th calc as i^th element of ybar_vec
}

Faster way to generate large list of vectors from permuted datasets [R]

Setup For the purposes of my simulation, I'm generating a list of B=2000 elements, with each element being the output of a permutation procedure in which I first permute the rows of a 200x8000 matrix and for each column, I calculate the Kolmogorov-Smirnov test statistic between the first and second 100 rows (you can think of the first 100 rows as data from one group and the second 100 rows as data from another group).
Question This process takes a very long time (about 30-40 minutes) to generate the list. Is there a much faster way? In the future, I'd like to increase B to a larger value.
Code
B=2000
n.row=200; n.col=8000
#Generate sample data
samp.dat = matrix(rnorm(n.row*n.col),nrow=n.row)
perm.KS.list = NULL
for (b in 1:B){
#permute the rows
perm.dat.tmp = samp.dat[sample(nrow(samp.dat)),]
#Compute the permutation-based test statistics
perm.KS.list[[b]]= apply(perm.dat.tmp,2,function(y) ks.test.stat(y[1:100],y[101:200]))
}
#Modified KS-test function (from base package)
ks.test.stat <- function(x,y){
x <- x[!is.na(x)]
n <- length(x)
y <- y[!is.na(y)]
n.x <- as.double(n)
n.y <- length(y)
w <- c(x, y)
z <- cumsum(ifelse(order(w) <= n.x, 1/n.x, -1/n.y))
z <- z[c(which(diff(sort(w)) != 0), n.x + n.y)] #exclude ties
STATISTIC <- max(abs(z))
return(STATISTIC)
}
The 1:B loop has several places to optimize, but I agree that the real consumer is that inner function. Because you're simulating your well-behaved bootstrap samples, you can make two simplifying assumptions that the general base function can't:
There aren't missing values. This obviates the is.na() adjustments
The two sides (ie, x & y) have the same number of elements, so you don't need to count them separately. instead of splitting y in the loop, and them joining them back in the function (into w), just keep it together. The balanced sides also permit simplifications like remove the ifelse() clause. It produces a bunch of 0/1s, which are rescaled to -1/1s with integer arithmetic.
The function is reduced, which saves about 25% of the time. I added integers, instead of doubles inside cumsum().
ks.test.stat.balanced <- function(w){
n <- as.integer(length(w) * .5)
# z <- cumsum(ifelse(order(w) <= n, 1L, -1L)) / n
z <- cumsum((order(w) <= n)*2L - 1L) / n
# z <- z[c(which(diff(sort(w)) != 0), n + n)] #exclude ties
return( max(abs(z)) )
}
Ties shouldn't occur often with your gaussian rng, and the diff(sort(.)) is very expensive. If you're willing to remove that protection, the time is reduced by about 65%.
If you move the equation for z into abs(), it saves a little time over all those reps. I kept it separate above, so it's easier to read.
edit in case of an unbalanced simulation I'd recommend you:
still keep out the is.na,
still pass w,
still keep as much as possible in integer, not numeric, but
now include arguments n1 & n2 for the two group sizes.
Also, experiment w/ precalculating 1/n before cumsum() to avoid a lot of expensive divisions. Try to think of other math-y ways to extract calculations from an inner loop so it occurs less frequently.

R: draw from a vector using custom probability function

Forgive me if this has been asked before (I feel it must have, but could not find precisely what I am looking for).
Have can I draw one element of a vector of whole numbers (from 1 through, say, 10) using a probability function that specifies different chances of the elements. If I want equal propabilities I use runif() to get a number between 1 and 10:
ceiling(runif(1,1,10))
How do I similarly sample from e.g. the exponential distribution to get a number between 1 and 10 (such that 1 is much more likely than 10), or a logistic probability function (if I want a sigmoid increasing probability from 1 through 10).
The only "solution" I can come up with is first to draw e6 numbers from the say sigmoid distribution and then scale min and max to 1 and 10 - but this looks clumpsy.
UPDATE:
This awkward solution (and I dont feel it very "correct") would go like this
#Draw enough from a distribution, here exponential
x <- rexp(1e3)
#Scale probs to e.g. 1-10
scaler <- function(vector, min, max){
(((vector - min(vector)) * (max - min))/(max(vector) - min(vector))) + min
}
x_scale <- scaler(x,1,10)
#And sample once (and round it)
round(sample(x_scale,1))
Are there not better solutions around ?
I believe sample() is what you are looking for, as #HubertL mentioned in the comments. You can specify an increasing function (e.g. logit()) and pass the vector you want to sample from v as an input. You can then use the output of that function as a vector of probabilities p. See the code below.
logit <- function(x) {
return(exp(x)/(exp(x)+1))
}
v <- c(seq(1,10,1))
p <- logit(seq(1,10,1))
sample(v, 1, prob = p, replace = TRUE)

Draw random numbers from distribution within a certain range

I want to draw a number of random variables from a series of distributions. However, the values returned have to be no higher than a certain threshold.
Let’s say I want to use the gamma distribution and the threshold is 10 and I need n=100 random numbers. I now want 100 random number between 0 and 10. (Say scale and shape are 1.)
Getting 100 random variables is obviously easy...
rgamma(100, shape = 1, rate = 1)
But how can I accomplish that these values range from 0 to 100?
EDIT
To make my question clearer. The 100 values drawn should be scaled beween 0 and 10. So that the highest drawn value is 10 and the lowest 0. Sorry if this was not clear...
EDIT No2
To add some context to the random numbers I need: I want to draw "system repair times" that follow certain distributions. However, within the system simulation there is a binomial probability of repairs beeing "simple" (i.e. short repair time) and "complicated" (i.e. long repair time). I now need a function that provides "short repair times" and one that provides "long repair times". The threshold would be the differentiation between short and long repair times. Again, I hope this makes my question a little clearer.
This is not possible with a gamma distribution.
The support of a distribution determine the range of sample data drawn from it.
As the support of the gamma distribution is (0,inf) this is not possible.(see https://en.wikipedia.org/wiki/Gamma_distribution).
If you really want to have a gamma distribution take a rejection sampling approach as Alex Reynolds suggests.
Otherwise look for a distribution with a bounded/finite support (see https://en.wikipedia.org/wiki/List_of_probability_distributions)
e.g. uniform or binomial
Well, fill vector with rejection, untested code
v <- rep(-1.0, 100)
k <- 1
while (TRUE) {
q <- rgamma(1, shape=1, rate=1)
if (q > 0.0 && q < 100) {
v[k] <- q
k<-k+1
if (k>100)
break
}
}
I'm not sure you can keep the properties of the original distribution, imposing additional conditions... But something like this will do the job:
Filter(function(x) x < 10, rgamma(1000,1,1))[1:100]
For the scaling - beware, the outcome will not follow the original distribution (but there's no way to do it, as the other answers pointed out):
# rescale numeric vector into (0, 1) interval
# clip everything outside the range
rescale <- function(vec, lims=range(vec), clip=c(0, 1)) {
# find the coeficients of transforming linear equation
# that maps the lims range to (0, 1)
slope <- (1 - 0) / (lims[2] - lims[1])
intercept <- - slope * lims[1]
xformed <- slope * vec + intercept
# do the clipping
xformed[xformed < 0] <- clip[1]
xformed[xformed > 1] <- clip[2]
xformed
}
# this is the requested data
10 * rescale(rgamma(100,1,1))
Use truncdist package. It truncates any distribution between upper and lower bounds.
Hope that helped.

Resources