R - How to Speed Up Recursion and Double Summation - r

Since this is essentially a question about how to efficiently perform a computation in R, I will start with the equation and then provide an explanation for the problem after the code for those who would find it useful or interesting.
I have written a script in R to generate values using the following function:
The function, as you can see, is recursive and involves double summation. It works well for small numbers around 15 or lower, but the execution time gets prohibitively long at higher values of n and t. I need to be able to perform the calculation for every n and t pair from 1 to 30. Is there a way to write a script that won't take months to execute?
My current script is:
explProb <- function(n,t) {
prob <- 0
#################################
# FIRST PART - SINGLE SUMMATION
#################################
i <- 0
if(t<=n) {
i <- c(t:n)
}
prob = sum(choose(n,i[i>0])*((1/3)^(i[i>0]))*((2/3)^(n-i[i>0])))
#################################
# SECOND PART - DOUBLE SUMMATION
#################################
if(t >= 2) {
for(k in 1:(t-1)) {
j <- c(0:(k-1))
prob = prob + sum(choose(n,n-k)*((1/6)^(j))*((1/6)^(k-j))*((2/3)^(n-k))*explProb(k-j,t-k))
}
}
return(prob)
}
MAX_DICE = 30
MAX_THRESHOLD = 30
probabilities = matrix(0,MAX_DICE,MAX_THRESHOLD)
for(dice in 1:MAX_DICE) {
for(threshold in 1:MAX_THRESHOLD) {
#print(sprintf("DICE = %d : THRESH = %d", dice, threshold))
probabilities[dice,threshold] = explProb(dice,threshold)
}
}
I am trying to write a script to generate a set of probabilities for a particular type of dice roll in a tabletop roleplaying game (Shadowrun 5th Edition, to be specific). The type of dice roll is called an "Exploding Dice Roll". In case you are not familiar with how these rolls work in this game, let me briefly explain.
Whenever you try to accomplish a task you make a test by rolling a number of six-sided dice. Your goal is to get a predetermined number "hits" when rolling those dice. A "hit" is defined as a 5 or 6 on a six-sided die. So, for example, if you have a dice pool of 5 dice, and you roll: 1, 3, 3, 5, 6 then you have gotten 2 hits.
In some cases you are allowed to re-roll all of the 6's that were rolled in order to try and get MORE hits.This is called an "exploding" roll. The 6's counts as hits, but can be re-rolled to "explode" into even more hits. For clarification I'll give a quick example...
If you roll 10 dice and get a result of 1, 2, 2, 4, 5, 5, 6, 6, 6, 6 then you have gotten 6 hits on the first roll... However, the 4 dice that rolled 6's can be re-rolled again. If you roll those dice and get 3, 5, 6, 6 then you have 3 more hits for a total of 9 hits. But you can now re-roll the two more sixes you got... etc... You keep re-rolling the sixes, adding the 5's and 6's to your total hits, and keep going until you get a roll with no sixes.
The function listed above generates these probabilities taking an input of "# of dice" and "number of hits" (called a "threshold" here).
n = # of Dice being rolled
t = Threshold number of "hits" to be reached

Calculation with Transition Matrix
If we have n=10 dice, then the probability of 0 to 10 occurrences of an event with prob=2/6 may be efficiently calculated in R as
dbinom(0:10,10,2/6)
Since you are allowed to keep rolling until failure, any number of ultimate hits is possible (the support of the distribution is [0,Inf)), albeit with geometrically diminishing probabilities. A recursive numeric solution is feasible due to the need to establish a cutoff for machine precision and the presence of a threshold to censor.
Since rerolls are with a smaller number of dice, it makes sense to precalculate all transition probabilities.
X<-outer(0:10,0:10,function(x,size) dbinom(x,size,2/6))
Where the i-th row of the j-th column gives the probability of (i-1) successes (hits) with (j-1) trials (dice rolled). For example, the probability of exactly 1 success with 6 trials is located at X[2,7].
Now if you start out with 10 dice, we can represent this as the vector
d<-c(rep(0,10),1)
Showing that with probability 1 we have 10 dice with 0 probability everywhere else.
After a single roll, the probabilities of the number of live dice is X %*% d.
After two rolls, the probabilities are X %*% X %*% d. We can calculate the live dice state probabilities after any number of rolls by iterating.
T<-Reduce(function(dn,n) X %*% dn,1:11,d,accumulate=TRUE)
Where T[1] gives the probabilities of live dice before the first roll and T[11] gives the probabilities of live dice before the 11th (after the 10th).
This is sufficient to calculate expected values, but for the distribution of cumulative sums, we'll need to track additional information in the state. The following function reshapes a state matrix at each step so that the i-th row and j-th column has the probability of (i-1) live dice with a current cumulative total of j-1.
step<-function(m) {
idx<-arrayInd(seq_along(m),dim(m))
idx[,2]<-rowSums(idx)-1
i<-idx[nrow(idx),]
m2<-matrix(0,i[1],i[2])
m2[idx]<-m
return(m2)
}
In order to recover the probabilities for cumulative totals, we use the following convenience function to sum across anti-diagonals
conv<-function(m)
tapply(c(m),c(row(m)+col(m)-2),FUN=sum)
The probabilities of continuing to roll rapidly diminish, so I've cut off at 40, and shown up to 20, rounded to 4 places
round(conv(Reduce(function(mn,n) X %*% step(mn), 1:40, X %*% d))[1:21],4)
#> 0 1 2 3 4 5 6 7 8 9
#> 0.0173 0.0578 0.1060 0.1413 0.1531 0.1429 0.1191 0.0907 0.0643 0.0428
#>
#> 10 11 12 13 14 15 16 17 18 19
#> 0.0271 0.0164 0.0096 0.0054 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001
Calculation with Simulation
This can also be calculated in reasonable time with reasonable precision using simple simulation.
We simulate a roll of n 6-sided dice with sample(1:6,n,replace=TRUE), calculate the number to re-roll, and iterate until none are available, counting "hits" along the way.
sim<-function(n) {
k<-0
while(n>0) {
roll<-sample(1:6,n,replace=TRUE)
n<-sum(roll>=5)
k<-k+n
}
return(k)
}
Now we can simply replicate a large number of trials and tabulate
prop.table(table(replicate(100000,sim(10))))
#> 0 1 2 3 4 5 6 7 8 9
#> 0.0170 0.0588 0.1053 0.1431 0.1518 0.1433 0.1187 0.0909 0.0657 0.0421
#>
#> 10 11 12 13 14 15 16 17 18 19
#> 0.0252 0.0161 0.0102 0.0056 0.0030 0.0015 0.0008 0.0004 0.0002 0.0001
This quite feasible even with 30 dice (a few seconds even with 100,000 replications).

Efficient Calculation Using Probability Distributions
The approach in the question and in my other answer use sums over transitions of dependent binomial distributions. The dependency arising from the carry over of previous successes (hits) to subsequent trials (rolls) complicates the calculations.
An alternative approach is to view each die separately. Roll a single die as long as it turns up as a hit. Each die is independent of the other, so the random variables may be summed efficiently through convolution. However, the distribution for each die is a geometric distribution, and the sum of independent geometric distributions gives rise to a negative binomial distribution.
R provides the negative binomial distribution, so the results obtained in my other answer may be had all at once by
round(dnbinom(0:19,10,prob=2/3),4)
[1] 0.0173 0.0578 0.1060 0.1413 0.1531 0.1429 0.1191 0.0907 0.0643 0.0428
[11] 0.0271 0.0164 0.0096 0.0054 0.0030 0.0016 0.0008 0.0004 0.0002 0.0001
The probability matrix in the question, with MAX_DICE=MAX_THRESHOLD=10, has first column equal to
1-dnbinom(0,1:10,prob=2/3)
So, you might be looking for the cumulative distribution function. I have not been able to figure out your intentions with the subsequent columns, but perhaps the goal was
outer(1:10,0:10,function(size,x) 1-dnbinom(x,size,prob=2/3))

Related

How to understand the result of Discrete Fourier Transform under period finding?

I am learning how to use Discrete Fourier Transform(DFT) to find the period about a^x mod(N), in which x is a positive integer, a is any prime number, and N is the product of two prime factors p and q.
For example, the period of 2^x mod(15) is 4,
>>> for x in range(8):
... print(2**x % 15)
...
Output: 1 2 4 8 1 2 4 8
^-- the next period
and the result of DFT is as following,
(cited from O'Reilly Programming Quantum Computers chapter 12)
There are 4 spikes with 4-unit spacing, and I think the latter 4 means that period is 4.
But, when N is 35 and period is 12
>>> for x in range(16):
... print(2**x % 35)
...
Output: 1 2 4 8 16 32 29 23 11 22 9 18 1 2 4 8
^-- the next period
In this case, there are 8 spikes greater than 100, whose locations are 0, 5, 6, 11, 32, 53, 58, 59, respectively.
Does the location sequence imply the magic number 12? And how to understand "12 evenly spaced spikes" from the righthand graph?
(cited from O'Reilly Programming Quantum Computers chapter 12)
see How to compute Discrete Fourier Transform? and all the sublinks especially How do I obtain the frequencies of each value in an FFT?.
As you can see i-th element of DFT result (counting from 0 to n-1 including) represent Niquist frequency
f(i) = i * fsampling / n
And DFT result uses only those sinusoidal frequencies. So if your signal does have different one (even slightly different frequency or shape) aliasing occurs.
Aliased sinusoid creates 2 frequencies in DFT output one higher and one lower frequency.
Any sharp edge is translated to many frequencies (usually continuous spectrum like your last example)
The f(0) is no frequency and represents DC offset.
On top of all this if the input of your DFT is real domain then the DFT result is symmetric meaning you can use only first half of the result as the second is just mirror image (not including the f(0)) which makes sense as you can not represent bigger than fsampling/2 frequency in real domain data.
Conclusion:
You can not obtain frequency of signal used by DFT as there is infinite number of ways how such signal can be computed. DFT is reconstructing the signal using sinwaves and your signal is definately no sinwave so the results will not match what you think.
Matching niquist frequencies to yours is done by correctly chosing the n for DFT however without knowing the frequency ahead you can not do this ...
It may be possible to compute the singular sinwave frequency from its 2 aliases however your signal is no sinwave so that is not applicable for your case anyway.
I would use different approaches to determine frequency of integer numeric signal:
compute histogram of signal
so count how many of each number there is
test possible frequencies
You can brute force all possible periods of signal and test if consequent periods are the same however for big data is this not optimal...
We can use histogram to speed this up. So if you look at the counts cnt(ix) from histogram for periodic signal of frequency f and period T in data of size n then the period of signal should be a common divider of all the counts
T = n/f
k*f = GCD(all non zero cnt[i])
where k divides the GCD result. However in case n is not exact multiple of T or the signal has noise or slight deviations in it this will not work. However we can at least estimate the GCD and test all frequencies around which will be still faster than brute force.
So for each count (not accounting for noise) it should comply this:
cnt(ix) = ~ n/(f*k)
k = { 1,2,3,4,...,n/f}
so:
f = ~ n/(cnt(ix)*k)
so if you got signal like this:
1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
then histogram would be cnt[]={0,7,8,4,0,0,0,0,...} and n=19 so computing f in periods per n for each used element leads to:
f(ix) = n/(cnt(ix)*k)
f(1) = 19/(7*k) = ~ 2.714/k
f(2) = 19/(8*k) = ~ 2.375/k
f(3) = 19/(4*k) = ~ 4.750/k
Now the real frequency should be a common divider (CD) of results so taking biggest and smallest counts rounded up and down (ignoring noise) leads to these options:
f = CD(2,4) = 2
f = CD(3,4) = none
f = CD(2,5) = none
f = CD(3,5) = none
so now test frequency (luckily its just one valid in this case) 2 periods per 19 samples meaning T = ~ 9.5 so test rounded up and down ...
signal(t+ 0)=1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
signal(t+ 9)=1,1,1,2,2,2,2,3,3,1 // check 9 elements
signal(t+10)=1,1,2,2,2,2,3,3,1,? // check 10 elements
As you can see signal(t...t+9)==signal(t+9...t+9+9) meaning the period is T=9.

How to generate n random numbers from negative binomial distribution?

I am trying to make a function in order to generate n random numbers from negative binomial distribution.
To generate it, I first made a function to generate n random variables from geometric distribution. My function for generating n random numbers from geometric distribution as follows:
rGE<-function(n,p){
I<-rep(NA,n)
for (j in 1:n){
x<-rBer(1,p)
i<-1 # number of trials
while(x==0){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested this function (rGE), for example for rGE(10,0.5), which is generating 10 random numbers from a geometric distribution with probability of success 0.5, a random result was:
[1] 2 4 2 1 1 3 4 2 3 3
In rGE function I used a function named rBer which is:
rBer<-function(n,p){
sample(0:1,n,replace = TRUE,prob=c(1-p,p))
}
Now, I want to improve my above function (rGE) in order to make a function for generating n random numbers from a negative binomial function. I made the following function:
rNB<-function(n,r,p){
I<-seq(n)
for (j in 1:n){
x<-0
x<-rBer(1,p)
i<-1 # number of trials
while(x==0 & I[j]!=r){
x<-rBer(1,p)
i<-i+1
}
I[j]<- i
}
return(I)
}
I tested it for rNB(3,2,0.1), which generates 3 random numbers from a negative binomial distribution with parametrs r=2 and p=0.1 for several times:
> rNB(3,2,0.1)
[1] 2 1 7
> rNB(3,2,0.1)
[1] 3 1 4
> rNB(3,2,0.1)
[1] 3 1 2
> rNB(3,2,0.1)
[1] 3 1 3
> rNB(3,2,0.1)
[1] 46 1 13
As you can see, I think my function (rNB) does not work correctly, because the results always generat 1 for the second random number.
Could anyone help me to correct my function (rNB) in order to generate n random numbers from a negative binomial distribution with parametrs n, r, and p. Where r is the number of successes and p is the probability of success?
[[Hint: Explanations regarding geometric distribution and negative binomial distribution:
Geometric distribution: In probability theory and statistics, the geometric distribution is either of two discrete probability distributions:
The probability distribution of the number X of Bernoulli trials needed to get one success, supported on the set { 1, 2, 3, ... }.
The probability distribution of the number Y = X − 1 of failures before the first success, supported on the set { 0, 1, 2, 3, ... }
Negative binomial distribution:A negative binomial experiment is a statistical experiment that has the following properties:
The experiment consists of x repeated trials.
Each trial can result in just two possible outcomes. We call one of these outcomes a success and the other, a failure.
The probability of success, denoted by P, is the same on every trial.
The trials are independent; that is, the outcome on one trial does not affect the outcome on other trials.
The experiment continues until r successes are observed, where r is specified in advance.
]]
Your function will be much faster if you use R's native vectorization. The way you can do this is to generate all your Bernoulli trials at once.
Note that for a negative binomial distribution, the expected value (i.e. the mean number of Bernoulli trials it will take to get r successes) is r * p / (1 - p) (Reference)
If we want to draw n negative binomial samples, then the expected total number of Bernoulli trials will therefore be n * r * p / (1 - p). So we want to draw at least that many Bernoulli samples. For simplicity, we can start by drawing twice that number: 2 * n * r * p / (1 - p) . In the unlikely case that this is not enough, we can draw twice as many again repeatedly until we have enough; once the sum of the resultant vector of Bernoulli trials is greater than r * n, we know we have enough Bernoulli trials to simulate our n negative binomial trials.
We can now run a cumsum on the vector of Bernoulli trials to keep track of the number of positive trials. If you then perform integer division on this vector by %/% r, you will have all the Bernoulli trials labelled according to which negative binomial trial they belonged to. You then table this vector.
The first r numbers of the table (obtained by subsetting the table by [1:n] or equivalently by [seq(n)] is your negative binomial draw. We just remove the table's names by using as.numeric. We also subtract the number of successes (i.e. r), from each of our counts, since we are only counting the failures, not the successes.
rNB <- function(n, r, p) {
mult <- 2
all_samples <- 0
while(sum(all_samples) < n * r)
{
all_samples <- rBer(mult * n * r * p / (1 - p), p)
mult <- mult * 2
}
as.numeric(table(cumsum(all_samples) %/% r))[seq(n)] - r
}
So we can do:
rNB(3, 2, 0.1)
#> [1] 14 19 41
rNB(3, 2, 0.1)
#> [1] 23 6 56
rNB(3, 2, 0.1)
#> [1] 11 31 59
rNB(3, 2, 0.1)
#> [1] 7 21 14
mean(rNB(10000, 2, 0.1))
#> [1] 18.0002
We can test this against R's own rnbinom:
mean(rnbinom(10000, 2, 0.1))
#> [1] 18.0919
hist(rnbinom(10000, 2, 0.5), breaks = 0:20)
hist(rNB(10000, 2, 0.5), breaks = 0:20)
Note that the logic of your own version isn't quite right. In particular, the line while(x == 0 & I[j] != r) doesn't make any sense. I is a vector of 1:n, so in your example, whenever j is 2, I[j] is equal to r and the loop stops. This is why your second number is always 1. I don't know what you were trying to do here.
If you want to do it one Bernoulli trial at a time, as you are doing in your own version, try this modified function. The variable names should hopefully make it easy to follow the logic:
rNB <- function(n, r, p) {
# Create an empty vector of length n for our results
draws <- numeric(n)
# Now for each of the n trials we will get a negative binomial sample:
for (i in 1:n) {
# Create success and failure counters for this draw
failures <- successes <- 0
# Now run Bernoulli trials, counting successes and failures as we go
# until we hit r successes
while(successes < r)
{
if(rBer(1, p) == 1)
successes <- successes + 1
else
failures <- failures + 1
}
# Once we have reached r successes, the current number of failures is our
# negative binomial draw
draws[i] <- failures
}
return(draws)
}
This gives identical results to the faster, albeit more opaque, vectorized version.

Count the Number of 6s Rolled on a Number of Dice in R

I am trying to develop code that will tell me the likelihood of rolling at least one six given 1 thru 20 die using. I am specifically trying to build a single piece of code that loops through the problem space. generates this information. The question has left me at a loss.
I have tried using the sample function and looked at contingency tables.
die1 = sample(1:6,n,replace=T)
die2 = sample(1:6,n,replace=T)
sum_of_dice = die1 + die2
counts = table(sum_of_dice)
proba_empiric = counts/sum(counts)
barplot(proba_empiric)
The above provides the basis for a probability but not for the joint probability of two die.
The final code should be able to tell me the likelihood of rolling a six on 1 die, 2 die, 3 die, all the way to twenty die.
One way to simulate the probability of rolling at least one 6 using 1 to 20 die is to use rbinom():
sapply(1:20, function(x) mean(rbinom(10000, x, 1/6) > 0))
[1] 0.1675 0.3008 0.4174 0.5176 0.5982 0.6700 0.7157 0.7704 0.8001 0.8345 0.8643 0.8916 0.9094 0.9220 0.9310
[16] 0.9471 0.9547 0.9623 0.9697 0.9718
If I am understanding you correctly, you have 20 dice and you want to know the probability of atleast one six happening in them.
We can write a function to roll one die
roll_die <- function() sample(6, 1)
Then write another function which rolls 20 dice and checks if there is atleast one six in it
roll_20_die <- function() {
any(replicate(20, roll_die()) == 6)
}
and replicate this function sufficient number of times to get the probability ratio
n <- 10000
table(replicate(n, roll_20_die()))/n
# FALSE TRUE
#0.0244 0.9756

Calculating accuracy for already existing forecast

I have the Original forecast data from a company (12 observations). Next to that I have the REAL 12 observations. I just want to calculate the accuracy of the companies current method with the real data and let them know what the MSE, MAPE, MAD, MAE etc is. So I don't have to calculate the forecast myself, but just using these 2 datasets. I can't get my head around how to use the accuracy() function in this case. I can convert the forecast dataset to a time series value, but I still keep getting errors.
Anyone knows how to help me out?
> Forecast_data
1 8237
2 13438
3 10026
4 9651
5 11043
6 8500
7 10126
8 11560
9 11175
10 9103
11 14456
12 10308
> Real data
1 16507
2 14637
3 15210
4 17818
5 17606
6 13396
7 11603
8 11094
9 14087
10 14304
11 17887
12 14116
Look at the first "date" (1) for the moment. The actual/observed value is a[1]=16507, and the forecast/estimate is f[1]=8237.
So the error/deviation is e[1]=f[1]-a[1]=8237-16507=-8270 (you underestimate) and in percentage p[1]=e[1]/a [1]=-8270/16507=-.501=-50.1% (you underestimate by 50%).
Do this for all the dates and you'll get a column of error in value e[i] and in percentage p[i].
The MSE (Mean Squared Error) is the average of e[i]^2.
The MAD (Mean Absolute Derivation) is the average of abs(e[i]).
The MAE (Mean Absolute Error) is the average of abs(e[i]) (the same).
The MASP (Mean Absolute Percent Error) is the average of abs(p[i]).

Autocorrelation function of binary time series

I have binary (1 or 0) time series of an event and I want to calculate its ACF. The problem is that I need to split the TS into clusters according to their duration and to calculate ACF of each subset.
Let me show you an example:
TS : (1,1,1,0,0,1,1,0,0,0,1)
I'd like to have an ACF that is a sum of :
ACF of cluster 1 : (1,1,1,0,0,0,0,0,0,0,0)
ACF of cluster 2 : (1,1,0,0,0,0,0,0,0,0,0)
ACF of cluster 3 : (1,0,0,0,0,0,0,0,0,0,0)
and then average these 3 vectors to get the result I need. The number of clusters is arbitrary, approximate duration of time series varies between 1k to 10k observations
It's not clear to me at all what you're trying to do.
In agreement with #OttoKässi I don't understand the logic behind the subsets. Why three? Why those three? What is the (mathematical) rationale for constructing those subsets.
More fundamentally, averaging correlation coefficients makes little sense to me. In autocorrelation, you calculate Pearson's product-moment correlation coefficients of the vector with different lagged versions of that same vector. Then you want to do that for three different (orthogonal) vectors, and average the coefficients? Why? That makes no statistical sense to me.
That aside, to calculate the autocorrelation for the three vectors you can do the following:
# Your sample vectors
v <- list(
v1 = c(1,1,1,0,0,0,0,0,0,0,0),
v2 = c(1,1,0,0,0,0,0,0,0,0,0),
v3 = c(1,0,0,0,0,0,0,0,0,0,0));
# Calculate acf for lag = 0 ... 10 and store as columns in dataframe
# The rows correspond to lag = 0 ... 10
acf <- as.data.frame(lapply(v, function(x) as.numeric(acf(x, plot = FALSE)$acf)));
acf;
# v1 v2 v3
#1 1.00000000 1.00000000 1.000000000
#2 0.63257576 0.47979798 -0.009090909
#3 0.26515152 -0.04040404 -0.018181818
#4 -0.10227273 -0.06060606 -0.027272727
#5 -0.13636364 -0.08080808 -0.036363636
#6 -0.17045455 -0.10101010 -0.045454545
#7 -0.20454545 -0.12121212 -0.054545455
#8 -0.23863636 -0.14141414 -0.063636364
#9 -0.27272727 -0.16161616 -0.072727273
#10 -0.18181818 -0.18181818 -0.081818182
#11 -0.09090909 -0.09090909 -0.090909091
If you now insist, you could calculate average correlation coefficients for different lags by taking the row averages. Mind you, I don't see how this makes statistical sense though.
rowMeans(acf);
#[1] 1.00000000 0.36776094 0.06885522 -0.06338384 -0.08451178 -0.10563973
#[7] -0.12676768 -0.14789562 -0.16902357 -0.14848485 -0.09090909

Resources