Converting stochastic probabilities into target numbers - r

Given the following stochastic transition matrix, how can I convert these probabilities to random number generator target numbers?
transmat <- structure(c(0.77, 0.561, 0.14, 0.187, 0.07, 0.063, 0, 0.063,
0.01, 0.125, 0.01, 0.001), dim = c(2L, 6L),
dimnames = list(c("0", "1"), c("0", "1", "2", "5", "8", "9")))
The aim is to output target numbers for a transition to occur in a simulation environment. This can be done manually as in the example below, which gives approximations of dice roll ranges for the sum of 3 fair six-sided dice for the first row of the transition matrix (see the probability distribution here); however, an algorithm to do this is needed.
In the provided example, state 0 will remain as state 0 with a roll of 3-12, but will transition to state 1 on a roll of 13-14, state 2 with a roll of 15-16, state 8 on a roll of 17, and state 9 on a roll of 18.
0 1 2 5 8 9
0 "3-12" "13-14" "15-16" "-" "17" "18"
Conceptually I am unsure of how to proceed. One thought was to (1) index the probabilities in the row and the range of random number generator results and (2) somehow compare the cumulative probability and the remaining probabilities with the probabilities of possible random numbers.

My understanding of your problem is that you'd like to generate simulations from the probabilities in your matrix "transmat".
Apart from the literal way of performing the dice-roll simulation you describe (i.e. just generate random dice rolls using floor(runif(n=1,min=1,max=7)) and generate the next state using some kind of if statement),
You could use sample to generate realizations from the probabilities in "transmat" directly:
sample(x=as.numeric(colnames(transmat)),
size=10^4,
replace=TRUE,
prob=transmat[1,])
The key here is using the probabilities from transmat as the argument to prob in sample. A count of the simulated states using table:
0 1 2 8 9
7681 1413 713 98 95
shows this is a simulation of the transition probabilities in the first row of your transition matrix.

Could you use different colored dice (or just roll one at a time)? That would give more states, each with the same probability. For example, there are 6^3 = 216 states with three dice if they are ordered by color (or by throw order). For the first row of transmat, it could be:
c(
"1,1,1 - 5,4,4" = 0.768518519,
"5,4,5 - 6,3,5" = 0.143518519,
"6,3,6 - 6,6,2" = 0.069444444,
" - " = 0,
"6,6,3 - 6,6,4" = 0.009259259,
"6,6,5 - 6,6,6" = 0.009259259
)
#> 1,1,1 - 5,4,4 5,4,5 - 6,3,5 6,3,6 - 6,6,2 - 6,6,3 - 6,6,4 6,6,5 - 6,6,6
#> 0.768518519 0.143518519 0.069444444 0.000000000 0.009259259 0.009259259
Each additional die/roll would increase the precision by a factor of 6.
If that's an option, I have some ideas about automating it.

Related

How to generate a series of number by function of sample in R, with given different probability in each try?

For example I have a vector about possibility is
myprob <- (0.58, 0.51, 0.48, 0.46, 0.62)
And I want to sampling a series of number between 1 and 0 each time by the probability of c(1-myprob, myprob),
which means in the first number in the series, the function sample 1 and 0 by (0.42, 0.58), the second by (0.49, 0.50) and so on,
how can I generate the 5 numbers by sample?
The syntax of
Y <- sample(c(1,0), 1, replace=F, prob=c(1-myprob, prob))
would have incorrect number of probabilities and only 1 number output if I specify the prob;
while the syntax of
Y <- sample(c(1,0), 5, replace=F, prob=c(1-myprob, prob))
would have the probabilities focus on only 0.62(or not I am not sure, but the results seems not correct at all)
Thanks for any reply in advance!
If myprob is the probability of drawing 1 for each iteration, then you can use rbinom, with n = 5 and size = 1 (5 iterations of a 1-0 draw).
set.seed(2)
rbinom(n = 5, size = 1, prob = myprob)
[1] 1 0 1 0 0
Maël already proposed a great solution sampling from a binomial distribution. There are probably many more alternatives and I just wanted to suggest two of them:
runif()
as.integer(runif(5) > myprob)
This will first generate a series of 5 uniformly distributed random numbers between 0 and 1, then compare that vector against myprob and convert the logical values TRUE/FALSE to 1/0.
vapply(sample())
vapply(myprob, function(p) sample(1:0, 1, prob = c(1-p, p)), integer(1))
This is what you may have been looking for in the first place. This executes the sample() command by iterating over the values of myprob as p and returns the 5 draws as a vector.

Simulation in R for multistage sampling

I am trying to simulate a multi stage sampling scheme as following:
I have N = 90 (Population) size
I need to choose 20 sample out of N = 90 (without replacement) with equal probability
Then I need choose 17 sample out 20 (so in this case my previous sample become population i.e. N=20) without replacement and with equal probability
Then I need to find 3(20-17) samples and choose 2 sample out of 3 (Here N = 3) without replacement and with equal probability.
I need to replicate whole for 1000 times and calculate mean for step 2, step 3, and step 4
''' replicate(1000,sample(1:90,20,replace = F, prob = NULL)) '''
I am trying to using this but I am not getting results.

Conditional probability in r

The question:
A screening test for a disease, that affects 0.05% of the male population, is able to identify the disease in 90% of the cases where an individual actually has the disease. The test however generates 1% false positives (gives a positive reading when the individual does not have the disease). Find the probability that a man has the disease given that has tested positive. Then, find the probability that a man has the disease given that he has a negative test.
My wrong attempt:
I first started by letting:
• T be the event that a man has a positive test
• Tc be the event that a man has a negative test
• D be the event that a man has actually the disease
• Dc be the event that a man does not have the disease
Therefore we need to find P(D|T) and P(D|Tc)
Then I wrote this code:
set.seed(110)
sims = 1000
D = rep(0, sims)
Dc = rep(0, sims)
T = rep(0, sims)
Tc = rep(0, sims)
# run the loop
for(i in 1:sims){
# flip to see if we have the disease
flip = runif(1)
# if we got the disease, mark it
if(flip <= .0005){
D[i] = 1
}
# if we have the disease, we need to flip for T and Tc,
if(D[i] == 1){
# flip for S1
flip1 = runif(1)
# see if we got S1
if(flip1 < 1/9){
T[i] = 1
}
# flip for S2
flip2 = runif(1)
# see if we got S1
if(flip2 < 1/10){
Tc[i] = 1
}
}
}
# P(D|T)
mean(D[T == 1])
# P(D|Tc)
mean(D[Tc == 1])
I'm really struggling so any help would be appreciated!
Perhaps the best way to think through a conditional probability question like this is with a concrete example.
Say we tested one million individuals in the population. Then 500 individuals (0.05% of one million) would be expected to have the disease, of whom 450 would be expected to test positive and 50 to test negative (since the false negative rate is 10%).
Conversely, 999,500 would be expected to not have the disease (one million minus the 500 who do have the disease), but since 1% of them would test positive, then we would expect 9,995 people (1% of 999,500) with false positive results.
So, given a positive test result taken at random, it either belongs to one of the 450 people with the disease who tested positive, or one of the 9,995 people without the disease who tested positive - we don't know which.
This is the situation in the first question, since we have a positive test result but don't know whether it is a true positive or a false positive. The probability of our subject having the disease given their positive test is the probability that they are one of the 450 true positives out of the 10,445 people with positive results (9995 false positives + 450 true positives). This boils down to the simple calculation 450/10,445 or 0.043, which is 4.3%.
Similarly, a negative test taken at random either belongs to one of the 989505 (999500 - 9995) people without the disease who tested negative, or one of the 50 people with the disease who tested negative, so the probability of having the disease is 50/989505, or 0.005%.
I think this question is demonstrating the importance of knowing that disease prevalence needs to be taken into account when interpreting test results, and very little to do with programming, or R. It requires only a calculator (at most).
If you really wanted to run a simulation in R, you could do:
set.seed(1) # This makes the sample reproducible
sample_size <- 1000000 # This can be changed to get a larger or smaller sample
# Create a large sample of 1 million "people", using a 1 to denote disease and
# a 0 to denote no disease, with probabilities of 0.0005 (which is 0.05%) and
# 0.9995 (which is 99.95%) respectively.
disease <- sample(x = c(0, 1),
size = sample_size,
replace = TRUE,
prob = c(0.9995, 0.0005))
# Create an empty vector to hold the test results for each person
test <- numeric(sample_size)
# Simulate the test results of people with the disease, using a 1 to denote
# a positive test and 0 to denote a negative test. This uses a probability of
# 0.9 (which is 90%) of having a positive test and 0.1 (which is 10%) of having
# a negative test. We draw as many samples as we have people with the disease
# and put them into the "test" vector at the locations corresponding to the
# people with the disease.
test[disease == 1] <- sample(x = c(0, 1),
size = sum(disease),
replace = TRUE,
prob = c(0.1, 0.9))
# Now we do the same for people without the disease, simulating their test
# results, with a 1% probability of a positive test.
test[disease == 0] <- sample(x = c(0, 1),
size = 1e6 - sum(disease),
replace = TRUE,
prob = c(0.99, 0.01))
Now we have run our simulation, we can just count the true positives, false positives, true negatives and false negatives by creating a contingency table
contingency_table <- table(disease, test)
contingency_table
#> test
#> disease 0 1
#> 0 989566 9976
#> 1 38 420
and get the approximate probability of having the disease given a positive test like this:
contingency_table[2, 2] / sum(contingency_table[,2])
#> [1] 0.04040015
and the probability of having the disease given a negative test like this:
contingency_table[2, 1] / sum(contingency_table[,1])
#> [1] 3.83992e-05
You'll notice that the probability estimates from sampling are not that accurate because of how small some of the sampling probabilities are. You could simulate a larger sample, but it might take a while for your computer to run it.
Created on 2021-08-19 by the reprex package (v2.0.0)
To expand on Allan's answer, but relating it back to Bayes Theorem, if you prefer:
From the question, you know (converting percentages to probabilities):
Plugging in:

How to understand the result of Discrete Fourier Transform under period finding?

I am learning how to use Discrete Fourier Transform(DFT) to find the period about a^x mod(N), in which x is a positive integer, a is any prime number, and N is the product of two prime factors p and q.
For example, the period of 2^x mod(15) is 4,
>>> for x in range(8):
... print(2**x % 15)
...
Output: 1 2 4 8 1 2 4 8
^-- the next period
and the result of DFT is as following,
(cited from O'Reilly Programming Quantum Computers chapter 12)
There are 4 spikes with 4-unit spacing, and I think the latter 4 means that period is 4.
But, when N is 35 and period is 12
>>> for x in range(16):
... print(2**x % 35)
...
Output: 1 2 4 8 16 32 29 23 11 22 9 18 1 2 4 8
^-- the next period
In this case, there are 8 spikes greater than 100, whose locations are 0, 5, 6, 11, 32, 53, 58, 59, respectively.
Does the location sequence imply the magic number 12? And how to understand "12 evenly spaced spikes" from the righthand graph?
(cited from O'Reilly Programming Quantum Computers chapter 12)
see How to compute Discrete Fourier Transform? and all the sublinks especially How do I obtain the frequencies of each value in an FFT?.
As you can see i-th element of DFT result (counting from 0 to n-1 including) represent Niquist frequency
f(i) = i * fsampling / n
And DFT result uses only those sinusoidal frequencies. So if your signal does have different one (even slightly different frequency or shape) aliasing occurs.
Aliased sinusoid creates 2 frequencies in DFT output one higher and one lower frequency.
Any sharp edge is translated to many frequencies (usually continuous spectrum like your last example)
The f(0) is no frequency and represents DC offset.
On top of all this if the input of your DFT is real domain then the DFT result is symmetric meaning you can use only first half of the result as the second is just mirror image (not including the f(0)) which makes sense as you can not represent bigger than fsampling/2 frequency in real domain data.
Conclusion:
You can not obtain frequency of signal used by DFT as there is infinite number of ways how such signal can be computed. DFT is reconstructing the signal using sinwaves and your signal is definately no sinwave so the results will not match what you think.
Matching niquist frequencies to yours is done by correctly chosing the n for DFT however without knowing the frequency ahead you can not do this ...
It may be possible to compute the singular sinwave frequency from its 2 aliases however your signal is no sinwave so that is not applicable for your case anyway.
I would use different approaches to determine frequency of integer numeric signal:
compute histogram of signal
so count how many of each number there is
test possible frequencies
You can brute force all possible periods of signal and test if consequent periods are the same however for big data is this not optimal...
We can use histogram to speed this up. So if you look at the counts cnt(ix) from histogram for periodic signal of frequency f and period T in data of size n then the period of signal should be a common divider of all the counts
T = n/f
k*f = GCD(all non zero cnt[i])
where k divides the GCD result. However in case n is not exact multiple of T or the signal has noise or slight deviations in it this will not work. However we can at least estimate the GCD and test all frequencies around which will be still faster than brute force.
So for each count (not accounting for noise) it should comply this:
cnt(ix) = ~ n/(f*k)
k = { 1,2,3,4,...,n/f}
so:
f = ~ n/(cnt(ix)*k)
so if you got signal like this:
1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
then histogram would be cnt[]={0,7,8,4,0,0,0,0,...} and n=19 so computing f in periods per n for each used element leads to:
f(ix) = n/(cnt(ix)*k)
f(1) = 19/(7*k) = ~ 2.714/k
f(2) = 19/(8*k) = ~ 2.375/k
f(3) = 19/(4*k) = ~ 4.750/k
Now the real frequency should be a common divider (CD) of results so taking biggest and smallest counts rounded up and down (ignoring noise) leads to these options:
f = CD(2,4) = 2
f = CD(3,4) = none
f = CD(2,5) = none
f = CD(3,5) = none
so now test frequency (luckily its just one valid in this case) 2 periods per 19 samples meaning T = ~ 9.5 so test rounded up and down ...
signal(t+ 0)=1,1,1,2,2,2,2,3,3,1,1,1,2,2,2,2,3,3,1
signal(t+ 9)=1,1,1,2,2,2,2,3,3,1 // check 9 elements
signal(t+10)=1,1,2,2,2,2,3,3,1,? // check 10 elements
As you can see signal(t...t+9)==signal(t+9...t+9+9) meaning the period is T=9.

Confusion Between 'sample' and 'rbinom' in R

Why are these not equivalent?
#First generate 10 numbers between 0 and .5
set.seed(1)
x <- runif(10, 0, .5)
These are the two statements I'm confused by:
#First
sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
#Second
rbinom(length(x), size = 1, prob=x)
I was originally trying to use 'sample'. What I thought I was doing was generating ten (0,1) pairs, then assigning the probability that each would return either a 0 or a 1.
The second one works and gives me the output I need (trying to run a sim). So I've been able to solve my problem. I'm just curious as to what's going on under the hood with 'sample' so that I can understand R better.
The first area of difference is the location of the length of the vector specification in the parameter list. The names size have different meanings in these two functions. (I hadn't thought about that source of confusion before, and I'm sure I have made this error myself many times.)
The random number generators (starting with r and having a distribution suffix) have that choice as the first parameter, whereas sample has it as the second parameter. So the length of the second one is 10 and the length of the first is 1. In sample the draw is from the values in the first argument, while 'size' is the length of the vector to create. In the rbinom function, n is the length of the vector to create, while size is the number of items to hypothetically draw from a theoretical urn having a distribution determined by 'prob'. The result returned is the number of "ones". Try:
rbinom(length(x), size = 10, prob=x)
Regarding the argument to prob: I don't think you need the c().
The difference between the two function is quite simple.
Think of a pack of shuffled cards, and choose a number of cards from it. That is exactly the situation that sample simulates.
This code,
> set.seed(123)
> sample(1:40, 5)
[1] 12 31 16 33 34
randomly extract five numbers from the 1:40 vector of numbers.
In your example, you set size = 1. It means you choose only one element from the pool of possible values. If you set size = 10 you will get ten values as you desire.
set.seed(1)
x <- runif(10, 0, .5)
> sample(rep(c(0,1), length(x)), size = 10, prob = c(rbind(1-x,x)), replace = F)
[1] 0 0 0 0 0 0 0 1 0 1
Instead, the goal of the rbinom function is to simulate events where the results are "discrete", such as the flip of a coin. It considers, as parameters, the probability of success on a trial, such as the flip of the coin, according to a given probability of 0.5. Here we simulate 100 flips. If you think that the coin could be stacked in order to favor one specific outcome, we could simulate this behaviour by setting probability equals to 0.8, as in the example below.
> set.seed(123)
> table(rbinom(100, 1, prob = 0.5))
0 1
53 47
> table(rbinom(100, 1, prob = 0.8))
0 1
19 81

Resources