Decimal precission problems with runif - r

I'm running into issues when simulating low-probability events with runif in R, and wondering how to solve this.
Consider the following example for an experiment where we simulate values of TRUE with probability 5e-10 in a sample of size 10e9, and check if any of these samples got that value of TRUE. This experiment is repeated 10 times:
set.seed(123)
probability <- 0.0000000005
n_samples <- 1000000000
n_tries <- 10
for (i in 1:n_tries) {
print(any(runif(n=n_samples, min=0, max=1) < probability))
}
Code above will run relatively fast, and nearly half of the experiment replicates will return TRUE as expected.
However, as soon as the probability becomes 5e-11 (probability <- 0.00000000005), that expectation fails and no TRUE values will be returned even if the number of replicates is increased (used n_tries <- 100 twice with no luck; the whole process took 1h running).
This means runif is not returning values with as many precision as 11 decimals. This was unexpected, as R to my understanding works with as much as 16 decimals of precision, and we might need to simulate processes with probabilities that small (around 15 decimals).
Is this why runif fails to provide the expected output? are there any other alternatives/solutions to this problem?
Thank you
EDIT: I have made a test to check whether this problem could be related to boundary bias (causing a reduced density of probability near extreme values of 0 or 1). To do so, the result of runif is added a constant (e.g. k <- 0.5) and compared against the value of probability plus that same constant. However, that does not seem to fix the issue.

Related

Simulation to find random sequences

With R I can try to find the probability that the Age vector below resulted from random sampling. I used the runs test (from randtests package) with resulted in p-value = 0.2892. Other colleagues used the rle functune (run length encoding in R) or others to simulate whether the probabilities of random allocation generating the observed sequences. Their result shows p < 0.00000001 that this sequence is the result of random sampling. I am trying to find the R code to replicate their findings. any help is highly appreciated on how to simulate to replicate their findings.
Update: I received advice from statistician that I can do this using non-parametric bootstrap. However, I still do not know how this can be done. I appreciate your help.
example:
Age <-c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73,69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73) ;
randtests::runs.test(Age);
X <- rle(Age);X$lengths
What was initially presented isn't the whole story. If one looks at the supplement where these numbers are from, the reported p-value is for comparing two vectors. OP only provides one, and hence the task is not well-defined.
The full assertion of the research article is that
group1 <- c(68,71,72,69,80,78,80,81,84,82,67,73,65,68,66,70,69,72,74,73,68,75,70,72,75,73)
group2 <- c(69,75,74,79,80,78,80,81,79,82,69,73,67,66,70,72,69,72,75,80,68,69,71,77,70,73)
being two independent random samples has a p-value < 0.00000001.
Even checking identity along position (10 entries in original) with permutations within a group, I'm seeing only 2 or 3 draws per million that have a similar number of identical values. I.e., something like:
set.seed(123)
mean(replicate(1e6, sum(sample(group1, length(group1)) == group2)) >= 10)
# 2e-06
Testing correlations and/or bootstrapping could easily be in the p-value range that is reported (nothing as extreme in 100 million simulations).

How to get only positive values in a monte-carlo simulation?

Using the code below we get 10,000 random values normally distributed around the mean and the values can be positive and negative. I am dealing with a problem where negative values of simulation result makes no sense. How can I generate a normal distribution with only positive values? Or is there any other appropriate way to handle this?
runs <- 100000
sims <- rnorm(runs,mean=30,sd=30)
If you want to get rid of the negatives you can do this after your code. This will give you a kind of truncated normal distribution if that's what you're after.
sims <- sims[sims>0]

Estimate the chance n rolls of m fair six-sided dice

Similar with De mere problem
I want to generate a Monte Carlo simulation to estimate the probability of rolling at least one from n rolls of m fair six-sided dice.
My code:
m<-5000
n<-3
x<-replicate(m, sample(1:6,n,TRUE)==1)
p<-sum(x)/m
p is the probability estimated. Here I get the value 0.4822.
My questions:
1) Is there any other way without using sum to do it?
2) I doubt the code is wrong as the probability maybe too high.
Although the question as stated is a little unclear, the code suggests you want to estimate the chance of obtaining at least one outcome of "1" among n independent dice and that you aim to estimate this by simulating the experiment m times.
Program simulations from the inside out. Begin with a single iteration. You started well, but to be perfectly clear let's redo it using a highly suggestive syntax. Try this:
1 %in% sample(1:6,n,TRUE)
This uses sample to realize the results of n independent fair dice and checks whether the outcome 1 appears among any of them.
Once you are satisfied that this emulates your experiment (run it a bunch of times), then indeed replicate will perform the simulation:
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
That produces m results. Each will be TRUE (interpreted as equal to 1) in all iterations where 1 appeared and otherwise will be FALSE (interpreted as 0). Consequently, the average number of times 1 appeared can be obtained as
mean(x)
This empirical frequency is a good estimate of the theoretical probability.
As a check, note that 1 will not appear on a single die with a probability of 1-1/6 = 5/6 and therefore--because the n dice are independent--will not appear on any of them with a probability of (5/6)^n. Consequently the chance a 1 will appear must be 1 - (5/6)^n. Let us output those two values: the simulation mean and theoretical result. We might also include a Z score, which is a measure of how far away from the theoretical result the mean is. Typically, Z scores between -2 and 2 aren't significant evidence of any discrepancy.
Here's the full code. Although there are faster ways to write it, this is very fast already and is about as clear as one could make it.
m <- 5000 # Number of simulation iterations
n <- 3 # Number of dice per iteration
set.seed(17) # For reproducible results
x <- replicate(m, 1 %in% sample(1:6,n,TRUE))
# Compare to a theoretical result.
theory <- 1-(5/6)^n
avg <- mean(x)
Z <- (avg - theory) / sd(x) * sqrt(length(x))
c(Mean=signif(avg, 5), Theoretical=signif(theory, 5), Z.score=signif(Z, 3))
The output is
Mean Theoretical Z.score
0.4132 0.4213 -1.1600
Notice that neither result is anywhere near n/6, which would be 1/2 = 0.500.

Preventing a Gillespie SSA Stochastic Model From Running Negative

I have produce a stochastic model of infection (parasitic worm), using a Gillespie SSA. The model used the "GillespieSSA"package (https://cran.r-project.org/web/packages/GillespieSSA/index.html).
In short the code models a population of discrete compartments. Movement between compartments is dependent on user defined rate equations. The SSA algorithm acts to calculate the number of events produced by each rate equation for a given timestep (tau) and updates the population accordingly, process repeats up to a given time point. The problem is, the number of events is assumed Poisson distributed (Poisson(rate[i]*tau)), thus produces an error when the rate is negative, including when population numbers become negative.
# Parameter Values
sir.parms <- c(deltaHinfinity=0.00299, CHi=0.00586, deltaH0=0.0854, aH=0.5,
muH=0.02, SigmaW=0.1, SigmaM =0.8, SigmaL=104, phi=1.15, f = 0.6674,
deltaVo=0.0166, CVo=0.0205, alphaVo=0.5968, beta=52, mbeta=7300 ,muV=52, g=0.0096, N=100)
# Inital Population Values
sir.x0 <- c(W=20,M=10,L=0.02)
# Rate Equations
sir.a <- c("((deltaH0+deltaHinfinity*CHi*mbeta*L)/(1+CHi*mbeta*L))*mbeta*L*N"
,"SigmaW*W*N", "muH*W*N", "((1/2)*phi*f)*W*N", "SigmaM*M*N", "muH*M*N",
"(deltaVo/(1+CVo*M))*beta*M*N", "SigmaL*L*N", "muV*L*N", "alphaVo*M*L*N", "(aH/g)*L*N")
# Population change for even
sir.nu <- matrix(c(+0.01,0,0,
-0.01,0,0,
-0.01,0,0,
0,+0.01,0,
0,-0.01,0,
0,-0.01,0,
0,0,+0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/230,
0,0,-0.01/32),nrow=3,ncol=11,byrow=FALSE)
runs <- 10
set.seed(1)
# Data Frame of output
sir.out <- data.frame(time=numeric(),W=numeric(),M=numeric(),L=numeric())
# Multiple runs and combining data and SSA methods
for(i in 1:runs){
sim <- ssa(sir.x0,sir.a,sir.nu,sir.parms, method="ETL", tau=1/12, tf=140, simName="SIR")
sim.out <- data.frame(time=sim$data[,1],W=sim$data[,2],M=sim$data[,3],L=sim$data[,4])
sim.out$run <- i
sir.out <- rbind(sir.out,sim.out)
}
Thus, rates are computed and the model updates the population values for each time step, with the data store in a data frame, then attached together with previous runs. However, when levels of the population get very low events can occur such that the number of events that occurs reducing a population is greater than the number in the compartment. One method is to make the time step very small, however this greatly increases the length of the simulation very long.
My question is there a way to augment the code so that as the data is created/ calculated at each time step any values of population numbers that are negative are converted to 0?
I have tried working on this problem, but only seem to be able to come up with methods that alter the values once the simulation is complete, with the negative values still causing issues in the runs themselves.
E.g.
if (sir.out$L < 0) sir.out$L == 0
Any help would be appreciated
I believe the problem is the method you set ("ETL") in the ssa function. The ETL method will eventually produce negative numbers. You can try the "OTL" method, based on Efficient step size selection for the tau-leaping simulation method- in which there are a few more parameters that you can tweak, but the basic command is:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="OTL", tf=140, simName="SIR")
Or the direct method, which will not produce negative number whatsoever:
ssa(sir.x0,sir.a,sir.nu,sir.parms, method="D", tf=140, simName="SIR")

SVM in R, value of cost doesn't affect test error rate

I am currently using SVMs in R (e1071) with linear kernels to attempt to classify a high dimensional data set. It consists of around 300 patients with around 12000 gene activity levels measured for each patient. My goal is to predict patient response (binary: treatment effective or not) to a certain drug based upon these gene activities.
I want to establish the range of cost values to pass to the tune.svm function and this is where I am running into trouble. My understanding is that the way to do this is to try progressively smaller and larger values until lower and upper bounds for reasonable performance are respectively established; nevertheless, when I attempt to do this, no matter how large or small I make my possible costs, my resulting test error rate is never worse than about 50%. This is happening both with my actual data set and with this toy version. If this subset is too small I can provide a more significant chunk of it. Thanks for any advice.
My code:
dat.ex <- read.table("svm_ex.txt", header=T, row.names=1)
trainingSize <- 20
possibleCosts <- c(10^-50, 10^-25, 10^25, 10^50)
trainingDat <- sample(1:dim(dat.ex)[1], replace = FALSE, size = trainingSize)
ex.results <- vector()
for(i in 1:length(possibleCosts))
{
svm.ex <- svm(dat.ex[trainingDat, -1], factor(dat.ex[trainingDat, 1]), kernel="linear", cost=possibleCosts[i], type="C-classification")
test.ex <- predict(svm.ex, newdata=data.frame(x = dat.ex[-trainingDat,-1]))
truth.ex <- table(pred = test.ex, truth = factor(dat.ex[-trainingDat,1]))
exTestCorrectRate <- (truth.ex[1,1] + truth.ex[2,2])/(dim(dat.ex)[1] - trainingSize)
ex.results[i] <- exTestCorrectRate
}
print(ex.results)
First, you try ugly weird values of C. You should check the much smaller range of values (say between 1e-15 and 1e10) and in much geater resolution ( for example - 25 different values for the interval I suggested).
Second, you have very small dataset. 20 training vectors with 10 dimensions may be hard to model
I discovered the problem. In the full data set approximately 2/3 of the responses are 1 and 1/3 are 0. For these extreme parameters, every response was predicted to be 1 and thus test error rates in the range of 50% - 80% (with some fluctuations occurring due to training data selection) kept occurring.

Resources