rbinom is producing higher than expected amounts of success - r

I was just attempting to recreate an example using the rbinom function, but the numbers of "successes" is a lot higher than I would be expecting due to the low probabilities.
numSamples <- 10000 #number of samples to be drawn from population
numTrials <- 100 #this is the sample size (size of each sample)
probs <- seq(0.001, 0.9999, 0.01)
for (i in 1:length(probs)) {
x <- rbinom(n = numSamples, size = numTrials, prob = probs[i])
}
Everything seems straightforward, except that I am getting the number of success to be between 97 - 100 for all of the samples. When I do a few test cases manually using the smaller probabilities such as 0.001 I get the number of successes expected: 0. So there is an issue in how my for loop is reading things. What is going wrong ?

Here's one way to get all the samples:
res <- list()
for (i in 1:length(probs)) {
res <- c(res,
list(rbinom(n = numSamples, size = numTrials, prob = probs[i]))
}
You can then unlist(res) to get one long vector, or do.call(rbind, res) to collapse to a matrix. Growing a list this way and then collapsing it doesn't incur the same performance penalty as growing a vector.

Related

How can I make my for loop in R run faster? Can I vectorize this?

#Start: Initialize values
#For each block lengths (BlockLengths) I will run 10 estimates (ThetaL). For each estimate, I simulate 50000 observarions (Obs). Each estimate is calculated on the basis of the blocklength.
Index=0 #Initializing Index.
ThetaL=10 #Number of estimations of Theta.
Obs=50000 #Sample size.
Grp=vector(length=7) #Initializing a vector of number of blocks. It is dependent on block lengths (see L:15)
Theta=matrix(data=0,nrow=ThetaL,ncol=7) #Initializing a matrix of the estimates of Thetas. There are 10 for each block length.
BlockLengths<-c(10,25,50,100,125,200,250) #Setting the block lengths
for (r in BlockLengths){
Index=Index+1
Grp[Index]=Obs/r
for (k in 1:ThetaL){
#Start: Constructing the sample
Y1<-matrix(data=0,nrow=Obs,ncol=2)
Y1[1,]<-runif(2,0,1)
Y1[1,1]<--log(-(Y1[1,1])^2 +1)
Y1[1,2]<--log(-(Y1[1,2])^2 +1)
for (i in 2:Obs)
{
Y1[i,1]<-Y1[i-1,2]
Y1[i,2]<-runif(1,0,1)
Y1[i,2]<--log(-(Y1[i,2])^2 +1)
}
X1 <- vector(length=Obs)
for (i in 1:Obs){
X1[i]<-max(Y1[i,])
}
#End: Constructing the sample
K=0 #K will counts number of blocks with at least one exceedance
for (t in 1:Grp[Index]){ #For loop from 1 to number of groups
a=0
for (j in (1+r*(t-1)):(t*r)){ #Loop for the sample within each group
if (X1[j]>quantile(X1,0.99)){ #If a value exceeds high threshold, we add 1 to some variable a
a=a+1
}
}
if(a>=1){ #For the group, if a is larger than 1, we have had a exceedance.
K=K+1 #Counts number of blocks with at least one exceedance.
}
}
N<-sum(X1>=quantile(X1,0.99)) #Summing number of exceedances
Theta[k,Index]<- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs))) #Estimate
#Theta[k,Index]<-K/N
}
}
I have been running the above code without errors and it took me about 20 minutes, but I want to run the code for larger sample and more repetitions, which makes the run time absurdly large. I tried to only have the necessary part inside the loops to optimize it a little. Is it possible to optimize it even further or should I use another programming language as I've read R is bad for "for loop". Will vectorization help? In case, how can I vectorize the code?
First, you can define BlockLengths before Grp and Theta as both of them depend on it's length:
Index = 0
ThetaL = 2
Obs = 10000
BlockLengths = c(10,25)
Grp = vector(length = length(BlockLengths))
Theta = matrix(data = 0, nrow = ThetaL, ncol = length(BlockLengths))
Obs: I decreased the size of the operation so that I could run it faster. With this specification, your original loop took 24.5 seconds.
Now, for the operation, there where three points where I could improve:
Creation of Y1: the second column can be generated at once, just by creating Obs random numbers with runif(). Then, the first column can be created as a lag of the second column. With only this alteration, the loop ran in 21.5 seconds (12% improvement).
Creation of X1: you can vectorise the max function with apply. This alteration saved further 1.5 seconds (6% improvement).
Calculation of K: you can, for each t, get all the values of X1[(1+r*(t-1)):(t*r)], and run the condition on all of them at once (instead of using the second loop). The any(...) does the same as your a>=1. Furthermore, you can remove the first loop using lapply vectorization function, then sum this boolean vector, yielding the same result as your combination of if(a>=1) and K=K+1. The usage of pipes (|>) is just for better visualization of the order of operations. This by far is the more important alteration, saving more 18.4 seconds (75% improvement).
for (r in BlockLengths){
Index = Index + 1
Grp[Index] = Obs/r
for (k in 1:ThetaL){
Y1 <- matrix(data = 0, nrow = Obs, ncol = 2)
Y1[,2] <- -log(-(runif(Obs))^2 + 1)
Y1[,1] <- c(-log(-(runif(1))^2 + 1), Y1[-Obs,2])
X1 <- apply(Y1, 1, max)
K <- lapply(1:Grp[Index], function(t){any(X1[(1+r*(t-1)):(t*r)] > quantile(X1,0.99))}) |> unlist() |> sum()
N <- sum(X1 >= quantile(X1, 0.99))
Theta[k,Index] <- (1/r) * ((log(1-K/Grp[Index])) / (log(1-N/Obs)))
}
}
Using set.seed() I got the same results as your original loop.
A possible way to improve more is substituting the r and k loops with purrr::map function.

How to use for loops to take multiple samples?

The setup is: "Suppose that you have a population of 10,000 and 6,000 people are Democrats. We simulate survey sampling from this hypothetical population. First, generate a vector of the population in R. Then, create 500 random samples of size n = 1,000."
My code so far is:
pop<-c(rep("Democrat", 6000), rep("Republican", 4000))
nTrials <- 500
n <- 1000
results <- rep(NA, nTrials)
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=FALSE)
results[i]<- sampledpop<-c(rep(1, 6000), rep(0, 4000))
nTrials <- 500
n <- 1000
results <- matrix(data=NA, ncol = nTrials, nrow = n)
Y<-matrix(data=NA, ncol=nTrials, nrow=1)
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
}
I think this code works, but I'm worried about how to tell if the matrix is filling correctly.
You can easily view objects you've saved using the view function. Link to View Function in R.
we can also put a line into our R code that halts execution until after a key stroke. Stack exchange thread covering this
Putting the two together, we can then put put two lines into a loop, one which shows us the current version of the final output, and another which pauses the loop until we continue. This will let us explore the behaviour of the loop step by step. Using one of your loops as an example:
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
View(Y)
readline(prompt="Press [enter] to continue")
}
Keep in mind this will keep going for the specified amount of trials.
You could limit the amount of trials, but then you cannot be sure to get the same result, so instead we could put a break into the code. link to info about the break statement. This lets us interrupt a for loop early, assuming we are happy with how things are building up. To make the break really shine, lets pair it with some user input, you can choose if you'd like to continue or not. link for collecting user input in r
so then combing all of this we get something like:
for(i in 1:nTrials)
{
sampled <- sample(x=pop, size=n, replace=TRUE)
results[,i]<- sampled
Y[,i]<- sum(results[,i])
View(Y,)
interrupt = readline(prompt="Enter 1 for next loop, 0 to exit: ")
if (interrupt == 0) {break}
}
For what it's worth, your code looks perfectly fine to me so far.
Try this
replicate(500L, sample(c("Democrat", "Republican"), 1000L, replace = TRUE, prob = c(0.6, 0.4)), simplify = FALSE)
Or this
pop <- c(rep("Democrat", 6000L), rep("Republican", 4000L))
replicate(500L, sample(pop, 1000L), simplify = FALSE)

for loop in R — generating and calculate different results

Giving rnorm(100) I need to create a for loop in R to calculate the mean and the standard deviation generating 100 different numbers each time, and store those results in a vector.
How can I accomplish that?
so far
a <- rnorm(100) # generate 100 random numbers with normal distribution
sample <- 100 # number of samples
results <- rep(NA, 100) # vector creation for storing the results
for (i in 1:sample){
results[i] <- as.data.frame() # then i stuck here, lol, the most important part
}
This function will return the mean, standard deviation and sample size. The only argument required is the sample size (n).
results <- function(n) {
data.frame(Mean = mean(rnorm(n)),
Stdev = sd(rnorm(n)),
n = n
) -> out
return(out)
}

Filling a matrix with for-loop output

I want to fill a matrix with data simulated by using a for-loop containing the rbinom-function. This loop executes the rbinom-function 100 times, thus generating a different outcome every run. However, I can't find a way to get these outcomes in a matrix for further analysis. When assigning the for loop to an object, this object appears empty in the environment and can thus not be used in the matrix. ('data' must be of a vector type, was 'NULL').
When not including the rbinom-function in a for loop, it can be assigned to an object and I'm able to use the output in the matrix. Every column, however, contains the exact same sequence of numbers. When only running the for loop containing the rbinom-function, I do get different sequences, as it runs the rbinom function 100 times instead of 1 time. I just don't know how to integrate the loop in te matrix.
The two pieces of code I have:
n = 100
size = 7
loop_vill <- for (i in 1:100) {
print(rbinom(n=n, size=size, prob=0.75)) #working for-loop
}
vill <- rbinom(n=n, size=size, prob=0.75)
sim_data_vill <- matrix(data=vill, nrow=length(loop_vill), ncol=100)
#creates a matrix in which all columns are exact copies, should be solved
when able to use outputs of loop_vill.
sim_data_vill
When calling sim_data_vill, it (logically) contains a matrix of 100 rows and 100 columns, with all columns being the same. However, I would like to see a matrix with all columns being different (thus containing the output of a new run of the rbinom-function each time).
Hello as far as i can see you are having a few problems.
You are currently not running the for loop for each column (only the 1 vector is saved in vill)
You are not looping over the rbinom
Now there's a few ways to achieve what you want. (Scroll to the last example for the efficient way)
method 1: For loop
Using your idea we can use a for loop. The best idea is to save an empty matrix first and fill it in with the for loop
nsim <- 100 #how many rbinom are w
n <- 100000
size = 7
prob = 0.75
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
Now this will work, but is a bit slow, and requires a whopping 3 lines of code for the simulation part!
method 2: apply
We can remove the for loop and pre-assigned matrix with using one of the myriad apply like functions. One such function is replicate. This reduces the massive 3 lines of code to:
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
wuh.. That was short, but can we do even better? Actually running functions such as rbinom multiple times can be rather slow and costly.
method 3: using vectorized functions (very fast)
One thing you will hear whispered (or shouted) is the word vectorized, when it comes to programming in R. Basically, calling a function will induce overhead, and if you are working with a vectorized function, calling it once, will make sure you only induce overhead once, instead of multiple times. All distribution functions in R such as rbinom are vectorized. So what if we just do all the simulation in one go?
sim_data_vill_vectorized_functions <- matrix(rbinom(nsim * n, size, prob), ncol = nsim, nrow = n, byrow = FALSE) #perform all simulations in 1 rbinom call, and fill in 1 matrix.
So lets just quickly check how much faster this is compared to using a for loop or apply. This can be done using the microbenchmark package:
library(microbenchmark)
microbenchmark(for_loop = {
sim_data_vill_for_loop <- matrix(ncol = nsim, nrow = n)
for(i in seq(nsim)) #iterate from 1 to nsim
sim_data_vill_for_loop[, i] <- rbinom(n, size = size, prob = prob) #fill in 1 column at a time
},
apply = {
sim_data_vill_apply <- replicate(nsim, rbinom(n, size, prob))
},
vectorized = {
sim_data_vill_vectorized <- matrix(rbinom(nsim * n, size = size, prob = prob), ncol = nsim, nrow = n, byrow = FALSE)
}
)
Unit: milliseconds
expr min lq mean median uq max neval
for_loop 751.6121 792.5585 837.5512 812.7034 848.2479 1058.4144 100
apply 752.4156 781.3419 837.5626 803.7456 901.6601 1154.0365 100
vectorized 696.9429 720.2255 757.7248 737.6323 765.3453 921.3982 100
Looking at the median time, running all the simulations at once is about 60 ms. faster than using a for loop. As such here it is not a big deal, but in other cases it might be. (reverse n and nsim, and you will start seeing the overhead becoming big part of the calculations.)
Even if it is not a big deal, using vectorized computations where they pop up, is in all cases prefered, to make code more readable, and to avoid unnecessary bottlenecks that have already been optimized in implemented code.

Comparing random samples using bootstrapping (to determine minimum sample size)

Im looking for a faster way to compare random samples using a bootstrapping method
At the moment my code takes a sample size, then takes 100 different random samples. It records the mean in each of them. Then it takes a slightly larger sample size, and does the same thing. And so on.
The purpose of this is to find the minimum sample size needed to accurately represent my population
My code at the moment:
set.seed(124)
mydata <- rnorm(10000, 1, 100)
Summary_Table <- data.frame(0:0)
#generate list of sample sizes
Sample.sizes <- seq(1, length(mydata), by = 100)
#take 100 random samples of each size and record summary stats
for(i in 1:100) {
for(j in 1:length(Sample.sizes)) {
Random.Sample <- mydata[sample(1:length(mydata), Sample.sizes[j], replace=FALSE)]
Summary_Table[j,i] <- mean(Random.Sample) # - median(Random.Sample)) / sd(Random.Sample)
}
}
Trouble is this take 15 mins to run with the dataset I am using - is there a quicker way that I am missing?
You might want to consider parallelizing it. E.g.
getSumStat <- function(size) {
mean(mydata[sample(1:length(mydata), size, replace=FALSE)])
}
library(parallel)
cl <- makeCluster(getOption("cl.cores", 3))
clusterExport(cl, varlist=c("getSumStat", "mydata"))
out <- parSapply(cl, Sample.sizes, function(j) replicate(100, getSumStat(j)))
stopCluster(cl)

Resources