Comparing random samples using bootstrapping (to determine minimum sample size) - r

Im looking for a faster way to compare random samples using a bootstrapping method
At the moment my code takes a sample size, then takes 100 different random samples. It records the mean in each of them. Then it takes a slightly larger sample size, and does the same thing. And so on.
The purpose of this is to find the minimum sample size needed to accurately represent my population
My code at the moment:
set.seed(124)
mydata <- rnorm(10000, 1, 100)
Summary_Table <- data.frame(0:0)
#generate list of sample sizes
Sample.sizes <- seq(1, length(mydata), by = 100)
#take 100 random samples of each size and record summary stats
for(i in 1:100) {
for(j in 1:length(Sample.sizes)) {
Random.Sample <- mydata[sample(1:length(mydata), Sample.sizes[j], replace=FALSE)]
Summary_Table[j,i] <- mean(Random.Sample) # - median(Random.Sample)) / sd(Random.Sample)
}
}
Trouble is this take 15 mins to run with the dataset I am using - is there a quicker way that I am missing?

You might want to consider parallelizing it. E.g.
getSumStat <- function(size) {
mean(mydata[sample(1:length(mydata), size, replace=FALSE)])
}
library(parallel)
cl <- makeCluster(getOption("cl.cores", 3))
clusterExport(cl, varlist=c("getSumStat", "mydata"))
out <- parSapply(cl, Sample.sizes, function(j) replicate(100, getSumStat(j)))
stopCluster(cl)

Related

rbinom is producing higher than expected amounts of success

I was just attempting to recreate an example using the rbinom function, but the numbers of "successes" is a lot higher than I would be expecting due to the low probabilities.
numSamples <- 10000 #number of samples to be drawn from population
numTrials <- 100 #this is the sample size (size of each sample)
probs <- seq(0.001, 0.9999, 0.01)
for (i in 1:length(probs)) {
x <- rbinom(n = numSamples, size = numTrials, prob = probs[i])
}
Everything seems straightforward, except that I am getting the number of success to be between 97 - 100 for all of the samples. When I do a few test cases manually using the smaller probabilities such as 0.001 I get the number of successes expected: 0. So there is an issue in how my for loop is reading things. What is going wrong ?
Here's one way to get all the samples:
res <- list()
for (i in 1:length(probs)) {
res <- c(res,
list(rbinom(n = numSamples, size = numTrials, prob = probs[i]))
}
You can then unlist(res) to get one long vector, or do.call(rbind, res) to collapse to a matrix. Growing a list this way and then collapsing it doesn't incur the same performance penalty as growing a vector.

for loop in R — generating and calculate different results

Giving rnorm(100) I need to create a for loop in R to calculate the mean and the standard deviation generating 100 different numbers each time, and store those results in a vector.
How can I accomplish that?
so far
a <- rnorm(100) # generate 100 random numbers with normal distribution
sample <- 100 # number of samples
results <- rep(NA, 100) # vector creation for storing the results
for (i in 1:sample){
results[i] <- as.data.frame() # then i stuck here, lol, the most important part
}
This function will return the mean, standard deviation and sample size. The only argument required is the sample size (n).
results <- function(n) {
data.frame(Mean = mean(rnorm(n)),
Stdev = sd(rnorm(n)),
n = n
) -> out
return(out)
}

two sided ks test loop, get p.value

I have a column of data from which I am taking randomized sub samples of 50%.
I'm running a two sided ks test to compare the distribution of 50% of the data against 100% of the data to see if the distribution is still a significant fit.
In order to meet my objectives I want to run this as a loop of say 1000 to get an average p-value from 1000 randomized sub samples. This line of code gives me a single p-value for a random subset of 50% of my sample:
dat50=dat[sample(nrow(dat),replace=F,size=0.50*nrow(dat)),]
ks.test(dat[,1],dat50[,1], alternative="two.sided")
I need a line of code that will run this 1000 times saving the resulting (different) p value each time in a column which I can then average. The code I'm trying to get to work looks like this:
x <- numeric(100)
for (i in 1:100){
x<- ks.test(dat[,7],dat50[,7], alternative="two.sided")
x<-x$p.value
}
However this does not store multiple p-values
Also tried this:
get.p.value <- function(df1, df2) {
x <- rf(5, df1=df1, df2=df2)
p.value <- ks.test(dat[,6],dat50[,6], alternative="two.sided")$p.value
}
replicate (2000, get.p.value(df1 = 5, df2 = 10))
I hope that is clear and I would appreciate any help solving this so much!
Q
In your for loop you are overwriting x in each iteration meaning that you will only save the p-value for the last iteration. Try this instead:
x <- numeric(100)
for (i in 1:length(x))
x[i] <- ks.test(dat[,17], dat[sample(nrow(dat), replace=F, size=0.5*nrow(dat)),7])$p.value
You can get the same result using replicate with:
replicate(100, ks.test(dat[,7], dat[sample(nrow(dat), replace=F, size=0.5*nrow(dat)),7])$p.value)

Working with multiple cores and sparse matrices in R

I am working on a project that requires large matrices with a larger number of zeros. Unfortunately, as some of these matrices can have more than 1e10 elements, working with the "standard" R matrices is not an option, due to RAM constraints. Also, I need to work on multiple cores, as the computation can take quite a long time and really shouldn't.
So far, I have been working with the foreach package, and converted the results (which come in standard matrices) to sparse matrices afterwards. I can't help but think that there must be a smarter way.
Here is a minimal example of what I have been doing so far:
cl <- makeSOCKcluster(8)
registerDoSNOW(cl)
Mat <- foreach(j=1:length(lambda), .combine='cbind') %dopar% {
replicate(iter, rpois(n=1, lambda[j]))
}
Mat <- Matrix(Mat, sparse=TRUE)
stopCluster(cl)
The lambdas are all quite small, so that only every 5th element or so is different from zero, making it sensible to store the results in a sparse matrix.
Unfortunately, it has now become necessary to increase the number of iterations from 1e6 to at least 1e7, so that the matrix that is produced by the foreach loop is too large to be stored on 8GB of RAM. What I now want to do is split up the tasks into steps that each have 1e6 iterations, and combine these into a single, sparse matrix.
I now have the following as an idea:
library(Matrix)
library(snow)
cl <- makeSOCKcluster(8)
iter <- 1e6
steps <- 1e5
numsteps <- iter / steps
draws <- function(x, lambda, steps){
replicate(n=steps, rpois(n=1, lambda=lambda))
}
for(i in 1:numsteps){
Mat <- Matrix(0, nrow=steps, ncol=96, sparse=TRUE)
Mat <- Matrix(
parApply(cl=cl, X=Mat, MARGIN=2, FUN=draws, lambda=0.2, steps=steps)
, sparse = TRUE)
if(!exists("fullmat")) fullmat <- Mat else fullmat <- rBind(fullmat, Mat)
rm(Mat)
}
stopCluster(cl)
It works fine, but I had to fix lambda to some value. For my application, I need the values in the ith row to come from a poisson distribution with mean equal to the ith element of the lambda vector. This obviously worked fine in the foreach loop., but I have yet to find a way to make it work in an apply loop.
My questions are:
Is it possible to have the apply function "know" on which row it is operating and pass a corresponding argument to a function?
Is there a way to work with foreach and sparse matrices without the need of creating a standard matrix and converting it into a sparse one in the next step?
If none of the above, is there a way for me to manually assign tasks to slave processes of R - that is, could I specifically tell a process to work on column 1, another to work on column 2 and so on, each creating a sparse vector and only combining these in the last step.
I was able to find a solution to my problem.
In my case, I am able to define a unique ID for each of the columns, and can address the parameters by that. The following code should illustrate what I mean:
library(snow)
library(Matrix)
iter <- 1e6
steps <- 1e5
# define a unique id
SZid <- seq(from=1, to=10, by=1)
# in order to have reproducible code, generate random parameters
SZlambda <- replicate(runif(n=1, min=0, max=.5))
SZmu <- replicate(runif(n=1, min=10, max=15))
SZsigma <- replicate(runif(n=1, min=1, max=3))
cl <- makeSOCKcluster(8)
clusterExport(cl, list=c("SZlambda", "SZmu", "SZsigma"))
numsteps <- iter / steps
MCSZ <- function(SZid, steps){ # Monte Carlo Simulation
lambda <- SZlambda[SZid]; mu <- SZmu[SZid]; sigma <- SZsigma[SZid];
replicate(steps, sum(rlnorm(meanlog=mu, sdlog=sigma,
n = rpois(n=1, lambda))
))
}
for (i in 1:numsteps){
Mat <- Matrix(
parSapply(cl, X=SZid, FUN=MCSZ, steps=steps), sparse=TRUE)
if(!exists("LossSZ")) LossSZ <- Mat else LossSZ <- rBind(LossSZ, Mat)
rm(Mat)
}
stopCluster(cl)
The trick is to apply the function not over the matrix, but over a vector of unique ids that line up with the indices of the parameters.

Parallel Monte Carlo Simulation in R using snowfall

I try to compare up to thousands of estimated beta distributions. Each beta distribution is characterized by the two shape parameters alpha & beta.
I now draw 100,000 samples of every distribution. As a final result I want to get an order of the distributions with the highest Probability in every sample draw.
My first approach was to use lapply for generating a matrix of N * NDRAWS numeric values which was consuming too much memory as N gets beyond 10,000. (10,000 * 100,000 * 8 Bytes)
So I decided to use a sequential approach of ordering every single draw, then cumsum the order of all draws and get the final order as shown in the example below:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T), beta=sample(1:200, N, replace=T))
vec <- vector(mode = "integer", length = N )
for(i in 1:NDRAWS){
# order probabilities after a single draw for every theta
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
# sum up winning positions for every theta
vec[pos] <- vec[pos] + 1:N
}
# order thetas
ord <- order(-vec)
df[ord,]
This is only consuming N * 4 Bytes of memory, as there is no giant matrix but a single vector of length N. My Question now is, how to speed up this operation using snowfall (or any other multicore package) by taking advantage of my 4 CPU Cores, instead of using just one core???
# parallelize using snowfall pckg
library(snowfall)
sfInit( parallel=TRUE, cpus=4, type="SOCK")
sfLapply( 1:NDRAWS, function(x) ?????? )
sfStop()
Any help is appreciated!
This can be parallelized in the same way that one would parallelize random forest or bootstrapping. You just perform the sequential code on each of the workers but with each using a smaller number of iterations. That is much more efficient than splitting each iteration of the for loop into a separate parallel task.
Here's your complete example converted to use the foreach package with the doParallel backend:
set.seed(12345)
N=100
NDRAWS=100000
df <- data.frame(alpha=sample(1:20, N, replace=T),
beta=sample(1:200, N, replace=T))
library(doParallel)
nworkers <- detectCores()
cl <- makePSOCKcluster(nworkers)
clusterSetRNGStream(cl, c(1,2,3,4,5,6,7))
registerDoParallel(cl)
vec <- foreach(ndraws=rep(ceiling(NDRAWS/nworkers), nworkers),
.combine='+') %dopar% {
v <- integer(N)
for(i in 1:ndraws) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
v[pos] <- v[pos] + 1:N
}
v
}
ord <- order(-vec)
df[ord,]
Note that this gives different results than the sequential version because different random numbers are generated by the workers. I used the parallel random number support provided by the parallel package since that is good practice.
Well, the functionality is there. I'm not sure though what you'd be returning with each iteration.
Perhaps try this?
myFunc <- function(xx, N) {
pos <- order(rbeta(N, shape1=df$alpha, shape2=df$beta) )
vec[pos] + 1:N
}
Using doParallel will allow you to add results:
require(doParallel)
registerDoParallel(cores=4)
foreach(i=1:NDRAWS, .combine='+') %dopar% myFunc(i, N)

Resources