How can I handle a very large number of combinations in R? - r

Here is my problem - I would like to generate a fairly large number of factorial combinations and then apply some constraints on them to narrow down the list of all possible combinations. However, this becomes an issue when the number of all possible combinations becomes extremely large.
Let's take an example - Assume we have 8 variables (A; B; C; etc.) each taking 3 levels/values (A={1,2,3}; B={1,2,3}; etc.).
The list of all possible combinations would be 3**8 (=6561) and can be generated as following:
tic <- function(){start.time <<- Sys.time()}
toc <- function(){round(Sys.time() - start.time, 4)}
nX = 8
tic()
lk = as.list(NULL)
lk = lapply(1:nX, function(x) c(1,2,3))
toc()
tic()
mapx = expand.grid(lk)
mapx$idx = 1:nrow(mapx)
toc()
So far so good, these operations are done pretty quickly (< 1 second) even if we significantly increase the number of variables.
The next step is to generate a corrected set of all pairwise comparisons (An uncorrected set would be obtain by freely combining all 6561 options with each other, leading to 65616561=43046721 combinations) - The size of this "universe" would be: 6561(6561-1)/2 = 21520080. Already pretty big!
I am using the R built-in function combn to get it done. In this example the running time remains acceptable (about 20 seconds on my PC) but things become impossible with higher higher number of variables and/or more levels per variable (running time would increase exponentially, for example it already took 177 seconds with 9 variables!). But my biggest concern is actually that the object size would become so large that R can no longer handle it (Memory issue).
tic()
univ = t(combn(mapx$idx,2))
toc()
The next step would be to identify the list of combinations meeting some pre-defined constraints. For instance I would like to sub-select all combinations sharing exactly 3 common elements (ie 3 variables take the same values). Again the running time will be very long (even if a 8 variables) as my approach is to loop over all combinations previously defined.
tic()
vrf = NULL
vrf = sapply(1:nrow(univ), function(x){
j1 = mapx[mapx$idx==univ[x,1],-ncol(mapx)]
j2 = mapx[mapx$idx==univ[x,2],-ncol(mapx)]
cond = ifelse(sum(j1==j2)==3,1,0)
return(cond)})
toc()
tic()
univ = univ[vrf==1,]
toc()
Would you know how to overcome this issue? Any tips/advices would be more than welcome!

Related

Generate permutations in sequential order - R

I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.

R lookup time for very long vector

In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>

Forming a Wright-Fisher loop with "sample()"

I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100

R probability simulation that won't terminate?

I'm teaching a statistics class where I'm having students explore questions in probability and statistics through simulation using R. Recently there was some confusion about the probability of getting exactly two 6's when rolling 5 dice. The answer is choose(5,2)*5^3/6^5, but some students were convinced that "order shouldn't matter"; i.e. that the answer should be choose(5,2)*choose(25,3)/choose(30,5). I thought it would be fun to have them simulate rolling 5 dice thousands of times, keeping track of the empirical probability for each experiment, and then repeat the experiment many times. The problem is the two numbers above are sufficiently close that it's quite hard to get a simulation to tease out the difference in a statistically significant fashion (of course I could just be doing it wrong). I tried rolling 5 dice 100000 times, then repeating the experiment 10000 times. This took an hour or so to run on my i7 linux machine and still allowed for a 25% chance that the correct answer is choose(5,2)*choose(25,3)/choose(30,5). So I increased the number of dice rolls per experiment to 10^6. Now the code has been running for over 2 days and shows no sign of finishing. I'm confused by this, as I only increased the number of operations by an order of magnitude, implying that the run time should be closer to 10 hours.
Second question: Is there a better way to do this? See code posted below:
probdist = rep(0,10000)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
A good rule of thumb is to never, ever write a for loop in R. Here's an alternative solution:
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
> system.time(samples <- replicate(n=10000,expr=doSample()))
user system elapsed
0.06 0.00 0.06
> mean(samples)
[1] 0.1588
> choose(5,2)*5^3/6^5
[1] 0.160751
Doesn't seem to be too accurate with $10,000$ samples. Better with $100,000$:
> system.time(samples <- replicate(n=100000,expr=doSample()))
user system elapsed
0.61 0.02 0.61
> mean(samples)
[1] 0.16135
I had originally awarded a correct answer check to M. Berk for his/her suggestion to use the R replicate() function. Further investigation has forced to to rescind my previous endorsement. It turns out that replicate() is just a wrapper for sapply(), which doesn't actually afford any performance benefits over a for loop (this seems to be a common misconception). In any case, I prepared 3 versions of the simulation, 2 using a for loop, and one using replicate, as suggested, and ran them one after the other, starting from a fresh R session each time, in order to compare the execution times:
# dice26dist1.r: For () loop version with unnecessary array allocation
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcome = rep(0,1000000)
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcome[k] = 1
}
probdist[j] = sum(outcome)/length(outcome)
}
system.time(source('dice26dist1.r'))
user system elapsed
596.365 0.240 598.614
# dice26dist2.r: For () loop version
probdist = rep(0,100)
for (j in 1:length(probdist))
{
outcomes = 0
for (k in 1:1000000)
{
rolls = sample(1:6, 5, replace=T)
if (length(rolls[rolls == 6]) == 2) outcomes = outcomes + 1
}
probdist[j] = outcomes/1000000
}
system.time(source('dice26dist2.r'))
user system elapsed
506.331 0.076 508.104
# dice26dist3.r: replicate() version
doSample <- function()
{
sum(sample(1:6,size=5,replace=TRUE)==6)==2
}
probdist = rep(0,100)
for (j in 1:length(probdist))
{
samples = replicate(n=1000000,expr=doSample())
probdist[j] = mean(samples)
}
system.time(source('dice26dist3.r'))
user system elapsed
804.042 0.472 807.250
From this you can see that the replicate() version is considerably slower than either of the for loop versions by any system.time metric. I had originally thought that my problem was mostly due to cache misses by allocating the million character outcome[] array, but comparing the times of dice26dist1.r and dice26dist2.r indicates that this only has nominal impact on performance (although the impact on system time is considerable: >300% difference.
One might argue that I'm still using for loops in all three simulations, but as far as I can tell this is completely unavoidable when simulating a random process; I have to simulate actually going through the random process (in this case, rolling 5 die) every time. I would love to know about any technique that would allow me to avoid using a for loop (in a way that improves performance, of course). I understand that this problem would lend itself very effectively to parallelization, but I'm talking about using a single R session -- is there a way to make this faster?
Vectorization is almost always preferred to any for loop. In this case, you should see substantial speedup by generating all your dice throws first, then checking how many in each group of five equal 6.
set.seed(5)
N <- 1e6
foo <- matrix(sample(1:6, 5*N, replace=TRUE), ncol=5)
p <- mean(rowSums(foo==6)==2)
se <- sqrt(p*(1-p)/N)
p
## [1] 0.160382
Here's a 95% confidence interval:
p + se*qnorm(0.975)*c(-1,1)
## [1] 0.1596628 0.1611012
We can see that the true answer (ans1) is in the interval, but the false answer (ans2) is not, or we could perform significance tests; the p-value when testing the true answer is 0.31 but for the false answer is 0.0057.
(ans1 <- choose(5,2)*5^3/6^5)
## [1] 0.160751
pnorm(abs((ans1-p)/se), lower=FALSE)*2
## [1] 0.3145898
ans2 <- choose(5,2)*choose(25,3)/choose(30,5)
## [1] 0.1613967
pnorm(abs((ans2-p)/se), lower=FALSE)*2
## [1] 0.005689008
Note that I'm generating all the dice throws at once; if memory is an issue, you could split this up into pieces and combine, as you did in your original post. This is possibly what caused your unexpected speedup in time; if it was necessary to use swap memory, this would slow it substantially. If so, better to increase the number of time you run the loop, not the number of rolls within the loop.

Efficiencies for nested for loop

I've created the following code that nests a for loop inside of a for loop in R. It is a simulation to calculate Power. I've read that R isn't great for doing for loops but I was wondering if there are any efficiencies I could apply to make this run a bit faster. I'm fairly new to R as well as programming of any sort. Right now the run times I'm seeing are:
m=10 I get .17 sec
m=100 I get 3.95 sec
m=1000 I get 246.26 sec
m=2000 I get 1003.55 sec
I was hoping to set the number of times to sample, m, upwards of 100K but I'm afraid to even set this at 10K
Here is the code:
m = 1000 # number of times we are going to take samples
popmean=120 # set population mean at 120
popvar=225 # set known/established population
variance at 225
newvar=144 # variance of new methodology
alpha=.01 # set alpha
teststatvect = matrix(nrow=m,ncol=1) # empty vector to populate with test statistics
power = matrix(nrow=200,ncol=1) # empty vector to populate with power
system.time( # not needed - using to gauge how long this takes
for (n in 1:length(power)) # begin for loop for different sample sizes
for(i in 1:m){ # begin for loop to take "m" samples
y=rnorm(n,popmean,sqrt(newvar)) # sample of size n with mean 120 and var=144
ts=sum((y-popmean)^2/popvar) # calculate test statistic for each sample
teststatvect[i]=ts # loop and populate the vector to hold test statistics
vecpvals=pchisq(teststatvect,n) # calculate the pval of each statistic
power[n]=length(which(vecpvals<=alpha))/length(vecpvals) # loop to populate power vector. Power is the proportion lessthan ot equal to alpha
}
}
)
I reorganized your code a bit and got rid of the inner loop.
Sampling one long vector of random numbers (and then collapsing it into a matrix) is much faster than repeatedly sampling short vectors (replicate, as suggested in another answer, is nice for readability, but in this case you can do better by sampling random numbers in a block)
colSums is faster than summing inside a for loop or using apply.
it's just sugar (i.e. it isn't actually any more efficient), but you can use mean(pvals<=alpha) in place of sum(pvals<=alpha)/length(alpha)
I defined a function to return the power for a specified set of parameters (including sample size), then used sapply to range over the vector of sizes (not faster than a for loop, but cleaner and maybe easier to generalize).
Code:
powfun <- function(ssize=100,
m=1000, ## samples per trial
popmean=120, ## pop mean
popvar=225, ## known/established pop variance
newvar=144, ## variance of new methodology
alpha=0.01,
sampchisq=FALSE) ## sample directly from chi-squared distrib?
{
if (!sampchisq) {
ymat <- matrix(rnorm(ssize*m,popmean,sd=sqrt(newvar)),ncol=m)
ts <- colSums((ymat-popmean)^2/popvar) ## test statistic
} else {
ts <- rchisq(m,df=ssize)*newvar/popvar
}
pvals <- pchisq(ts,df=ssize) ## pval
mean(pvals<=alpha) ## power
}
Do you really need the power for every integer value of sample size, or would a more widely spaced sample be OK (if you need exact values, interpolation would probably be pretty accurate)
ssizevec <- seq(10,250,by=5)
set.seed(101)
system.time(powvec <- sapply(ssizevec,powfun,m=5000)) ## 13 secs elapsed
This is reasonably fast and might get you up to m=1e5 if you needed, but I'm not quite sure why you need results that are that precise -- the power curve is reasonably smooth with m=5000 ...
If you're impatiently waiting for long simulations, you can also get a progress bar to print by replacing sapply(ssizevec,powfun,m=5000) with library(plyr); aaply(ssizevec,.margins=1,powfun,.progress="text",m=5000)
Finally, I think you can speed the whole up a lot by sampling chi-squared values directly, or by doing an analytical power calculation (!). I think that rchisq(m,df=ssize)*newvar/popvar is equivalent to the first two lines of the loop, and you might even be able to do a numerical computation on the chi-squared densities directly ...
system.time(powvec2 <- sapply(ssizevec,powfun,m=5000,sampchisq=TRUE))
## 0.24 seconds elapsed
(I just tried this out, sampling m=1e5 at every value of sample size from 1 to 200 ... it takes 24 seconds ... but I still think it might be unnecessary.)
A picture:
par(bty="l",las=1)
plot(ssizevec,powvec,type="l",xlab="sample size",ylab="power",
xlim=c(0,250),ylim=c(0,1))
lines(ssizevec,powvec2,col="red")
In general, you want as far as possible to take advantage of vectorization, not so much for speed as readability/comprehension.
Why is writing to power[n] inside the inner loop (and I guess the calculation of vecpals as well)? Shouldn't that be in the outer loop after the inner loop executes? You may want to move the calculation of the square root outside both loops.
Why are teststatvect and power initialized as matrices (which are explicitly two dimensional arrays) and not vectors (or rather, as one dimensional arrays, using array)? Is variance at 225just the end of the comment from the previous line? You may want to check formatting. (Is this homework?)
For what it looks like you're trying to do here, you may want to take advantage of the very handy function replicate, perhaps by writing a specific function to call it on.

Resources