Run the script using hard drive - r

Given a set of n inputs, I want to generate all permutations of 0's and 1's (essentially the input matrix for a truth table). In order to do so, I am using the permutations command (using the gtools package) in R, as follows:
> permutations(2,n,v=c(0,1),repeats.allowed=TRUE)
where n is the number of inputs.
However, given sufficiently large number of n (let's say 26), the size of the variable becomes very high (if n=26, the variable would be approx. 13GB in size). Given this, I wanted to know if there is any way (in R) of using the hard disk instead of creating the variable on the RAM? (I might actually have to run this with n = 86 which would be an impossible thing to do on the RAM).

Related

Generate permutations in sequential order - R

I previously asked the following question
Permutation of n bernoulli random variables in R
The answer to this question works great, as long as n is relatively small (<30), otherwise the following error code occurs Error: cannot allocate vector of size 4.0 Gb. I can get the code to run with somewhat larger values by using my desktop at work but eventually the same error occurs. Even for values that my computer can handle, say 25, the code is extremely slow.
The purpose of this code to is to calculate the difference between the CDF of an exact distribution (hence the permutations) and a normal approximation. I randomly generate some data, calculate the test statistic and then I need to determine the CDF by summing all the permutations that result in a smaller test statistic value divided by the total number of permutations.
My thought is to just generate the list of permutations one at a time, note if it is smaller than my observed value and then go on to the next one, i.e. loop over all possible permutations, but I can't just have a data frame of all the permutations to loop over because that would cause the exact same size and speed issue.
Long story short: I need to generate all possible permutations of 1's and 0's for n bernoulli trials, but I need to do this one at a time such that all of them are generated and none are generated more than once for arbitrary n. For n = 3, 2^3 = 8, I would first generate
000
calculate if my test statistic was greater (1 or 0) then generate
001
calculate again, then generate
010
calculate, then generate
100
calculate, then generate
011
etc until 111
I'm fine with this being a loop over 2^n, that outputs the permutation at each step of the loop but doesn't save them all somewhere. Also I don't care what order they are generated in, the above is just how I would list these out if I was doing it by hand.
In addition if there is anyway to speed up the previous code that would also be helpful.
A good solution for your problem is iterators. There is a package called arrangements that is able to generate permutations in an iterative fashion. Observe:
library(arrangements)
# initialize iterator
iperm <- ipermutations(0:1, 3, replace = T)
for (i in 1:(2^3)) {
print(iperm$getnext())
}
[1] 0 0 0
[1] 0 0 1
.
.
.
[1] 1 1 1
It is written in C and is very efficient. You can also generate m permutations at a time like so:
iperm$getnext(m)
This allows for better performance because the next permutations are being generated by a for loop in C as opposed to a for loop in R.
If you really need to ramp up performance you can you the parallel package.
iperm <- ipermutations(0:1, 40, replace = T)
parallel::mclapply(1:100, function(x) {
myPerms <- iperm$getnext(10000)
# do something
}, mc.cores = parallel::detectCores() - 1)
Note: All code is untested.

Forming a Wright-Fisher loop with "sample()"

I am trying to create a simple loop to generate a Wright-Fisher simulation of genetic drift with the sample() function (I'm actually not dead-set on using this function, but, in my naivety, it seems like the right way to go). I know that sample() randomly selects values from a vector based on certain probabilities. My goal is to create a system that will keep running making random selections from successive sets. For example, if it takes some original set of values and samples a second set, I'd like the loop to take another random sample from the second set (using the probabilities that were defined earlier).
I'd like to just learn how to do this in a very general way. Therefore, the specific probabilities and elements are arbitrary at this point. The only things that matter are (1) that every element can be repeated and (2) the size of the set must stay constant across generations, per Wright-Fisher. For an example, I've been playing with the following:
V <- c(1,1,2,2,2,2)
sample(V, size=6, replace=TRUE, prob=c(1,1,1,1,1,1))
Regrettably, my issue is that I don't have any code to share yet precisely because I'm not sure of how to start writing this kind of loop. I know that for() loops are used to repeat a function multiple times, so my guess is to start there. However, from what I've researched about these, it seems that you have to start with a variable (typically i). I don't have any variables in this sampling that seem explicitly obvious; which isn't to say one couldn't be made up.
If you wanted to repeatedly sample from a population with replacement for a total of iter iterations, you could use a for loop:
set.seed(144) # For reproducibility
population <- init.population
for (iter in seq_len(iter)) {
population <- sample(population, replace=TRUE)
}
population
# [1] 1 1 1 1 1 1
Data:
init.population <- c(1, 1, 2, 2, 2, 2)
iter <- 100

Generating large adjacency matrix

I'm trying to generate a adjacency matrix from a csv.
The csv contains 2 columns, 1 for users and 1 for projects. The two columns form a bipartite graph, where each user can be part of multiple projects or none at all, but no edges between nodes of the same set (there are no repeating entries for the same user-project pair, but there are repeated entries of the same user or projects with different combinations for pairs).
I wrote a comparison for comparing each user's project with the entire project set using Matlab and ismember(a,b). The algorithm runs iteratively through each entry. In the end, I have an adjacency matrix of size M(|users| + |user|) x (|users| + |user|).
For small entry count < 15000, it works fast, but for a sample of +15000, Matlab stalls. I initialize the adjacency matrix with a zeros matrix (zero(r,c)) and add row by row the results of ismember(a,b). But for my Matlab, a zeros matrix zero(15000,15000) almost maxes out the memory. I tried also making a zero matrix in R with that size (matrix(0, 15000, 15000)) and it also maxes out R's memory.
Is there a way to get around this? My full sample size is 597,000 rows (with ~70,000 users and ~35,000 projects) and I want to run a network analysis of it.
Also I want to keep it in matrix format and not an adjacency list because I have a max cut min flow algorithm I want to run on the results and it only works with matrices.
Updated:
The data looks like this
User | Project
382 2429
385 2838
294 2502
... ...
It is taken from SourceForge using Zerlot from University of Notredame. Where each int value is a key in a SQL database.
I want to convert this affiliation data into a one-mode user-to-user adjacency matrix where each edge between users is a shared project.

Clustering big data

I have a list like this:
A B score
B C score
A C score
......
where the first two columns contain the variable name and third column contains the score between both. Total number of variables is 250,000 (A,B,C....). And the score is a float [0,1]. The file is approximately 50 GB. And the pairs of A,B where scores are 1, have been removed as more than half the entries were 1.
I wanted to perform hierarchical clustering on the data.
Should I convert the linear form to a matrix with 250,000 rows and 250,000 columns? Or should I partition the data and do the clustering?
I'm clueless with this. Please help!
Thanks.
Your input data already is the matrix.
However hierarchical clustering usually scales O(n^3). That won't work with your data sets size. Plus, they usually need more than one copy of the matrix. You may need 1TB of RAM then... 2*8*250000*250000is a lot.
Some special cases can run in O(n^2): SLINK does. If your data is nicely sorted, it should be possible to run single-link in a single pass over your file. But you will have to implement this yourself. Don't even think of using R or something fancy.

Very slow raster::sampleRandom, what can I do as a workaround?

tl;dr: why is raster::sampleRandom taking so much time? e.g. to extract 3k cells from 30k cells (over 10k timesteps). Is there anything I can do to improve the situation?
EDIT: workaround at bottom.
Consider a R script in which I have to read a big file (usually more than 2-3GB) and perform quantile calculation over the data. I use the raster package to read the (netCDF) file. I'm using R 3.1.2 under 64bit GNU/Linux with 4GB of RAM, 3.5GB available most of the time.
As the files are often too big to fit into memory (even 2GB files for some reason will NOT fit into 3GB of available memory: unable to allocate vector of size 2GB) I cannot always do this, which is what I would do if I had 16GB of RAM:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(getValues(pr)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
But instead I can sample a smaller number of cells in my files using the function sampleRaster() from the raster package, still getting good statistics.
e.g.:
pr <- brick(filename[i], varname=var[i], na.rm=T)
qs <- quantile(sampleRandom(pr, cnsample)*gain[i], probs=qprobs, na.rm=T, type=8, names=F)
I perform this over 6 different files (i goes from 1 to 6) which all have about 30k cells and 10k timesteps (so 300M values). Files are:
1.4GB, 1 variable, filesystem 1
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
2.7GB, 2 variables, so about 1.35GB for the variable that I read, filesystem 2
1.2GB, 1 variable, filesystem 3
1.2GB, 1 variable, filesystem 3
Note that:
files are on three different nfs filesystem, whose performance I'm not sure of. I cannot rule out the fact that the nfs filesystems can greatly vary in performance from one moment to the other.
RAM usage is 100% all of the time when the script runs, but the system does not use all of it's swap.
sampleRandom(dataset, N) takes N non-NA random cells from one layer (= one timestep), and reads their content. Does so for the same N cells for each layer. If you visualize the dataset as a 3D matrix, with Z as timesteps, the function takes N random non-NA columns. However, I guess the function does not know that all the layers have the NAs in the same positions, so it has to check that any column it chooses does not have NAs in it.
When using the same commands on files with 8393 cells (about 340MB in total) and reading all the cells, the computing time is a fraction of trying to read 1000 cells from a file with 30k cells.
The full script which produces the output below is here, with comments etc.
If I try to read all the 30k cells:
cannot allocate vector of size 2.6 Gb
If I read 1000 cells:
5 minutes
45 m
30 m
30 m
20 m
20 m
If I read 3000 cells:
15 minutes
18 m
35 m
34 m
60 m
60 m
If I try to read 5000 cells:
2.5 h
22 h
for >2 I had to stop after 18h, I had to use the workstation for other tasks
With more tests, I've been able to find out that it's the sampleRandom() function that's taking most of the computing time, not the calculation of the quantile (which I can speed up using other quantile functions, such as kuantile()).
Why is sampleRandom() taking so long? Why does it perform so strangely, sometimes fast and sometimes very slow?
What is the best workaround? I guess I could manually generate N random cells for the 1st layer and then manually raster::extract for all timesteps.
EDIT:
Working workaround is to do:
cells <- sampleRandom(pr[[1]], cnsample, cells=T) #Extract cnsample random cells from the first layer, exluding NAs
cells[,1]
prvals <- pr[cells[,1]] #Read those cells from all layers
qs <- quantile(prvals, probs=qprobs, na.rm=T, type=8, names=F) #Compute quantile
This works and is very fast because all layers have NAs in the same positions. I think this should be an option that sampleRandom() could implement.

Resources