Creating distance matrix in R for a matrix in a higher dimensional space - r

I have created an euclidean distance matrix using dist() function in R.
Below is my R script. As the dimensions of matrix would be 16809 * 16809 while running this script in R I got the error message:
Error: cannot allocate vector of size 1.1 Gb
So is there any way to get rid of this error?
I haven't used parallelization in R previously. Can it be done using parallelization?
rnd.points = matrix(runif(3 * 16809), ncol = 3)
rnd.points <- rnd.points[1:5,]
ds <- dist(rnd.points)
as.matrix(ds) -> nt
nt

As #Gopola said: dist(.) computes all pairwise distances, and hence needs
O(n^2) memory. Indeed, dist() is efficient and only stores half of the symmetric n x n matrix.
If I compute dist() on a computer with enough RAM, it works nicely, and indeed creates an object ds of size 1.1 Gb ... which is not so large for today's computers.
rnd.points <- matrix(runif(3 * 16809), ncol = 3)
ds <- dist(rnd.points)
object.size(ds)
Note however that your
as.matrix(ds) -> nt
is not such a good idea as the resulting matrix nt is indeed (almost) twice the size of ds, as nt is of course a n x n matrix.

O/S has a principal limit on RAM-addressing ( smaller for a 32-bit system, larger for 64-bit system )
O/S next has a design-based limit for a max RAM a process can allocate ( +kill-s afterwards )
Had the same InRAM constraints in python and went beyond that
Sure, at some cost, but was a worth piece of experience.
python numpy has a wonderfull feature for this very scenario seamlessly inbuilt - a .memmap(). The word seamlessly is intentionally emphasised, as this is of the core importance for your problem re-formulation / re-design costs. There are tools available, but it will be your time to master 'em and to re-design your algoritm ( libraries et al ) so as these can use the new tools - guess what - SEAMLESSLY. This is the hidden part of the iceberg.
Handy R tools available:
filebacked.big.matrix which also supports an HPC cluster-wide sharing for distributed processing ( thus solving both PSPACE and PTIME dimensions of the HPC processing challenge, unless you fortunately hit the filesystem fileSize ceiling )
ff which allowslibrary(ff)pt_coords <- ff( vmode = "double", dim = c(16809, 3), initdata = 0 )pt_dists <- ff( vmode = "double", dim = c(16809, 16809), initdata = -1 )and work with it in as simple as in matrix-alike [row,column] mode to fill in the points and process their pair-wise distances et al,
?ffsave for further details on saving your resulting distances data
and last, but not least
mmap + indexing
Parallel? No.Distributed?Yes, might help with PTIME:
As noted with filebacked.big.matrix there are chances to segment the computational PSPACE into smaller segments for distributed processing and reduction of the PTIME, but the concept is in principle just a concurrent (re)-use of available resouces, not the [ PARALLEL ] system-behaviour ( while it is necessary to admit, that lot of marketing ( the bad news is that even the technology marketing has joined this unfair and knowingly incorrect practice ) texts mis-uses the word parallel / parallelism in places, where a just concurrent system-behaviour is observed ( there are not many real, true-PARALLEL, systems ) ).
Conclusion:
Big matrices are doable in R well beyond the InRAM limits, select the tools most suitable for your problem-domain and harness all the HPC-resources you may.
Error: cannot allocate vector of size 1.1 Gb is solved.
There is nothing but resources, that imposts limits and delays on our computing-ready tasks, so do not hesitate to make your move while computing resources are still available for your Project, otherwise you will find yourself, with all the re-engineered software ready, but waiting in a queue for the computing resources.

Related

Chainer: ParallelUpdater performance vs MultiprocessUpdater

I'd like to train a CNN on the CIFAR10 dataset with chainer on multiple GPUs on a single node. I tried adapting this example to use ParallelUpdater, in a manner identical to the mnist data parallel example but training performance was very poor -- slower than training on one GPU, even though all 8 GPUs were being utilized. I changed to MultiprocessUpdater and performance (iters/sec) was much better.
Bad:
num_gpus = 8
chainer.cuda.get_device_from_id(0).use()
train_iter = chainer.iterators.SerialIterator(train, batch_size)
if num_gpus > 0:
updater = training.updater.ParallelUpdater(
train_iter,
optimizer,
devices={('main' if device == 0 else str(device)): device for device in range(num_gpus)},
)
else:
updater = training.updater.StandardUpdater(train_iter, optimizer, device=0)
Good:
num_gpus = 8
devices = range(num_gpus)
train_iters = [chainer.iterators.MultiprocessIterator(i, batch_size, n_processes=num_gpus) \
for i in chainer.datasets.split_dataset_n_random(train, len(devices))]
test_iter = chainer.iterators.MultiprocessIterator(test, batch_size, repeat=False, n_processes=num_gpus)
device = 0 if num_gpus > 0 else -1 # -1 indicates CPU, 0 indicates first GPU device.
if num_gpus > 0:
updater = training.updaters.MultiprocessParallelUpdater(train_iters, optimizer, devices=range(num_gpus))
else:
updater = training.updater.StandardUpdater(train_iters[0], optimizer, device=device)
I also ran this benchmarking scripts with 8 GPUs, using the ParallelUpdater, but performance was also very poor: https://github.com/mitmul/chainer-cifar10/blob/master/train.py
My question is: how can I get good performance from ParallelUpdater, and what might I be doing wrong with it?
Thanks!
Using multiple GPUs, there is some overhead for communicating, so each iteration speed could be slower.
If you using data parallel method, you can use much more large batch size and large learning rate, it could accelerate your training.
I am not so familiar with ParallelUpdater, so my understanding might be wrong.
I guess the purpose of ParallelUpdater is not for the speed performance, instead its main purpose is to use memory efficiently to compute large batch gradient.
When reading the source code, model update is done in python for loop, so due to the GIL (Global Interpreter Lock) mechanism, I guess its computation itself is not done in parallel.
https://github.com/chainer/chainer/blob/master/chainer/training/updaters/parallel_updater.py#L118
As written, you can use MultiprocessUpdater if you want to get benefit of speed performance by using multiple GPU.
Also, you can consider using ChainerMN which is extension library for multi-GPU training with chainer.
github
documentation

BDgraph R package producing different (but consistent) results on different OSs

I'm producing samples from a G-Wishart distribution (for example Mohammadi and Wit (2015) and Mohammadi et al. (2017) ) using the BDgraph package in R, but I'm getting different results from one OS to another.
The results are however consistent on the same OS across different machines!
To see this (and to give a minimum reproducible example) I'll sample from the rgwish function on one OS (say linux)
library(BDgraph)
N = 10000
s=7
nu = s+5
m = sample(5:50,s,replace = TRUE)
G = matrix(nrow = s,ncol = s,
c(0,1,0,0,0,0,0,
0,0,1,1,0,0,0,
0,0,0,1,1,1,0,
0,0,0,0,1,1,0,
0,0,0,0,0,1,0,
0,0,0,0,0,0,1,
0,0,0,0,0,0,0))
sample_linux <- rgwish( n = N, adj.g = G, b = nu - s + 1 , D = diag(m,s,s) )
save.image("foo.RData")
I'll then save the resulting samples and the parameters somewhere. Reboot on (say) Windows and run
load("foo.RData")
library(BDgraph)
sample_win <- rgwish( n = N, adj.g = G, b = nu - s + 1 , D = diag(m,s,s) )
plot( density( sample_linux[7,7,],n=2024), type="l")
lines( density( sample_win[7,7,],n=2024 ) ,col="red" )
The two marginal distribution (of this last diagonal element in this example) are clearly different in my experience:
If I however repeat the procedure on another machine with linux installed the two samples coincide.
The underlying graph G doesn't seem to matter, I've tried with both decomposable or non-decomposable graphs and tried different formats for the adjacency matrix (with diagonal or not, symmetric or upper trianguar etc..) although the one here seems to be the preferred format, and inside the rgwish function the authors correct for it anyway.
R version is 3.4.1 on all the machines and BDgraph and all connected packages are at their last version available*.
For those who might be curious OSX gives a consistently different third set of answers...
The only thing changing that I can think of are the BLAS and LAPACK libraries, but I haven't installed any "experimental"/weird package, openBLAS on both my linux systems and I don't even know which one on Windows (the one R comes with in the binaries from CRAN)...
EDIT: I suppose that there wasn't really a question, so...what do you think of it? Any idea why this could happen? Any idea how to solve the issue?
Until proven wrong I'll assume I'm the one doing something wrong, either in sampling or in verifying, so I decided to write here before contacting the maintainer of the package directly.
*(igraph compiled from github in both cases as normal install on linux fails.)
Problem solved from (I believe) version 2.42 of the package.
The issue was with sampling random number inside some OMP parallel region. Linux and MacOSX could make use of OMP while my version under Windows couldn't, hence different results under different OSs (the Windows version was correct for reference).
The author of the package figured out the problem and provided the fix which will be available from the next release at the time of this answer.

R running out of memory during time series distance computation

Problem description
I have 45000 short time series (length 9) and would like to compute the distances for a cluster analysis. I realize that this will result in (the lower triangle of) a matrix of size 45000x45000, a matrix with more than 2 billion entries. Unsurprisingly, I get:
> proxy::dist(ctab2, method="euclidean")
Error: cannot allocate vector of size 7.6 Gb
What can I do?
Ideas
Increase available/addressable memory somehow? However, these 7.6G are probably beyond some hard limit that cannot be extended? In any case, the system has 16GB memory and the same amount of swap. By "Gb", R seems to mean Gigabyte, not Gigabit, so 7.6Gb puts us already dangerously close to a hard limit.
Perhaps a different distance computation method instead of euclidean, say DTW, might be more memory efficient? However, as explained below, the memory limit seems to be the resulting matrix, not the memory required at computation time.
Split the dataset into N chunks and compute the matrix in N^2 parts (actually only those parts relevant for the lower triangle) that can later be reassembled? (This might look similar to the solution to a similar problem proposed here.) It seems to be a rather messy solution, though. Further, I will need the 45K x 45K matrix in the end anyway. However, this seems to hit the limit. The system also gives the memory allocation error when generating a 45K x 45K random matrix:
> N=45000; memorytestmatrix <- matrix( rnorm(N*N,mean=0,sd=1), N, N)
Error: cannot allocate vector of size 15.1 Gb
30K x 30K matrices are possible without problems, R gives the resulting size as
> print(object.size(memorytestmatrix), units="auto")
6.7 Gb
1 Gb more and everything would be fine, it seems. Sadly, I do not have any large objects that I could delete to make room. Also, ironically,
> system('free -m')
Warning message:
In system("free -m") : system call failed: Cannot allocate memory
I have to admit that I am not really sure why R refuses to allocate 7.6 Gb; the system certainly has more memory, although not a lot more. ps aux shows the R process as the single biggest memory user. Maybe there is an issue with how much memory R can address even if more is available?
Related questions
Answers to other questions related to R running out of memory, like this one, suggest to use a more memory efficient methods of computation.
This very helpful answer suggests to delete other large objects to make room for the memory intensive operation.
Here, the idea to split the data set and compute distances chunk-wise is suggested.
Software & versions
R version is 3.4.1. System kernel is Linux 4.7.6, x86_64 (i.e. 64bit).
> version
_
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 4.1
year 2017
month 06
day 30
svn rev 72865
language R
version.string R version 3.4.1 (2017-06-30)
nickname Single Candle
Edit (Aug 27): Some more information
Updating the Linux kernel to 4.11.9 has no effect.
The bigmemory package may also run out of memory. It uses shared memory in /dev/shm/ of which the system by default (but depending on configuration) allows half the size of the RAM. You can increase this at runtime by doing (for instance) mount -o remount,size=12Gb /dev/shm, but this may still not allow usage of 12Gb. (I do not know why, maybe the memory management configuration is inconsistent then). Also, you may end up crashing your system if you are not careful.
R apparently actually allows access to the full RAM and can create objects up to that size. It just seems to fail for particular functions such as dist. I will add this as an answer, but my conclusions are a bit based on speculation, so I do not know to what degree this is right.
R apparently actually allows access to the full RAM. This works perfectly fine:
N=45000; memorytestmatrix <- matrix(nrow=N, ncol=N)
This is the same thing I tried before as described in the original question, but with a matrix of NA's instead of rnorm random variates. Reassigning one of the values in the matrix as float (memorytestmatrix[1,1]<-0.5) still works and recasts the matrix as a float matrix.
Consequently, I suppose, you can have a matrix of that size, but you cannot do it the way the dist function attempts to do it. A possible explanation is that the function operates with multiple objects of that size in order to speed the computation up. However, if you compute the distances element-wise and change the values in place, this works.
library(mefa) # for the vec2dist function
euclidian <- function(series1, series2) {
return((sum((series1 - series2)^2))^.5)
}
mx = nrow(ctab2)
distMatrixE <- vec2dist(0.0, size=mx)
for (coli in 1:(mx-1)) {
for (rowi in (coli+1):mx) {
# Element indices in dist objects count the rows down column by column from left to righ in lower triangular matrices without the main diagonal.
# From row and column indices, the element index for the dist object is computed like so:
element <- (mx^2-mx)/2 - ((mx-coli+1)^2 - (mx-coli+1))/2 + rowi - coli
# ... and now, we replace the distances in place
distMatrixE[element] <- euclidian(ctab2[rowi,], ctab2[coli,])
}
}
(Note that addressing in dist objects is a bit tricky, since they are not matrices but 1-dimensional vectors of size (N²-N)/2 recast as lower triangular matrices of size N x N. If we go through rows and columns in the right order, it could also be done with a counter variable, but computing the element index explicitly is clearer, I suppose.)
Also note that it may be possible to speed this up by making use of sapply by computing more than one value at a time.
There exist good algorithms that do not need a full distance matrix in memory.
For example, SLINK and DBSCAN and OPTICS.

How to speed up the generation of a latin hypercube (LHS) design

I'm trying to generate an optimized LHS (Latin Hypercube Sampling) design in R, with sample size N = 400 and d = 7 variables, but it's taking forever. My pc is an HP Z820 workstation with 12 cores, 32 Mb RAM, Windows 7 64 bit, and I'm running Microsoft R Open which is a multicore version of R. The code has been running for half an hour, but I still don't see any results:
library(lhs)
lhs_design <- optimumLHS(n = 400, k = 7, verbose = TRUE)
It seems a bit weird. Is there anything I could do to speed it up? I heard that parallel computing may help with R, but I don't know how to use it, and I have no idea if it speeds up only code that I write myself, or if it could speed up an existing package function such as optimumLHS. I don't have to use the lhs package necessarily - my only requirement is that I would like to generate an LHS design which is optimized in terms of S-optimality criterion, maximin metric, or some other similar optimality criterion (thus, not just a vanilla LHS). If worse comes to worst, I could even accept a solution in a different environment than R, but it must be either MATLAB or a open source environment.
Just a little code to check performance.
library(lhs)
library(ggplot2)
performance<-c()
for(i in 1:100){
ptm<-proc.time()
invisible(optimumLHS(n = i, k = 7, verbose = FALSE))
time<-print(proc.time()-ptm)[[3]]
performance<-rbind(performance,data.frame(time=time, n=i))
}
ggplot(performance,aes(x=n,y=time))+
geom_point()
Not looking too good. It seems to me you might be in for a very long wait indeed. Based on the algorithm, I don't think there is a way to speed things up via parallel processing, since to optimize the separation between sample points, you need to know the location of the all the sample points. I think your only option for speeding this up will be to take a smaller sample or get (access)a faster computer. It strikes me that since this is something that only really has to be done once, is there a resource where you could just get a properly sampled and optimized distribution already computed?
So it looks like ~650 hours for my machine, which is very comparable to yours, to compute with n=400.

tracking memory usage and garbage collection in R

I am running functions which are deeply nested and consume quite a bit of memory as reported by the Windows task manager. The output variables are relatively small (1-2 orders of magnitude smaller than the amount of memory consumed), so I am assuming that the difference can be attributed to intermediate variables assigned somewhere in the function (or within sub-functions being called) and a delay in garbage collection. So, my questions are:
1) Is my assumption correct? Why or why not?
2) Is there any sense in simply nesting calls to functions more deeply rather than assigning intermediate variables? Will this reduce memory usage?
3) Suppose a scenario in which R is using 3GB of memory on a system with 4GB of RAM. After running gc(), it's now using only 2GB. In such a situation, is R smart enough to run garbage collection on its own if I had, say, called another function which used up 1.5GB of memory?
There are certain datasets I am working with which are able to crash the system as it runs out of memory when they are processed, and I'm trying to alleviate this. Thanks in advance for any answers!
Josh
1) Memory used to represent objects in R and memory marked by the OS as in-use are separated by several layers (R's own memory handling, when and how the OS reclaims memory from applications, etc.). I'd say (a) I don't know for sure but (b) at times the task manager's notion of memory use might not accurately reflect the memory actually in use by R, but that (c) yes, probably the discrepancy you describe reflects memory allocated by R to objects in your current session.
2) In a function like
f = function() { a = 1; g=function() a; g() }
invoking f() prints 1, implying that memory used by a is still being marked as in use when g is invoked. So nesting functions doesn't help with memory management, probably the reverse.
Your best bet is to clean-up or re-use variables representing large allocations before making more large allocations. Appropriately designed functions can help with this, e.g.,
f = function() { m = matrix(0, 10000, 10000); 1 }
g = function() { m = matrix(0, 10000, 10000); 1 }
h = function() { f(); g() }
The large memory of f is no longer needed by the time f returns, and so is available for garbage collection if the large memory required for g necessitates this.
3) If R tries to allocate memory for a variable and can't, it'll run its garbage collector a and try again. So you don't gain anything by running gc() yourself.
I'd make sure that you've written memory efficient code, and if there are still issues I'd move to a 64bit platform where memory is less of an issue.
R has facilities for memory profiling, but it needs to be built that. While we enable that for Debian / Ubuntu, I do not know what the default for Windows is.
Usage of memory profiling is discussed (briefly) in the 'Writing R Extensions' manual.
Coping with (limited) memory on a 32-bit system (and particularly Windows) has its challenges. Most people will recommend that you switch to a system with as much RAM as possible running a 64-bit OS.

Resources