Large arrays in Julia - bigdata

I have a 10000x10000 array in Julia, say A=rand(10000,10000). How can I store that large array so I can work with it in a IDE like Atom/Juno, performing matrices operations, determinants, eigenvalues and so on? Or even if I transfer that array to R, is it a way to work with that large array in R?

If your data is sparse (not all cells have values) you can store it as a sparse Matrix, which will greatly improve the memory footprint (see https://docs.julialang.org/en/v1/stdlib/SparseArrays/). Whether or not it fits into memory also depends on what the elements of the Matrix are. E.g. can you represent the values with Int8 or do you need 64-bit precision elements? A Matrix is not just a Matrix.
On a more general note, if your objects become so big they don't fit into memory, you can write them to disk and "memory-map" them, in that way you can use on-disk Matrices for anything you can use a normal Matrix for. You can check the documentation here: https://docs.julialang.org/en/v1/stdlib/Mmap

Related

Is calculating the index in an array more efficient than letting the compiler do it?

I'm trying to generalize a neural network function to arbitrarily many layers, and so I need multiple matrices to hold the weights for each neuron in each layer. I was originally explicitly declaring matrix objects in R to hold my weights for each layer. Instead of having one matrix per layer, I thought of a way (not saying it's original), to store all of my weights in a single array and defined an "indexing function" to map a weight to its appropriate index in the array.
I defined the function as follows:
where is the k-th weight of the j-th neuron in the i-th layer and L(r) is the number of neurons in layer r. After writing these definitions, I realize that stackoverflow doesn't allow latex like mathoverflow which is unfortunate.
Now the question is: Is it more efficient to compute the index of my weights in this way, or is actually less efficient?
After looking up how indices are computed for arrays in general, this is essentially what is done on compilation anyway if I just kept a matrix in each layer holding the weights, so it seems like I may just be making my code overly complicated and harder to understand if there's no difference in time efficiency.
TL;DR use the matrices its easier to understand and takes advantage of optimized CPU instructions.
In computer science parlance, the efficiency (scalability) of algorithms is reasoned about using Big O cost. A score can be given to both the time and space complexity.
Using Big O notation lets compare the two approaches:
Array Approach
time complexity:
Array index access is O(1) time, no matter how large an array becomes, it is just as computationally easy to access an element given its index.
As you've created a function to compute the index of the k-th weight, this adds some small complexity but would probably run in constant O(1) time as it is a mathematical expression, so negligible.
space complexity:
O(N) where N is the number of weights across all layers.
Matrices Approach
time complexity:
A matrix is essentially a 2d array with O(1) access
space complexity
O(N + M), where N is number of neurons and M is number of weights.
Conceptually, we can see that the two approaches have an equivalent time and space complexity score.
However there are the other trade-offs involved (and as a good SO-er must inform you of those)
When it comes to working with the data in the array vs matrices approach, the array approach is less efficient as it circumvents the opportunity for MISD operations. As #liborm alluded to there are vectorised (MISD) operations handled by lower level system libraries like LAPACK/BLAS, which "batch" CPU instructions for some matrix operations (less overhead cost to transfer and compute data at CPU compared to sending a new instruction every time)
Instead of having one matrix per layer, I thought of a way ... to store all of my weights in a single array
It's hard to see why you would opt-ed for the latter as it requires you to create a bespoke indexing function. Maybe its nicer to think about all your weights being in one long array place? However I would argue the mental load required to maintain the array mapping is higher than having multiple matrices dedicated to a layer.
A hash-table like structure of matrices would be much easier to reason about
layers <- list(layer1 = [[...]], layer2 = [[...]], layerN = [[...]])
Further reading
http://www.noamross.net/blog/2014/4/16/vectorization-in-r--why.html
There are many factors to take into consideration in each of the approaches. I'm not familiar with R but I'm assuming matrices' buffers are represented as one-dimensional arrays in memory. (Even if they are written as two dimensional arrays in the underlying C implementation the compiler stores it as one-dimensional array in memory)
The overall outline of memory operations are:
Case: Several matrices per layers
Allocation of matrices:
Accessing of indices:
Case: One matrix for all layers + index calculation
Allocation of matrix cost:
Accesing each of the indices cost:
Function cost:
We can clearly see that the second case, scales better, even though there's the additional cost of the function call.
Having said that, in general having a statically allocated array with all the weights for all the layers, should be faster.
In most cases, computers's bottleneck is memory bandwidth, and the best way to counteract this is to minimize the number of memory accesses.
With this in mind there's another more primitive reason why the 2nd approach will probably be faster: Caches.
Here's a good explanation of the performance difference in accesing a two-dimensional array in a loop by Good Ol' Bob Martin
TL; DR: Caches take advantage of the principle of locality, and therefore, having memory accesses spatially close to each other (as you would in one single array and accessing them in a cache-friendly way as explained in Bob Martin's answer) renders better performance than having them spatially separated (having them in several distinct arrays).
PS: I also recommend to benchmark both approaches and compare, since these nuances regarding the cache are machine-dependent. It might be the case that the Dataset/NN is small enough to fit completely in RAM or even in cache? in a very powerful server.
I'm sure you want to use some kind of native array objects, so you get the speedups provided by BLAS/LAPACK implementations (see eg Intel MKL discussion here if you're on Windows). Most of the time in NN evaluation will be spent in matrix multiplications (like SGEMM), and this is where BLAS implementations like Intel MKL can be an order of magnitude faster.
That is - even if the hand-coded indices for your single-array multi-layer network were super fast, you won't be able to use it with the optimised multiplication routines, which would make your whole network significantly slower. Use the native array objects and create a multi layer abstraction on top of them.
But actually if you want speed and usability (and to really build some NN models), you should consider using something like R interface to TensorFlow. As a bonus you'll get things like running on the GPU for free.
Nice puzzle.. If you are asking calculating index in which would happen in runtime for which it needs to be compiled. Just want to understand how would you let the compiler compute it? IF you have a need to playing with the info anytime later then I would suggest to use Hasmap kind of mechanism. I had done it for a similar need.

JuMP with sparse matrices?

How do I deal with sparse matrices in JuMP?
For example, suppose I want to impose a constrain of the form:
A * x == 0
where A is a sparse matrix and x a vector of variables. I assume that the sparsity of A could be exploited to make the optimization faster. How can I take advantage of this in JuMP?
JuMP already benefits from sparse matrix in different ways, I've not checked the source but refer to a cited paper from JuMP.jl:
In the case of LP, the input data structures are the vectors c and b
and the matrix A in sparse format, and the routines to generate these
data structures are called matrix generators
One point to note is that, the main task of algebraic modeling languages (AMLs) like JuMP is to generate input data structures for solvers. AMLs like JuMP do not solve generated problems themselves but they call standard appropriate solvers to do the task.

Why are matrices (in R) so much slower and larger than image files that contain the same data?

I am working with raw imaging mass spectrometry data. This kind of data is very similar to a traditional image file, except that rather than 3 colour channels, we have channels corresponding to the number of ions we are measuring (in my case, 300). The data is originally stored in a proprietary format, but can be exported to a .txt file as a table with the format:
x, y, z, i (intensity), m (mass)
As you can imagine, the files can be huge. A typical image might be 256 x 256 x 20, giving 1310720 pixels. If each has 300 mass channels, this gives a table with 393216000 rows and 5 columns. This is huge! And consequently won't fit into memory. Even if I select smaller subsets of the data (such as a single mass), the files are very slow to work with. By comparison, the proprietary software is able to load up and work with these files extremely quickly, for example just taking a second or two to open up a file into memory.
I hope I have made myself clear. Can anyone explain this? How can it be that two files containing essentially the exact same data can have such different sizes and speeds? How can I work with a matrix of image data much faster?
Can anyone explain this?
Yep
How can it be that two files containing essentially the exact same data can have such different sizes and speeds?
R is using doubles are default numeric type. Thus, just a storage for your data frame is about 16Gb. Proprietary software most likely is using float as underlying type, thus cutting the memory requirements to 8Gb.
How can I work with a matrix of image data much faster?
Buy a computer with 32Gb. Even with 32Gb computer, think about using data.table in R with operations done via references, because R likes to copy data frames.
Or you might want to move to Python/pandas for processing, with explicit use of dtype=float32
UPDATE
If you want to stay with R, take a look at bigmemory package, link, though I would say dealing with it is not for a people with weak heart
The answer to this question turned out to be a little esoteric and pretty specific to my data-set, but may be of interest to others. My data is very sparse - i.e. most of the values in my matrix are zero. Therefore, I was able to significantly reduce the size of my data using the Matrix package (capitalisation important), which is designed to more efficiently handle sparse matrices. To implement the package, I just inserted the line:
data <- Matrix(data)
The amount of space saved will vary depending on the sparsity of the dataset, but in my case I reduced 1.8 GB to 156 Mb. A Matrix behaves just as a matrix, so there was no need to change my other code, and there was no noticeable change in speed. Sparsity is obviously something that the proprietary format could take advantage of.

dist() function in R: vector size limitation

I was trying to draw a hierarchical clustering of some samples (40 of them) over some features(genes) and I have a big table with 500k rows and 41 columns (1st one is name) and when I tried
d<-dist(as.matrix(file),method="euclidean")
I got this error
Error: cannot allocate vector of size 1101.1 Gb
How can I get around of this limitation? I googled it and came across to the ff package in R but I don't quite understand whether that could solve my issue.
Thanks!
Generally speaking hierarchical clustering is not the best approach for dealing with very large datasets.
In your case however there is a different problem. If you want to cluster samples structure of your data is wrong. Observations should be represented as the rows, and gene expression (or whatever kind of data you have) as the columns.
Lets assume you have data like this:
data <- as.data.frame(matrix(rnorm(n=500000*40), ncol=40))
What you want to do is:
# Create transposed data matrix
data.matrix.t <- t(as.matrix(data))
# Create distance matrix
dists <- dist(data.matrix.t)
# Clustering
hcl <- hclust(dists)
# Plot
plot(hcl)
NOTE
You should remember that euclidean distances can be rather misleading when you work with high-dimensional data.
When dealing with large data sets, R is not the best choice.
The majority of methods in R seems to be implemented by computing a full distance matrix, which inherently needs O(n^2) memory and runtime. Matrix based implementations don't scale well to large data , unless the matrix is sparse (which a distance matrix per definition isn't).
I don't know if you realized that 1101.1 Gb is 1 Terabyte. I don't think you have that much RAM, and you probably won't have the time to wait for computing this matrix either.
For example ELKI is much more powerful for clustering, as you can enable index structures to accelerate many algorithms. This saves both memory (usually down to linear memory usage; for storing the cluster assignments) and runtime (usually down to O(n log n), one O(log n) operation per object).
But of course, it also varies from algorithm to algorithm. K-means for example, which needs point-to-mean distances only, does not need (and cannot use) an O(n^2) distance matrix.
So in the end: I don't think the memory limit of R is your actual problem. The method you want to use doesn't scale.
I just experience a related issue but with less rows (around 100 thousands for 16 columns).
RAM size is the limiting factor.
To limitate the need in memory space I used 2 different functions from 2 different packages.
from parallelDist the function parDist() allow you to obtain the distances quite fast. it uses RAM of course during the process but it seems that the resulting dist object is taking less memory (no idea why).
Then I used the hclust() function but from the package fastcluster. fastcluster is actually not so fast on such an amount of data but it seems that it uses less memory than the default hclust().
Hope this will be useful for anybody who find this topic.

dist function with large number of points

I am using the dist {stats} function to calculate the distance between points, my problem is that I have 24469 points, and the output for the dist function gives me a vector with 18705786 length, instead of the matrix. I tried already to export as.matrix, but the file is 2 large.
How can I have access to what points corresponds each distance?
For example which(distance<=700) gives me the position in the vector, but how can I get the info to what points this distance corresponds to?
There are asome things you could try, also depending on what you need exactly:
Calculate the distances in a loop, and only keep those that match the criterium. Especially when the number of matches is much smaller than the total size of the distance matrix, this saves a lot of RAM usage. This loop is probably very slow if it is implemented in pure R, that is alos why dist does not use R but I believe C to perform the calculations. This could mean that you get your results, but have to wait a while. Alternatively, the excellent Rcpp package would allow you to write this down in C/C++, making it much much faster probably.
Start using packages like bigmemory in storing the distance matrix. You then build it in a loop and store it iteratively in the bigmemory object (I have not worked with bigmemory before, so I don't know the exact details). Then after building the matrix, you can access it to extract your desired results. Effectively, all tricks to handle large data in R apply to this bullet. See e.g. R SO posts on big data.
Some interesting links (found googling for r distance matrix for large vector):
Efficient (memory-wise) function for repeated distance matrix calculations AND chunking of extra large distance matrices
(lucky you!) http://stevemosher.wordpress.com/2012/04/08/using-bigmemory-for-a-distance-matrix/

Resources