Max Array Length in Julia - julia

I can create an array of a million elements like this:
Array(1:1_000_000)
Vector{Int64} with 1000000 elements
but if I try to create an array of a billion elements I get this:
Array(1:1_000_000_000)
Julia has exited.
Press Enter to start a new session.
Is Julia not able to handle a billion elements in an array or what am I doing wrong here?

You are creating an Array of Int64, each of which needs to be stored in memory:
julia> sizeof(3)
8
So at some point you're bound to run out of memory - this is not due to some inherent limit on the number of elements in an array, but rather the size of the overall array, which in turn depends on the size of each element. Consider:
julia> sizeof(Int8(3))
1
julia> [Int8(1) for _ in 1:1_000_000_000]
1000000000-element Array{Int8,1}:
1
1
1
⋮
1
1
1
so filling the array with a smaller data type (8-bit rather than 64-bit Integer) allows me to create an array with more elements.

While there is no limit how big an Array can be in Julia there is obviously the available RAM memory limit (mentioned in the other answer). Basically, you can assume that all your available system memory can be allocated for a Julia process. sizeof is a good way to calculate how much RAM you need.
However, if you actually do big array computing in Julia the above limit can be circumvented in many ways:
Use massive memory machines from a major cloud computing provider. I use Julia on AWS Linux and it walks like a charm - you can have a machine up to 4TB RAM on a virtual machine and 24TB RAM on a bare metal machine. While it is not a Julia solution, sometimes it is the easiest and cheapest way to go.
Sometimes your data is sparse - you do not actually use all of those memory cells. In such cases consider SparseArrays. In other cases your sparse data is formatted in some specific way (e.g. non-zero values only on diagonal) in that case use BanndedMatrices.jl. It is worth noting that there is even a Julia package for infinite algebra. Basically whatever you find at the Julia Matrices project is worth looking at.
You can use memory mapping - that means that most of your array is on disk and only some part is hold in RAM. In this way you are limited by your disk space rather than the RAM.
You can use DistributedArrays.jl and have a single huge Array hosted on several machines.
Hope it will be useful for you or other people trying to do big data algebra in Julia.

Related

Rf_allocVector only allocates and does not zero out memory

Original motivation behind this is that I have a dynamically sized array of floats that I want to pass to R through Rcpp without either incurring the cost of a zeroing out nor the cost of a deep copy.
Originally I had thought that there might be some way to take heap allocated array, make it aware to R's gc system and then wrap it with other data to create a "Rcpp::NumericVector" but it seems like that that's not possible - or doable with my current knowledge.
However and correct me if I'm wrong it looks like simply constructing a NumericVector with a size N and then using it as an N sized allocation will call R.h's Rf_allocVector and that itself does not either zero out the allocated array - I tested it on a small C program that gets dyn.loaded into R and it looks like garbage values. I also took a peek at the assembly and there doesn't seem to be any zeroing out.
Can anyone confirm this or offer any alternate solution?
Welcome to StackOverflow.
You marked this rcpp but that is a function from the C API of R -- whereas the Rcpp API offers you its constructors which do in fact set the memory tp zero:
> Rcpp::cppFunction("NumericVector goodVec(int n) { return NumericVector(n); }")
> sum(goodVec(1e7))
[1] 0
>
This creates a dynamically allocated vector using R's memory functions. The vector is indistinguishable from R's own. And it has the memory set to zero
as we use R_Calloc, which is documented in Writing R Extension to setting the memory to zero. (We may also use memcpy() explicitly, you can check the sources.)
So in short, you just have yourself confused over what the C API of R, as well as Rcpp offer, and what is easiest to use when. Keep reading documentation, running and writing examples, and studying existing code. It's all out there!

OpenCL doesn't allowes late initializtion of variable in constant space

I want to generate a matrix which will be read by many thread after its generation so I declared it with program scope. It has to be constant so I am just assigning values once so
1) why openCl asking for initialization while declaration only?
2) How can I fix this issue?
1) Because you can't tell the gpu which elements are written by which threads. Constants are prepared by preprocessor using scalar engine, not parallel one. Parallel engine would need N x N times synchronizations to achieve that, where N is number of threads participating in building constant buffer.
2-a) If you want to work with constant memory, prepare a simple(__global, not constant) buffer in a kernel, use it as constant buffer in the next kernel(engine puts it in constant memory space). But constant space is small so the matrix should be small. This needs 2 kernels, means kernel overhead.
2-b) If cache performance is enough, just use a buffer. So it can be in a single kernel(first thread group prepares matrix, remaining ones compute using it, not starting until first group gives signal using atomic functions)
2-c) If local memory is bigger than constant memory, you can use local memory and build that matrix for each compute unit by themselves so it should take same amount of cycles(maybe even less if you use all cores) and probably faster than constant memory. This doesn't need communication between thread groups so would be fast.
2-d)If matrix is big and you need most of bandwidth, distribute it to all memory spaces. Example: put 1/4 of matrix to constant memory (5x bandwidth), put 1/4 of matrix to local memory (10x bandwidth), put 1/4 of matrix to global memory(2x from cache performance), put remaining data to instruction space(instructions themselves) so multiple threads would be working on 4 different places concurrently, using all bandwidth (constant + local + cache + instruction cache).

Append OpenCL result to list / Reduce solution room

I have an OpenCL Kernel with multiple work items. Let's assume for discussion, that I have a 2-D Workspace with x*y elements working on an equally sized, but sparce, array of input elements. Few of these input elements produce a result, that I want to keep, most don't. I want to enqueue another kernel, that only takes the kept results as an input.
Is it possible in OpenCL to append results to some kind of list to pass them as input to another Kernel or is there a better idea to reduce the volume of the solution space? Furthermore: Is this even a good question to ask with the programming model of OpenCL in mind?
What I would do if the amount of result data is a small percentage (ie: 0-10%) is use local atomics and global atomics, with a global counter.
Data interface between kernel 1 <----> Kernel 2:
int counter //used by atomics to know where to write
data_type results[counter]; //used to store the results
Kernel1:
Create a kernel function that does the operation on the data
Work items that do produce a result:
Save the result to local memory, and ensure no data races occur using local atomics in a local counter.
Use the work item 0 to save all the local results back to global memory using global atomics.
Kernel2:
Work items lower than "counter" do work, the others just return.

OpenCl and power iteration method (eigendecomposition)

I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Resources