MPI RMA create windows for ghost cells in a multidimensional array - mpi

if I have a array A[100][100][100], How I create a window for remote memory access for the six edge subarrays (ghost cells) especially for A[0][:][:] and A[100][:][:].
In MPI-1,I create vector type to send/recv ghost cells.
In MPI-2 and -3, do I need to expose the entire array or only the ghost cells? Of course, the latter will be much better if possible.

MPI RMA windows are contiguous areas in memory and for performance reasons implementations might require that they are allocated specifically using MPI_ALLOC_MEM. The boundary cells on 4 of the 6 sides of a 3-D array are not contiguous in memory. Some implementations could also require that windows start aligned on a page or other kind of boundary. Therefore you have to register a window that spans the whole array.
While it is technically possible to expose two separate windows for A[0][:][:] and A[99][:][:] and these would not expose any other parts of the array, this is simply not possible for A[:][0][:], A[:][99][:], and so on because of their discontinuous character.
I would suggest that you allocate A using MPI_ALLOC_MEM (or MPI_Alloc_mem if you program in C/C++). You could then use the appropriate vector types in MPI_GET and MPI_PUT in order to easily access the remote halo cells as well as the local cells that are to be copied over.

Related

Is Tcl nested dictionary uses references to references, and avoids capacity issues?

According to the thread:
TCL max size of array
Tcl cannot have >128M list/dictionary elements. However, One could have a nested dictionary, which total values (in different levels) exceeds the number.
Now, is the nested dictionary using references, by design? This would mean that as long as in 1 level of the dictionary tree, there is no more than 128M elements, you should be fine. Is that true?
Thanks.
The current limitation is that no individual memory object (C struct or array) can be larger than 2GB, and it's because the high-performance memory allocator (and a few other key APIs) uses a signed 32-bit integer for the size of memory chunk to allocate.
This wasn't a significant limitation on a 32-bit machine, where the OS itself would usually restrict you at about the time when you started to near that limit.
However, on a 64-bit machine it's possible to address much more, while at the same time the size of pointers is doubled, e.g., 2GB of space means about 256k elements for a list, since each needs at least one pointer to hold the reference for the value inside it. In addition, the reference counter system might well hit a limit in such a scheme, though that wouldn't be the problem here.
If you create a nested structure, the total number of leaf memory objects that can be held within it can be much larger, but you need to take great care to never get the string serialisation of the list or dictionary since that would probably hit the 2GB hard limit. If you're really handling very large numbers of values, you might want to consider using a database like SQLite as storage instead as that can be transparently backed by disk.
Fixing the problem is messy because it impacts a lot of the API and ABI, and creates a lot of wreckage in the process (plus a few subtle bugs if not done carefully, IIRC). We'll fix it in Tcl 9.0.

vector<vector> as a quick-traversal 2d data structure

I'm currently considering the implementation of a 2D data structure to allow me to store and draw objects in correct Z-Order (GDI+, entities are drawn in call order). The requirements are loosely:
Ability to add new objects to the top of any depth index
Ability to remove arbitrary object
(Ability to move object to the top of new depth index, accomplished by 2 points above)
Fast in-order and reverse-order traversal
As the main requirement is speed of traversal across the full data, the first thing that came to mind was an array like structure, eg. vector. It also easily allows for pushing new objects (removing objects not so great..). This works perfectly fine for our requirements, as it just so happens that the bulk of drawable entities don't change, and the ones that do sit at the top end of the order.
However it got me thinking of the implications for more dynamic requirements:
A vector will resize itself as required -> as the 'depth' vectors would need to be maintained contiguously in memory (top-level vector enforces it), this could lead to some pretty expensive vector resizes. Worst case all vectors need to be moved to new memory location, average case requiring all vectors up the chain to be moved.
Vectors will often hold a buffer at the end for adding new objects -> traversal could still easily force a cache miss while jumping between 'depth' vectors, rendering the top-level vector's contiguous memory less beneficial
Could someone confirm that these observations are indeed correct, making a vector a mostly very expensive structure for storing larger dynamic data sets?
From my thoughts above, I end up deducing that while traversing the whole dataset, specifically jumping between different vectors in the top-level vector, you might as well use any other data structure with inferior traversal complexity, or similar random access complexity (linked_list; map). Traversal would effectively be the same, as we might as well assume the cache misses will happen anyway, and we save ourselves a lot of bother by not keeping the depth vectors contiguously in memory.
Would that indeed be a good solution? If I'm not mistaken, on a 1D problem space, this would come down to what's more important traversal or addition/removal, vector or linked-list. On a 2D space I'm not so sure it is so black and white.
I'm wondering what sort of application requires good traversal across a 2D space, without compromising data addition/removal, and what sort of data structures are used there.
P.S. I just noticed I'm completely ignoring space-complexity, so might as well keep on ignoring it (unless you feel like adding more insight :D)
Your first assumption is somewhat incorrect.
Instead of thinking of vectors as the blob of memory itself, think of it as a pointer to automatically managed blob of memory and some metadata to keep track of it. A vector itself is a fixed size, the memory it keeps track of isn't. (See this example, note that the size of the vector object is constant: https://ideone.com/3mwjRz)
A vector of vectors can be thought of as an array of pointers. Resizing what the pointers point to doesn't mean you need to resize the array that contains them. The promise of items being contiguous still holds: the parent array has all of the pointers adjacent to each other and each pointer points to a contiguous chunk of memory. However, it's not guaranteed that the end of arr[0][N-1] is adjacent to the beginning of arr[1][0]. (To this end, your second point is correct.)
I guess that a Linked List would be more appropriate as you will always be traversing the whole list (vectors are good for random access). Linked lists inserts and removal are very cheap and the traversal isn't that different from a vector traversal. Maybe you should consider a Doubly Linked List as you want to traverse it in both ways.

What is the advantage of using a 1d image over a 1d buffer?

I understand that in 2d, images are cached in x and y directions.
But in 1d, why would you want to use an image? Is the memory used
for images faster than memory used for buffers?
1D Image stays the image, so it has all advantages that Image has against Buffer. That is:
Image IO operations are usually well-cached.
Samplers can be used, which gives benefits like computationally cheap interpolation, hardware-resolved out-ouf-bound access, etc.
Though, you should remember that Image has some constraints in comparison to regular Buffer:
Single Image can be used either for reading or for writing within one kernel.
You can't use vloadN / vstoreN operations, which can handle up to 16 values per call. Your best option is read_imageX & write_imageX functions, which can load / store up to 4 values per one call. That can be serious issue on GPU, with vector architecture.
If you are not using 4-component format, usually, you are loosing part of performance as many functions process samples from color planes simultaneously. So, payload is decreasing.
If we talk about GPU, different parts of hardware are involved into processing of Images & Buffers, so it's difficult to draw up, how one is better than another. Carefull benchmarking & algorithm optimizations are needed.

OpenCL: Work items, Processing elements, NDRange

My classmates and me are being confronted with OpenCL for the first time. As expected, we ran into some issues. Below I summarized the issues we had and the answers we found. However, we're not sure that we got it all right, so it would be great if you guys could take a look at both our answers and the questions below them.
Why didn't we split that up into single questions?
They partly relate to each other.
We think these are typical beginner's questions. Those fellow students who we consulted all replied "Well, that I didn't understand either."
Work items vs. Processing elements
In most of the lectures on OpenCL that I have seen, they use the same illustration to introduce computing units and processing elements as well as work groups and work items. This has led my classmates and me to continuously confuse these concepts. Therefore we now came up with a definition that emphasizes on the fact that processing elements are very different from work items:
A work item is a kernel that is being executed, whereas a processing element is an abstract model that represents something that actually does computations. A work item is something that exists only temporarily in software, while a processing element abstracts something that physically exists in hardware. However, depending on the hardware and therefore depending on the OpenCL implementation, a work item might be mapped to and executed by some piece of hardware that is represented by a so-called processing element.
Question 1: Is this correct? Is there a better way to express this?
NDRange
This is how we perceive the concept of NDRange:
The amount of work items that are out there is being represented by the NDRange size. Commonly, this is also being referred to as the global size. However, the NDRange can be either one-, two-, or three-dimensional ("ND"):
A one-dimensional problem would be some computation an a linear vector. If the vector's size is 64 and there are 64 work items to process that vector, then the NDRange size equals 64.
A two-dimensional problem would be some computation on an image. In the case of an 1024x768 image, the NDRange size Gx would be 1024 and the NDRange size Gy would be 768. This assumes, that there are 1024x768 work items out there to process each pixel of that image. The NDRange size then equals 1024x768.
A three-dimensional example would be some computation on a 3D model or so. Additionally, there is NDRange size Gz.
Question 2: Once again, is this correct?
Question 3: These dimensions are simply out there for convienence right? One could simply store the color values of each pixel of an image in a linear vector of the size width * height. The same is true for any 3D problem.
Various
Question 4: We were being told that the execution of kernels (in other words: work items) could be synchronized within a work group using barrier(CLK_LOCAL_MEM_FENCE); Understood. We were also (repeatedly) being told that work groups cannot be synchronized. Alright. But then what's the use of barrier(CLK_GLOBAL_MEM_FENCE);?
Question 5: In our host program, we specify a context that consists of one or more device(s) from one of the available platforms. However, we can only enqueue kernels in a so-called command queue that is linked to exactly one device (that has to be in the context). Again: The command queue is not linked to the previously defined context, but to a single device. Right?
Question 1: Almost correct. A work-item is an instance of a kernel (see paragraph 2 of section 3.2 of the standard). See also the definition of processing element from the standard:
Processing Element: A virtual scalar processor. A work-item may
execute on one or more processing elements.
see also the answer I provided to that question.
Question 2 & 3: Use more than one dimensions or the exact same number of work-items than you have data elements to process depends on your problem. It's up to you and how easier the development would be. Note also that you have a constrain with ocl 1.2 and below which forces you to have the global size a multiple of the work-group size (removed with ocl 2.0).
Question 4: Yes, synchronization during the execution of a kernel is only possible within a work-group thanks to barriers. The difference between the flags you pass as parameter refer to the type of memory. With CLK_LOCAL_MEM_FENCE all work-items will have to make sure that data they have to write in local memory will be visible to the others. With CLK_GLOBAL_MEM_FENCE it's the same but for global memory
Question 5: Within a context you can have several devices having themselves several command queues. As you stated, a command-queue is linked to one device, but you can enqueue your kernels in different command-queues from different devices. Note that if two command-queues try to access the same memory object (without sync) you get an undefined behavior. You'd typically use two or more command queues when their respective jobs are not related.
However you can synchronized command-queues through events and as a matter of fact you can also create your own events (called user events) see section 5.9 for event and section 5.10 for user events (of the standard).
I'd advice you to read at least the first chapters (1 to 5) of the standard. If you're in a hurry, at least the chap 2 which is actually the glossary.

vector type and one dimension array in OpenCl

I'd like to implement two versions of my kernel, a vector and a scalar versions.
Now I'm wondering whether let's say double4 type is similar in term of memory access to a array of double of size 4.
What I have in mind is to use the same data type for my two kernels where in the scalar one I will just work on each component individually (.s0 .. .s3) like with a regular array.
In other world I'd like to use OpenCl vector types for storage only in the scalar kernel and take the advantage of the vector properties in the vector kernel.
I honestly don't want to have different variable types for each kernel.
Does that make sense to you guys?
Any hints here?
Thank you,
Éric.
2, 4, 8 and 16 element vectors are laid out in memory just like 2/4/8/16 scalars. The exception is 3 element vectors, which use as much memory as 4 element vectors. The main benefit of using vectors in my experience has been that all devices support some form of instruction level parallelism, either through SIMD instructions like on CPUs or through executing independent instructions simultaneously, which happens on GPUs.
Regarding the memory access pattern:
This depends first and foremost on your OpenCL kernel compiler: A reasonable compiler would use a single memory transaction to fetch the data for multiple array cells used in a single work item, or even multiple cells used in multiple items. On NVidia GPUs global device memory is read in units of 128 bytes, which makes it worthwhile to coalesce as many as (Edit:) 32 float values for every read; see
NVidia CUDA Best Pracices Guide: Coalesced Access to Global Memory
So using float4 might not even be enough to maximize your bandwidth utilization.
Regarding the use of vector types in kernels:
I believe that these would be useful mostly, if not only, on CPUs with vector instructions, and not on GPUs - where work items are inherently scalar; the vectorization is over multiple work items.
Not sure if I get your question. I'll give it a try with a bunch of general hints&tricks.
You don't have arrays in private memory, so here vectors can come in handy. As is described by the others, memory-alignment is comparable. See http://streamcomputing.eu/blog/2013-11-30/basic-concepts-malloc-kernel/ for some information.
The option you are missing is using the structs. Read the second part of the first answer of Arranging memory for OpenCL to know more.
Another thing that could be handy:
__attribute__((vec_type_hint(vectortype)))
Intel has various explanations: http://software.intel.com/sites/products/documentation/ioclsdk/2013XE/OG/Writing_Kernels_to_Directly_Target_the_Intel_Architecture_Processors.htm
It is quite tricky to write multiple kernels in one. You can use macro-tricks as described in http://streamcomputing.eu/blog/2013-10-17/writing-opencl-code-single-double-precision/

Resources