OpenCL performance: using arrays of primitives vs arrays of structures - opencl

I'm trying to use several arrays of doubles in a kernel that are all the same length. Instead of passing each double* in as a separate argument, I know I can define a structure in the .cl file that holds several doubles and then just pass into the kernel one pointer for an array of the structures instead.
Will the performance be different for the two ways? Please correct me if I am wrong, but I think passing individual double pointers means the access can be coalesced. Will accessing the structures also be coalesced?

As long as your structures don't contain any pointers what you say is absolutely possible. The primary impact is generally, as you've already considered, the effect this has on the coalescing of memory operations. How big an effect is down to your memory access pattern, the size of your struct and the device you're running on. More details would be needed to describe this more fully.
Saying that, one instance where I've used a struct in this way very successfully is where the element being read is the same for all work items in a work group. In this case there is no penalty on my hardware (nvidia GTX 570). Also it is worth remembering that in some cases the added latency introduced by the memory operation being serialised can be hidden. In the CUDA world this would be achieved by having a high occupancy for a problem with a high arithmetic intensity.
Finally it is worth pointing out that the semantic clarity of using a struct can have a benefit in and of itself. You'll have to consider this against any performance cost for your particular problem. My advice is to try it and see; it is very difficult to predict the impact of these issues ahead of time.

Theoretically it is the same performance. However if you access some of the members more often, using several segregatedvarrays will have much more performance, due to cpu cache locality. However most operations will be more difficult when you have several arrays.

The structures and the single elements will have the exact same performance.
Supose you have a big array of doubles, and the first work item uses 0, 100, 200, 300, ... and the next one uses 1, 101, 201, 301, ...
If you have a structure of 100 doubles, in memory the first structure will be first(0-99), then the second(100-199) and so on. The kernels will acess exactly the same memory and in the same places, the only difference is how you define the memory abstraction.
In a more generic case of having a structure of different element types (char, int, double, bool, ...) it may happen that the aligment is not as if it were a single array of data. But it will still be "semi-coalesced". I would even bet the performance is still the same.

Related

Is this an advantage of MPI_PACK over derived datatype?

Suppose a process is going to send a number of arrays of different sizes but of the same type to another process in a single communication, so that the receiver builds the same arrays in its memory. Prior to the communication the receiver doesn't know the number of arrays and their sizes. So it seems to me that though the task can be done quite easily with MPI_Pack and MPI_Unpack, it cannot be done by creating a new datatype because the receiver doesn't know enough. Can this be regarded as an advantage of MPI_PACK over derived datatypes?
There is some passage in the official document of MPI which may be referring to this:
The pack/unpack routines are provided for compatibility with previous libraries. Also, they provide some functionality that is not otherwise available in MPI. For instance, a message can be received in several parts, where the receive operation done on a later part may depend on the content of a former part.
You are absolutely right. The way I phrase it is that with MPI_Pack I can make "self-documenting messages". First you store an integer that says how many elements are coming up, then you pack those elements. The receiver inspects that first int, then unpacks the elements. The only catch is that the receiver needs to know an upper bound on the number of bytes in the pack buffer, but you can do that with a separate message, or a MPI_Probe.
There is of course the matter that unpacking a packed message is way slower than straight copying out of a buffer.
Another advantage to packing is that it makes heterogeneous data much easier to handle. The MPI_Type_struct is rather a bother.

sending data: MPI_Type_contiguous vs primitive types

I am trying to exchange data (30 chars) b/w two processes to understand the MPI_Type_contiguous API as:
char data[30];
MPI_Type_contiguous(10,MPI_CHAR,&mytype);
MPI_Type_commit(&mytype);
MPI_Send(data, 3,mytype,1, 99,MPI_COMM_WORLD);
But the similar task could have been accomplished via :
MPI_Send(data, 30,MPI_CHAR,1, 99,MPI_COMM_WORLD);
I guess there is no latency factor advantage as i am using only single function call in both cases(or is it?).
Can anyone share a use case where MPI_Type_contiguous is advantageous over primitive types(in terms of performance/ease of accomplishing a task)?
One use that immediately comes to mind is sending very large messages. Since the count argument to MPI_Send is of type int, on a typical LP64 (Unix-like OSes) or LLP64 (Windows) 64-bit OS it is not possible to directly send more than 231-1 elements, even if the MPI implementation is using internally 64-bit lengths. With modern compute nodes having hundreds of GiBs of RAM, this is becoming a nuisance. The solution is to create a new datatype of length m and send n elements of the new type for a total of n*m data elements, where n*m can now be up to 262-232+1. The method is future-proof and can also be used on 128-bit machines as MPI datatypes can be nested even further. This workaround and the fact that registering a datatype is way cheaper (in execution time) compared to the time it takes such large messages to traverse the network was used by the MPI Forum to reject the proposal for adding new large-count APIs or modifying the argument types of the existing ones. Jeff Hammond (#Jeff) has written a library to simplify the process.
Another use is in MPI-IO. When setting the view of file with MPI_File_set_view, a contiguous datatype can be provided as the elementary datatype. It allows one to e.g. treat in a simpler way binary files of complex numbers in languages that do not have a built-in complex datatype (like the earlier versions of C).
MPI_Type_contiguous is for making a new datatype which is count copies of the existing one. This is useful to simplify the processes of sending a number of datatypes together as you don't need to keep track of their combined size (count in MPI_send can be replaced by 1).
For you case, I think it is exactly the same. The text from using MPI adapted slightly to match your example is,
When a count argument is used in an MPI operation, it is the same as if a contigous type of that size has been constructed.
MPI_Send(data, count, MPI_CHAR, 1, 99, MPI_COMM_WORLD);
is exactly the same as
MPI_Type_contiguous(count, MPI_CHAR, &mytype);
MPI_Type_commit(&mytype);
MPI_Send(data, 1, mytype, 1, 99, MPI_COMM_WORLD);
MPI_Type_free(&mytype);
You are correct, as there is only one actual communication call, the latency will be identical (and bandwidth, the same number of bytes are sent).

Is Tcl nested dictionary uses references to references, and avoids capacity issues?

According to the thread:
TCL max size of array
Tcl cannot have >128M list/dictionary elements. However, One could have a nested dictionary, which total values (in different levels) exceeds the number.
Now, is the nested dictionary using references, by design? This would mean that as long as in 1 level of the dictionary tree, there is no more than 128M elements, you should be fine. Is that true?
Thanks.
The current limitation is that no individual memory object (C struct or array) can be larger than 2GB, and it's because the high-performance memory allocator (and a few other key APIs) uses a signed 32-bit integer for the size of memory chunk to allocate.
This wasn't a significant limitation on a 32-bit machine, where the OS itself would usually restrict you at about the time when you started to near that limit.
However, on a 64-bit machine it's possible to address much more, while at the same time the size of pointers is doubled, e.g., 2GB of space means about 256k elements for a list, since each needs at least one pointer to hold the reference for the value inside it. In addition, the reference counter system might well hit a limit in such a scheme, though that wouldn't be the problem here.
If you create a nested structure, the total number of leaf memory objects that can be held within it can be much larger, but you need to take great care to never get the string serialisation of the list or dictionary since that would probably hit the 2GB hard limit. If you're really handling very large numbers of values, you might want to consider using a database like SQLite as storage instead as that can be transparently backed by disk.
Fixing the problem is messy because it impacts a lot of the API and ABI, and creates a lot of wreckage in the process (plus a few subtle bugs if not done carefully, IIRC). We'll fix it in Tcl 9.0.

vector type and one dimension array in OpenCl

I'd like to implement two versions of my kernel, a vector and a scalar versions.
Now I'm wondering whether let's say double4 type is similar in term of memory access to a array of double of size 4.
What I have in mind is to use the same data type for my two kernels where in the scalar one I will just work on each component individually (.s0 .. .s3) like with a regular array.
In other world I'd like to use OpenCl vector types for storage only in the scalar kernel and take the advantage of the vector properties in the vector kernel.
I honestly don't want to have different variable types for each kernel.
Does that make sense to you guys?
Any hints here?
Thank you,
Éric.
2, 4, 8 and 16 element vectors are laid out in memory just like 2/4/8/16 scalars. The exception is 3 element vectors, which use as much memory as 4 element vectors. The main benefit of using vectors in my experience has been that all devices support some form of instruction level parallelism, either through SIMD instructions like on CPUs or through executing independent instructions simultaneously, which happens on GPUs.
Regarding the memory access pattern:
This depends first and foremost on your OpenCL kernel compiler: A reasonable compiler would use a single memory transaction to fetch the data for multiple array cells used in a single work item, or even multiple cells used in multiple items. On NVidia GPUs global device memory is read in units of 128 bytes, which makes it worthwhile to coalesce as many as (Edit:) 32 float values for every read; see
NVidia CUDA Best Pracices Guide: Coalesced Access to Global Memory
So using float4 might not even be enough to maximize your bandwidth utilization.
Regarding the use of vector types in kernels:
I believe that these would be useful mostly, if not only, on CPUs with vector instructions, and not on GPUs - where work items are inherently scalar; the vectorization is over multiple work items.
Not sure if I get your question. I'll give it a try with a bunch of general hints&tricks.
You don't have arrays in private memory, so here vectors can come in handy. As is described by the others, memory-alignment is comparable. See http://streamcomputing.eu/blog/2013-11-30/basic-concepts-malloc-kernel/ for some information.
The option you are missing is using the structs. Read the second part of the first answer of Arranging memory for OpenCL to know more.
Another thing that could be handy:
__attribute__((vec_type_hint(vectortype)))
Intel has various explanations: http://software.intel.com/sites/products/documentation/ioclsdk/2013XE/OG/Writing_Kernels_to_Directly_Target_the_Intel_Architecture_Processors.htm
It is quite tricky to write multiple kernels in one. You can use macro-tricks as described in http://streamcomputing.eu/blog/2013-10-17/writing-opencl-code-single-double-precision/

Count the frequency of bytes in a purely functional language

If we had an assignment:
Given a block of binary data, count the frequency of the bytes within it.
And you were supposed to do this in C, the answer would be trivial and reasonably fast even for larger binary blocks. How would one go about implementing this in a purely functional language, without side effects?
For example, if you wrote a function that accepted freqency counts for each byte and the rest of the list of bytes, and returned modified frequency counts, it would have to do awful lot of work for data set of 100M bytes.
Also, if you sorted the data and then somehow counted the amount of subsequent same-valued bytes, the sort itself would take a lot of time.
Is there a reasonable way to implement this?
The straightforward way to do it is indeed to pass in and return data structures mapping bytes to counts. This would probably be implemented as some kind of tree (since that's what you get out of the standard library containers, as far as I know). In pure functional programming when you're passed in a tree and you need to return a new tree with a difference in only one node, the returned tree ends up sharing almost all of its structure and data with the original tree.
There is some overhead in traversing the tree to get to the count, but since you're counting bytes the tree is only ever smaller than 256 elements, so the overhead is log(255), which is a constant. It doesn't get larger for large data sets - it doesn't change the big-oh complexity of the algorithm. That's actually true even if you use the greatest possible overhead of copying around a full 256-entry array of counts with no sharing.
If you want to optimise this, you can take advantage of the fact that the "intermediate" frequency counts are never needed except as part of the computation of the next set of counts. That means you can use various techniques for getting the implementation to use destructive updates even while you're still semantically writing functional code. An STref in Haskell is basically letting you do this manually.
Theoretically the compiler could notice that you're replacing a never-needed-again value with a new one, so it could do the update in place for you. I don't know whether or not any actual production ready compilers are currently able to make this optimisation.

Resources