Endianness and OpenCL Transfers - opencl

In OpenCL, transfer from CPU client side to GPU server side is accomplished through clEnqueueReadBuffer(...)/clEnqueueWriteBuffer(...). However, the documentation does not specify whether any endian-related conversions take place in the underlying driver.
I'm developing on x86-64, and a NVIDIA card--both little endian, so the potential problem doesn't arise for me.
Does conversion happen, or do I need to do it myself?

The transfer do not do any conversions. The runtime does not know the type of your data.
You can probably expect conversions only on kernel arguments.

You can query the device endianness (using clGetDeviceInfo and check CL_DEVICE_ENDIAN_LITTLE ), but I am not aware of a way that allows transparent conversions.

This is the point, where INMHO the specification is not satisfactory.
At first it is clear about pointers, that is, data that a pointer is referencing can be in host or device byte order, and one can declare this by a pointer attribute, and the default byte order is that of the device.
So according to this, developers have to take care of the endianness that they feed as input to a kernel.
But than in "Appendix B - Portability" it's said that implementations may or may not automatically convert endianness of kernel arguments and that developers should consult the documentation of the vendors in case host and device byte order is different.
Sorry for me being that direct but what shit is that. I mean the intention of the OpenXX specifications is that they should make it possible to write cross platform code. But when there are that significant asspects that can vary from implementation to implementation, this is quite not possible.
The next point is, what does this all mean for OpenCL/OpenGL interoperation.
In OpenGL data for buffer objects like VBO's have to be in host byte order. So what in case such a buffer is shared between OpenCL and OpenGL. Must the data of it be transformed before and after they are processed by an OpenCL kernel or not?


Is this an advantage of MPI_PACK over derived datatype?

Suppose a process is going to send a number of arrays of different sizes but of the same type to another process in a single communication, so that the receiver builds the same arrays in its memory. Prior to the communication the receiver doesn't know the number of arrays and their sizes. So it seems to me that though the task can be done quite easily with MPI_Pack and MPI_Unpack, it cannot be done by creating a new datatype because the receiver doesn't know enough. Can this be regarded as an advantage of MPI_PACK over derived datatypes?
There is some passage in the official document of MPI which may be referring to this:
The pack/unpack routines are provided for compatibility with previous libraries. Also, they provide some functionality that is not otherwise available in MPI. For instance, a message can be received in several parts, where the receive operation done on a later part may depend on the content of a former part.
You are absolutely right. The way I phrase it is that with MPI_Pack I can make "self-documenting messages". First you store an integer that says how many elements are coming up, then you pack those elements. The receiver inspects that first int, then unpacks the elements. The only catch is that the receiver needs to know an upper bound on the number of bytes in the pack buffer, but you can do that with a separate message, or a MPI_Probe.
There is of course the matter that unpacking a packed message is way slower than straight copying out of a buffer.
Another advantage to packing is that it makes heterogeneous data much easier to handle. The MPI_Type_struct is rather a bother.

I don't understand what exactly does the function bytesToWrite() Qt

I searched for bytesToWrite in doc and that what I found "For buffered devices, this function returns the number of bytes waiting to be written. For devices with no buffer, this function returns 0."
First what does mean buffered devices. And can anyone please explain to me what exactly this function does and where or how can I use it.
Many IO devices are buffered, which means that data isn't sent straight away, but it is accumulated to be sent in bulk when there is a sufficient amount.
This is done essentially to have better performance, as sending data normally has some fixed overhead (at the very least the syscall overhead), which is well amortized when sending data in bulk, but would have to be paid for each write if no buffering would be used.
(notice that here we are only talking about QIODevice buffers, normally there are also all kinds of kernel-mode buffers and buffers internal to hardware devices themselves)
bytesToWrite tells you how much stuff is in the QIODevice write buffer, i.e. how many bytes you wrote that are waiting to be actually written (as in, given to the OS to write).
I never actually had to use that member, but I suppose it could be useful e.g. to in a producer-consumer scenario (=if the write buffer is lower than something, then you have to actually calculate the next chunk of data to send), to manually handle buffering in some places or even just for debugging/logging purposes.
it's actually very usefull when you're using an asynchronous API.
you can for example, use it inside a bytesWritten() slot to tell wether the buffer is empty and the data has been fully written or not.

Is it possible to get device load in OpenCL

I know how to use clGetDeviceInfo to query information about the device but I don't know how to get information about the device at runtime. For example, how much global memory is in use right now? How busy have the processing elements been, on average, in the last n nanoseconds?
AFAIK, no. OpenCL itself does not have any API to query current status of a device. Those are exposed by the vendor of your particular implementation (like the GPUPerfAPI from AMD or the Graphics Performance analyzer from Intel).
Hope this helps.
What I did to be able to determine the free memory at runtime is write a wrapper around clDevice (or cl::Device in my case) and pipe all buffer allocations through said wrapper.
At the begin of the program, I query the total device memory (CL_DEVICE__GLOBAL_MEM_SIZE) and when buffers are allocated I store their addresses and sizes in a vector so I can subtract the accumulated size of the currently allocated buffers from the total memory.
With OpenCL, you can assign callback calls to the buffers, which are called when the buffer is destroyed (clSetMemObjectDestructorCallback). So I use those to clean up when the buffer is released. Hint: the cl_mem parameter with which the callback is called is NOT a valid mem object. It may have already been destroyed so you cannot query it for its size (that took me a couple of hours, even though it's clearly stated in the standard ...).
This way, I can always know, how much memory is left on the device.

Endianess of network data transmissions over TCP/IP

Here is a question I've been trying to solve since quite some time ago. This does not attain a particular languaje, although it's not really beneficial for some that have a VM that specifies endianess. I know, like the 99.9999% of people that use sockets to send data using TCP/IP, that the protocol specifies a endianess for the transmission elements, like destination address, port and such. The thing I don't know is if it requires the payload to be in a specific format to prevent incompatibilities.
For example, let's say I develop a protocol that is not a presentation layer, and that I, due to the inmense dominance that little endian devices have nowadays, decide to make it little endian (for example the positions of the players and such are transmitted in little endian order). For example a network module for a game engine, where latencies matter and byte conversion would cost a noticeable amount of time. Of course the address, port and all of that data that is protocol related would be specified in big endian as is mandatory, I'm talking about the payload, and only that.
Would that protocol work out of the box (translating the contents as necessary, of course, once the the transmission is received) on a big endian machine? Or would the checksums of the IP protocol or something of the kind get computed wrong since the data is in a different order, and the programmer does not have control of them if raw_sockets aren't used?
Since the whole explanation can be misleading, feel free to ask for clarifications.
Thank you very much.
The thing I don't know is if it requires the payload to be in a specific format to prevent incompatibilities.
It doesn't, and it doesn't have a way of telling. To TCP it's just a byte-stream. It is up to the application protocol to decide endian-ness, and it is up to the implementors at each end to implement it correctly. There is a convention to use big-endian, but there's no compulsion.
Application-layer protocols dictate their own endianness. However, by convention, multi-byte integer values should be sent in network-byte order (big endian) for consistency across platforms, such as by using platform-provided hton...() (host-to-network) and ntoh...() (network-to-host) function implementations in your code. On little-endian systems, they will do the necessary byte swapping. On big endian systems, they are no-ops. The functions provide an abtraction layer so code does not have to worry about that.

Endianness of HMAC-SHA code

I am transmitting AES messages. My understanding is that: 1. The AES algorithm treats messages as byte-wise and is endian-neutral. 2. The Initialization Vector is endian-neutral as far as transmission and reception is concerned.
I am also calculating an HMAC-SHA384 code for the message. From my reading it sounds as though HMAC-SHA384 does need byte-swapping if the transmission endianness (big-endian in my case) does not match machine endianness. The swapping should occur from byte 0 and 47, 1 and 46, so on? Can anyone more knowledgeable in the subject than I confirm or contradict this please?
I am presently using the .NET HMACSHA384 class, but on the other end I will be writing C++ code and don't yet know what library will provide the HMAC code.
leppie is right, both are send as byte arrays. And you can be pretty sure that the byte array received will conform to NIST specifications and test vectors. So you should not overly worry about endianness in this case.
If there are people that need to worry then it is the implementors of the hash function. E.g. NIST unfortunately specified a little endian machine (Intel processor) as reference platform for SHA-3. The first version of Skein had incorrect test vectors because of a bug regarding endianness in the first iteration (1.0).
