what's the difference between int and cl_int in OpenCL? [duplicate] - opencl

This question already has an answer here:
When to use the OpenCL API scalar data types?
(1 answer)
Closed 8 years ago.
There are many data types in OpenCL, such as int, cl_int, char, cl_char, 'short', 'cl_short'. But what is the difference between int and cl_int, and when should I use cl_int instead of int?

The size of an int in C/C++ is machine dependent. It is guaranteed to be at least 16 bits, but these days will usually be 32 bits, and could also be 64. This poses a problem when passing data between a host and device in OpenCL - if the device has a different idea about what the size of an int is, then passing an int value(s) to the device might not produce the expected result.
The OpenCL headers provide the cl_int definition to provide a datatype that is always 32 bits, which matches the size that an OpenCL device expects. This means that you can pass a cl_int value, or an array of cl_int values from the host to device (and back), without running into problems with the sizes being mismatched.
So, whenever you are writing host code that deals with values or buffers that will be passed to the device, you should always use the cl_ datatypes.

Related

Accessing structured data following a struct in OpenCL

Summary: Does OpenCL permit creating a pointer in a kernel function from a pointer to a structure and a byte offset to data after the structure in the same memory block?
I'm trying to better understand the limitations of OpenCL in regards to pointers and structures. A project I'm currently working on involves the processing of different kinds of signal nodes, which can have drastically different sized state data from one processing instance to the next. I'm starting with a Linux CPU low latency SCHED FIFO implementation first, so no memory allocation or system calls in processing threads, but trying to plan for an eventual OpenCL implementation.
With this in mind I started designing the algorithm to allocate all the state data as one block, which begins with a structure, and has additional data structures and arrays appended, being careful about proper alignment for data types. Integer offset fields in the structures indicate the byte positions in the buffer to additional data. So technically there aren't any pointers in the structures which would likely not work when passing the data from host to device. However, the resulting size of the state data will differ from one synthesis Node to the next, though the size wont change once they are allocated. I'm not sure if this breaks the "no variable length structures" rule of OpenCL or not.
Simple example (pseudo OpenCL code):
// Additional data following Node structure:
// cl_float fArray[fArrayLen];
// cl_uint iArray[iArrayLen];
typedef struct
{
cl_float val1;
cl_float val2;
cl_uint fArrayOfs;
cl_uint fArrayLen;
cl_uint iArrayOfs;
cl_uint iArrayLen;
...
} Node;
void
node_process (__global Node *node)
{
__global cl_float *fArray;
__global cl_uint *iArray;
// Construct pointers to arrays following Node structure
fArray = ((cl_uchar *)node) + node->fArrayOfs;
iArray = ((cl_uchar *)node) + node->iArrayOfs;
...
}
If this isn't possible, does anyone have any suggestions on defining complex data structures which are somewhat dynamic in nature without passing dozens of pointers to kernel functions? The dynamic nature is only when they are allocated, not once the Kernel is processing. The only other option I can think of is defining the processing node state as a union and pass additional data structures as parameters to the Kernel function, but this is likely to turn into a huge number of function parameters. Or maybe a __local structure with pointers is permissible?
Yes, this is allowed in OpenCL (as long as you stick to alignment rules, as you mentioned), you will however want to be very careful:
First,
fArray = ((cl_uchar *)node) + node->fArrayOfs;
^^^^^^^^^^
You've missed off the memory type here, make sure you include __global or it defaults to (IIRC) __private which takes you straight to the land of undefined behaviour. Generally, I recommend being explicit about memory type for all pointer declarations and types, as the defaults are often non-obvious.
Second, if you're planning to run this on GPUs, if the control flow and memory access patterns for adjacent work-items are very different, you are in for a bad time, performance wise. I recommend reading the GPU vendors' OpenCL performance optimisation guides before architecting the way you split up the work and design the data structures.

Self Referencing Pointer in OpenCL

I have an OpenCL C++ code working on the Intel Platform. I do have an idea that pointers are not accepted within a structure on the Kernel End. However, I have a Class which utilizes the Self-Referencing Pointer option within it. Now, I am able to use a structure and replicate the same for the structure on the host side but I am not able to do the same on the device side.
For example as follows:
Class Classname{
Classname *SameClass_Selfreferencingpointer;
}
On the Host side I have done the same for the structure as well:
struct Structurename{
Structurename *SameStructure_Selfreferencingpointer;
}
Could someone give an alternate option for this implementation for the device side?
Thank you for any help in advance.
Since there isn't malloc in opencl device and also structs are used in buffers as an array of structs, you could add index of it so it knows where it remains in the array. You can allocate a big buffer prior to kernel, then use atomic functions to increment fake malloc pointer as if it is allocating from the buffer but simply returning an integer that points to last "allocated" struct index. Then, host side would just use the index instead of pointer.
If struct alignments become an issue between host an device, you can add indexing of fields too. Such as starting byte of a field A, starting byte of a field B, all compacted in a single 4-byte integer for a struct having 4 used fields except indexes.
Maybe you can add a preprocess stage:
host writes an artificial number to a field such as 3.1415
device checks floating points in struct for all byte offsets until it finds 3.1415
device puts the found byte offset to an array and sends it to host
then host writes float fields in a struct starting from that byte offset
so host and device become alignment compatible, uses same offset in all kernels that get a struct from host
maybe opposite is better
device puts 3.14 in a field of struct
device writes the struct to an array of struct
host gets the buffer
host checks for 3.14 and finds byte offset
host writes and fp number starting from that offset for future work
which would need both your class and its replicated struct on host+device side.
You should also look for "sycl api".

How to replace MPI_Pack_size if I need to send more than 2GB of data?

I want to send and receive more than 2 GB of data using MPI and I came across a lot of articles like the ones cited below:
http://blogs.cisco.com/performance/can-we-count-on-mpi-to-handle-large-datasets,
http://blogs.cisco.com/performance/new-things-in-mpi-3-mpi_count
talking about changes that are made starting with MPI 3.0 allowing to send and receive bigger chunks of data.
Most of the functions now are receiving as parameter an MPI_Count object instead of int, but not all of them.
How can I replace
int MPI_Pack_size(int incount, MPI_Datatype datatype, MPI_Comm comm,
int *size)
in order to get the size of a larger buffer? (because here the size can only be at most 2GB)
The MPI_Pack routines (MPI_Pack, MPI_Unpack, MPI_Pack_size, MPI_Pack_external) are, as you see, unable to support more than 32 bits worth of data, due to the integer pointer used as a return value. I don't know why the standard did not provide MPI_Pack_x, MPI_Unpack_x, MPI_Pack_size_x, and MPI_Pack_external_x -- presumably an oversight? As Jeff suggests, it might have been done so because packing multiple gigs of data is unlikely to provide much benefit. Still, it breaks orthogonality not to have those...
A quality implementation (I do not know if MPICH is one of those) should return an error about the type being too big, allowing you to pack a smaller amount of data.

MPI_Aint in MPI_(I)NEIGHBOR_ALLTOALLW() vs int in MPI_(I)ALLTOALLW()

With MPI3.0 neighborhood collective communications were introduced.
And in 2 of them (MPI_NEIGHBOR_ALLTOALLW and MPI_INEIGHBOR_ALLTOALLW) displacements (sdispls and rdispls) are arrays of const MPI_Aint. Contrariwise of how the same, but collective, funcions (MPI_ALLTOALLW and MPI_ALLTOALLW) are defined -arrays of const int.
Also considering what the MPI Standard v3.0 says about MPI_Aint (page 16):
2.5.6 Addresses
Some MPI procedures use address arguments that represent an absolute address in the
calling program. The datatype of such an argument is MPI_Aint in C and
INTEGER (KIND=MPI_ADDRESS_KIND) in Fortran. These types must have the same width
and encode address values in the same manner such that address values in one language
may be passed directly to another language without conversion. There is the MPI constant
MPI_BOTTOM to indicate the start of the address range.
I still don't get the point and, if exist, the difference (in addition that MPI_Aint can't be negative) between int and MPI_Aint!
MPI_Aint is a portable C data type that can hold memory addresses and it could be larger than the usual int. The policy of the MPI Forum is to not change the signature of existing MPI calls (as it could break existing applications - see here). Rather new calls are introduced that supersede the old ones. The rationale is that int worked well before LP64 64-bit architectures become popular at which point int could no longer be used to address the whole virtual address space of a single process. After this realisation some MPI calls got new versions in later versions that use MPI_Aint or MPI_Count (large integer type) instead of int. For example, MPI_Get_count_x supersedes MPI_Get_count and uses MPI_Count instead of int.
In this respect MPI_Alltoallw is an old call (it comes from MPI-2.0) and it retains its signature of using int offsets while MPI_(I)Neighbor_alltoallw is a new one (it comes with MPI-3.0) and it uses the address type in order to be able to work with data located (almost) anywhere in memory.
The same applies to the Fortran bindings.

How to send integers over QTcpSocket?

I'm new to Qt, so if this is a completely stupid question...
I'm trying to use QTcpSocket.
If I do:
...
QTcpSocket * socket
socket = new QTcpSocket(this);
socket.write(1);
...
It complains about write() not working with integers (not a const char *).
If I do:
...
QTcpSocket * socket
socket = new QTcpSocket(this);
socket.write("1");
...
the other side sees it as the integer 49 (ASCII for 1).
On a similar but different issue, is it possible to send structs or unions over QTcpSocket?
==================================================
EDIT:
The server already accepts integers, and is expecting an integer - I have no control over that.
The problem you have is not really related to Qt, the same issue would arise with any other Socket or Streaming interface.
The provider of the Server needs to give you the protocol description. This description usually contains the ports used (TCP, UDP, numbers), other TCP parameters and the coding of the transmitted data. Sometimes, this protocol (or its implementation) is called a (protocol-) stack.
The coding not only contains the byte ordering, but also a description of how complex structures are transmitted.
The latter information is often coded in something that is called "ASN.1" - Abstract Syntax Notation.
In case your server is really simple and just accepts Integers one after the other without any meta-information and is on the same platform, than you could do something like this:
foreach (int i in my set of integers)
{
ioDevice->write((const char*) &i, sizeof(i));
}
You take the address of your integer as a data buffer start and transmit as many bytes as your integer has.
But note well, this will fail if you transmit data from an Intel architecture to a 16-bit architecture or a motorola PPC.
I suggest using QDataStream with sockets. This will protect you from little endian/big endian conversion problem.
So something as below :
qint32 myData = 1;
QDataStream os( &mySocket );
os << myData;
When you write in a string representation, you also have to interpret the string on the other side. There is, for example, QString::toInt().
When you want to write the integer as integer, you will have more throughput, as it takes less bytes to transmit. However you should read about the topic of network byte order.
In principal it is possible to copy structs, etc. into a buffer and also over the network. However things get complicated again when you transmit data between different architectures or even only different builds of your software. So you shouldn't send the raw data, but use serialization! See this question:
Serialization with Qt
It provides answers on how to generate streams out of objects and objects out of streams. These streams is what you then use to transmit over the network. Then you don't have to deal with the integers themselves anymore!
Different overloads, you are looking for:
qint64 QIODevice::write ( const char * data, qint64 maxSize );
and
qint64 QIODevice::write ( const QByteArray & byteArray );

Resources