I am writing simple debugging/logging functions using ring-buffer in a chunk of the global memory. The problem is lack of any snprintf-like function in OpenCL. What would be the suggestion? To use some embedded implementation, and extend the format specification for vector types?
(Please do not reply that string ops are inefficient and that OpenCL is designed for computations; I know that.)
Some CPU implementations support printf etc, so that might help if your implementation does not rely on unsported work-group dimensions. When I worked with OpenCL I usually would do the verification on the host side, i.e. implement the buffer-reading algorithm and then write the data back using a 1:1 map of the work items to the result buffer. This makes it quite easy to verfiy as you know which thread wrote what given the index in the result buffer. It might be a good idea to initalize the client buffer with known data (i.e copy a host buffer into the reuslt buffer before executing the kernel) to avoid confusion.
I realize this isn't a very technical answer, but I hope it helps somewhat.
Related
There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.
First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/
Problem: Allocating a 2d array of pointers
User solution: use mallocPitch
"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)
"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu
Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/
Problem: Allocating space on host and passing it to device
Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/
Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.
Third link: Allocate 2D Array on Device Memory in CUDA
Problem: Allocating and transferring 2d arrays
User solution: use mallocPitch
Other solution: flatten it
Fourth link: How to use 2D Arrays in CUDA?
Problem: Allocate and traverse 2d arrays
Submitted solution: Does not show allocation
Other solution: squash it
There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.
Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?
The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.
Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.
Also according to this link: Copy an object to device?
and the sub link answer: cudaMemcpy segmentation fault
This gets a little iffy.
The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?
I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?
Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.
cudaMallocPitch/cudaMemcpy2D:
First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.
general, dynamically allocated 2D case:
If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:
there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)
Also, here is a thrust method for building a general dynamically allocated 2D array.
flattening:
If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".
general, dynamically allocated 3D case:
As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).
special case: array width known at compile time:
Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.
doubly-subscripted host code, singly-subscripted device code:
Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).
From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.
I would like to know and understand how can one declare half-precision buffers and pointers in SYCL namely in the following ways -
Via the buffer class.
Using malloc_device() function.
Also, suppose I have an existing fp32 matrix / array on the host side. How can I copy it's contents to fp16 memory on the GPU side.
TIA
For half-precision, you can just use sycl::half as the template parameter for either of these.
accHalf[i] = static_cast<sycl::half>(accFloat[i]);
For copying, you'll need to convert the data from fp32 to fp16, which you could probably do using a kernel to perform the conversion. This seems to be a well documented problem with solutions, see this thread.
I need to send some fairly large data-structures between instances of my running Ada program. Obviously json over https is an option. Not one I want to use as it's bigger than I'd like in terms of data overhead, but it will work for now.
Ideally I'd want to mash it into a binary blob and be sent with a hash to confirm the message. Is there a decent way to do this in Ada?
I would look for a solution based on Streams, sent over TCP.
If you want to implement your own blocking and hashing, you’ll probably need to write the raw stream to memory first so that you can tell how big the blob is and work out the checksum. A fairly straightforward approach to this would be here, spec and body.
For a solution that’s had a lot more work put into it, look at Dmitry Kazakov’s Simple Components’ Block Streams.
Ideally I'd want to mash it into a binary blob and be sent with a hash
to confirm the message. Is there a decent way to do this in Ada?
As mentioned, the DSA [Annex E] is an excellent way to handle this, though with some caveats due to the implementations (rather than language) — the definitions of/for the DSA are broad enough that the transport could be mostly anything, so long as the interface (RPC- and Stream-based) is respected.
Things will be simple[r] if you structure your program with proper categorizations from the outset —see Pure, Shared_Passive, Remote_Types, and Remote_Call_Interface in the ARM, and the Ada Rational— rather than trying to shoehorn something extant into the DSA's required structuring. (That said, there are some cases where modifying an extant program to be DSA capable is rather simply a matter of adding the categorization pragmas/aspects and configuring+compiling.)
Note that Ada's containers are designed so that they can be used in DSA programs, and are all [IIRC] Remote_Types categorized.
I need to send some fairly large data-structures between instances of
my running Ada program.
Also an option is ASN.1, which allows you to make a type-definition for some data in a language- and machine-independent manner. There are several ASN.1 compilers and a good chunk can generate Ada; here's one (written in F*, IIRC) used by the ESA and freely available open-source.
ASN.1 has an encoding scheme which is optimized for space, and so will give you the most compact on-the-wire representation.
Obviously json over https is an option. Not
one I want to use as it's bigger than I'd like in terms of data
overhead, but it will work for now.
Using HTTP and JSON directly is attractive for many people because it's "easy", though this ease is typically misleading: all the things that they don't do, such as range-checking values or validating the structure are offloaded to the programmer. — That said, you can make things modular and use generics to allow you to "swap-out" methods.
Generic
Type Data(<>) is private;
Type Transport_Type(<>) is private;
Target : However_you_address_the_target;
with Function Encode(Input : Data) return Transport_Type;
Procedure Send( Value : Data );
and
Generic
Type Data(<>) is private;
Type Transport_Type(<>) is private;
with Function Decode(Input : Transport_Type) return Data;
Procedure Receive( Value : Transport_Type ) return Data;
Or something similar to this. I would rate this as less convenient than using the DSA, but also possibly a bit more simple, considering you [mostly] don't have to worry about categorization with this method.
Pipe is one of the OpenCL 2.0's new features, and this feature has been demonstrated in the AMDAPPSDK's producer/consumer example. I've read some articles abut pipe's use cases and they're all like the producer/consumer way.
My question is, the same functionality can be achieved by creating a global memory space/object and passing the pointer to 2 kernel functions given that OpenCL 2.0 provides the shared virtual memory. So what's the difference between a pipe object and a global memory object? Or is it invented just for optimization?
It is as useful as std::vector and std::queue.
One is useful to store data, while the other is useful to store packets.
Packets are indeed data, but it is much easier to handle them as small units rather than a big block.
Pipes in OpenCL allow you to consume these small packets in a kernel, without having to deal with the indexing + storing + pointers + forloops hell that would happen if you manually implement a pipe mechanism yourself in the kernel.
Pipes are useful for example when each work item can generate variable number of outputs. Prior to OpenCL 2.0 this was difficult to handle.
Pipes may reside in faster memory (vendor specific) i.e. Altera recommends using pipes to exchange data between kernels instead of using global memory.
Pipes are designed to transfer data from one kernel to another kernel/s without the need to store/load data in/from global or host memory. This is essentially a FIFO on the FPGA device. So, the speed of access of the data are much faster than that of through DDR or host memory. This is probably the reason to use FPGA as an accelerator.
Sometimes the DDRs are used to share data between kernels as well. One example is that a SIMD kernel want to share some data with a single task kernel with requirement on input data sequence. As, Pipes will run out of order in a SIMD way.
Other than the Pipes, you can use Altera channels for more function support. But this is not portable to other OpenCL devices.
Hope this can help. :)
So I'm having a hard time grasping the idea behind pointers and all that memory allocation.
I'm thinking nowadays with computer as powerful as they are right now why do we have to use pointers at all?
Isn't there always a workaround to do things without the help of pointers?
Pointers are an indirection: instead of working with the data itself, you are working with (something) that points to the data. Depending on the semantics of the language, this allows many things: cheaply switch to another instance of data (by setting the pointer to point to another instance), passing pointers allows access to the original data without having to make (a possibly expensive) copy, etc.
Memory allocation is related to pointers, but separate: you can have pointers without allocating memory. The reason you need pointers for memory allocation is that the actual address the allocated block of memory resides is not known at compile time, so you can only access it via a level of indirection (i.e. pointers) -- the compiler statically allocates space for the pointer that will point to the dynamically allocated memory.
Pointers are incredibly powerful. Just because computers have a faster processing time nowdays, doesn't mean that's any reason to abandon something as essential as pointers. Passing around giant chunks of memory on the stack is inefficient at best, catastrophic at worst. With pointers, you only need to maintain a reference to where the data resides, rather than duplicating huge chunks of memory each time you call a function.
Also, if you're copying all the data every time, how do you modify the original data? Aside from returning the copy of the structure in every call that touches it.
I remember reading somewhere that Dijkstra was assessing a student for a programming course; this student was quite intelligent but s/he wasn't able to solve the problem because there was sort of a mental block.
All the code was sort of ok, but what was needed was simply to use the expression
a[a[i+1]] = j;
and even if being so close to the solution still the goal seemed to be miles away.
Languages "without pointers" already exist... e.g. BASIC. Without explicit pointers, that is. But the indirection idea... the idea that you can have data to mean just where to find other data is central to programming.
The very idea of array is about being able to use computed values to find other values.
Trying to hide this idea is an horrible plan. According to Dijkstra anyone that has been exposed to the BASIC language has already received such a mental mutilation that is impossible to recover as a good programmer (and probably the absence of explicit indirection was one of the problems).
I think he was exaggerating.
Just a bit.