Declaring Half precision floating point memory in SYCL - opencl

I would like to know and understand how can one declare half-precision buffers and pointers in SYCL namely in the following ways -
Via the buffer class.
Using malloc_device() function.
Also, suppose I have an existing fp32 matrix / array on the host side. How can I copy it's contents to fp16 memory on the GPU side.
TIA

For half-precision, you can just use sycl::half as the template parameter for either of these.
accHalf[i] = static_cast<sycl::half>(accFloat[i]);
For copying, you'll need to convert the data from fp32 to fp16, which you could probably do using a kernel to perform the conversion. This seems to be a well documented problem with solutions, see this thread.

Related

Indexing Dynamic Arrays in CUDA [duplicate]

There are a lot of questions online about allocating, copying, indexing, etc 2d and 3d arrays on CUDA. I'm getting a lot of conflicting answers so I'm attempting to compile past questions to see if I can ask the right ones.
First link: https://devtalk.nvidia.com/default/topic/392370/how-to-cudamalloc-two-dimensional-array-/
Problem: Allocating a 2d array of pointers
User solution: use mallocPitch
"Correct" inefficient solution: Use malloc and memcpy in a for loop for each row (Absurd overhead)
"More correct" solution: Squash it into a 1d array "professional opinion," one comment saying no one with an eye on performance uses 2d pointer structures on the gpu
Second link: https://devtalk.nvidia.com/default/topic/413905/passing-a-multidimensional-array-to-kernel-how-to-allocate-space-in-host-and-pass-to-device-/
Problem: Allocating space on host and passing it to device
Sub link: https://devtalk.nvidia.com/default/topic/398305/cuda-programming-and-performance/dynamically-allocate-array-of-structs/
Sub link solution: Coding pointer based structures on the GPU is a bad experience and highly inefficient, squash it into a 1d array.
Third link: Allocate 2D Array on Device Memory in CUDA
Problem: Allocating and transferring 2d arrays
User solution: use mallocPitch
Other solution: flatten it
Fourth link: How to use 2D Arrays in CUDA?
Problem: Allocate and traverse 2d arrays
Submitted solution: Does not show allocation
Other solution: squash it
There are a lot of other sources mostly saying the same thing but in multiple instances I see warnings about pointer structures on the GPU.
Many people claim the proper way to allocate an array of pointers is with a call to malloc and memcpy for each row yet the functions mallocPitch and memcpy2D exist. Are these functions somehow less efficient? Why wouldn't this be the default answer?
The other 'correct' answer for 2d arrays is to squash them into one array. Should I just get used to this as a fact of life? I'm very persnickety about my code and it feels inelegant to me.
Another solution I was considering was to max a matrix class that uses a 1d pointer array but I can't find a way to implement the double bracket operator.
Also according to this link: Copy an object to device?
and the sub link answer: cudaMemcpy segmentation fault
This gets a little iffy.
The classes I want to use CUDA with all have 2/3d arrays and wouldn't there be a lot of overhead in converting those to 1d arrays for CUDA?
I know I've asked a lot but in summary should I get used to squashed arrays as a fact of life or can I use the 2d allocate and copy functions without getting bad overhead like in the solution where alloc and cpy are called in a for loop?
Since your question compiles a list of other questions, I'll answer by compiling a list of other answers.
cudaMallocPitch/cudaMemcpy2D:
First, the cuda runtime API functions like cudaMallocPitch and cudaMemcpy2D do not actually involve either double-pointer allocations or 2D (doubly-subscripted) arrays. This is easy to confirm simply by looking at the documentation, and noting the types of parameters in the function prototypes. The src and dst parameters are single-pointer parameters. They could not be doubly-subscripted, or doubly dereferenced. For additional example usage, here is one of many questions on this. here is a fully worked example usage. Another example covering various concepts associated with cudaMallocPitch/cudaMemcpy2d usage is here. Instead the correct way to think about these is that they work with pitched allocations. Also, you cannot use cudaMemcpy2D to transfer data when the underlying allocation has been created using a set of malloc (or new, or similar) operations in a loop. That sort of host data allocation construction is particularly ill-suited to working with the data on the device.
general, dynamically allocated 2D case:
If you wish to learn how to use a dynamically allocated 2D array in a CUDA kernel (meaning you can use doubly-subscripted access, e.g. data[x][y]), then the cuda tag info page contains the "canonical" question for this, it is here. The answer given by talonmies there includes the proper mechanics, as well as appropriate caveats:
there is additional, non-trivial complexity
the access will generally be less efficient than 1D access, because data access requires dereferencing 2 pointers, instead of 1.
(note that allocating an array of objects, where the object(s) has an embedded pointer to a dynamic allocation, is essentially the same as the 2D array concept, and the example you linked in your question is a reasonable demonstration for that)
Also, here is a thrust method for building a general dynamically allocated 2D array.
flattening:
If you think you must use the general 2D method, then go ahead, it's not impossible (although sometimes people struggle with the process!) However, due to the added complexity and reduced efficiency, the canonical "advice" here is to "flatten" your storage method, and use "simulated" 2D access. Here is one of many examples of questions/answers discussing "flattening".
general, dynamically allocated 3D case:
As we extend this to 3 (or higher!) dimensions, the general case becomes overly complex to handle, IMO. The additional complexity should strongly motivate us to seek alternatives. The triply-subscripted general case involves 3 pointer accesses before the data is actually retrieved, so even less efficient. Here is a fully worked example (2nd code example).
special case: array width known at compile time:
Note that it should be considered a special case when the array dimension(s) (the width, in the case of a 2D array, or 2 of the 3 dimensions for a 3D array) is known at compile-time. In this case, with an appropriate auxiliary type definition, we can "instruct" the compiler how the indexing should be computed, and in this case we can use doubly-subscripted access with considerably less complexity than the general case, and there is no loss of efficiency due to pointer-chasing. Only one pointer need be dereferenced to retrieve the data (regardless of array dimensionality, if n-1 dimensions are known at compile time for a n-dimensional array). The first code example in the already-mentioned answer here (first code example) gives a fully worked example of that in the 3D case, and the answer here gives a 2D example of this special case.
doubly-subscripted host code, singly-subscripted device code:
Finally another methodology option allows us to easily mix 2D (doubly-subscripted) access in host code while using only 1D (singly-subscripted, perhaps with "simulated 2D" access) in device code. A worked example of that is here. By organizing the underlying allocation as a contiguous allocation, then building the pointer "tree", we can enable doubly-subscripted access on the host, and still easily pass the flat allocation to the device. Although the example does not show it, it would be possible to extend this method to create a doubly-subscripted access system on the device based off a flat allocation and a manually-created pointer "tree", however this would have approximately the same issues as the 2D general dynamically allocated method given above: it would involve double-pointer (double-dereference) access, so less efficient, and there is some complexity associated with building the pointer "tree", for use in device code (e.g. it would necessitate an additional cudaMemcpy operation, probably).
From the above methods, you'll need to choose one that fits your appetite and needs. There is not one single recommendation that fits every possible case.

Is there a safe alternative to replace the unsafe slice::from_raw_parts for raw pointers?

I have a Rust dynamic library which is intended to be called from any language. The arguments to the exported function are two char * pointers to memory and two lengths for each piece of memory.
The problem is that from_raw_parts reduces to a memcpy and can segfault in a variety of dangerous ways if for example the lengths are wrong. I'm then using bincode::deserialize on the slices to use them as Rust objects. Is there any safer option to deal with incoming raw pointers to memory?
No.
What you are asking doesn't make sense. To some level, the entire reason that Rust the language exists is because raw pointers are inherently dangerous. Rust's references (and their related lifetimes) are a structured way of performing compile-time checks to ensure that a pointer is valid and safe to use.
Once you start using raw pointers, the compiler can no longer help you with those pointers and it's now up to you to ensure that safety is guaranteed.
from_raw_parts reduces to a memcpy
This doesn't seem correct. No memory should be copied to create a slice. A Rust slice is effectively just a pair of (pointer, length) — the same things that you are passing in separately. I'd expect those each to be register-sized, so calling memcpy would be overkill.
Using the resulting slice could possibly involve copying the data, but that's not due to from_raw_parts anymore.

string formatting in OpenCL?

I am writing simple debugging/logging functions using ring-buffer in a chunk of the global memory. The problem is lack of any snprintf-like function in OpenCL. What would be the suggestion? To use some embedded implementation, and extend the format specification for vector types?
(Please do not reply that string ops are inefficient and that OpenCL is designed for computations; I know that.)
Some CPU implementations support printf etc, so that might help if your implementation does not rely on unsported work-group dimensions. When I worked with OpenCL I usually would do the verification on the host side, i.e. implement the buffer-reading algorithm and then write the data back using a 1:1 map of the work items to the result buffer. This makes it quite easy to verfiy as you know which thread wrote what given the index in the result buffer. It might be a good idea to initalize the client buffer with known data (i.e copy a host buffer into the reuslt buffer before executing the kernel) to avoid confusion.
I realize this isn't a very technical answer, but I hope it helps somewhat.

How do you work around the inabilty to pass a list of cl_mem into a kernel invocation?

There are lots of real-world reasons you'd want to do this. Ours is because we have a list of variable length data structures, and we want to be able to change the size of one of the elements without recopying them all.
Here's a few things I've tried:
Just have a lot of kernel arguments. Sure, sounds hacky, but works for small N. This is actually what we've been doing.
Do 1) with some sort of macro loop which extends the kernel args to the max size (which I think is device dependent). I don't really want to do this... it sounds bad.
Create some sort of list of structs which contain pointers, and fill it before your kernel invocation. I tried this, and I think it violates the spec. According to what I've seen on the nVidia forums, preserving the address of a device pointer beyond one kernel invocation is illegal. If anyone can point to where in the spec it says this, I'd love to know, because I can't find it. However, this definitely breaks on ATI hardware, as it moves the objects around.
Give up, store the variable sized objects in a big array, and write a clever algorithm to use empty space so the whole array must be reflowed less often. This will work, but is an inelegant, complicated design. Also, it requires lots of scary pointer arithmetic...
Does anyone else have other ideas? What about experiences trying to do this; is there a least hacky way? Why?
To 3:
OpenCL 1.1 spec page 193 says "Arguments to kernel functions in a program cannot be declared as a pointer to a pointer(s)."
Struct containing a pointer to pointer (pointer to a buffer object) might not be against strict reading of this sentence but it's within the spirit: No pointers to buffer objects may be passed as arguments from host code to kernel even if they're hidden inside a user defined struct.
I'd opt for option 5: Do not use variable size data structures. If you have any way of making them constant size by all means do it. It will make your life a whole lot easier. To be precise there is no 'variable size structure'. Every struct definition produces constant sized structs, so if the size has changed then the struct itself has changed and therefore requires another mem object. Every pointer passed to kernel function must have a single type.
In addition to sharpnelis answer option 5:
If the objects have similar size you could use unions on the biggest possible object size. But make sure you use explicit alignment. Pass a second buffer identifying the union used in each object in your variable-sized-objects-in-static-size-union buffer.
I reverted to this when using opencl lib code that only allowed one variable array of arbitrary type. I simply used cl_float2 to pass two floats. Since the cl_floatN types are implemented as unions - what works for the build in types will work for you as well.

What should replace "memcpy" inside OpenCL kernels?

The OpenCL language, which extends C99, does not provide the memcpy function. What should be used instead?
As far as I know, there is nothing like that defined in OpenCL. OpenCL does not provide a concept like dynamic memory and therefore, such functionality is not needed.
You could just run over your array with for and copy the data element by element. But, the target array is of fixed size due to the need to specify the array length at compile time.
On the other side, OpenCL (and OpenGL as a kind of origin) was defined in a more static way. The data needs to be provided to the GPU and the result size needs to be defined. The GPU calculates the input to the pre-defined output location. It is not meant to create more processes within the GPU and it is also not meant to allocate dynamically memory to not disturbed the host doing it.

Resources