STD Classes in CUDA Kernel - vector

I know that there is no way using std classes such as string, vector, map or set in CUDA kernel. However, it's very uncomfortable without them. I have to write a lot of code in CUDA kernel, so I would like to use at least strings and vectors. I'm not talking about something like thrust. I want to be able to write something like this:
__global__ void kernel()
{
cuda_vector<int> a;
for(int i=0;i<10;i++)
a.push_back(i);
}
int main()
{
kernel<<<1,512>>>();
return 0;
}
This should create 512 threads and in each thread I want to create cuda_vector class and use it as std::vector. I didn't find any solution on the internet and I started to write my own class. Each function of this class is defined as "__ host __ " and " __ device __" function so that I can use it on both CPU and GPU.
Theoretically, it can be implemented, however only on Fermi architecture. Because, we need to allocate memory dynamically. I have GTX 580 and started to write my own Vector. But it's tiring and needs a lot of time. Isn't there any implementation which I can use? I can't believe that there isn't any. Do so many software developers write on CUDA without it? And noone tried to write his/her own version?

The reason you don't find something like std::vector for cuda is performance. Your traditional vector object doesn't fit well with the CUDA model. If you are planning on using only 512 threads and each one will be managing a std::vector like object your performance is going to be worse than running the same code on the CPU.
GPU threads are not like CPU threads, they should be as light as possible. Use thread blocks and shared memory to have the threads cooperate. If you are manipulating a string, each thread should be working on one character, if you are using vectors in the CPU pass an array of that to the GPU, and have each thread work on one element. Basically, think about how to solve the problem with the CUDA programming model as apposed to solving it with a CPU approach and then translating it to CUDA.

I've not used it, but the CuPP framework may be of interest to you, especially the vector<T> implementation. Looks like it could do what you need it to do.

Related

Translate OpenCL SPIR-V to Vulkan SPIR-V

Is it possible to translate OpenCL-style SPIR-V to Vulkan-style SPIR-V?
I know that it is possible to use clspv to compile OpenCL C to Vulkan-style SPIR-V, but I haven't seen any indication that it also supports ingesting OpenCL-style SPIR-V.
Thank you for any suggestions if you know how to achieve this :)
I know that it is possible to use clspv to compile OpenCL C to
Vulkan-style SPIR-V, but I haven't seen any indication that it also
supports ingesting OpenCL-style SPIR-V.
clspv compiles to "Opencl-style SPIR-V". IOW, it uses OpenCL execution model and also OpenCL memory model. The answer to your question is no (in general). The problem is that e.g. GLSL uses logical memory model, which means pointers are abstract, so you can't have pointers to pointers. While OpenCL allows this, because it uses physical memory model. Plus there are other things in OpenCL which cannot be expressed in GLSL. You could try to write some translator, and it might work for some very simple code, but that's about it.

Alternative to clGetGLContextInfoKHR?

I am writing a N-body physics simulation. I would like to ask if there is an alternative to the OpenCL clGetGLContextInfoKHR() function ? I need to find out during runtime which GPU is used for OpenGL rendering so that I can use OpenCL for vertex manipulation on this same GPU (for performance reasons).
I have searched the OpenCL.dll for the function clGetGLContextInfoKHR() using Dependancy Walker, but it seems that the implementation that is installed on my computer does not support it sice this function is missing from the DLL. I have also tried glGetString(GL_RENDERER), but the name string it returns differres from the name string which clGetDeviceInfo(.., CL_DEVICE_NAME, ...) returns (not by much, but enough to make it for example difficult to destinguish two GPUs from the same manufacturer). Is there any other way except manually choosing the correct OpenCL device ?
Thanks for help !

OpenCL dead lock possibility

I'm using global atomics to synchronize between work groups in OpenCL.
So the kernel uses a code like
... global volatile uint* counter;
if(get_local_id(0) == 0) {
while(*counter != expected_value);
}
barrier(0);
To wait until counter becomes expected_value.
And at another place it does
if(get_local_id(0) == 0) atomic_inc(counter);
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly. But if one work group starts only after another has completely finished, then the kernel can deadlock.
On CPU and on GPU (NVidia CUDA platform), it seems to always work, with a large number of work groups (over 8000).
For the algorithm this seems to be the most efficient implementation. (It does a prefix sums over each line in a 2D buffer.)
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this always works?
Does OpenCL and/or NVidia's OpenCL implementation guarantee that this
always works?
As far as the OpenCL standard is concerned, this is not guaranteed (similarly for CUDA). Now, in practice, it may very well work due to your specific OpenCL implementation, but bear in mind that it's not guaranteed by the standard, so make sure you understand your implementation's execution model to ensure this is safe, and that such code won't necessarily be portable across other conforming implementations.
Theoretically the algorithm is such that this should always work, if all work groups are running concurrencly
OpenCL states that work groups can run in any order, and not necessarily in parallel nor even concurrently. CUDA has similar wording, although CUDA 9 does support a form of grid-wise synchronization.
OpenCL spec, 3.2.2 Execution Model: Execution of kernel-instances:
A conforming implementation may choose to serialize the work-groups so a correct algorithm cannot assume that work-groups will execute in parallel. There is no safe and portable way to synchronize across the independent execution of work-groups since once in the work-pool, they can execute in any order.

string formatting in OpenCL?

I am writing simple debugging/logging functions using ring-buffer in a chunk of the global memory. The problem is lack of any snprintf-like function in OpenCL. What would be the suggestion? To use some embedded implementation, and extend the format specification for vector types?
(Please do not reply that string ops are inefficient and that OpenCL is designed for computations; I know that.)
Some CPU implementations support printf etc, so that might help if your implementation does not rely on unsported work-group dimensions. When I worked with OpenCL I usually would do the verification on the host side, i.e. implement the buffer-reading algorithm and then write the data back using a 1:1 map of the work items to the result buffer. This makes it quite easy to verfiy as you know which thread wrote what given the index in the result buffer. It might be a good idea to initalize the client buffer with known data (i.e copy a host buffer into the reuslt buffer before executing the kernel) to avoid confusion.
I realize this isn't a very technical answer, but I hope it helps somewhat.

What should replace "memcpy" inside OpenCL kernels?

The OpenCL language, which extends C99, does not provide the memcpy function. What should be used instead?
As far as I know, there is nothing like that defined in OpenCL. OpenCL does not provide a concept like dynamic memory and therefore, such functionality is not needed.
You could just run over your array with for and copy the data element by element. But, the target array is of fixed size due to the need to specify the array length at compile time.
On the other side, OpenCL (and OpenGL as a kind of origin) was defined in a more static way. The data needs to be provided to the GPU and the result size needs to be defined. The GPU calculates the input to the pre-defined output location. It is not meant to create more processes within the GPU and it is also not meant to allocate dynamically memory to not disturbed the host doing it.

Resources