I'm studying how to offload some quite heavy calculations on GPUs.
Although on my machine I have a NVIDIA RTX GPU, I would like to avoid using
CUDA in order to develop something portable on other GPUs as well (at least in its core).
Thus the choice of OpenCL.
Now, my current biggest concern is that, within the core that is suitable for offload I intensively make use of LAPACK SVD implementation.
However, in OpenCL, kernel code cannot either:
Be linked to external libraries. There's a "workaraound" using clEnqueueNativeKernel(), but this does not seem to apply in this case (call within a kernel itself) (not to mention this is not very portable, since it is needed the device to support CL_EXEC_NATIVE_KERNEL capability);
Accept function pointers as kernel arguments.
So, does anyone know of the existence of a OpenCL kernel SVD open-source implemetation, which can then be called within a parent OpenCL kernel?
I googled, and found several libraries/implementations of SVD for GPU offload, but I couldn't see how to "embed" them into an OpenCL kernel (they all seem implementations to be launched from host code). If I'm wrong, please correct me. Any help is more than welcome.
Implement an event-callback API between host and kernel using only atomic functions such that:
void callExternalLib(__global int * ptr)
{
atomic_inc(ptr,1);
// if clWaitForEvents not supported in kernel
while(atomic_inc(ptr,0) == 1)
{
// somehow wait until signal 0 is received
}
dynamicParallelismLaunchRestOfTheAlgorithm();
}
__kernel void test(__global int * communication, __global int * data)
{
callExternalLib(communication);
}
// at the same time on host with a dedicated event-thread:
// if opencl-events do not work between gpu and host
while(ptr.load()==0)
{
std::this_thread::yield();
}
if(ptr.load()==CALL_SVD)
{
clMagmaCopyToGraphicsCard(); // not required if buffer handle can be shared
clMagmaComputeOnGPU();
clMagmaCopyToHost(); // not required if buffer handle can be shared
copyToYourOpenCLBuffer(); // not required if buffer handle can be shared
ptr--; // inform kernel's threads that clmagma function has been called
}
From https://man.opencl.org/atomic_store.html:
With fine-grained system SVM, sharing happens at the granularity of
individual loads and stores anywhere in host memory. Memory
consistency is always guaranteed at synchronization points, but to
obtain finer control over consistency, the OpenCL atomics functions
may be used to ensure that the updates to individual data values made
by one unit of execution are visible to other execution units. In
particular, when a host thread needs fine control over the consistency
of memory that is shared with one or more OpenCL devices, it must use
atomic and fence operations that are compatible with the C11 atomic
operations.
I don't know if your graphics card / driver supports this. OpenCL 2.0 may not be fully supported by all GPUs.
To make host-side libraries run directly on GPU, you'll need to convert some parts by hand:
allocations
math functions' implementations like sqrt,cos,sin,exp
intrinsic functions (GPU can't run AVX maybe except Intel's XeonPhi?)
alignments of structs, arrays
dependencies to other libraries
maybe even calling-conventions? (some gpus don't have a real call stack)
Latency of just an atomically-triggered GPU-library call should be negligible if the work is heavy but it's not suitable when every clock-cycle is required on GPU-side. So it wouldn't be good for working with small matrices.
Related
I am new to OpenCL and working on block cipher encryption using OpenCL on FPGA. I read some paper and know there are two sorts of kernels in Opencl (single work-item and NDRange). The functions of NDRange kernel will not be pipelined by the compiler automatically while functions of the single work-item kernel will.
Is it recommended to implement single work-item kernel rather than NDRange kernel
on FPGA? Why?
If I want to make the kernel run in a loop until reading all the data, then the kernel (fetch some data from host at one time--run on FPGA--write back). How can the pipeline be achieved?
Single work-item kernel allows you to move the computation loops into your kernel, and you can generate custom pipelines, Make clever optimizations on accumulations, and control access patterns through "pragmas". An NDRange Kernel relies on you to partition the data among the work-items, and compiler generates SIMD type hardware, each unit described by your kernel. It is good if your problem has a regular data parallelism making partitioning easy. The NDRange kernels of OpenCL are designed for SIMD computation units like GPU. You can also utilize "channels" to move data between single work-item kernels in streaming applications, relieving DRAM bandwidth. For NDRange kernels you would have to use global memory as medium of data sharing between kernels.
Shreedutt's answer was generally acceptable before ~2016. Intel's pipe and channel implementation go beyond this by quite far:
User can have multiple kernels or multiple work-items accessing a pipe if you do not have work-item-variant or dependency on the work-item ID.
Deterministic order of access to pipe from multiple work-items (e.g. NDRange kernels) can be controlled and guaranteed: "When pipes exist in the body of a loop with multiple work-items, as shown below, each
loop iteration executes prior to subsequent iterations. This implies that loop iteration 0 of each work-item in a work-group executes before iteration 1 of each work-item in a work-group, and so on."
__kernel void ordering (__global int * data, write_only pipe int __attribute__((blocking)) req)
{
write_pipe (req, &data[get_global_id(0)]);
}
Channels extension is a viable alternative which can be executed in the loop with multiple work-items:
__kernel void ordering (__global int * data, int X) {
int n = 0;
while (n < X)
{
write_channel_intel (req, data[get_global_id(0)]);
n++;
}
}
There are restrictions and caveats that can be found in section 5 of the UG-OCL002 | 2018.05.23 for channels and pipes. I would suggest a read though it and watch the latest training block https://www.youtube.com/watch?v=_0RtAKeRl00. Another huge caveat is that the big companies decided to have separate code syntax for OpenCL with requiring different pragmas, one more another less.
I should have started however with this IWOCL presentation: https://www.iwocl.org/wp-content/uploads/iwocl2017-kapre-patel-opencl-pipes.pdf. The reason is than these are new compute models and huge performance gains can be attained by properly structuring your parallel application. Even more importantly is to learn how to move and NOT to move data. Check out latest trick GPU on removing transposes: https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/
We can do more tricks like this in FPGA, can we?
I leave it for the interested readers and contributors to weight in on XILINX OpenCL pipes.
IMHO this is the most important topic for software defined FPGAs since the slices bread, especially if we are to win some races in ML/AI GPUs vs. FPGAs. I am rooting for the FPGA team.
Consider a pair of OpenCL kernels which read and write to the same memory locations. As a simple example, consider the following OpenCL program:
__kernel void k1(__global int * a)
{
a[0] = 2*a[1];
}
__kernel void k2(__global int * a)
{
a[1] = a[0]-1;
}
If many threads are launched, running many of each of these kernels, the resulting state of global memory is non-deterministic.
This still potentially allows one to write asynchronous algorithms which accept any of the possible orderings of the operations within the kernels.
However, this requires that reads and writes to global GPU memory are atomic.
My questions are
Is this guaranteed to be true on any current GPGPU hardware?
If this considered undefined behavior by the OpenCL standard? If so, what do common implementations (specifically that included with the CUDA toolkit) do?
How can one test this concern?
If you enqueue your kernel commands to a single command queue that is created as an in-order queue (i.e. you didn't specify CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE when you created it), then only one kernel command will execute at a time. This means that you won't have any such issues between different kernel instances (although you could still have race conditions between work-items in a single kernel instance, if they are accessing the same memory locations).
If you are using out-of-order queues, or multiple command-queues, then you may indeed have a race condition. There is no guarantee that your load-modify-store sequence will be an atomic operation, and this will cause undefined behaviour.
Depending on what you actually want to do with your kernels, you may be able to make use of OpenCL's built-in atomic functions, which do allow you to perform a particular set of read-modify-write operations in an atomic manner.
I have been trying to do FFT in OpenCL. It worked for me with a Kernel like this,
__kernel void butterfly(__global float2* twid, __global float2* X,
const int n,}
{
/* Butterfly structure*/
}
I call this Kernel thousands of times. Thus READ/WRITE to a global memory is too much time taking. The twid(float2) array is just read, never manipulated and array X is READ & WRITE type of array.
1.Which is the most suitable type of memory for this?
2. If I use local memory, will I be able to pass it to another Kernel as an argument without copying it to global memory?
I am a beginner in OpenCL.
Local memory is only usable within the work group; it can't be seen by other work groups and can't be used by other kernels. Only global memory and images and do those things.
Think of local memory as user-managed cache used to accelerate multiple access to the same global memory within a work group.
If you are doing FFT for small bloks, you may fit into private memory. Otherwise, as Dithermaster said, use local memory.
Also, I've implemented some FFT kernels and strongly advice you to avoid usage of butterfly scheme unless you're 100% sure of it. Simple schemes (even matrix multiplication) may show better results because of vectorization & good memory access patterns. Butterfly scheme is optimized for sequential processing. On GPU it may show poor performance .
On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.
I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...
Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).
I'm trying to understand the architecture of OpenCL devices such as GPUs, and I fail to see why there is an explicit bound on the number of work items in a local work group, i.e. the constant CL_DEVICE_MAX_WORK_GROUP_SIZE.
It seems to me that this should be taken care of by the compiler, i.e. if a (one-dimensional for simplicity) kernel is executed with local workgroup size 500 while its physical maximum is 100, and the kernel looks for example like this:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(i);
barrier();
moreCode(i);
barrier();
finalCode(i);
}
then it could be converted automatically to an execution with work group size 100 on this kernel:
__kernel void test(float* input) {
i = get_global_id(0);
someCode(5*i);
someCode(5*i+1);
someCode(5*i+2);
someCode(5*i+3);
someCode(5*i+4);
barrier();
moreCode(5*i);
moreCode(5*i+1);
moreCode(5*i+2);
moreCode(5*i+3);
moreCode(5*i+4);
barrier();
finalCode(5*i);
finalCode(5*i+1);
finalCode(5*i+2);
finalCode(5*i+3);
finalCode(5*i+4);
}
However, it seems that this is not done by default. Why not? Is there a way to make this process automated (other than writing a pre-compiler for it myself)? Or is there an intrinsic problem which can make my method fail on certain examples (and can you give me one)?
I think that the origin of the CL_DEVICE_MAX_WORK_GROUP_SIZE lies in the underlying hardware implementation.
Multiple threads are running simultaneously on computing units and every one of them needs to keep state (for call, jmp, etc). Most implementations use a stack for this and if you look at the AMD Evergreen family their is an hardware limit for the number of stack entries that are available (every stack entry has subentries). Which in essence limits the number of threads every computing unit can handle simultaneously.
As for the compiler can do this to make it possible. It could work but understand that it would mean to recompile the kernel over again. Which isn't always possible. I can imagine situations where developers dump the compiled kernel for each platform in a binary format and ships it with their software just for "not so open-source" reasons.
Those constants are queried from the device by the compiler in order to determine a suitable work group size at compile-time (where compiling of course refers to compiling the kernel). I might be getting you wrong, but it seems you're thinking of setting those values by yourself, which wouldn't be the case.
The responsibility is within your code to query the system capabilities to be prepared for whatever hardware it will run on.