I've created an in-order OpenCL queue. My pipeline enqueues multiple kernels into the queue.
queue = clCreateCommandQueue(cl.context, cl.device, 0, &cl.error);
for(i=0 ;i < num_kernels; i++){
clEnqueueNDRangeKernel(queue, kernels[i], dims, NULL, global_work_group_size, local_work_group_size, 0, NULL, &event);
}
The output of kernels[0] is intput to kernels[1]. Output of kernels[1] is input to kernels[2] and so on.
Since my command queue is an in-order queue, my assumption is kernels[1] will start only after kernels[0] is completed.
Is my assumption valid?
Should I use clWaitForEvents to make sure the previous kernel is completed before enqueuing the next kernel?
Is there any way I can stack multiple kernels into the queue & just pass the input to kernels[0] & directly get the output from the last kernel? (without having to enqueue every kernel one by one)
Your assumption is valid. You do not need to wait for events in an in-order queue. Take a look at the OpenCL doc:
https://www.khronos.org/registry/OpenCL/sdk/1.2/docs/man/xhtml/clCreateCommandQueue.html
If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
command-queue is not set, the commands enqueued to a command-queue
execute in order. For example, if an application calls
clEnqueueNDRangeKernel to execute kernel A followed by a
clEnqueueNDRangeKernel to execute kernel B, the application can assume
that kernel A finishes first and then kernel B is executed. If the
memory objects output by kernel A are inputs to kernel B then kernel B
will see the correct data in memory objects produced by execution of
kernel A. If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a
commandqueue is set, then there is no guarantee that kernel A will
finish before kernel B starts execution.
As to the other question: yes, you'll need to enqueue every kernel that you want to run explicitly. Consider it a good thing, as there is no magic happening.
Of course you can always write your own helpers in C/C++ (or whatever host language you are using) that simplify this, and potentially hide the cumbersome kernel calls. Or use some GPGPU abstraction library to do the same.
Related
OpenCL 1.2: I am running a sequence of kernels in two separate command queues so that they can be scheduled concurrently, synchronising at the end. There are separate data buffers been used in these
queues.
I started by using the same kernel objects in the separate queues. However, that seemed to get
the data buffers all mixed up between the two queues. I looked up but could not find anything in the
references regarding this. In particular, I see nothing explicitly mentioned in the clSetKernelArgs()
page regarding this. There is a note saying
Users may not rely on a kernel object to retain objects specified as argument values to the kernel.
which I am not sure applies to this case.
So my devised solution is to inline the kernel code and make two separate kernel functions that
call this code for each one of the kernels in the two parallel queues. This fixed the issue.
However:
(1) I was not happy with this, and so I tested again on different devices. Data buffers are
jumbled up in the Intel HD630 GPU, but not in the AMD Radeon Pro 560 (where all is good).
(2) This solution does not scale. If I want to implement a larger amount of task parallelism
using the same context, doing separate kernels for each parallel stream is no good. I have
not tested two separate contexts, I supposed it would work, but then it would mean copying
data from one context to the other at the sync point, which defeats the whole exercise.
Has anyone tried this, or have any insights on the issue?
So my devised solution is to inline the kernel code and make two separate kernel functions that call this code for each one of the kernels in the two parallel queues
You don't need to do that. You can achieve the same effect by simply creating multiple cl_kernel objects from the same cl_program. Simply call clCreateKernel multiple times with the same arguments. If it helps, you can think of cl_kernel as a struct that contains 1) a pointer to some executable code, and 2) storage for arguments. A cl_kernel does not actually "own" the code; the program does.
However, that seemed to get the data buffers all mixed up between the two queues
Are you aware that there are no implicit guarantees on the order of command execution across queues ?
Example: if you have one in-order cl_command_queue, enqueing commands A and then B in it guarantees A will execute before B. However, if you have two command queues C1 and C2, if you enqueue A into C1, then enqueue B into C2, then B can execute before A (or A before B).
If you need to enforce a specific order, you must use events. Like this:
cl_event ev;
cl_kernel A, B;
cl_command_queue C1, C2;
...
clEnqueueNDR(C1, A, ... , &ev);
clEnqueueNDR(C2, B, ..., 1, &ev, NULL);
clReleaseEvent(ev);
This guarantees B executes after A.
I need to write an OpenCL program for reducing a large buffer (several million floats) into a single float. For the simplicity of the question I will suppose here that I need to compute the sum of all floats.
So I have written a kernel which takes a float buffer as input, and sums it by packets of 64. It writes the result to a buffer which is 64 times smaller. I then iterate the call of this kernel until the data is small enough to be copied back on the host and summed by the CPU.
I'm new to OpenCL, do I need to have a barrier between each kernel so that they are run sequentially, or is OpenCL smart enough to detect that the nth kernel pass is writing to an output buffer used as the input buffer of the n+1th kernel?
Or is there a smarter approach?
If you are using a single, in-order command queue for all of your kernel launches (i.e. you do not use the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property), then each kernel invocation will run to completion before the next begins - you do not need any explicit barriers to enforce this behaviour.
If you are using an out-of-order command queue or multiple queues, you can enforce data dependencies via the use of OpenCL events. Each call to clEnqueueNDRangeKernel can optionally return an event object, which can be passed to subsequent commands as dependencies.
I know that MPI_SENDRECV allow to overcome the problem of deadlocks (when we use the classic MPI_SEND and MPI_RECV functions).
I would like to know if MPI_SENDRECV(sent_to_process_1, receive_from_process_0) is equivalent to:
MPI_ISEND(sent_to_process_1, request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request1)
MPI_WAIT(request2)
with asynchronous MPI_ISEND and MPI_RECV functions?
From I have seen, MPI_ISEND and MPI_RECV creates a fork (i.e. 2 processes). So if I follow this logic, the first call of MPI_ISEND generates 2 processes. One does the communication and the other calls MPI_RECV which forks itself 2 processes.
But once the communication of first MPI_ISEND is finished, does the second process call MPI_IRECV again? With this logic, the above equivalent doesn't seem to be valid...
Maybe I should change to this:
MPI_ISEND(sent_to_process_1, request1)
MPI_WAIT(request1)
MPI_IRECV(receive_from_process_0, request2)
MPI_WAIT(request2)
But I think that it could be create also deadlocks.
Anyone could give to me another solution using MPI_ISEND, MPI_IRECV and MPI_WAIT to get the same behaviour of MPI_SEND_RECV?
There's some dangerous lines of thought in the question and other answers. When you start a non-blocking MPI operation, the MPI library doesn't create a new process/thread/etc. You're thinking of something more like a parallel region of OpenMP I believe, where new threads/tasks are created to do some work.
In MPI, starting a non-blocking operation is like telling the MPI library that you have some things that you'd like to get done whenever MPI gets a chance to do them. There are lots of equally valid options for when they are actually completed:
It could be that they all get done later when you call a blocking completion function (like MPI_WAIT or MPI_WAITALL). These functions guarantee that when the blocking completion call is done, all of the requests that you passed in as arguments are finished (in your case, the MPI_ISEND and the MPI_IRECV). Regardless of when the operations actually take place (see next few bullets), you as an application can't consider them done until they are actually marked as completed by a function like MPI_WAIT or MPI_TEST.
The operations could get done "in the background" during another MPI operation. For instance, if you do something like the code below:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Barrier(MPI_COMM_WORLD);
MPI_Waitall(2, req);
The MPI_ISEND and the MPI_IRECV would probably actually do the data transfers in the background during the MPI_BARRIER. This is because as an application, you are transferring "control" of your application to the MPI library during the MPI_BARRIER call. This lets the library make progress on any ongoing MPI operation that it wants. Most likely, when the MPI_BARRIER is complete, so are most other things that finished first.
Some MPI libraries allow you to specify that you want a "progress thread". This tells the MPI library to start up another thread (not that thread != process) in the background that will actually do the MPI operations for you while your application continues in the main thread.
Remember that all of these in the end require that you actually call MPI_WAIT or MPI_TEST or some other function like it to ensure that your operation is actually complete, but none of these spawn new threads or processes to do the work for you when you call your nonblocking functions. Those really just act like you stick them on a list of things to do (which in reality, is how most MPI libraries implement them).
The best way to think of how MPI_SENDRECV is implemented is to do two non-blocking calls with one completion function:
MPI_Isend(..., MPI_COMM_WORLD, &req[0]);
MPI_Irecv(..., MPI_COMM_WORLD, &req[1]);
MPI_Waitall(2, req);
How I usually do this on node i communicating with node i+1:
mpi_isend(send_to_process_iPlus1, requests(1))
mpi_irecv(recv_from_process_iPlus1, requests(2))
...
mpi_waitall(2, requests)
You can see how ordering your commands this way with non-blocking communication allows you (during the ... above) to perform any computation that does not rely on the send/recv buffers to be done during your communication. Overlapping computation with communication is often crucial for maximizing performance.
mpi_send_recv on the other hand (while avoiding any deadlock issues) is still a blocking operation. Thus, your program must remain in that routine during the entire send/recv process.
Final points: you can initialize more than 2 requests and wait on all of them the same way using the above structure as dealing with 2 requests. For instance, it's quite easy to start communication with node i-1 as well and wait on all 4 of the requests. Using mpi_send_recv you must always have a paired send and receive; what if you only want to send?
I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.
Your Second case is the way to go.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects
in the long chain of actions and release them after
everything is complete?
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don't worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
-> I/O operation -> I/O operation
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?
Gives an error OUT_OF_RESOURCES or similar when allocating.
At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?
It will not do this automatically, except you have set the memory as a host PTR. But I'm unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.
I was under the impression (sorry, I was going to cite specification but can't find it today, so I downgraded the strength of my assertion) that when you enqueue a kernel with cl_mem references, it takes a retain on those objects, and releases them when the kernel is done.
This could allow you to release these objects after enqueing a kernel without actually having to wait for the kernel to finish running. This is how the async "clEnqueue" operations are reconciled with the synchronous operations (i.e., memory release), and prevents the use of released memory objects by the runtime and kernel.
I am interested in implementing a sort of event driven dispatch queue using MPI (message passing interface). The basic problem I want to solve is this: I have a master process which inserts jobs into a global queue, and each available slave process retrieves the next job in the queue if there is one.
From what I've read about MPI, it seems like sending and receiving processes must be in agreement about when they send and receive. As in, suppose one process sends a message but the other process does not know it needs to receive, or vice versa, then everything deadlocks. Is there any way to make every process a bit more independent?
You can do that as follows:
Declare one master-node (0), that is going to distribute the tasks. In this node the pseudocode is:
int sendTo
for task in tasks:
MPI_Recv(...sendTo, MPI_INT, MPI_ANY_SOURCE,...)
MPI_Send(job,... receiver: sendTo)
for node in nodes:
MPI_Recv(...sendTo, MPI_INT, MPI_ANY_SOURCE,...)
MPI_SEND(job_null,...,receiver: sendTo)
In the slave nodes the code would be:
while (true)
MPI_Send(myNodenum to 0, MPI_INT)
MPI_Recv(job from 0)
if (job == job_null)
break
else
execute job
I think this should work.
You might want to use charm++; it's not explicitly an event driven framework, but does provide an abstraction mechanism for performing tasks and distributing those tasks dynamically.