OpenCL 2.0 Device command queue keeps filling up and halting execution

OpenCL 2.0 Device command queue keeps filling up and halting execution - opencl

I am utilizing OpenCL's enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelB
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}
kernel void kernelB(args)
{
//Do some stuff
//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}
kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm
//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
kernel void kernelD(args)
{
//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. So my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning.
I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.
Thank you in advance!
(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)

I don't know exactly what's going wrong, but I've noticed a few things in the code you posted and this feedback would be too long/hard to read in comments, so here goes - not a definite answer, but an attempt to get a bit closer:
Code doesn't quite do what the comments say
In kernelD, you have:
//Finally, if some condition is met, enqueue kernelC again.
…
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
This actually enqueues kernelD itself again, not kernelC as the comments suggest. The other condition branch enqueues kernelA.
This could be a typo in the reduced version of your code.
Potential task explosion
This could again be down to the way you've abridged the code, but I don't quite see how
So my queue should really only reach a max of like 1 or 2 tasks at a time.
can be true. By my reading, all work items of both kernelC and kernelD will spawn new tasks; and as there seems to be more than 1 work item in each case, this seems like it could easily spawn a very large number of tasks:
For example, in kernelC:
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelD(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
kernelB will have created at least 256 work items running kernelC. Here, work item 0 will (if other requirements met) spawn 1 task with at least 256 more work items, and 255+ tasks with 1 work-item running kernelA. kernelD behaves similarly.
So with a few iterations, you could easily end up with a few thousand tasks for running kernelA queued. I don't really know what your code does, but it seems like a good idea to check if cutting down these hundreds of kernelA tasks improves the situation, and whether you can perhaps modify kernelA so that you just enqueue it once with a range instead of enqueueing a work size of 1 from every work item. (Or something along those lines - perhaps enqueue once per group if that makes more sense. Basically, reduce the number of times enqueue_kernel gets called.)
enqueue_kernel() return value
Have you actually checked the return value for enqueue_kernel? It tells you exactly why it failed, so even if my suggestion above isn't possible, perhaps you can set some global state which will allow kernelA to restart the calculation once more tasks have drained, if it was interrupted?

Related

Will this producer-consumer scenario with semaphores ever deadlock?

I have read through a whole bunch of producer-consumer problems that use semaphores, but I haven't been able to find an answer for this exact one. What I want to know is if this solution will ever deadlock?
semaphore loadedBuffer = 0
semaphore emptyBuffer = N where n>2
semaphore mutex = 1
Producer(){
P(emptyBuffers)
P(mutex)
//load buffer
V(loadedBuffers)
v(mutex)
Consumer(){
P(loadedBuffer)
P(mutex)
//empty buffer
V(mutex)
v(emptyBuffer)
I do believe this is a good solution, because I cannot find a circumstance where this would deadlock because any time the mutex semaphore is used, a thread cannot possibly be waiting on anything else.
Am I correct in assuming this is a good solution and will never deadlock?

It's not quite clear what are those P(), V() and v() in your algorithm, but in general you need just two semaphores as described in Wikipedia:
semaphore fillCount = 0; // items produced
semaphore emptyCount = BUFFER_SIZE; // remaining space
procedure producer()
{
while (true)
{
item = produceItem();
down(emptyCount);
putItemIntoBuffer(item);
up(fillCount);
}
}
procedure consumer()
{
while (true)
{
down(fillCount);
item = removeItemFromBuffer();
up(emptyCount);
consumeItem(item);
}
}
Source: https://en.wikipedia.org/wiki/Producer%E2%80%93consumer_problem#Using_semaphores

MPI irecv,isend communication between tasks

I am new to MPI, so sorry if this sounds stupid. I want a process to have an MPI_Irecv. If it has been called for a task, then it finds a result and sends the result back to the process that called it. How can I check if it has been actually assigned to a task? So that I can have an if{} in which that task takes place while the rest of the process continues with other stuff.
Code example:
for (i=0;i<size_of_Q;i++) {
MPI_Irecv( &shmeio, 1, mpi_point_C, root, 77, MPI_COMM_WORLD, &req );
//I want to put an if right here.
//If it's true process does task.
//Finds a number. then
MPI_Isend( &Bestcandidate, 1, mpi_point_C, root, 66, MPI_COMM_WORLD, &req );
//so that it can return the result.
//if it wasn't assigned a task it carries on with its other tasks.
} //(here is where for loop ends)

You might be confusing what MPI is supposed to do. MPI isn't really a tasking-based model as compared to some others (map reduce, some parts of OpenMP, etc.). MPI has historically focused on SPMD (single program multiple data) types of applications. That's not to say that MPI can't handle MPMD (there's an entire chapter in the standard about dynamic processes and most launchers can run different executables on different ranks.
With that in mind, when you start your job, you'll usually have all of the processes that you'll ever have (unless you're using dynamic processing like MPI_COMM_SPAWN). You probably used something like:
mpiexec -n 8 ./my_program arg1 arg2 arg3
Many times, if people are trying to emulate a tasking (or master/worker) model, they'll treat rank 0 as the special "master":
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (0 == rank) {
while (/* work not done */) {
/* Check if parts of work are done */
/* Send work to ranks without work */
}
} else {
while (/* work not done */ {
/* Get work from master */
/* Compute */
/* Send results to master */
}
}
Often, when waiting for the work, you'll do something like:
for (i = 1; i < num_process; i++) {
MPI_Irecv(&result[i], ..., &requests[i]);
}
This will set up the receives for each rank that will send you work. Then later, you can do something like:
MPI_Testany(num_processes - 1, requests, &index, &flag, statuses);
if (flag) {
/* Process work */
MPI_Send(work_data, ..., index, ...);
}
This will check to see if any of the requests (the handles used to track the status of a nonblocking operation) are completed and will then send new work to the worker that finished.
Obviously, all of this code is not copy/paste ready. You'll have to figure out how/if it applies to your work and adapt it accordingly.

How do I check that all MPI procs were used to call a procedure?

I have designed a procedure that must be called by all processors in the communicator in order to function properly. If the user called it with only the root rank, I want the procedure to know this and then produce a meaningful error message to the user of the procedure. At first I thought of having the procedure call a checking routine shown below:
subroutine AllProcsPresent
! Checks that all procs have been used to call this procedure
use MPI_stub, only: nproc, Allreduce
integer :: counter
counter=1
call Allreduce(counter) ! This is a stub procedure that will add "counter" across all procs
if (counter(1)==get_nproc()) then
return
else
print *, "meaningful error"
end if
end subroutine AllProcsPresent
But this won't work because the Allreduce is going to wait for all procs to check in and if only root was used to do the call, the other procs will never arrive. Is there a way to do what I'm trying to do?

There's not much you can do here. You might want to look at 'collecheck' for ideas, but it's hard to find a good resource for that package. Here's its git home:
http://git.mpich.org/mpe.git/tree/HEAD:/src/collchk
If you look at 'NOTES' there's an item about "call consistency" described as "Ensures that all processes in the communicator have made the same call in a given event". Hope that can give you some ideas.

Ensuring that a collective operation is entered by all ranks within a communicator is the responsibility of the programmer.
However, you might consider using the MPI 3.0 non-blocking collective MPI_Ibarrier with an MPI_Test loop and time out. However, non-blocking collectives can't be cancelled, so if the other ranks do not join in the operation within your time out, you will have to abort the entire job. Something like:
void AllPresent(MPI_Comm comm, double timeout) {
int all_here = 0;
MPI_Request req;
MPI_Ibarrier(comm, &req);
double start_time = MPI_Wtime();
do {
MPI_Test(&req, &all_here, MPI_STATUS_IGNORE);
sleep(0.01);
double now = MPI_Wtime();
if (now - start_time > timeout) {
/* Print an error message */
MPI_Abort(comm, 1);
}
} while (!all_here);
/* Run your procedure now */
}

limit the number of children and descendants processes

I have to use fork() recursively, but limit the number of forked processes (including children and descendants) to (for example) 100. Considering this code snippet:
void recursive(int n) {
for(int i=0; i<n; i++) {
if(number_of_processes() < 100) {
if(fork() == 0) {
number_of_processes_minus_one();
recursive(i);
exit(0);
}
}
else
recursive(i);
}
}
How to implement number_of_processes() and number_of_processes_minus_one()? Do I have to use IPC? I tried to pre-create a file, write PROC_MAX into it and lock-read-write-unlock it in number_of_processes() but it still eat all my pids.

I suspect that the simplest thing to do is to use a pipe. Before you fork anything create a pipe, write 100 bytes into the write side, and close the write side. Then, try to read one byte from the pipe whenever you want to fork. If you are able to read a byte, then fork. If not, then don't. Trying to track the number of total forks with a global variable will fail if children are allowed to fork, but the pipe will persist across all descendants.

Concurrent Processing - Petersons Algorithm

For those unfamiliar, the following is Peterson's algorithm used for process coordination:
int No_Of_Processes; // Number of processes
int turn; // Whose turn is it?
int interested[No_Of_Processes]; // All values initially FALSE
void enter_region(int process) {
int other; // number of the other process
other = 1 - process; // the opposite process
interested[process] = TRUE; // this process is interested
turn = process; // set flag
while(turn == process && interested[other] == TRUE); // wait
}
void leave_region(int process) {
interested[process] = FALSE; // process leaves critical region
}
My question is, can this algorithm give rise to deadlock?

No, there is no deadlock possible.
The only place you are waiting is while loop. And the process variables is not shared between threads and they are different, but turn variable is shared. So it's impossible to get true condition for turn == process for more then one thread in every single moment.
But anyway your solution is not correct at all, the Peterson's algorithm is only for two concurrent threads, not for any No_Of_Processes like in your code.
In original algorithm for N processes deadlocks are possible link.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

OpenCL 2.0 Device command queue keeps filling up and halting execution - opencl

Related

Will this producer-consumer scenario with semaphores ever deadlock?

MPI irecv,isend communication between tasks

How do I check that all MPI procs were used to call a procedure?

limit the number of children and descendants processes

Concurrent Processing - Petersons Algorithm

Categories

Resources