Device-side enqueue causes CL_OUT_OF_RESOURCES - opencl

I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.

I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).

Related

When to use MPI_BUFFER_ATTACH?

As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage

heap memory release policy in Arduino

#include <Arduino.h>
#include "include/MainComponent.h"
/*
 Turns on an LED on for one second, then off for one second, repeatedly.
*/
MainComponent* mainComponent;
void setup()
{
   mainComponent = new MainComponent();
   mainComponent->beginComponent();
}
void loop()
{
   mainComponent->runComponent();
}
is there any callback to release memory in Arduino ?(e.g to call delete mainComponent)
or this will happen automatically as the loop ends?
what is the strategy to ensure freeing the memory allocated in that code snippet?
SCENARIO :"I wanted to access the object in both methods , so the  object is declared in the global scope then instantiated at setup."
What happen when loop() terminated ? will  mainComponent still remain in the memory?
If it was in OS NO , process will terminated then resources will be deallocated.
So in Arduino how can I achieve above SCENARIO , by ensuring memory will be deallocated when the controller is switched off ?
What is confusing you is that the main() function is hidden by the basic Arduino IDE. Your programs have a main() function just like on any other platform, and have a lifecycle same as when run on a computer with OS. If you look under arduino___\hardware\cores\aduino, you will find a file main.cpp, which is included into your binaries:
int main(void)
{
init();
//...
setup();
for (;;) {
loop();
if (serialEventRun) serialEventRun();
}
return 0;
}
Considering this file you will now see, that while you exit the loop(), it is continuously called. Your program never exits. In general, your best pattern is to new objects once and never delete, like you have done here. If you are new'ing and delete'ing objects repeatedly on a microcontroller, you are not thinking about lifecycles and resources wisely.
So
"is the new'd object deleted at return from loop()?" No, the program is still running and it stays on the heap.
"What happens at power off? Is there a way to clean up?" The moment the supply voltage drops too low, the microcontroller will stop executing instructions. Power supervisor circuitry prevents the controller from doing anything erratic as the voltage drops (should prevent) When the voltage is conpletely drained, all the RAM is lost. Without adding circuitry, you have no way to execute any clean up at power off.
"Do I need to clean up?" No, at power up, everything is reset to a known state. Operation cannot be affected by anything left behind in RAM (presumes you initialize all your variables).

OpenCL Profiling Info Unavailable for Marker Commands

Using AMDs APP OpenCL implementation with JOCL bindings, I'm trying to create a generic bracketing profiler using Java automatic resource management. The basic idea is:
class Timer implements AutoCloseable {
...
Timer {
...
clEnqueueMarker( commandQueue, startEvent );
}
void close() {
cl_event stopEvent = new cl_event();
clEnqueueMarker( commandQueue, stopEvent );
clFinish( commandQueue );
... calculate and output times ...
}
}
My problem is that profiling information is not available for the marker command events (stopEvent and startEvent). This is despite a) setting CL_QUEUE_PROFILING_ENABLE on the command queue and b) flushing and waiting on the command queue and verifying that the stop and start events are CL_COMPLETE with no errors.
So my question is, is profiling supported on marker commands in AMD OpenCL? If not, is it explicitly disallowed by the spec (I found nothing to this effect)?
Thanks.
I've rechecked the spec and it seems to me that what you get is normal (though I've never paid much attention to that detail previously). In the section 5.12 about profiling, the standard states:
This section describes profiling of OpenCL functions that are enqueued
as commands to a command-queue. The specific functions being
referred to are: clEnqueue{Read|Write|Map}Buffer,
clEnqueue{Read|Write}BufferRect, clEnqueue{Read|Write|Map}Image,
clEnqueueUnmapMemObject, clEnqueueCopyBuffer, clEnqueueCopyBufferRect,
clEnqueueCopyImage, clEnqueueCopyImageToBuffer,
clEnqueueCopyBufferToImage, clEnqueueNDRangeKernel , clEnqueueTask and
clEnqueueNativeKernel.
So the clEnqueueMarker() function is not in the list, and I guess the CL_PROFILING_INFO_NOT_AVAILABLE value returned makes sense.
I just tried this and it seems to work now. Tested on Windows 10 with an AMD 7870 and on Linux with Nvidias Titan Black and Titan X cards.
The OpenCL 1.2 specs still contain the paragraph #CaptainObvious quoted. The clEnqueueMarker function is still missing, but I can get profiling information without a problem.
The start and end times on marker events are always equal, which makes a lot of sense.
Btw. clEnqueueMarker is deprecated in OpenCL 1.2 and should be replaced with clEnqueueMarkerWithWaitList.

Confusion over the time taken by write() and read() sys calls

The below code simply calculates the time taken to write a file.
#include<time.h>
void main()
{
int fp;
long a,b;
char *str = "Life is like that only";
fp = open("tmp.txt",O_WRONLY,0666);
time(&a);
write(fp,str);
time(&b);
/*(b-a) should be the time taken to write
* the file tmp.txt.
*/
close(fp);
return;
}
My question is that if we have a single CPU then whether the time taken (b-a) would be exact or it can be affected by the execution of other process running parallel.
Some posts here mention that write() and read() can be treated almost like atomic syscalls as if they are not successful EINTR is set that simply means to try again.But still does that mean if it is successful then in the course of its execution all other processes are on hold.
Other processes (that are not using I/O or that are using I/O on different devices) can run while your process is waiting for the write to complete, and your process may not immediately get the CPU back after it completes.
In practice, for a small write to a regular file, your write() will probably return immediately after copying your data into a kernel-space buffer, rather than waiting for it to go all the way to the disk.

How can I terminate a QThread

Recently ,I come across this problem as I memtioned in this Title.
I have tried by using QThread::terminate(),but I just can NOT stop
the thread ,which is in a dead loop (let's say,while(1)).
thanks a lot.
Terminating the thread is the easy solution to stopping an async operation, but it is usually a bad idea: the thread could be doing a system call or could be in the middle of updating a data structure when it is terminated, which could leave the program or even the OS in an unstable state.
Try to transform your while(1) into while( isAlive() ) and make isAlive() return false when you want the thread to exit.
QThreads can deadlock if they finish "naturally" during termination.
For example in Unix, if the thread is waiting on a "read" call, the termination attempt (a Unix signal) will make the "read" call abort with an error code before the thread is destroyed.
That means that the thread can still reach it's natural exit point while being terminated. When it does so, a deadlock is reached since some internal mutex is already locked by the "terminate" call.
My workaround is to actually make sure that the thread never returns if it was terminated.
while( read(...) > 0 ) {
// Do stuff...
}
while( wasTerminated )
sleep(1);
return;
wasTerminated here is actually implemented a bit more complex, using atomic ints:
enum {
Running, Terminating, Quitting
};
QAtomicInt _state; // Initialized to Running
void myTerminate()
{
if( _state.testAndSetAquire(Running, Terminating) )
terminate();
}
void run()
{
[...]
while(read(...) > 0 ) {
[...]
}
if( !_state.testAndSetAquire(Running, Quitting) ) {
for(;;) sleep(1);
}
}
Have you tried exit or quit?
Did the thread call QThread::setTerminationEnabled(false)? That would cause thread termination to delay indefinitely.
EDIT: I don't know what platform you're on, but I checked the Windows implementation of QThread::terminate. Assuming the thread was actually running to begin with, and termination wasn't disabled via the above function, it's basically a wrapper around TerminateThread() in the Windows API. This function accepts disrespect from no thread, and tends to leave a mess behind with resource leaks and similar dangling state. If it's not killing the thread, you're either dealing with zombie kernel calls (most likely blocked I/O) or have even bigger problems somewhere.
To use unnamed pipes
int gPipeFdTest[2]; //create a global integer array
As an when where you intend to create pipes use
if( pipe(gPipeFdTest) < 0)
{
perror("Pipe failed");
exit(1);
}
The above code will create a pipe which has two ends gPipeFdTest[0] for reading and gPipeFdTest[1] for writing. What you can do is in your run function set up to read the pipe using select system call. And from where you want to come out of run, there set up to write using write system call. I have used select system call for monitoring the read end of the pipe as it suits my implmentation. Try to figure all this out in your case. If you need any more help, give me a buzz.
Edit:
My problem was just like yours. I had a while(1) loop and the other things I tried needed mutexes and other fancy multithreading mumbo jumbo, which added complexity and debugging was nightmare. Using pipes absolved me from those complexities besides simplified the code. I am not saying that it is the best option but in my case it turned out to be the best and cleanest alternative. I was bugged my hung application before this solution.

Resources