OpenCL Profiling Info Unavailable for Marker Commands - opencl

Using AMDs APP OpenCL implementation with JOCL bindings, I'm trying to create a generic bracketing profiler using Java automatic resource management. The basic idea is:
class Timer implements AutoCloseable {
...
Timer {
...
clEnqueueMarker( commandQueue, startEvent );
}
void close() {
cl_event stopEvent = new cl_event();
clEnqueueMarker( commandQueue, stopEvent );
clFinish( commandQueue );
... calculate and output times ...
}
}
My problem is that profiling information is not available for the marker command events (stopEvent and startEvent). This is despite a) setting CL_QUEUE_PROFILING_ENABLE on the command queue and b) flushing and waiting on the command queue and verifying that the stop and start events are CL_COMPLETE with no errors.
So my question is, is profiling supported on marker commands in AMD OpenCL? If not, is it explicitly disallowed by the spec (I found nothing to this effect)?
Thanks.

I've rechecked the spec and it seems to me that what you get is normal (though I've never paid much attention to that detail previously). In the section 5.12 about profiling, the standard states:
This section describes profiling of OpenCL functions that are enqueued
as commands to a command-queue. The specific functions being
referred to are: clEnqueue{Read|Write|Map}Buffer,
clEnqueue{Read|Write}BufferRect, clEnqueue{Read|Write|Map}Image,
clEnqueueUnmapMemObject, clEnqueueCopyBuffer, clEnqueueCopyBufferRect,
clEnqueueCopyImage, clEnqueueCopyImageToBuffer,
clEnqueueCopyBufferToImage, clEnqueueNDRangeKernel , clEnqueueTask and
clEnqueueNativeKernel.
So the clEnqueueMarker() function is not in the list, and I guess the CL_PROFILING_INFO_NOT_AVAILABLE value returned makes sense.

I just tried this and it seems to work now. Tested on Windows 10 with an AMD 7870 and on Linux with Nvidias Titan Black and Titan X cards.
The OpenCL 1.2 specs still contain the paragraph #CaptainObvious quoted. The clEnqueueMarker function is still missing, but I can get profiling information without a problem.
The start and end times on marker events are always equal, which makes a lot of sense.
Btw. clEnqueueMarker is deprecated in OpenCL 1.2 and should be replaced with clEnqueueMarkerWithWaitList.

Related

Device-side enqueue causes CL_OUT_OF_RESOURCES

I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.
I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).

When Qt-5 will fail the connect

Reading Qt signal & slots documentation, it seems that the only reason for a new style connection to fail is:
"If there is already a duplicate (exact same signal to the exact same slot on the same objects), the connection will fail and connect will return false"
Which means that connection was already successful the first time and does not allow multi-connections when using Qt::UniqueConnection.
Does this means that Qt-5 style connection will always success? Are there any other reasons for failure?
The new-style connect can still fail at runtime for a variety of reasons:
Either sender or receiver is a null pointer. Obviously this requires a check that can only happen at runtime.
The PMF you specified for a signal is not actually a signal. Lacking proper C++ reflection capabilities, all you can do at compile time is checking that the signal is a non-static member function of the sender's class.
However, that's not enough to make it a signal: it also needs to be in a signals: section in your class definition. When moc sees your class definition, it will generate some metadata containing the information that that function is indeed a signal. So, at runtime, the pointer passed to connect is looked up in a table, and connect itself will fail if the pointer is not found (because you did not pass a signal).
The check on the previous point actually requires a comparison between pointers to member functions. It's a particularly tricky one, because it will typically involve different TUs:
one is the TU containing moc-generated data (typically a moc_class.cpp file). In this TU there's the aforementioned table containing, amongst other things, pointers to the signals (which are just ordinary member functions).
is the TU where you actually invoke connect(sender, &Sender::signal, ...), which generates the pointer that gets looked up in the table.
Now, the two TUs may be in the same application, or perhaps one is in a library and the other in your application, or maybe in two libraries, etc; your platform's ABI starts to get into play.
In theory, the pointers stored when doing 1. are identical to the pointers generated when doing 2.; in practice, we've found cases where this does not happen (cf. this bug report that I reported some time ago, where older versions of GNU ld on ARM generated code that failed the comparison).
For Qt this meant disabling certain optimizations and/or passing some extra flags to the places where we know this to happen and break user software. For instance, as of Qt 5.9, there is no support for -Bsymbolic* flags on GCC on anything but x86 and x86-64.
Of course, this does not mean we've found and fixed all the possible places. New compilers and more aggressive optimizations might trigger this bug again in the future, making connect return false, even when everything is supposed to work.
Yes it can fail if either sender or receiver are not valid objects (nullptr for example)
Example
QObject* obj1 = new QObject();
QObject* obj2 = new QObject();
// Will succeed
connect(obj1, &QObject::destroyed, obj2, &QObject::deleteLater);
delete obj1;
obj1 = nullptr;
// Will fail even if it compiles
connect(obj1, &QObject::destroyed, obj2, &QObject::deleteLater);
Do not try to register pointer type. I've used the macro
#define QT_REG_TYPE(T) qRegisterMetaType<T>(#T)
with pointer type CMyWidget*, that was the problem. Using the type directly worked.
No it's not always successful. The docs give an example here where connect would return false because the signal should not contain variable names.
// WRONG
QObject::connect(scrollBar, SIGNAL(valueChanged(int value)),
label, SLOT(setNum(int value)));

How to change the dart-sqlite code from synchronous style to asynchronous?

I'm trying to use Dart with sqlite, with this project dart-sqlite.
But I found a problem: the API it provides is synchronous style. The code will be looked like:
// Iterating over a result set
var count = c.execute("SELECT * FROM posts LIMIT 10", callback: (row) {
print("${row.title}: ${row.body}");
});
print("Showing ${count} posts.");
With such code, I can't use Dart's future support, and the code will be blocking at sql operations.
I wonder how to change the code to asynchronous style? You can see it defines some native functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/lib/sqlite.dart#L238
_prepare(db, query, statementObject) native 'PrepareStatement';
_reset(statement) native 'Reset';
_bind(statement, params) native 'Bind';
_column_info(statement) native 'ColumnInfo';
_step(statement) native 'Step';
_closeStatement(statement) native 'CloseStatement';
_new(path) native 'New';
_close(handle) native 'Close';
_version() native 'Version';
The native functions are mapped to some c++ functions here: https://github.com/sam-mccall/dart-sqlite/blob/master/src/dart_sqlite.cc
Is it possible to change to asynchronous? If possible, what shall I do?
If not possible, that I have to rewrite it, do I have to rewrite all of:
The dart file
The c++ wrapper file
The actual sqlite driver
UPDATE:
Thanks for #GregLowe's comment, Dart's Completer can convert callback style to future style, which can let me to use Dart's doSomething().then(...) instead of passing a callback function.
But after reading the source of dart-sqlite, I realized that, in the implementation of dart-sqlite, the callback is not event-based:
int execute([params = const [], bool callback(Row)]) {
_checkOpen();
_reset(_statement);
if (params.length > 0) _bind(_statement, params);
var result;
int count = 0;
var info = null;
while ((result = _step(_statement)) is! int) {
count++;
if (info == null) info = new _ResultInfo(_column_info(_statement));
if (callback != null && callback(new Row._internal(count - 1, info, result)) == true) {
result = count;
break;
}
}
// If update affected no rows, count == result == 0
return (count == 0) ? result : count;
}
Even if I use Completer, it won't increase the performance. I think I may have to rewrite the c++ code to make it event-based first.
You should be able to write a wrapper without touching the C++. Have a look at how to use the Completer class in dart:async. Basically you need to create a Completer, return Completer.future immediately, and then call Completer.complete(row) from the existing callback.
Re: update. Have you seen this article, specifically the bit about asynchronous extensions? i.e. If the C++ API is synchronous you can run it in a separate thread, and use messaging to communicate with it. This could be a way to do it.
The big problem you've got is that SQLite is an embedded database; in order to process your query and provide your results, it must do computation (and I/O) in your process. What's more, in order for its transaction handling system to work, it either needs its connection to be in the thread that created it, or for you to run in serialized mode (with a performance hit).
Because these are fairly hard constraints, your plan of switching things to an asynchronous operation mode is unlikely to go well except by using multiple threads. Since using multiple connections complicates things a lot (as you can't share some things between them, such as TEMP TABLEs) let's consider going for a single serialized connection; all activity will be serialized at the DB level, but for an application that doesn't use the DB a lot it will be OK. At the C++ level, you'd be talking about calling that execute from another thread and then sending messages back to the caller thread to indicate each row and the completion.
But you'll take a real hit when you do this; in particular, you're committing to only doing one query at a time, as the technique runs into significant problems with semantic effects when you start using two connections at once and the DB forces serialization on you with one connection.
It might be simpler to do the above by putting the synchronous-asynchronous coupling at the Dart level by managing the worker thread and inter-thread communication there. That would let you avoid having to change the C++ code significantly. I don't know Dart well enough to be able to give much advice there.
Myself, I'd just stick with synchronous connection processing so that I can make my application use multi-threaded mode more usefully. I'd be taking the hit with the semantics and giving each thread its own connection (possibly allocated lazily) so that overall speed was better, but I do come from a programming community that regards threads as relatively heavyweight resources, so make of that what you will. (Heavy threads can do things that reduce the number of locks they need that it makes no sense to try to do with light threads; it's about overhead management.)

Sample Grabber Sink release() issue

I use Sample Grabber Sink in my Media session using most of code from msdn sample.
In OnProcessSample method I memcpy data to media buffer, attach it to MFSample and put this one into main process pointer. Problem is I either get memory leaking or crashes in ntdll.dll
ntdll.dll!#RtlpLowFragHeapFree#8() Unknown
SampleGrabberSink:
OnProcessSample(...)
{
MFCreateMemoryBuffer(dwSampleSize,&tmpBuff);
tmpBuff->Lock(&data,NULL,NULL);
memcpy(data,pSampleBuffer,dwSampleSize); tmpBuff->Unlock();
MFCreateSample(&tmpSample);
tmpSample->AddBuffer(tmpBuff);
while(!(*Free) && (*pSample)!=NULL)
{
Sleep(1);
}
(*Free)=false;
(*pSample)=tmpSample;
(*Free)=true;
SafeRelease(&tmpBuff);
}
in main thread
ReadSample()
{
if(pSample==NULL)
return;
while(!Free)
Sleep(1);
Free=false;
//process sample into dx surface//
SafeRelease(&pSample);
Free=true;
}
//hr checks omitted//
With this code i get that ntdll.dll error after playing few vids.
I also tried to push samples in qeue so OnProcess doesn't have to wait but then some memory havent free after video ended.
(even now it practicaly doesn't wait, Session rate is 1 and main process can read more than 60fps)
EDIT: It was thread synchronization problem. Solved by using critical section thanks to Roman R.
It is not easy to see is from the code snippet, but I suppose you are burning cycles on a streaming thread (you have your callback called on) until a global/shared variable is NULL and then you duplicate a media sample there.
You need to look at synchronization APIs and serialize access to shared variables. You don't do that and eventually either you are accessing freed memory or breaking reference count of COM object.
You need an event set externally when you are ready to accept new buffer from the callback, then the callback sees the event, enters critical section (or, reader/writer lock), does your *pSample magic there, exits from critical section and sets another event indicating availability of a buffer.

interrupted system call error when writing to a pipe

In my user space Linux application, I have a thread which communicated to the main process through a pipe. Below is the code
static void _notify_main(int cond)
{
int r;
int tmp = cond;
r = write( _nfy_fd, &tmp, sizeof(tmp) );
ERROR( "write failed: %d. %s\n", r, strerror(r) );
}
Pretty straight forward. It's been working fine for quite a while now. But recently, the write call will fail with "interrupted system call" error after the programme went under some stress test.
Strangely, the stuff actually went through the pipe no problem. Of course I'd still like to go to the bottom of the error message and get rid of it.
Thanks,
The write(2) man page mentions:
Conforming to
SVr4, 4.3BSD, POSIX.1-2001.
Under SVr4 a write may be interrupted and return EINTR at any point, not just before any data is written.
I guess you were just lucky that it didn't occur so far.
If you google just for the "interrupted system call", you will find this thread which tells you to use siginterrupt() to auto-restart the write call.
From http://www.gnu.org/
A signal can arrive and be handled while an I/O primitive such as open
or read is waiting for an I/O device. If the signal handler returns,
the system faces the question: what should happen next?
POSIX specifies one approach: make the primitive fail right away. The
error code for this kind of failure is EINTR. This is flexible, but
usually inconvenient. Typically, POSIX applications that use signal
handlers must check for EINTR after each library function that can
return it, in order to try the call again. Often programmers forget to
check, which is a common source of error.
So you can handle the EINTR error, there is another choice by the way, You can use sigaction to establish a signal handler specifying how that handler should behave. Using the SA_RESTART flag, return from that handler will resume a primitive; otherwise, return from that handler will cause EINTR.
see interrupted primitives

Resources