OpenCL callback on already completed event

OpenCL callback on already completed event - opencl

In OpenCL I can register a callback function to be called when an event has completed on the GPU using clSetEventCallback.
But I get the cl_event only immediatly after enqueuing the command on the queue. So there is a small possibility that at the time when clSetEventCallback is called on the CPU, the event has already completed on the GPU.
If clSetEventCallback is called on an event that is already completed, will the OpenCL driver call the callback anyways?

OpenCL specification says that:
All callbacks registered for an event object must be called. All enqueued callbacks shall be called before the event object is destroyed. Callbacks must return promptly. The behavior of calling expensive system routines, OpenCL API calls to create contexts or command-queues, or blocking OpenCL operations from the following list below, in a callback is undefined.
It is a bit vague, but I think you can assume that a callback will be called even if an event is already completed by the time you call clSetEventCallback. Otherwise it makes user's code unnecessary complicated.

Related

Deadlocks, Mutexes and Signal,Slots in Qt

I am developing an application that uses QSerialPort to receive data via uart. In this application I use a log function that writes to a file. Before writing to the file the application locks a mutex and after writing to the file it unlocks the mutex. Between mutex lock and unlock I am not calling the log function again.
Of course the data coming from serial port arrives asynchronously and it triggers a signal and a slot where the data is processed. In this function where the data is processed I am calling the log function again.
I am not using multithreading in my application, as far as I know the slots are called in the same thread.
The question is: can the single thread deadlock itself when the data from QSerialPort arrives exactly after the mutex in the log function was locked ? (This would mean a double lock of the same mutex - assume we do not use a recursive mutex)
Is there a good source of knowledge about such a topic ?

If your serial port gets data exactly after the mutex locked it will not execute immediatly (because this is not interruption) but it wait when QEventLoop will execute a receiving data slot (in global event loop). You can execute all events manually with qApp->processEvents() so try to avoid it inside lock/unlock block.
If you use explicit calling of log function in single thread you will not have a deadlock (I think). But be carefull with writing a log via qDebug() macro with reimplementing qInstallMessageHandler because you can forget and use qDebug() inside mutex lock/unlock block. Then you will have a deadlock.
Also it could be some troubles with callback functions inside your lock/unlock block. So be carefull with callbacks too.

How is asynchronous callback implemented?

How do all the languages implements asynchronous callbacks?
For example in C++, one need to have a "monitor thread" to start a std::async. If it is started in main thread, it has to wait for the callback.
std::thread t{[]{std::async(callback_function).get();}}.detach();
v.s.
std::async(callback_function).get(); //Main thread will have to wait
What about asynchronous callbacks in JavaScript? In JS callbacks are massively used... How does V8 implement them? Does V8 create a lot of threads to listen on them and execute callback when it gets message? Or does it use one thread to listen on all the callbacks and keep refreshing?
For example,
setInterval(function(){},1000);
setInterval(function(){},2000);
Does V8 create 2 threads and monitor each callback state, or it has a pool thing to monitor all the callbacks?

V8 does not implement asynchronous functions with callbacks (including setInterval). Engine simply provides a way to execute JavaScript code.
As a V8 embedder you can create setInterval JavaScript function linked to your native C++ function that does what you want. For example, create thread or schedule some job. At this point it is your responsibility to call provided callback when it is necessary. Only one thread at a time can use V8 engine (V8 isolate instance) to execute code. This means synchronization is required if a callback needs to be called from another thread. V8 provides locking mechanism is you need this.
Another more common approach to solve this problem is to create a queue of functions for V8 to execute and use infinite queue processing loop to execute code on one thread. This is basically an event loop. This way you don't need to use execution lock, but instead use another thread to push callback function to a queue.
So it depends on a browser/Node.js/other embedder how they implement it.

TL;DR: To implement asynchronous callback is basically to allow the control flow to proceed without blocking for the callback. Before the callback function is finally called, the control flow is free to execute anything that has no dependence on the callback's result, e.g., the caller can proceed as if the callback function has returned, or the caller may yield its control to other functions.
Since the question is for general implementation rather than a specific language, my answer tries to be as general as to cover the implementation commonalities.
Different languages have different implementations for asynchronous callbacks, but the principles are the same. The key is to decouple the control flow from the code executed. They correspond to the execution context (like a thread of control with a runtime stack) and the executed task. Traditionally the execution context and the executed task are usually 1:1 associated. With asynchronous callbacks, they are decoupled.
1. The principles
To decouple the control flow from the code, it is helpful to think of every asynchronous callback as a conditional task. When the code registers an asynchronous callback, it virtually installs the task's condition in the system. The callback function is then invoked when the condition is satisfied. To support this, a condition monitoring mechanism and a task scheduler are needed, so that,
The programmer does not need to track the callback's condition;
Before the condition is satisfied, the program may proceed to execute other code that does not depend on the callback's result, without blocking on the condition;
Once the condition is satisfied, the callback is guaranteed to execute. The programmer does not need to schedule its execution;
After the callback is executed, its result is accessible to the caller.
2. Implementation for Portability
For example, if your code needs to process the data from a network connection, you do not need to write the code checking the connection state. You only registers a callback that will be invoked once the data is available for processing. The dirty work of connection checking is left to the language implementation, which is known to be tricky especially when we talk about scalability and portability.
The language implementation may employ asynchronous io, nonblocking io or a thread pool or whatever techniques to check the network state for you, and once the data is ready, the callback function is then scheduled to execute. Here the control flow of your code looks like directly going from the callback registration to the callback execution, because the language hides the intermediate steps. This is the portability story.
3. Implementation for Scalability
To hide the dirty work is only part of the whole story. The other part is that, your code itself does not need to block waiting for the task condition. It does not make sense to wait for one connection's data when you have lots of network connections simultaneously and some of them may already have data ready. The control flow of your code can simply register the callback, and then moves on with other tasks (e.g., the callbacks whose conditions have been satisfied), knowing that the registered callbacks will be executed anyway when their data are available.
If to satisfy the callback's condition does not involve much of the CPU (e.g., waiting for a timer, or waiting for the data from network), and the callback function itself is light-weighted, then single CPU (or single thread) is able to process lots of callbacks concurrently, such as incoming network requests processing. Here the control flow may look like jumping from one callback to another. This is the scalability story.
4. Implementation for Parallelism
Sometimes, the callbacks are not pending for non-blocking IO condition, but for blocking operations such as page fault; or the callbacks do not rely on any condition, but are pure computation logics. In this case, asynchronous callback does not save you the CPU waiting time (because there is no idle waiting). But since asynchronous callback implies that the callback function can be executed in parallel with the caller or other callbacks (subject to certain data sharing and synchronization constraints), the language implementation can dispatch the callback tasks to different threads, achieving the benefits of parallelism, if the platform has more than one hardware thread context. It still improves scalability.
5. Implementation for Productivity
The productivity with asynchronous callback may not be very positive when the code need to deal with chained callbacks, i.e., when callbacks register other callbacks in recursive way, known as callback hell. There are ways to rescue.
The semantics of an asynchronous callback can be explored so as to substitute the hopeless nested callbacks with other language constructs. Basically there can be two different views of callbacks:
From data flow point of view: asynchronous callback = event + task.
To register a callback essentially generates an event that will emit
when the task condition is satisfied. In this view, the chained
callbacks are just events whose processing triggers other event
emission. It can be naturally implemented in event-driven
programming, where the task execution is driven by events. Promise
and Observable may also be regarded as event-driven concept. When
multiple events are ready concurrently, their associated tasks can
be executed concurrently as well.
From control flow point of view: to register a callback yields the
control to other code, and the callback execution just resumes the
control flow once its condition is satisfied. In this view, chained
asynchronous callbacks are just resumable functions. Multiple
callbacks can be written as one after another in traditional
"synchronous" way, with yield operation in between (or await). It
actually becomes coroutine.
I haven't discussed the implementation of data passing between the asynchronous callback and its caller, but that is usually not difficult if using shared memory where caller and callback can share data. Actually Golang's channel can also be considered in line of yield/await but with its focus on data passing.

The callbacks that are passed to browser APIs, like setTimeout, are pushed into the same browser queue when the API has done its job.
The engine can check this queue when the stack is empty and push the next callback into the JS stack for execution.
You don’t have to monitor the progress of the API calls, you asked it to do a job and it will put your callback in the queue when it’s done.

Does cudaFree after asynchronous call work?

I want to ask whether calling to cudaFree after some asynchronous calls is valid? For example
int* dev_a;
// prepare dev_a...
// launch a kernel to process dev_a (asynchronously)
cudaFree(dev_a);
In this case, since kernel launch is asynchronous, when the cudaFree part is reached, the kernel may haven't finish running yet. Then will the cudaFree(dev_a) immediately after it destroy the data?

As per Jared's comment, I am about 99% certain that the CUDA driver free/malloc pair are implemented as blocking calls which will synchronize the context on which they operate before they execute the call.

CUDA now provide functions for asynchronous memory management based on streams: cudaMallocAsync, cudaMemcpyAsync, cudaMemcpyAsync.
A short introduction is available here

QObject destructors not being called

I have two QObject child classes in my Qt application. One object from each of these classes was instantiated on the stack. Previously, my application would exit cleanly. However, since I've updated to Qt5.1.0, their destructors are not being called. I get the following warning twice when I launch the debugger.
the debug information found in "/usr/lib/debug//lib64/libfreebl3.so.debug"
does not match "/lib64/libfreebl3.so" (CRC mismatch)
Is this a bug in Qt or in my code?

See the documentation of QCoreApplication::exec:
We recommend that you connect clean-up code to the aboutToQuit() signal, instead of putting it in your application's main() function because on some platforms the QCoreApplication::exec() call may not return. For example, on Windows when the user logs off, the system terminates the process after Qt closes all top-level windows. Hence, there is no guarantee that the application will have time to exit its event loop and execute code at the end of the main() function after the QCoreApplication::exec() call.
You're using it incorrectly. It is not guaranteed that exec will be terminated after windows are closed. You should use aboutToQuit signal to stop other threads. If this signal is not emitted either, you need to call QApplication::quit() explicitly when your window is closed.

I'm not exactly sure in this case if this is a bug in your code or not, but anyway it is not recommended to create QObjects in the stack.
The reason is that the parent object (if any) will automatically call delete when destroyed, but then the object will also be automatically destroyed when it goes out of scope. Hence the object is destroyed twice which is Undefined Behaviour. That may explained why it worked well in one case, and not in another, since you can't rely on any consistent behaviour.
(But in your case it is weird that you say the destructor is not called at all...)

QObject based class has a queued connection to itself

I was digging into some source code I am working on. I found a peculiar statement that someone had coded. The source code is a GUI application with a QML GUI and uses QT 4.7.x.
The snippet below belongs to core application logic.
// connect signal-slots for decoupling
QObject::connect (this, SIGNAL(setCurrentTaskSignal(int)), this,
SLOT(SetCurrentTaskSlot(int)), Qt::QueuedConnection);
It's strange that the object connects to itself via a queued connection which essentially means that the object may "live" in different threads at the same time?
At first glance It didn't made any sense to me. Can anyone think of any reason why such a connection would be plausible or needed?. Would this even work?

It will work without any problem. Maybe there was some event loop processing required before calling SetCurrentTaskSlot?
Note that QueuedConnection doesn't mean that something is in different thread. QueuedConnection means only that when signal is emitted, corresponding slot won't be called directly. It will be queued on event loop, and will be processed when control will be given back to event loop

The queued connection implies nothing about where the receiver lives. The opposite is true: to safely send signals to an object living in another thread, you must use queued connections. But you can use them for an object living in any thread!
One uses a queued connection to ensure that the signal will be delivered from within the event loop, and not immediately from the emit site as happens with direct connection. Direct connection is conceptually a set of calls to function pointers on a list. Queued connection is conceptually an event sent to a clever receiver who can execute a function call based on the contents of the event.
The event is the internal QMetaCallEvent, and it is QObject::event that acts upon this event and executes the call.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex