How can I make CUDA return control after kernel launch? - asynchronous

It might be a stupid question but is there a way to return asynchronously from a kernel? For example, I have this kernel which does a first stream compaction which is outputted to the user but before it must do a second stream compaction to update its internal structure.
Is there a way to return the control to the user after the first stream compaction done while the GPU continues its second stream compaction in the background? Of course, the second stream compaction works only on shared memory and global memory, but nothing the user should retrieve.
I can't use thrust.

A GPU kernel does not, in itself, take control from the "user", i.e. from CPU threads on the system with the GPU.
However, with CUDA's runtime, the default way to invoke a GPU kernel has your thread wait until the kernel's execution concludes:
my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size>>>(args,go,here);
but you can also use streams. These are hardware-supported execution queues on which you can enqueue work (memory copying, kernel execution etc.) asynchronously, just like you asked.
Your launch in this case may look like:
cudaStream_t my_stream;
cudaError_t result = cudaStreamCreateWithFlags(&my_stream, cudaStreamNonBlocking);
if (result != cudaSuccess) { /* error handling */ }
my_kernel<<<my_grid_dims,my_block_dims,dynamic_shared_memory_size,my_stream>>>(args,go,here);
There are lots of resources on using streams; try this blog post for starters. The CUDA programming guide has a larg section on asynchronous execution .
Streams and various libraries
Thrust has offered asynchronous functionality for a while, using thrust::future and other constructs. See here.
My own Modern-C++ CUDA API wrappers make it somewhat easier to work with streams, relieving you of the need to check for errors all the time and to remember to destroy streams and release memory before it goes out of scope. make it somewhat easier to work with streams. See this example; the syntax looks something like this:
auto stream = device.create_stream(cuda::stream::async);
stream.enqueue.copy(d_a.get(), a.get(), nbytes);
stream.enqueue.kernel_launch(my_kernel, launch_config, d_a.get(), more, args);
(and errors throw an exception)

Related

Efficiently connecting an asynchronous IMFSourceReader to a synchronous IMFTransform

Given an asynchronous IMFSourceReader connected to a synchronous only IMFTransform.
Then for the IMFSourceReaderCallback::OnReadSample() callback is it a good idea not to call IMFTransform::ProcessInput directly within OnReadSample, but instead push the produced sample onto another queue for another thread to call the transforms ProcessInput on?
Or would I just be replicating identical work source readers typically do internally? Or put another way does work within OnReadSample run the risk of blocking any further decoding work within the source reader that could have otherwise happened more asynchronously?
So I am suggesting something like:
WorkQueue transformInputs;
...
// Called back async
HRESULT OnReadSampleCallback(... IMFSample* sample)
{
// Push sample and return immediately
Push(transformInputs, sample);
}
// Different worker thread awoken for transformInputs queue samples
void OnTransformInputWork()
{
// Transform object is not async capable
transform->TransformInput(0, Pop(transformInputs), 0);
...
}
This is touched on, but not elaborated on here 'Implementing the Callback Interface':
https://learn.microsoft.com/en-us/windows/win32/medfound/using-the-source-reader-in-asynchronous-mode
Or is it completely dependent on whatever the source reader sets up internally and not easily determined?
It is not a good idea to perform a long blocking operation in IMFSourceReaderCallback::OnReadSample. Nothing is going to be fatal or serious but this is not the intended usage.
Taking into consideration your previous question about audio format conversion though, audio sample data conversion is fast enough to happen on such callback.
Also, it is not clear or documented (depends on actual implementation), ProcessInput is often instant and only references input data. ProcessOutput would be computationally expensive in this case. If you don't do ProcessOutput right there in the same callback you might run into situation where MFT is no longer accepting input, and so you'd have to implement a queue anyway.
With all this in mind you would just do the processing in the callback neglecting performance impact assuming your processing is not too heavy, or otherwise you would just start doing the queue otherwise.

Why is Async version slower than single threaded version?

I am reading a large XML file using XmlReader and am exploring potential performance improvements via Async & pipelining. The following initial foray into the world of Async is showing that the Async version (which for all intents and purposes at this point is the equivalent of the Synchronous version) is much slower. Why would this be? All I've done is wrapped the "normal" code in an Async block and called it with Async.RunSynchronously
Code
open System
open System.IO.Compression // support assembly required + FileSystem
open System.Xml // support assembly required
let readerNormal (reader:XmlReader) =
let temp = ResizeArray<string>()
while reader.Read() do
()
temp
let readerAsync1 (reader:XmlReader) =
async{
let temp = ResizeArray<string>()
while reader.Read() do
()
return temp
}
let readerAsync2 (reader:XmlReader) =
async{
while reader.Read() do
()
}
[<EntryPoint>]
let main argv =
let path = #"C:\Temp\LargeTest1000.xlsx"
use zipArchive = ZipFile.OpenRead path
let sheetZipEntry = zipArchive.GetEntry(#"xl/worksheets/sheet1.xml")
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerNormal reader
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerAsync1 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
readerAsync2 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
printfn "DONE"
System.Console.ReadLine() |> ignore
0 // return an integer exit code
INFO
I am aware that the above Async code does not do any actual Async work - what I a trying to ascertain here is the overhead of simply making it Async
I don't expect it to go faster just because I've wrapped it in an Async. My question is the opposite: why the dramatic (IMHO) slowdown.
TIMINGS
A comment below correctly pointed out that I should provide timings for datasets of various sizes which is implicitly what had led me to be asking this question in the first instance.
The following are some times based on small vs large datasets. While the absolute values are not too meaningful, the relativities are interesting:
30 elements (small dataset)
Normal: 00:00:00.0006994
Async1: 00:00:00.0036529
Async2: 00:00:00.0014863
(A lot slower but presumably indicative of Async setup costs - this is as expected)
1.5 million elements
Normal: 00:00:01.5749734
Async1: 00:00:03.3942754
Async2: 00:00:03.3760785
(~ 2x slower. Surprised that the difference in timing is not amortized as the dataset gets bigger. If this is the case, then pipelining/parallelization can only improve performance here if you have more than two cores - to outweigh the overhead that I can't explain...)
There's no asynchronous work to do. In effect, all you get is the overheads and no benefits. async {} doesn't mean "everything in the braces suddenly becomes asynchronous". It simply means you have a simplified way of using asynchronous code - but you never call a single asynchronous function!
Additionaly, "asynchronous" doesn't necessarily mean "parallel", and it doesn't necessarily involve multiple threads. For example, when you do an asynchronous request to read a file (which you're not doing here), it means that the OS is told what you want to be done, and how you should be notified when it is done. When you run code like this using RunSynchronously, you're simply blocking one thread while posting asynchronous file requests - a scenario pretty much identical to using synchronous file requests in the first place.
The moment you do RunSynchronously, you throw away any reason whatsoever to use asynchronous code in the first place. You're still using a single thread, you just blocked another thread at the same time - instead of saving on threads, you waste one, and add another to do the real work.
EDIT:
Okay, I've investigated with the minimal example, and I've got some observations.
The difference is absolutely brutal with a profiler on - the non-async version is somewhat slower (up to 2x), but the async version is just never ending. It seems as if a huge number of allocations is going on - and yet, when I break the profiler, I can see that the non-async version (running in 4 seconds) makes a hundred thousand allocations (~20 MiB), while the async version (running over 10 minutes) only makes mere thousands. Maybe the memory profiler interacts badly with F# async? The CPU time profiler doesn't have this problem.
The generated IL is very different for the two cases. Most importantly, even though our async code doesn't actually do anything asynchronous, it creates a ton of async builder helpers, sprinkles a ton of (asynchronous) Delay calls through the code, and going into outright absurd territory, each iteration of the loop is an extra method call, including the setup of a helper object.
Apparently, F# automatically translates while into an asynchronous while. Now, given how well compressed xslt data usually is, very little I/O is involved in those Read operations, so the overhead absolutely dominates - and since every iteration of the "loop" has its own setup cost, the overhead scales with the amount of data.
While this is mostly caused by the while not actually doing anything, it also obviously means that you need to be careful about what you select as async, and you need to avoid using it in a case where CPU time dominates (as in this case - after all, both the async and non-async case are almost 100% CPU tasks in practice). This is further worsened by the fact that Read reads a single node at a time - something that's relatively trivial even in a big, non-compressed xml file. The overheads absolutely dominate. In effect, this is analogous to using Parallel.For with a body like sum += i - the setup cost of each of the each of the iterations absolutely dwarfs any actual work being done.
The CPU profiling makes this rather obvious - the two most work intensive methods are:
XmlReader.Read (expected)
Thread::intermediateThreadProc - also known as "this code runs on a thread pool thread". The overhead from this in a no-op code like this is around 100% - yikes. Apparently, even though there is no real asynchronicity anywhere, the callbacks are never run synchronously. Every iteration of the loop posts work to a new thread pool thread.
The lesson learned? Probably something like "don't use loops in async if the loop body does very little work". The overhead is incurred for each and every iteration of the loop. Ouch.
Asynchronous code doesn't magically make your code faster. As you've discovered, it'll tend to make isolated code slower, because there's overhead involved with managing the asynchrony.
What it can do is to be more efficient, but that's not the same as being inherently faster. The main purpose of Async is to make Input/Output code more efficient.
If you invoke a 'slow', blocking I/O operation directly, you'll block the thread until the operation returns.
If you instead invoke that slow operation asynchronously, it may free up the thread to do other things. It does require that there's an underlying implementation that's not thread-bound, but uses another mechanism for receiving the response. I/O Completion Ports could be such a mechanism.
Now, if you run a lot of asynchronous code in parallel, it may turn out to be faster than attempting to run the blocking implementation in parallel, because the async versions use fewer resources (fewer threads = less memory).

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.
Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...
tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.
The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

clEnqueueNDRange blocking on Nvidia hardware? (Also Multi-GPU)

On Nvidia GPUs, when I call clEnqueueNDRange, the program waits for it to finish before continuing. More precisely, I'm calling its equivalent C++ binding, CommandQueue::enqueueNDRange, but this shouldn't make a difference. This only happens on Nvidia hardware (3 Tesla M2090s) remotely; on our office workstations with AMD GPUs, the call is nonblocking and returns immediately. I don't have local Nvidia hardware to test on - we used to, and I remember similar behavior then, too, but it's a bit hazy.
This makes spreading the work across multiple GPUs harder. I've tried starting a new thread for each call to enqueueNDRange using std::async/std::finish in the new C++11 spec, but that doesn't seem to work either - monitoring the GPU usage in nvidia-smi, I can see that the memory usage on GPU 0 goes up, then it does some work, then the memory on GPU 0 goes down and the memory on GPU 1 goes up, that one does some work, etc. My gcc version is 4.7.0.
Here's how I'm starting the kernels, where increment is the desired global work size divided by the number of devices, rounded up to the nearest multiple of the desired local work size:
std::vector<cl::CommandQueue> queues;
/* Population of queues happens somewhere
cl::NDrange offset, increment, local;
std::vector<std::future<cl_int>> enqueueReturns;
int numDevices = queues.size();
/* Calculation of increment (local is gotten from the function parameters)*/
//Distribute the job among each of the devices in the context
for(int i = 0; i < numDevices; i++)
{
//Update the offset for the current device
offset = cl::NDRange(i*increment[0], i*increment[1], i*increment[2]);
//Start a new thread for each call to enqueueNDRangeKernel
enqueueReturns.push_back(std::async(
std::launch::async,
&cl::CommandQueue::enqueueNDRangeKernel,
&queues[i],
kernels[kernel],
offset,
increment,
local,
(const std::vector<cl::Event>*)NULL,
(cl::Event*)NULL));
//Without those last two casts, the program won't even compile
}
//Wait for all threads to join before returning
for(int i = 0; i < numDevices; i++)
{
execError = enqueueReturns[i].get();
if(execError != CL_SUCCESS)
std::cerr << "Informative error omitted due to length" << std::endl
}
The kernels definitely should be running on the call to std::async, since I can create a little dummy function, set a breakpoint on it in GDB and have it step into it the moment std::async is called. However, if I make a wrapper function for enqueueNDRangeKernel, run it there, and put in a print statement after the run, I can see that it takes some time between prints.
P.S. The Nvidia dev zone is down due to hackers and such, so I haven't been able to post the question there.
EDIT: Forgot to mention - The buffer that I'm passing to the kernel as an argment (and the one I mention, above, that seems to get passed between the GPUs) is declared as using CL_MEM_COPY_HOST_PTR. I had been using CL_READ_WRITE_BUFFER, with the same effect happening.
I emailed the Nvidia guys and actually got a pretty fair response. There's a sample in the Nvidia SDK that shows, for each device you need to create seperate:
queues - So you can represent each device and enqueue orders to it
buffers - One buffer for each array you need to pass to the device, otherwise the devices will pass around a single buffer, waiting for it to become available and effectively serializing everything.
kernel - I think this one's optional, but it makes specifying arguments a lot easier.
Furthermore, you have to call EnqueueNDRangeKernel for each queue in separate threads. That's not in the SDK sample, but the Nvidia guy confirmed that the calls are blocking.
After doing all this, I achieved concurrency on multiple GPUs. However, there's still a bit of a problem. On to the next question...
Yes, you're right. AFAIK - the nvidia implementation has a synchronous "clEnqueueNDRange". I have noticed this when using my library (Brahma) as well. I don't know if there is a workaround or a way of preventing this, save using a different implementation (and hence device).

Main loop in event-driven programming and alternatives

To the best of my knowledge, event-driven programs require a main loop such as
while (1) {
}
I am just curious if this while loop can cost a high CPU usage? Is there any other way to implement event-driven programs without using the main loop?
Your example is misleading. Usually, an event loop looks something like this:
Event e;
while ((e = get_next_event()) != E_QUIT)
{
handle(e);
}
The crucial point is that the function call to our fictitious get_next_event() pumping function will be generous and encourage a context switch or whatever scheduling semantics apply to your platform, and if there are no events, the function would probably allow the entire process to sleep until an event arrives.
So in practice there's nothing to worry about, and no, there's not really any alternative to an unbounded loop if you want to process an unbounded amount of information during your program's runtime.
Usually, the problem with a loop like this is that while it's doing one piece of work, it can't be doing anything else (e.g. Windows SDK's old 'cooperative' multitasking). The next naive jump up from this is generally to spawn a thread for each piece of work, but that's incredibly dangerous. Most people would end up with an executor that generally has a thread pool inside. Then, the handle call is actually just enqueueing the work and the next available thread dequeues it and executes it. The number of concurrent threads remains fixed as the total number of worker threads in the pool and when threads don't have anything to do, they are not eating CPU.

Resources