Concurrent write with OCaml Async

Concurrent write with OCaml Async - asynchronous

I'm reading data from the network and I'd like to write it to a file whenever I get them. The writes are concurrent and non sequential (think P2P file sharing). In C, I would get a file descriptor to the file(for the duration of the program) then use lseek, followed by write and eventually close the fd. These operations could be protected by a mutex in a multithreaded setting (especially, lseek and write should be atomic).
I don't really see how to get that behavior in Async. My initial idea is to have something like this.
let write fd s pos =
let posl = Int64.of_int pos in
Async_unix.Unix_syscalls.lseek fd ~mode:`Set posl
>>| fun _ ->
let wr = Writer.create t.fd in
let len = String.length s in
Writer.write wr s ~pos:0 ~len
Then, the writes are scheduled asynchronously when data is received.
My solution isn't correct. For one thing, this write task need to be atomic but it is not the case, since two lseek can be executed before the first Writer.write. Even if I can schedule the write sequentially it wouldn't help since Writer.write doesn't return a Deferred.t. Any idea?
BTW, this is a follow-up to a previous answered question.

The basic approach would be to have a queue of workers, where each worker performs an atomic seek/write1 operation. The invariant is that only one worker at a time is running. A more complicated strategy would employ a priority queue, where writes are ordered by some criterium that maximizes the throughput, e.g., writes to subsequent positions. You may also implement a sophisticated buffering strategy if you observe lots of small writes, then a good idea would be to coalesce them into larger chunks.
But let's start with a simple non-prioritized queue, implemented via Async.Pipe.t. For the positional write, we can't use the Writer interface, as it is designed for buffered sequential writes. So, we will use the Unix.lseek from Async_unix.Std and Bigstring.really_writefunction. The really_write is a regular non-asynchronous function, so we need to lift it into the Async interface using theFd.syscall_in_thread` function, e.g.,
let really_pwrite fd offset bytes =
Unix.lseek fd offset ~mode:`Set >>= fun (_ : int64) ->
Fd.syscall_in_thread fd (fun desc ->
Bigstring.really_write desc bytes)
Note: this function will write as many bytes as system decides, but no more than the length of bytes. So you might be interested in implementing a really_pwrite function that will write all bytes.
The overall scheme would include one master thread, that will own a file descriptor and accept write requests from multiple clients via an Async.Pipe. Suppose that each write request is a message of the follwing type:
type chunk = {
offset : int;
bytes : Bigstring.t;
}
Then your master thread will look like this:
let process_requests fd =
Async.Pipe.iter ~f:(fun {offset; bytes} ->
really_pwrite fd offset bytes)
Where the really_pwrite is a function that really writes all the bytes and handles all the errors. You may also use Async.Pipe.iter' function and presort and coalesce the writes before actually executing the pwrite syscall.
One more optimization note. Allocating a bigstring is a rather expensive operation, so you may consider to pre allocate one big bigstring and serve small chunks from it. This will create a limited resource, so your clients will wait until other clients will finish their writes and release their chunks. As a result, you will have a throttled system with a limited memory footprint.
1)Ideally we should use pwrite though Janestreet provides only pwrite_assume_fd_is_nonblocking function, that doesn't release OCaml runtime when a call to the system pwrite is done, and will actually block the whole system. So we need to use a combination of a seek and write. The latter will release the OCaml runtime so that the rest of the program can continue. (Also, given their definition of nonblocking fd, this function doesn't really make much sense, as only sockets and FIFO are considered non-blocking, and as far as I know, they do not support the seek operation. I will file an issue on their bug tracker.

Related

In Trio, how do you write data to a socket without waiting?

In Trio, if you want to write some data to a TCP socket then the obvious choice is send_all:
my_stream = await trio.open_tcp_stream("localhost", 1234)
await my_stream.send_all(b"some data")
Note that this both sends that data over the socket and waits for it to be written. But what if you just want to queue up the data to be sent, but not wait for it to be written (at least, you don't want to wait in the same coroutine)?
In asyncio this is straightforward because the two parts are separate functions: write() and drain(). For example:
writer.write(data)
await writer.drain()
So of course if you just want to write the data and not wait for it you can just call write() without awaiting drain(). Is there equivalent functionality in Trio? I know this two-function design is controversial because it makes it hard to properly apply backpressure, but in my application I need them separated.
For now I've worked around it by creating a dedicated writer coroutine for each connection and having a memory channel to send data to that coroutine, but it's quite a lot of faff compared to choosing between calling one function or two, and it seems a bit wasteful (presumably there's still a send buffer under the hood, and my memory channel is like a buffer on top of that buffer).

I posted this on the Trio chat and Nathaniel J. Smith, the creator of Trio, replied with this:
Trio doesn't maintain a buffer "under the hood", no. There's just the kernel's send buffer, but the kernel will apply backpressure whether you want it to or not, so that doesn't help you.
Using a background writer task + an unbounded memory channel is basically what asyncio does for you implicitly.
The other option, if you're putting together a message in multiple pieces and then want to send it when you're done would be to append them into a bytearray and then call send_all once at the end, at the same place where you'd call drain in asyncio
(but obviously that only works if you're calling drain after every logical message; if you're just calling write and letting asyncio drain it in the background then that doesn't help)
So the question was based on a misconception: I wanted to write into Trio's hidden send buffer, but no such thing exists! Using a separate coroutine that waits on a stream and calls send_all() makes more sense than I had thought.
I ended up using a hybrid of the two ideas (using separate coroutine with a memory channel vs using bytearray): save the data to a bytearray, then use a condition variable ParkingLot to signal to the other coroutine that it's ready to be written. That lets me coalesce writes, and also manually check if the buffer's getting too large.

Why is Async version slower than single threaded version?

I am reading a large XML file using XmlReader and am exploring potential performance improvements via Async & pipelining. The following initial foray into the world of Async is showing that the Async version (which for all intents and purposes at this point is the equivalent of the Synchronous version) is much slower. Why would this be? All I've done is wrapped the "normal" code in an Async block and called it with Async.RunSynchronously
Code
open System
open System.IO.Compression // support assembly required + FileSystem
open System.Xml // support assembly required
let readerNormal (reader:XmlReader) =
let temp = ResizeArray<string>()
while reader.Read() do
()
temp
let readerAsync1 (reader:XmlReader) =
async{
let temp = ResizeArray<string>()
while reader.Read() do
()
return temp
}
let readerAsync2 (reader:XmlReader) =
async{
while reader.Read() do
()
}
[<EntryPoint>]
let main argv =
let path = #"C:\Temp\LargeTest1000.xlsx"
use zipArchive = ZipFile.OpenRead path
let sheetZipEntry = zipArchive.GetEntry(#"xl/worksheets/sheet1.xml")
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerNormal reader
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
let temp1 = readerAsync1 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
System.GC.Collect()
let stopwatch = System.Diagnostics.Stopwatch()
stopwatch.Start()
let sheetStream = sheetZipEntry.Open() // again
use reader = XmlReader.Create(sheetStream)
readerAsync2 reader |> Async.RunSynchronously
stopwatch.Stop()
printfn "%A" stopwatch.Elapsed
printfn "DONE"
System.Console.ReadLine() |> ignore
0 // return an integer exit code
INFO
I am aware that the above Async code does not do any actual Async work - what I a trying to ascertain here is the overhead of simply making it Async
I don't expect it to go faster just because I've wrapped it in an Async. My question is the opposite: why the dramatic (IMHO) slowdown.
TIMINGS
A comment below correctly pointed out that I should provide timings for datasets of various sizes which is implicitly what had led me to be asking this question in the first instance.
The following are some times based on small vs large datasets. While the absolute values are not too meaningful, the relativities are interesting:
30 elements (small dataset)
Normal: 00:00:00.0006994
Async1: 00:00:00.0036529
Async2: 00:00:00.0014863
(A lot slower but presumably indicative of Async setup costs - this is as expected)
1.5 million elements
Normal: 00:00:01.5749734
Async1: 00:00:03.3942754
Async2: 00:00:03.3760785
(~ 2x slower. Surprised that the difference in timing is not amortized as the dataset gets bigger. If this is the case, then pipelining/parallelization can only improve performance here if you have more than two cores - to outweigh the overhead that I can't explain...)

There's no asynchronous work to do. In effect, all you get is the overheads and no benefits. async {} doesn't mean "everything in the braces suddenly becomes asynchronous". It simply means you have a simplified way of using asynchronous code - but you never call a single asynchronous function!
Additionaly, "asynchronous" doesn't necessarily mean "parallel", and it doesn't necessarily involve multiple threads. For example, when you do an asynchronous request to read a file (which you're not doing here), it means that the OS is told what you want to be done, and how you should be notified when it is done. When you run code like this using RunSynchronously, you're simply blocking one thread while posting asynchronous file requests - a scenario pretty much identical to using synchronous file requests in the first place.
The moment you do RunSynchronously, you throw away any reason whatsoever to use asynchronous code in the first place. You're still using a single thread, you just blocked another thread at the same time - instead of saving on threads, you waste one, and add another to do the real work.
EDIT:
Okay, I've investigated with the minimal example, and I've got some observations.
The difference is absolutely brutal with a profiler on - the non-async version is somewhat slower (up to 2x), but the async version is just never ending. It seems as if a huge number of allocations is going on - and yet, when I break the profiler, I can see that the non-async version (running in 4 seconds) makes a hundred thousand allocations (~20 MiB), while the async version (running over 10 minutes) only makes mere thousands. Maybe the memory profiler interacts badly with F# async? The CPU time profiler doesn't have this problem.
The generated IL is very different for the two cases. Most importantly, even though our async code doesn't actually do anything asynchronous, it creates a ton of async builder helpers, sprinkles a ton of (asynchronous) Delay calls through the code, and going into outright absurd territory, each iteration of the loop is an extra method call, including the setup of a helper object.
Apparently, F# automatically translates while into an asynchronous while. Now, given how well compressed xslt data usually is, very little I/O is involved in those Read operations, so the overhead absolutely dominates - and since every iteration of the "loop" has its own setup cost, the overhead scales with the amount of data.
While this is mostly caused by the while not actually doing anything, it also obviously means that you need to be careful about what you select as async, and you need to avoid using it in a case where CPU time dominates (as in this case - after all, both the async and non-async case are almost 100% CPU tasks in practice). This is further worsened by the fact that Read reads a single node at a time - something that's relatively trivial even in a big, non-compressed xml file. The overheads absolutely dominate. In effect, this is analogous to using Parallel.For with a body like sum += i - the setup cost of each of the each of the iterations absolutely dwarfs any actual work being done.
The CPU profiling makes this rather obvious - the two most work intensive methods are:
XmlReader.Read (expected)
Thread::intermediateThreadProc - also known as "this code runs on a thread pool thread". The overhead from this in a no-op code like this is around 100% - yikes. Apparently, even though there is no real asynchronicity anywhere, the callbacks are never run synchronously. Every iteration of the loop posts work to a new thread pool thread.
The lesson learned? Probably something like "don't use loops in async if the loop body does very little work". The overhead is incurred for each and every iteration of the loop. Ouch.

Asynchronous code doesn't magically make your code faster. As you've discovered, it'll tend to make isolated code slower, because there's overhead involved with managing the asynchrony.
What it can do is to be more efficient, but that's not the same as being inherently faster. The main purpose of Async is to make Input/Output code more efficient.
If you invoke a 'slow', blocking I/O operation directly, you'll block the thread until the operation returns.
If you instead invoke that slow operation asynchronously, it may free up the thread to do other things. It does require that there's an underlying implementation that's not thread-bound, but uses another mechanism for receiving the response. I/O Completion Ports could be such a mechanism.
Now, if you run a lot of asynchronous code in parallel, it may turn out to be faster than attempting to run the blocking implementation in parallel, because the async versions use fewer resources (fewer threads = less memory).

A MailboxProcessor that operates with a LIFO logic

I am learning about F# agents (MailboxProcessor).
I am dealing with a rather unconventional problem.
I have one agent (dataSource) which is a source of streaming data. The data has to be processed by an array of agents (dataProcessor). We can consider dataProcessor as some sort of tracking device.
Data may flow in faster than the speed with which the dataProcessor may be able to process its input.
It is OK to have some delay. However, I have to ensure that the agent stays on top of its work and does not get piled under obsolete observations
I am exploring ways to deal with this problem.
The first idea is to implement a stack (LIFO) in dataSource. dataSource would send over the latest observation available when dataProcessor becomes available to receive and process the data. This solution may work but it may get complicated as dataProcessor may need to be blocked and re-activated; and communicate its status to dataSource, leading to a two way communication problem. This problem may boil down to a blocking queue in the consumer-producer problem but I am not sure..
The second idea is to have dataProcessor taking care of message sorting. In this architecture, dataSource will simply post updates in dataProcessor's queue. dataProcessor will use Scanto fetch the latest data available in his queue. This may be the way to go. However, I am not sure if in the current design of MailboxProcessorit is possible to clear a queue of messages, deleting the older obsolete ones. Furthermore, here, it is written that:
Unfortunately, the TryScan function in the current version of F# is
broken in two ways. Firstly, the whole point is to specify a timeout
but the implementation does not actually honor it. Specifically,
irrelevant messages reset the timer. Secondly, as with the other Scan
function, the message queue is examined under a lock that prevents any
other threads from posting for the duration of the scan, which can be
an arbitrarily long time. Consequently, the TryScan function itself
tends to lock-up concurrent systems and can even introduce deadlocks
because the caller's code is evaluated inside the lock (e.g. posting
from the function argument to Scan or TryScan can deadlock the agent
when the code under the lock blocks waiting to acquire the lock it is
already under).
Having the latest observation bounced back may be a problem.
The author of this post, #Jon Harrop, suggests that
I managed to architect around it and the resulting architecture was actually better. In essence, I eagerly Receive all messages and filter using my own local queue.
This idea is surely worth exploring but, before starting to play around with code, I would welcome some inputs on how I could structure my solution.
Thank you.

Sounds like you might need a destructive scan version of the mailbox processor, I implemented this with TPL Dataflow in a blog series that you might be interested in.
My blog is currently down for maintenance but I can point you to the posts in markdown format.
Part1
Part2
Part3
You can also check out the code on github
I also wrote about the issues with scan in my lurking horror post
Hope that helps...

tl;dr I would try this: take Mailbox implementation from FSharp.Actor or Zach Bray's blog post, replace ConcurrentQueue by ConcurrentStack (plus add some bounded capacity logic) and use this changed agent as a dispatcher to pass messages from dataSource to an army of dataProcessors implemented as ordinary MBPs or Actors.
tl;dr2 If workers are a scarce and slow resource and we need to process a message that is the latest at the moment when a worker is ready, then it all boils down to an agent with a stack instead of a queue (with some bounded capacity logic) plus a BlockingQueue of workers. Dispatcher dequeues a ready worker, then pops a message from the stack and sends this message to the worker. After the job is done the worker enqueues itself to the queue when becomes ready (e.g. before let! msg = inbox.Receive()). Dispatcher consumer thread then blocks until any worker is ready, while producer thread keeps the bounded stack updated. (bounded stack could be done with an array + offset + size inside a lock, below is too complex one)
Details
MailBoxProcessor is designed to have only one consumer. This is even commented in the source code of MBP here (search for the word 'DRAGONS' :) )
If you post your data to MBP then only one thread could take it from internal queue or stack.
In you particular use case I would use ConcurrentStack directly or better wrapped into BlockingCollection:
It will allow many concurrent consumers
It is very fast and thread safe
BlockingCollection has BoundedCapacity property that allows you to limit the size of a collection. It throws on Add, but you could catch it or use TryAdd. If A is a main stack and B is a standby, then TryAdd to A, on false Add to B and swap the two with Interlocked.Exchange, then process needed messages in A, clear it, make a new standby - or use three stacks if processing A could be longer than B could become full again; in this way you do not block and do not lose any messages, but could discard unneeded ones is a controlled way.
BlockingCollection has methods like AddToAny/TakeFromAny, which work on an arrays of BlockingCollections. This could help, e.g.:
dataSource produces messages to a BlockingCollection with ConcurrentStack implementation (BCCS)
another thread consumes messages from BCCS and sends them to an array of processing BCCSs. You said that there is a lot of data. You may sacrifice one thread to be blocking and dispatching your messages indefinitely
each processing agent has its own BCCS or implemented as an Agent/Actor/MBP to which the dispatcher posts messages. In your case you need to send a message to only one processorAgent, so you may store processing agents in a circular buffer to always dispatch a message to least recently used processor.
Something like this:
(data stream produces 'T)
|
[dispatcher's BCSC]
|
(a dispatcher thread consumes 'T and pushes to processors, manages capacity of BCCS and LRU queue)
| |
[processor1's BCCS/Actor/MBP] ... [processorN's BCCS/Actor/MBP]
| |
(process) (process)
Instead of ConcurrentStack, you may want to read about heap data structure. If you need your latest messages by some property of messages, e.g. timestamp, rather than by the order in which they arrive to the stack (e.g. if there could be delays in transit and arrival order <> creation order), you can get the latest message by using heap.
If you still need Agents semantics/API, you could read several sources in addition to Dave's links, and somehow adopt implementation to multiple concurrent consumers:
An interesting article by Zach Bray on efficient Actors implementation. There you do need to replace (under the comment // Might want to schedule this call on another thread.) the line execute true by a line async { execute true } |> Async.Start or similar, because otherwise producing thread will be consuming thread - not good for a single fast producer. However, for a dispatcher like described above this is exactly what needed.
FSharp.Actor (aka Fakka) development branch and FSharp MPB source code (first link above) here could be very useful for implementation details. FSharp.Actors library has been in a freeze for several months but there is some activity in dev branch.
Should not miss discussion about Fakka in Google Groups in this context.
I have a somewhat similar use case and for the last two days I have researched everything I could find on the F# Agents/Actors. This answer is a kind of TODO for myself to try these ideas, of which half were born during writing it.

The simplest solution is to greedily eat all messages in the inbox when one arrives and discard all but the most recent. Easily done using TryReceive:
let rec readLatestLoop oldMsg =
async { let! newMsg = inbox.TryReceive 0
match newMsg with
| None -> oldMsg
| Some newMsg -> return! readLatestLoop newMsg }
let readLatest() =
async { let! msg = inbox.Receive()
return! readLatestLoop msg }
When faced with the same problem I architected a more sophisticated and efficient solution I called cancellable streaming and described in in an F# Journal article here. The idea is to start processing messages and then cancel that processing if they are superceded. This significantly improves concurrency if significant processing is being done.

How malloc() and sbrk() works in unix?

I am new to UNIX, and I am studying some of UNIX system calls such as brk(), sbrk(), and so on....
Last day I have read about malloc() function, and I was confused a little bit!
Can anybody tell me why malloc reduces the number of sbrk() system calls that the program must perform?
And another question, do brk(0), sbrk(0) and malloc(0) return the same value?

Syscalls are expensive to process because of the additional overhead that a syscall places: you have to switch to kernel mode. A system call gets into the kernel by issuing a "trap" or interrupt. It's a call to the kernel for a service, and because it executes in the kernel address space, it has a high overhead switch to kernel (and then switching back).
This is why malloc reduces the number of calls to sbrk() and brk(). It does so by requesting more memory than you asked it to, so that it doesn't have to issue a syscall everytime you need more memory.
brk() and sbrk() are different.
brk is used to set the end of the data segment to the value you specify. It says "set the end of my data segment to this address". Of course, the address you specify must be reasonable, the operating system must have enough memory, and you can't make it point to somewhere that would otherwise exceed the process maximum data size. Thus, brk(0) is invalid, since you'd be trying to set the end of the data segment to address 0, which is nonsense.
On the other hand, sbrk increments the data segment size by the amount you specify, and returns a pointer to the previous break value. Calling sbrk with 0 is valid; it is a way to get a pointer to the current data segment break address.
malloc is not a system call, it's a C library function that manages memory using sbrk. According to the manpage, malloc(0) is valid, but not of much use:
If size is 0, then malloc() returns either NULL, or a unique pointer
value that can later be successfully passed to free().
So, no, brk(0), sbrk(0) and malloc(0) are not equivalent: the first of them is invalid, the second is used to obtain the address of the program's break, and the latter is useless.
Keep in mind that you should never use both malloc and brk or sbrk throughout your program. malloc assumes it's got full control of brk and sbrk, if you interchange calls to malloc and brk, very weird things can happen.

why malloc reduces the number of sbrk() system calls that the program
must perform?
say, if you call malloc() to request 10 bytes memory, the implementation may use sbrk (or other system call like mmap) to request 4K bytes from OS. Then when you call malloc() next time to request another 10 bytes, it doesn't have to issue system call; it may just return some memory allocated by system call of the last time 4K.

malloc() function is used to call the sbrk system call to create a memory dynamically during the process.
malloc() function is already assigned in stdlib.h header file so the as per the required function is recursively call by the malloc function using the library function.
with the help of sbrk we need to explicitly declare some thing to call the system call.
According to the size given in function or through system call it return to the variable and store.

sbrk() function increases the programs data segment allocation by specified bytes.
malloc(4096); // sbrk += 4096 Bytes
free(); // freeing memory will not bring down the sbrk by 4096 Bytes
malloc(4096); // malloc'ing again will not increase the sbrk and it will use
the existing space which not result in sbrk() call.

Memory considerations when enqueing a long sequence of kernels and reads

I have a long sequence of kernels I need to run on some data like
data -> kernel1 -> data1 -> kernel2 -> data2 -> kernel3 -> data3 etc.
I need all the intermediate results to copied back to the host as well, so the idea would be something like (pseudo code):
inputdata = clCreateBuffer(...hostBuffer[0]);
for (int i = 0; i < N; ++i)
{
// create output buffer
outputdata = clCreateBuffer(...);
// run kernel
kernel = clCreateKernel(...);
kernel.setArg(0, inputdata);
kernel.setArg(1, outputdata);
enqueueNDRangeKernel(kernel);
// read intermediate result
enqueueReadBuffer(outputdata, hostBuffer[i]);
// output of operation becomes input of next
inputdata = outputdata;
}
There are several ways to schedule these operations:
Simplest is to always wait for the event of previous enqueue operation, so we wait for a read operation to complete before proceeding with the next kernel. I can release buffers as soon as they are not needed.
OR Make everything as asynchronous as possible, where kernel and read enqueues only wait for previous kernels, so buffer reads can happen while another kernel is running.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects in the long chain of actions and release them after everything is complete?
Importantly, how does OpenCL handle the case when the sum of all memory objects exceeds that of the total memory available on the device? At any point a kernel only needs the input and output kernels (which should fit in memory), but what if 4 or 5 of these buffers exceed the total, how does OpenCL allocate/deallocate these memory objects behind the scenes? How does this affect the reads?
I would be grateful if someone could clarify what happens in these situations, and perhaps there is something relevant to this in the OpenCL spec.
Thank you.

Your Second case is the way to go.
In the second (asynchronous) case I have a few questions:
Do I have to keep references to all cl_mem objects
in the long chain of actions and release them after
everything is complete?
Yes. But If all the data arrays are of the same size I would use just 2, and overwrite one after the other each iteration.
Then you will only need to have 2 memory zones, and the release and allocation should only occur at the beggining/end.
Don't worry about the data having bad values, if you set proper events the processing will wait to the I/O to finish. ie:
data -> kernel1 -> data1 -> kernel2 -> data -> kernel3 -> data1
-> I/O operation -> I/O operation
For doing that just set a condition that forces the kernel3 to start only if the first I/O has finished. You can chain all the events that way.
NOTE: Use 2 queues, one for I/O and another for processing will bring you parallel I/O, which is 2 times faster.
Importantly, how does OpenCL handle the case when the sum
of all memory objects exceeds that of the total memory available on the
device?
Gives an error OUT_OF_RESOURCES or similar when allocating.
At any point a kernel only needs the input and output kernels
(which should fit in memory), but what if 4 or 5 of these buffers
exceed the total, how does OpenCL allocate/deallocate these memory
objects behind the scenes? How does this affect the reads?
It will not do this automatically, except you have set the memory as a host PTR. But I'm unsure if that way the OpenCL driver will handle it properly. I would not allocate more than the maximum if I were you.

I was under the impression (sorry, I was going to cite specification but can't find it today, so I downgraded the strength of my assertion) that when you enqueue a kernel with cl_mem references, it takes a retain on those objects, and releases them when the kernel is done.
This could allow you to release these objects after enqueing a kernel without actually having to wait for the kernel to finish running. This is how the async "clEnqueue" operations are reconciled with the synchronous operations (i.e., memory release), and prevents the use of released memory objects by the runtime and kernel.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex