I did some tests which compare speed of using async as a method of deferring results and CompletableDeferred with combination of Job or startCoroutine to do the same job.
In summary there are 3 use cases:
async with default type of start (right away) [async]
CompletableDeferred + launch (basically Job) [cdl]
CompletableDeferred + startCoroutine [ccdl]
results are presented here:
In short every iteration of each use case test generates 10000 of async / cdl / ccdl requests and waits for them to complete. This is repeated 225 times with 25 times as a warmUp (not included in results) and data points are collected over 100 iteration of process above (as min, max, avg).
here is a code:
import com.source.log.log
import kotlinx.coroutines.*
import kotlin.coroutines.Continuation
import kotlin.coroutines.startCoroutine
import kotlin.system.measureNanoTime
import kotlin.system.measureTimeMillis
* #project Bricks
* #author SourceOne on 28.11.2019
/*I know that there are better ways to benchmark speed
* but given the produced results this method is fine enough
* */
fun benchmark(warmUp: Int, repeat: Int, action: suspend () -> Unit): Pair<List<Long>, List<Long>> {
val warmUpResults = List(warmUp) {
measureNanoTime {
runBlocking {
val benchmarkResults = List(repeat) {
measureNanoTime {
runBlocking {
return warmUpResults to benchmarkResults
/* find way to cancel startedCoroutine when deferred is
* canceled (currently you have to cancel whole context)
* */
fun <T> CoroutineScope.completable(provider: suspend () -> T): Deferred<T> {
return CompletableDeferred<T>().also { completable ->
Continuation(coroutineContext) { result ->
suspend fun calculateAsyncStep() = coroutineScope {
val list = List(10000) {
async { "i'm a robot" }
suspend fun calculateCDLStep() = coroutineScope {
val list = List(10000) {
CompletableDeferred<String>().also {
launch {
it.complete("i'm a robot")
suspend fun calculateCCDLStep() = coroutineScope {
val list = List(10000) {
completable { "i'm a robot" }
fun main() {
val labels = listOf("async", "cdl", "ccdl")
val collectedResults = listOf(
mutableListOf<Pair<List<Long>, List<Long>>>(),
"stabilizing runs".log()
repeat(2) {
println("async $it")
benchmark(warmUp = 25, repeat = 200) {
println("CDL $it")
benchmark(warmUp = 25, repeat = 200) {
println("CCDL $it")
benchmark(warmUp = 25, repeat = 200) {
"\n#Benchmark start".log()
val benchmarkTime = measureTimeMillis {
repeat(100) {
println("async $it")
collectedResults[0] += benchmark(warmUp = 25, repeat = 200) {
println("CDL $it")
collectedResults[1] += benchmark(warmUp = 25, repeat = 200) {
println("CCDL $it")
collectedResults[2] += benchmark(warmUp = 25, repeat = 200) {
"\n#Benchmark completed in ${benchmarkTime}ms".log()
"#Benchmark results:".log()
val minMaxAvg = collectedResults.map { stageResults ->
stageResults.map { (_, benchmark) ->
benchmark.minBy { it }!!, benchmark.maxBy { it }!!, benchmark.average().toLong()
minMaxAvg.forEachIndexed { index, list ->
"results for: ${labels[index]} [min, max, avg]".log()
list.forEach { results ->
There is no surprise that the first two use cases (async and cdl) are very close to each other and async is always better (because you don't have the overhead of creating job to complete deferred object) but comparing async vs CompletableDeferred + startCoroutine there is a huge gap between them (almost 2 times) in favor of the last one. Why there is such a big difference and if anyone knows, why shouldn't we just be using CompletableDeferred + startCoroutine wrapper (like completable() here) instead of async?
Here is a sample for 1000 points:
There are constant spikes in async and cdl results and some in ccdl (maybe gc?) but still there is far less with ccdl. I will rerun these tests with changed order of tests interleaving but it seems that it's related to something under the coroutines machinery.
I've accepted Marko Topolnik answer, but in addition to it, you still can use this 'as he called' bare launch method if you await for the result within the scope you have launched it.
In example if you will launch few deffered coroutines (async) and at the end of that scope you will await them all then the ccdl method will work as expected (at least from what i've seen in my tests).
Since launch and async are built as a layer on top of the low-level primitive createCoroutineUnintercepted(), whereas startCoroutine is practically a direct call into it, there aren't any surprises in your benchmark results.
why shouldn't we just be using CompletableDeferred + startCoroutine wrapper (like completable() here) instead of async?
A comment in your code already hints to the answer:
* find way to cancel startedCoroutine when deferred is
* canceled (currently you have to cancel whole context)
The layer you short-circuited with startCoroutine is precisely the layer that handles things as cancellation, coroutine hierarchy, exception handling and propagation, and so on.
Here's a simple example that shows you one of the things that break when you replace launch with a bare coroutine:
fun main() = runBlocking {
bareLaunch {
try {
println("Coroutine done")
} catch (e: CancellationException) {
println("Coroutine cancelled, the exception is: $e")
fun CoroutineScope.bareLaunch(block: suspend () -> Unit) =
block.startCoroutine(Continuation(coroutineContext) { Unit })
fun <T> CoroutineScope.bareAsync(block: suspend () -> T) =
CompletableDeferred<T>().also { deferred ->
block.startCoroutine(Continuation(coroutineContext) { result ->
result.exceptionOrNull()?.also {
} ?: run {
When you run this, you'll see the bare coroutine got cancelled after 10 milliseconds. The runBlocking builder didn't realize it had to wait for it to complete. If you replace bareLaunch { with launch {, you'll restore the designed behavior where the child coroutine completes normally. The same thing happens with bareAsync.
I'm trying to write a parallel data loader for deep learning in Rust. The task is to write an iterator that under the hood does the following
Reads files from disk and applies some compute-heavy preprocessing to them, the result is generally a numeric array (or multiple)
Groups the results of the previous step into batches of size B and "collates" them - this generally means just concatenating the arrays - moderately compute heavy
Yields the results from step 2.
Step 1 can be both IO and compute bound, depending on network latency, size of files and complexity of preprocessing. It has to be run in parallel by many workers. Step 2 should be off the main thread but likely doesn't need a pool of workers. Step 3 happens on main thread (exposed to Python).
The reason I write it in Rust is that Python offers two options: pure Python implementation shipped with PyTorch, based on multiprocessing, which is somewhat slow but very flexible (arbitrary user-defined data preprocessing and batching) and C++ implementation shipped with Tensorflow, which is assembled by the user from a set of predefined primitives. The latter is substantially faster but too restrictive for the kinds of data processing I wish to do. I expect that Rust will give me the speed of Tensorflow with flexibility of arbitrary code as in PyTorch.
My question is purely about the way to implement parallelism. The ideal setup is to have N workers for step 1) -> channel -> worker for step 2) -> channel -> step 3. Because the iterator object may be dropped at any time, there is a strict requirement to be able to terminate the whole scheme after Drop. On the other hand, there is the flexibility of loading the files in an arbitrary order: for example if the batch size B == 16 and max_n_threads == 32, it is perfectly fine to start 32 workers and yield the first batch containing the 16 examples which happen to return first. This can be exploited for speed.
My naive implementation creates the DataLoader in 3 steps:
Create a n_working: Arc<AtomicUsize> to control the number of worker threads active and should_shutdown: Arc<AtomicBool> to signal shutdown (when Drop is called)
Create a thread responsible for maintaining the pool. It spins on n_working < max_n_threads and keeps spawning worker threads which terminate on should_shutdown, otherwise fetch a single example, send it down the worker->batcher channel and decrement n_working
Create a batching thread which polls the worker->batcher channel, upon receiving B objects concatenates them into a batch and sends down the batcher->yielder channel
struct DataLoader {
collate_worker: Option<thread::JoinHandle<()>>,
example_worker: Option<thread::JoinHandle<()>>,
should_shut_down: Arc<AtomicBool>,
receiver: Receiver<Batch>,
length: usize,
impl DataLoader {
fn new(
dataset: Dataset,
batch_size: usize,
capacity: usize,
) -> Self {
let n_batches = dataset.len() / batch_size;
let max_n_threads = capacity * batch_size;
let (example_sender, collate_receiver) = bounded((batch_size - 1) * capacity);
let should_shut_down = Arc::new(AtomicBool::new(false));
let shutdown_flag = should_shut_down.clone();
let example_worker = thread::spawn(move || {
rayon::scope_fifo(|s| {
let dataset = &dataset;
let n_working = Arc::new(AtomicUsize::new(0));
let mut current_index = 0;
while current_index < n_batches * batch_size {
if n_working.load(Ordering::Relaxed) == max_n_threads {
if shutdown_flag.load(Ordering::Relaxed) {
let index = current_index.clone();
let sender = example_sender.clone();
let counter = n_working.clone();
let shutdown_flag = shutdown_flag.clone();
s.spawn_fifo(move |_s| {
let example = dataset.get_example(index);
if !shutdown_flag.load(Ordering::Relaxed) {
_ = sender.send(example);
} // if we should shut down, skip sending
counter.fetch_sub(1, Ordering::Relaxed);
current_index += 1;
n_working.fetch_add(1, Ordering::Relaxed);
let (batch_sender, final_receiver) = bounded(capacity);
let shutdown_flag = should_shut_down.clone();
let collate_worker = thread::spawn(move || {
'outer: loop {
let mut batch = vec![];
for _ in 0..batch_size {
if let Ok(example) = collate_receiver.recv() {
} else {
break 'outer;
let collated = collate(batch);
if shutdown_flag.load(Ordering::Relaxed) {
break; // skip sending
_ = batch_sender.send(collated);
Self {
collate_worker: Some(collate_worker),
example_worker: Some(example_worker),
should_shut_down: should_shut_down,
receiver: final_receiver,
length: n_batches,
impl DataLoader {
fn __iter__(slf: PyRef<Self>) -> PyRef<Self> { slf }
fn __next__(&mut self) -> Option<Batch> {
fn __len__(&self) -> usize {
impl Drop for DataLoader {
fn drop(&mut self) {
self.should_shut_down.store(true, Ordering::Relaxed);
if self.collate_worker.take().unwrap().join().is_err() {
println!("Panic in collate worker");
if self.example_worker.take().unwrap().join().is_err() {
println!("Panic in example_worker");
println!("dropped the dataloader");
This implementation works and roughly matches the performance of PyTorch but provides no significant speedup. I don't know where to look for improvements, but I imagine it would help to have the thing load-balance automatically in a work-stealing way and to flexibly spawn workers depending on the proportion of IO and compute time. I am also expecting performance issues due to the spinning pool manager and likely corner cases in my handling of Drop.
My question is how to best approach the problem. I am generally unsure if this should be tackled with parallel crates like rayon, async crates like tokio, or a mix of both. I also have the hunch my implementation could be much simpler with the correct use of their combinators/higher order APIs. I tried with rayon but I couldn't get a solution which doesn't wastefully enforce the original sequential returning order and respects the Drop requirement.
Okay I think I've figured out a solution for you that uses rayon parallel iterators.
The trick is to use Results in the rayon iterators, and return Err if the cancellation flag is set.
I first created a utility type to create a cancellable thread in which you can execute rayon iterators. You use it by passing in the thread closure which takes the atomic cancellation token as a parameter. Then you have to check if the cancellation token is true, and if so, exit early.
use std::sync::Arc;
use std::sync::atomic::{Ordering, AtomicBool};
use std::thread::JoinHandle;
fn collate(batch: &[Computed]) -> Batch {
batch.iter().map(|&x| i128::from(x)).sum()
struct Cancelled;
struct CancellableThread<Output: Send + 'static> {
cancel_token: Arc<AtomicBool>,
thread: Option<JoinHandle<Result<Output, Cancelled>>>,
impl<Output: Send + 'static> CancellableThread<Output> {
fn new<F: FnOnce(Arc<AtomicBool>) -> Result<Output, Cancelled> + Send + 'static>(init: F) -> Self {
let cancel_token = Arc::new(AtomicBool::new(false));
let thread_cancel_token = Arc::clone(&cancel_token);
CancellableThread {
thread: Some(std::thread::spawn(move || init(thread_cancel_token))),
fn output(mut self) -> Output {
impl<Output: Send + 'static> Drop for CancellableThread<Output> {
fn drop(&mut self) {
self.cancel_token.store(true, Ordering::Relaxed);
if let Some(thread) = self.thread.take() {
let _ = thread.join().unwrap();
I found it useful to create a closure that returns a Result<(), Cancelled> so I could use the try operator (?) to exit early.
CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
} else {
loop {
// was the thread dropped?
// if so, stop what we're doing
// do stuff and
// eventually return a result
I then used that CancellableThread abstraction in the DataLoader. No need to create a special Drop impl for it, because by default, it will call drop on each field anyways, which will handle the cancellation.
type Data = Vec<u8>;
type Dataset = Vec<Data>;
type Computed = u64;
type Batch = i128;
use rayon::prelude::*;
use crossbeam::channel::{unbounded, Receiver};
struct DataLoader {
example_worker: CancellableThread<()>,
collate_worker: CancellableThread<()>,
receiver: Receiver<Batch>,
length: usize,
I used unbounded channels, as it was one less thing to bother about. It shouldn't be hard to switch to bounded ones instead.
impl DataLoader {
fn new(dataset: Dataset, batch_size: usize) -> Self {
let (example_sender, collate_receiver) = unbounded();
let (batch_sender, final_receiver) = unbounded();
I'm not sure if you can always guarantee that the number of items in your dataset will be a multiple of the batch_size, so I decided to handle that explicitly.
let length = if dataset.len() % batch_size == 0 {
dataset.len() / batch_size
} else {
dataset.len() / batch_size + 1
I created the collating worker first, though that may not be necessary. As you can see, I had to duplicate a little bit to handle partial batches.
let collate_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
} else {
'outer: loop {
let mut batch = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
if let Ok(data) = collate_receiver.recv() {
} else {
if !batch.is_empty() {
// handle the last batch, if there
// weren't enough items to fill it
let collated = collate(&batch);
break 'outer;
let collated = collate(&batch);
The example worker is where things are really made much simpler, because we can just use rayon parallel iterators. As you can see, we check for cancellation before each heavy computation.
let example_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
} else {
let heavy_compute = |data: Data| -> Result<Computed, Cancelled> {
Ok(data.iter().map(|&x| u64::from(x)).product())
.try_for_each(|computed| {
Then we just construct the DataLoader. You can see the Python impl is identical:
DataLoader {
receiver: final_receiver,
// #[pymethods]
impl DataLoader {
fn __iter__(this: Self /* PyRef<Self> */) -> Self /* PyRef<Self> */ { this }
fn __next__(&mut self) -> Option<Batch> {
fn __len__(&self) -> usize {
firebase method is working on worker thread automatically. but I have used coroutine and callbackflow to implement firebase listener code synchronously or get return from the listener.
below is my code that I explained
coroutine await with firebase for one shot
override suspend fun checkNickName(nickName: String): Results<Int> {
lateinit var result : Results<Int>
.addOnCompleteListener { document ->
if (document.isSuccessful) {
val list = document.result.data?.get("nickNameList") as List<String>
if (list.contains(nickName))
result = Results.Exist(1)
result = Results.No(0)
else {
return result
callbackflow with firebase listener
override fun getOwnUser(): Flow<UserEntity> = callbackFlow{
val document = fireStore.collection("database/user/userList/")
val subscription = document.addSnapshotListener { snapshot,_ ->
if (snapshot!!.exists()) {
val ownUser = snapshot.toObject<UserEntity>()
if (ownUser != null) {
awaitClose { subscription.remove() }
so I really wonder these way is good or bad practice and its reason
Do not combine addOnCompleteListener with coroutines await(). There is no guarantee that the listener gets called before or after await(), so it is possible the code in the listener won't be called until after the whole suspend function returns. Also, one of the major reasons to use coroutines in the first place is to avoid using callbacks. So your first function should look like:
override suspend fun checkNickName(nickName: String): Results<Int> {
try {
val userList = fireStore.collection("database")
.get("nickNameList") as List<String>
return if (userList.contains(nickName)) Results.Exist(1) else Results.No(0)
} catch (e: Exception) {
// return a failure result here
Your use of callbackFlow looks fine, except you should add a buffer() call to the flow you're returning so you can specify how to handle backpressure. However, it's possible you will want to handle that downstream instead.
override fun getOwnUser(): Flow<UserEntity> = callbackFlow {
}.buffer(/* Customize backpressure behavior here */)
thank you for taking your time to read my problem.
Im currently using Firebase Firestore to retrieve a list of objects that I which to display to the UI, im trying to use a suspend function to fold the accumulative values of a sequence of calls from the Firestore server, but at the moment im unable to pass the result value outside the scope of the coroutine.
This is my fold function:
suspend fun getFormattedList(): FirestoreState {
return foldFunctions(FirestoreModel(""), ::getMatchesFromBackend, ...., ....)
This is my custom fold function:
suspend fun foldFunctions(model: FirestoreModel,
vararg functions: suspend (FirestoreModel, SuccessData) -> FirestoreState): FirestoreState {
val successData: SuccessData = functions.fold(SuccessData()) { updatedSuccessData, function ->
val status = function(model, updatedSuccessData)
if (status !is FirestoreState.Continue) {
return status
updatedSuccessData <--- I managed to retrieve the list of values correctly here
val successModel = SuccessData()
successData.matchList?.let { successModel.matchList = it }
successData.usermatchList?.let { successModel.usermatchList = it }
successData.formattedList?.let { successModel.formattedList = it }
return FirestoreState.Success(successModel) <--- I cant event get to this line with debugger on
This is my first function (which is working fine)
suspend fun getMatchesFromBackend(model: FirestoreModel, successData: SuccessData): FirestoreState {
return try {
val querySnapshot: QuerySnapshot? = db.collection("matches").get().await()
querySnapshot?.toObjects(Match::class.java).let { list ->
val matchList = mutableListOf<Match>()
list?.let {
for (document in it) {
successData.matchList = matchList <--- where list gets stored
} catch (e : Exception){
when (e) {
is RuntimeException -> FirestoreState.MatchesFailure
is ConnectException -> FirestoreState.MatchesFailure
is CancellationException -> FirestoreState.MatchesFailure
else -> FirestoreState.MatchesFailure
My hypothesis is that the suspen fun get cancelled and the continuation of the scope gets blocked, I have tried to use runBlocking { } without vail. If someone has an idea of how to circumvent this issue I'd be very gratefull.
Let's say I have list of repos. I want to iterate through all of them. As each repo returns with result, I wanted to pass it on.
val repos = listOf(repo1, repo2, repo3)
val deferredItems = mutableListOf<Deferred<List<result>>>()
repos.forEach { repo ->
deferredItems.add(async { getResult(repo) })
val results = mutableListOf<Any>()
deferredItems.forEach { deferredItem ->
println("results :: $results")
In the above case, It waits for each repo to return result. It fills the results in sequence, result of repo1 followed by result of repo2. If repo1 takes more time than repo2 to return result, we will be waiting for repo1's result even though we have result for repo2.
Is there any way to pass the result of repo2 as soon as we have the result?
The Flow API supports this almost directly:
.flatMapMerge { flow { emit(getResult(it)) } }
.collect { println(it) }
flatMapMerge first collects all the Flows that come out of the lambda you pass to it and then concurrently collects those and sends them into the downstream as soon as any of them completes.
That's what channels are for:
val repos = listOf("repo1", "repo2", "repo3")
val results = Channel<Result>()
repos.forEach { repo ->
launch {
val res = getResult(repo)
for (r in results) {
This example is incomplete, as I don't close the channel, so the resulting code will be forever suspended. Make sure that in your real code you close the channel once all results are received:
val count = AtomicInteger()
for (r in results) {
if (count.incrementAndGet() == repos.size) {
you should use Channels.
suspend fun loadReposConcurrent() = coroutineScope {
val repos = listOf(repo1, repo2, repo3)
val channel = Channel<List<YourResultType>>()
for (repo in repos) {
launch {
val result = getResult(repo)
var allResults = emptyList<YourResultType>()
repeat(repos.size) {
val result = channel.receive()
allResults = allResults + result
println("results :: $result")
in the code above in for (repo in repos) {...} loop all the requests calculated in seprate coroutines with launch and as soon as their result is ready will send to channel.
in repeat(repos.size) {...} the channel.receive() waits for new values from all coroutines and consumes them.
I am trying to understand which is the best way to have an asynchronous job fired at a scheduled rate in Kotlin, while the application is normally running it's normal tasks. Let's say I have a simple application that only prints out "..." every second, but every 5 seconds I want another job / thread / coroutine (which ever suits best) to print "you have a message!". For the async job I have a class NotificationProducer and it looks like this.
class NotificationProducer {
fun produce() {
println("You have a message!")
Then, my main method looks like this.
while (true) {
Should I use GlobalScope.async, Timer().schedule(...) or some Quartz job to achieve what I want? Any advice is highly appreciated. The point is that notification must come from another class (e.g. NotificationProducer)
If I correctly understand the issue, using Kotlin Coroutines you can implement it as the following:
class Presenter : CoroutineScope { // implement CoroutineScope to create local scope
private var job: Job = Job()
override val coroutineContext: CoroutineContext
get() = Dispatchers.Default + job
// this method will help to stop execution of a coroutine.
// Call it to cancel coroutine and to break the while loop defined in the coroutine below
fun cancel() {
fun schedule() = launch { // launching the coroutine
var seconds = 1
val producer = NotificationProducer()
while (true) {
if (seconds++ == 5) {
seconds = 1
Then you can use an instance of the Presenter class to launch the coroutine and stop it:
val presenter = Presenter()
presenter.schedule() // calling `schedule()` function launches the coroutine
presenter.cancel() // cancel the coroutine when you need
For simple scheduling requirements, you can consider using coroutines:
class NotificationProducerScheduler(val service: NotificationProducer, val interval: Long, val initialDelay: Long?) :
CoroutineScope {
private val job = Job()
private val singleThreadExecutor = Executors.newSingleThreadExecutor()
override val coroutineContext: CoroutineContext
get() = job + singleThreadExecutor.asCoroutineDispatcher()
fun stop() {
fun start() = launch {
initialDelay?.let {
while (isActive) {
println("coroutine done")
Otherwise, the Java concurrency API is pretty solid too:
class NotificationProducerSchedulerJavaScheduler(
val service: NotificationProducer,
val interval: Long,
val initialDelay: Long = 0
) {
private val scheduler = Executors.newScheduledThreadPool(1)
private val task = Runnable { service.produce() }
fun stop() {
fun start() {
scheduler.scheduleWithFixedDelay(task, initialDelay, interval, TimeUnit.MILLISECONDS)
This function will run a task in the background while proceeding with a "main" task that controls the lifecycle of the background job. Below is an example of usage.
* Runs a task in the background in IO while the op proceeds.
* The job is canceled when op returns.
* This is useful for updating caches and the like.
suspend fun withBackgroundTask(task: suspend () -> Unit, op: suspend () -> Unit) {
val job = CoroutineScope(Dispatchers.IO).launch { task() }
try {
} finally {
* Updates the cache in a background task while op runs.
suspend fun withCache(cache: Cache<*>, op: suspend () -> Unit) {
suspend fun cacheUpdate() {
while (true) {
withBackgroundTask(::cacheUpdate, op)