How to await `JoinHandle`s and update `JoinHandle`s at the same time? - asynchronous

Is it possible to both read a stream of Futures from a set of JoinHandle<()> tasks and update that set of tasks with new tasks at the same time?
I currently have a Service that runs some long tasks. Only thing is, I would actually like to (if possible) add new tasks in at the same time -- via a flag sent by some type of Receiver channel (not shown below to keep things simple).
Given that in Service::run handles becomes owned by that function, I would lean towards "no", this is not possible. Is this true? If this isn't possible given my setup, is there some way I could tweak the code below to make this possible?
I read in this answer that wrapping HashMap in an Option allows me to use .take() in Service::run since the value needs to be owned in order to call .into_values(). However, problem with this is that .take() consumes the value in the Mutex, leaving None in its wake.
Here is my minimal reproducible example (did not compile this, but should give the idea):
use tokio::{sleep, time::Duration, task::JoinHandle};
use async_std::{Mutex, Arc};
use futures::{
stream::{FuturesUnordered, StreamExt},
type Handles = Arc<Mutex<Option<HashMap<String, JoinHandle<()>>>>>;
fn a_task() -> impl Future<Output = ()> {
async move {
fn the_update_task(handles: Handles) -> impl Future<Output = ()> {
async move {
// would like to update `handles` here as I get new data from a channel
// calling .take() in Service::run nukes my handles here :(
struct Service {
handles: Handles,
impl Service {
fn new() -> Self {
let handles = Arc::new(Mutex::new(Some(HashMap::default())));
let handle = tokio::spawn(the_update_task(handles.clone());
Self { handles }
async fn add_a_task(&mut self, id: String) {
let handle = tokio::spawn(a_task());
self.handles.lock().await.as_mut().unwrap().insert(id, handle);
async fn run(self) {
let Service { handles, .. } = self;
let mut futs = FuturesUnordered::from_iter(
while let Some(fut) = {
info!("I completed a task! fut:?}");
async fn main() {
let mut srvc = Service::new();
let handle = tokio::spawn(;
I have tried
Using Arc(Mutex(HashMap))
Using Arc(Mutex(Option(HashMap)))
I seem to arrive always at the same conclusion:
I cannot both own handles in Service::run and update handles (even a copy/reference) from other part of the code

Just answering my own question here with the help of #user1937198's comment.
The solution was to update a reference to the FuturesUnordered directly with new tasks, as opposed to being concerned with handles. This simplifies things quite a bit.
use tokio::{sleep, time::Duration, task::JoinHandle};
use async_std::{Mutex, Arc};
use futures::{
stream::{FuturesUnordered, StreamExt},
fn a_task() -> impl Future<Output = ()> {
async move {
fn the_update_task(futs: Arc<Mutex<FuturesUnordered>>) -> impl Future<Output = ()> {
async move {
// Just push another task
let fut = tokio::spawn(a_task());
struct Service {
handles: HashMap<String, JoinHandle<()>>,
impl Service {
fn new() -> Self {
let handles = HashMap::default();
Self { handles }
async fn add_a_task(&mut self, id: String) {
let handle = tokio::spawn(a_task());
self.handles.insert(id, handle);
async fn run(self) {
let Service { handles, .. } = self;
let futs = Arc::new(Mutex::new(FuturesUnordered::from_iter(handles.into_values())));
while let Some(fut) = futs.lock() {
info!("I completed a task! fut:?}");
async fn main() {
let mut srvc = Service::new();
let handle = tokio::spawn(;


How to asynchronously memoize a field of struct in an Option

Suppose I have some data Bar (e.g. database client) which I would like to create only once
but lazily for my structure Foo.
struct Bar;
struct Foo {
bar: Option<Bar>
To do this, I check that the field is initialized; if not, I run the async routine.
The result of the routine is then saved as Some to reuse later.
I know, that Option::get_or_insert_with perfectly fits this scenario, but I have to
deal with async, so I do this manually like this.
impl Foo {
pub async fn get_bar(&mut self) -> &Bar {
if let Some(bar) = & {
return bar;
let bar = Self::create_bar().await;
/// Long and heavy-resource routine,
/// we want to memoize it.
async fn create_bar() -> Bar {
However, this cannot be compiled due to the immutable and mutable borrowing of
Is there a way to do this correctly?
Full example.
Interestingly the borrow checker is able to infer better lifetimes by using the ref keyword in your if let so the following works:
pub async fn get_bar(&mut self) -> &Bar {
if let Some(ref bar) = {
return bar;
let bar = Self::create_bar().await;
You can use the as_ref() method of Option<T>.
Here's it being used in your get_bar() function:
async fn get_bar(&mut self) -> &Bar {
if {
} else {
let bar = Self::create_bar().await;
I posted this as an answer because my reputation is too low. Please let me know if this answer is not suitable.

Parallel work stealing in arbitrary order in Rust

I'm trying to write a parallel data loader for deep learning in Rust. The task is to write an iterator that under the hood does the following
Reads files from disk and applies some compute-heavy preprocessing to them, the result is generally a numeric array (or multiple)
Groups the results of the previous step into batches of size B and "collates" them - this generally means just concatenating the arrays - moderately compute heavy
Yields the results from step 2.
Step 1 can be both IO and compute bound, depending on network latency, size of files and complexity of preprocessing. It has to be run in parallel by many workers. Step 2 should be off the main thread but likely doesn't need a pool of workers. Step 3 happens on main thread (exposed to Python).
The reason I write it in Rust is that Python offers two options: pure Python implementation shipped with PyTorch, based on multiprocessing, which is somewhat slow but very flexible (arbitrary user-defined data preprocessing and batching) and C++ implementation shipped with Tensorflow, which is assembled by the user from a set of predefined primitives. The latter is substantially faster but too restrictive for the kinds of data processing I wish to do. I expect that Rust will give me the speed of Tensorflow with flexibility of arbitrary code as in PyTorch.
My question is purely about the way to implement parallelism. The ideal setup is to have N workers for step 1) -> channel -> worker for step 2) -> channel -> step 3. Because the iterator object may be dropped at any time, there is a strict requirement to be able to terminate the whole scheme after Drop. On the other hand, there is the flexibility of loading the files in an arbitrary order: for example if the batch size B == 16 and max_n_threads == 32, it is perfectly fine to start 32 workers and yield the first batch containing the 16 examples which happen to return first. This can be exploited for speed.
My naive implementation creates the DataLoader in 3 steps:
Create a n_working: Arc<AtomicUsize> to control the number of worker threads active and should_shutdown: Arc<AtomicBool> to signal shutdown (when Drop is called)
Create a thread responsible for maintaining the pool. It spins on n_working < max_n_threads and keeps spawning worker threads which terminate on should_shutdown, otherwise fetch a single example, send it down the worker->batcher channel and decrement n_working
Create a batching thread which polls the worker->batcher channel, upon receiving B objects concatenates them into a batch and sends down the batcher->yielder channel
struct DataLoader {
collate_worker: Option<thread::JoinHandle<()>>,
example_worker: Option<thread::JoinHandle<()>>,
should_shut_down: Arc<AtomicBool>,
receiver: Receiver<Batch>,
length: usize,
impl DataLoader {
fn new(
dataset: Dataset,
batch_size: usize,
capacity: usize,
) -> Self {
let n_batches = dataset.len() / batch_size;
let max_n_threads = capacity * batch_size;
let (example_sender, collate_receiver) = bounded((batch_size - 1) * capacity);
let should_shut_down = Arc::new(AtomicBool::new(false));
let shutdown_flag = should_shut_down.clone();
let example_worker = thread::spawn(move || {
rayon::scope_fifo(|s| {
let dataset = &dataset;
let n_working = Arc::new(AtomicUsize::new(0));
let mut current_index = 0;
while current_index < n_batches * batch_size {
if n_working.load(Ordering::Relaxed) == max_n_threads {
if shutdown_flag.load(Ordering::Relaxed) {
let index = current_index.clone();
let sender = example_sender.clone();
let counter = n_working.clone();
let shutdown_flag = shutdown_flag.clone();
s.spawn_fifo(move |_s| {
let example = dataset.get_example(index);
if !shutdown_flag.load(Ordering::Relaxed) {
_ = sender.send(example);
} // if we should shut down, skip sending
counter.fetch_sub(1, Ordering::Relaxed);
current_index += 1;
n_working.fetch_add(1, Ordering::Relaxed);
let (batch_sender, final_receiver) = bounded(capacity);
let shutdown_flag = should_shut_down.clone();
let collate_worker = thread::spawn(move || {
'outer: loop {
let mut batch = vec![];
for _ in 0..batch_size {
if let Ok(example) = collate_receiver.recv() {
} else {
break 'outer;
let collated = collate(batch);
if shutdown_flag.load(Ordering::Relaxed) {
break; // skip sending
_ = batch_sender.send(collated);
Self {
collate_worker: Some(collate_worker),
example_worker: Some(example_worker),
should_shut_down: should_shut_down,
receiver: final_receiver,
length: n_batches,
impl DataLoader {
fn __iter__(slf: PyRef<Self>) -> PyRef<Self> { slf }
fn __next__(&mut self) -> Option<Batch> {
fn __len__(&self) -> usize {
impl Drop for DataLoader {
fn drop(&mut self) {, Ordering::Relaxed);
if self.collate_worker.take().unwrap().join().is_err() {
println!("Panic in collate worker");
if self.example_worker.take().unwrap().join().is_err() {
println!("Panic in example_worker");
println!("dropped the dataloader");
This implementation works and roughly matches the performance of PyTorch but provides no significant speedup. I don't know where to look for improvements, but I imagine it would help to have the thing load-balance automatically in a work-stealing way and to flexibly spawn workers depending on the proportion of IO and compute time. I am also expecting performance issues due to the spinning pool manager and likely corner cases in my handling of Drop.
My question is how to best approach the problem. I am generally unsure if this should be tackled with parallel crates like rayon, async crates like tokio, or a mix of both. I also have the hunch my implementation could be much simpler with the correct use of their combinators/higher order APIs. I tried with rayon but I couldn't get a solution which doesn't wastefully enforce the original sequential returning order and respects the Drop requirement.
Okay I think I've figured out a solution for you that uses rayon parallel iterators.
The trick is to use Results in the rayon iterators, and return Err if the cancellation flag is set.
I first created a utility type to create a cancellable thread in which you can execute rayon iterators. You use it by passing in the thread closure which takes the atomic cancellation token as a parameter. Then you have to check if the cancellation token is true, and if so, exit early.
use std::sync::Arc;
use std::sync::atomic::{Ordering, AtomicBool};
use std::thread::JoinHandle;
fn collate(batch: &[Computed]) -> Batch {
batch.iter().map(|&x| i128::from(x)).sum()
struct Cancelled;
struct CancellableThread<Output: Send + 'static> {
cancel_token: Arc<AtomicBool>,
thread: Option<JoinHandle<Result<Output, Cancelled>>>,
impl<Output: Send + 'static> CancellableThread<Output> {
fn new<F: FnOnce(Arc<AtomicBool>) -> Result<Output, Cancelled> + Send + 'static>(init: F) -> Self {
let cancel_token = Arc::new(AtomicBool::new(false));
let thread_cancel_token = Arc::clone(&cancel_token);
CancellableThread {
thread: Some(std::thread::spawn(move || init(thread_cancel_token))),
fn output(mut self) -> Output {
impl<Output: Send + 'static> Drop for CancellableThread<Output> {
fn drop(&mut self) {, Ordering::Relaxed);
if let Some(thread) = self.thread.take() {
let _ = thread.join().unwrap();
I found it useful to create a closure that returns a Result<(), Cancelled> so I could use the try operator (?) to exit early.
CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
} else {
loop {
// was the thread dropped?
// if so, stop what we're doing
// do stuff and
// eventually return a result
I then used that CancellableThread abstraction in the DataLoader. No need to create a special Drop impl for it, because by default, it will call drop on each field anyways, which will handle the cancellation.
type Data = Vec<u8>;
type Dataset = Vec<Data>;
type Computed = u64;
type Batch = i128;
use rayon::prelude::*;
use crossbeam::channel::{unbounded, Receiver};
struct DataLoader {
example_worker: CancellableThread<()>,
collate_worker: CancellableThread<()>,
receiver: Receiver<Batch>,
length: usize,
I used unbounded channels, as it was one less thing to bother about. It shouldn't be hard to switch to bounded ones instead.
impl DataLoader {
fn new(dataset: Dataset, batch_size: usize) -> Self {
let (example_sender, collate_receiver) = unbounded();
let (batch_sender, final_receiver) = unbounded();
I'm not sure if you can always guarantee that the number of items in your dataset will be a multiple of the batch_size, so I decided to handle that explicitly.
let length = if dataset.len() % batch_size == 0 {
dataset.len() / batch_size
} else {
dataset.len() / batch_size + 1
I created the collating worker first, though that may not be necessary. As you can see, I had to duplicate a little bit to handle partial batches.
let collate_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
} else {
'outer: loop {
let mut batch = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
if let Ok(data) = collate_receiver.recv() {
} else {
if !batch.is_empty() {
// handle the last batch, if there
// weren't enough items to fill it
let collated = collate(&batch);
break 'outer;
let collated = collate(&batch);
The example worker is where things are really made much simpler, because we can just use rayon parallel iterators. As you can see, we check for cancellation before each heavy computation.
let example_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
} else {
let heavy_compute = |data: Data| -> Result<Computed, Cancelled> {
Ok(data.iter().map(|&x| u64::from(x)).product())
.try_for_each(|computed| {
Then we just construct the DataLoader. You can see the Python impl is identical:
DataLoader {
receiver: final_receiver,
// #[pymethods]
impl DataLoader {
fn __iter__(this: Self /* PyRef<Self> */) -> Self /* PyRef<Self> */ { this }
fn __next__(&mut self) -> Option<Batch> {
fn __len__(&self) -> usize {

Rust Async doesn't execute in parallel for sockets

I'm trying to send and receive simultaneously to a multicast IP with Rust.
use futures::executor::block_on;
use async_std::task;
use std::{net::{UdpSocket, Ipv4Addr}, time::{Duration, Instant}};
fn main() {
let future = async_main();
async fn async_main() {
let mut socket = UdpSocket::bind("").unwrap();
let multi_addr = Ipv4Addr::new(234, 2, 2, 2);
let inter = Ipv4Addr::new(0,0,0,0);
let async_one = first(&socket);
let async_two = second(&socket);
futures::join!(async_one, async_two);
async fn first(socket: &std::net::UdpSocket) {
let mut buf = [0u8; 65535];
let now = Instant::now();
loop {
if now.elapsed().as_secs() > 10 { break; }
let (amt, src) = socket.recv_from(&mut buf).unwrap();
println!("received {} bytes from {:?}", amt, src);
async fn second(socket: &std::net::UdpSocket) {
let now = Instant::now();
loop {
if now.elapsed().as_secs() > 10 { break; }
socket.send_to(String::from("h").as_bytes(), "").unwrap();
The issue with this is first it runs the receive function and then it runs the send function, it never sends and receives simultaneously. With Golang I can do this with Goroutines but I'm finding this quite difficult in Rust.
I'm not very experienced with async in Rust, but your first() and second() functions don't appear to have any asynchronous calls in them -- in other words, there are not any calls that use .await. My understanding is that if nothing is awaited, then the functions will run synchronously, and I believe you get a compiler warning about it as well.
It doesn't look like std::net::UdpSocket provides any async methods that can be awaited, and you need to use async_std::net::UdpSocket instead.

Can I return a struct which uses PhantomData from a trait implementation to add a lifetime to a raw pointer without polluting the interface?

In this question someone commented that you could use PhantomData to add a lifetime bound to a raw pointer inside a struct. I thought I'd try doing this on an existing piece of code I've been working on.
Here's our (minimised) starting point. This compiles (playground):
extern crate libc;
use libc::{c_void, free, malloc};
trait Trace {}
struct MyTrace {
buf: *mut c_void,
impl MyTrace {
fn new() -> Self {
Self {
buf: unsafe { malloc(128) },
impl Trace for MyTrace {}
impl Drop for MyTrace {
fn drop(&mut self) {
unsafe { free(self.buf) };
trait Tracer {
fn start(&mut self);
fn stop(&mut self) -> Box<Trace>;
struct MyTracer {
trace: Option<MyTrace>,
impl MyTracer {
fn new() -> Self {
Self { trace: None }
impl Tracer for MyTracer {
fn start(&mut self) {
self.trace = Some(MyTrace::new());
// Pretend the buffer is mutated in C here...
fn stop(&mut self) -> Box<Trace> {
fn main() {
let mut tracer = MyTracer::new();
let _trace = tracer.stop();
println!("Hello, world!");
I think that the problem with the above code is that I could in theory move the buf pointer out of a MyTrace and use if after the struct has died. In this case the underlying buffer will have been freed due to the Drop implementation.
By using a PhantomData we can ensure that only references to buf can be obtained, and that the lifetimes of those references are bound to the instances of MyTrace from whence they came.
We can proceed like this (playground):
extern crate libc;
use libc::{c_void, free, malloc};
use std::marker::PhantomData;
trait Trace {}
struct MyTrace<'b> {
buf: *mut c_void,
_phantom: PhantomData<&'b c_void>,
impl<'b> MyTrace<'b> {
fn new() -> Self {
Self {
buf: unsafe { malloc(128) },
_phantom: PhantomData,
impl<'b> Trace for MyTrace<'b> {}
impl<'b> Drop for MyTrace<'b> {
fn drop(&mut self) {
unsafe { free(self.buf) };
trait Tracer {
fn start(&mut self);
fn stop(&mut self) -> Box<Trace>;
struct MyTracer<'b> {
trace: Option<MyTrace<'b>>,
impl<'b> MyTracer<'b> {
fn new() -> Self {
Self { trace: None }
impl<'b> Tracer for MyTracer<'b> {
fn start(&mut self) {
self.trace = Some(MyTrace::new());
// Pretend the buffer is mutated in C here...
fn stop(&mut self) -> Box<Trace> {
fn main() {
let mut tracer = MyTracer::new();
let _trace = tracer.stop();
println!("Hello, world!");
But this will give the error:
error[E0495]: cannot infer an appropriate lifetime due to conflicting requirements
--> src/
53 | Box::new(self.trace.take().unwrap())
| ^^^^^^
note: first, the lifetime cannot outlive the lifetime 'b as defined on the impl at 46:1...
--> src/
46 | impl<'b> Tracer for MyTracer<'b> {
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
= note: that the types are compatible:
expected std::option::Option<MyTrace<'_>>
found std::option::Option<MyTrace<'b>>
= note: but, the lifetime must be valid for the static lifetime...
= note: that the expression is assignable:
expected std::boxed::Box<Trace + 'static>
found std::boxed::Box<Trace>
I have three sub-questions:
Did I understand the motivation for PhantomData in this scenario correctly?
Where is 'static coming from in the error message?
Can this be made to work without changing the interface of stop? Specifically, without adding a lifetime to the return type?
I'm going to ignore your direct question because I believe you arrived at it after misunderstanding several initial steps.
I could in theory move the buf pointer out of a MyTrace and use if after the struct has died
Copy the pointer, not move it, but yes.
By using a PhantomData we can ensure that only references to buf can be obtained
This is not true. It is still equally easy to get a copy of the raw pointer and misuse it even when you add a PhantomData.
Did I understand the motivation for PhantomData in this scenario correctly?
No. PhantomData is used when you want to act like you have a value of some type without actually having it. Pretending to have a reference to something is only useful when there is something to have a reference to. There's no such value to reference in your example.
The Rust docs say something about raw pointers and PhantomData, but I perhaps got it wrong
That example actually shows my point well. The Slice type is intended to behave as if it has a reference to the Vec that it's borrowed from:
fn borrow_vec<'a, T>(vec: &'a Vec<T>) -> Slice<'a, T>
Since this Slice type doesn't actually have a reference, it needs a PhantomData to act like it has a reference. Note that the lifetime 'a isn't just made up out of whole cloth — it's related to an existing value (the Vec). It would cause memory unsafety for the Slice to exist after the Vec has moved, thus it makes sense to include a lifetime of the Vec.
why the commenter in the other question suggested I use PhantomData to improve the type safety of my raw pointer
You can use PhantomData to improve the safety of raw pointers that act as references, but yours doesn't have some existing Rust value to reference. You can also use it for correctness if your pointer owns some value behind the reference, which yours seemingly does. However, since it's a c_void, it's not really useful. You'd usually see it as PhantomData<MyOwnedType>.
Where is 'static coming from in the error message?
Why is adding a lifetime to a trait with the plus operator (Iterator<Item = &Foo> + 'a) needed?

How do you share implementation details in a functional language like rust?

I sometimes find myself writing abstract classes with partial implementation in C#:
abstract public class Executor {
abstract protected bool Before();
abstract protected bool During();
abstract protected bool After();
protected bool Execute() {
var success = false;
if (Before()) {
if (During()) {
if (After()) {
success = true;
return success;
Notwithstanding the wisdom of such a control structure, how would I accomplish this (partial shared implementation) in a functional language like rust?
Using default methods on traits is one way (and will probably/hopefully be the idiomatic way in the future; until recently, the struct-with-closures method #Slartibartfast demonstrates was the only thing that actually worked):
trait Executable {
fn before(&self) -> bool;
fn during(&self) -> bool;
fn after(&self) -> bool;
fn execute(&self) -> bool {
self.before() && self.during() && self.after()
impl Executable for int {
fn before(&self) -> bool { *self < 10 }
fn during(&self) -> bool { *self < 5 }
fn after(&self) -> bool { *self < 0 }
// execute is automatically supplied, if it is not implemented here
Note that it is possible for an implementation of Executable to override execute at the moment (I've opened an issue about a #[no_override] attribute that would disable this).
Also, default methods are experimental and prone to crashing the compiler (yes, more so than the rest of Rust), but they are improving quickly.
I'm not within reach of a rust compiler, so forgive broken code.
On a functional side of things, you could make a struct that holds three functions and invoke them
struct Execution {
before: #fn() -> bool,
during: #fn() -> bool,
after: #fn() -> bool
fn execute (e: Execution) -> bool {
but once you have a function as a first class value, you could pass say, a list of boolean functions to check against instead of fixed three, or something else depending on what are you trying to achieve.
On a rust side of things, you can make it more "object oriented" by using traits
trait Executable {
fn execute(&self);
impl Execution {
fn execute(&self) {
