I'm trying to write a parallel data loader for deep learning in Rust. The task is to write an iterator that under the hood does the following
Reads files from disk and applies some compute-heavy preprocessing to them, the result is generally a numeric array (or multiple)
Groups the results of the previous step into batches of size B and "collates" them - this generally means just concatenating the arrays - moderately compute heavy
Yields the results from step 2.
Step 1 can be both IO and compute bound, depending on network latency, size of files and complexity of preprocessing. It has to be run in parallel by many workers. Step 2 should be off the main thread but likely doesn't need a pool of workers. Step 3 happens on main thread (exposed to Python).
The reason I write it in Rust is that Python offers two options: pure Python implementation shipped with PyTorch, based on multiprocessing, which is somewhat slow but very flexible (arbitrary user-defined data preprocessing and batching) and C++ implementation shipped with Tensorflow, which is assembled by the user from a set of predefined primitives. The latter is substantially faster but too restrictive for the kinds of data processing I wish to do. I expect that Rust will give me the speed of Tensorflow with flexibility of arbitrary code as in PyTorch.
My question is purely about the way to implement parallelism. The ideal setup is to have N workers for step 1) -> channel -> worker for step 2) -> channel -> step 3. Because the iterator object may be dropped at any time, there is a strict requirement to be able to terminate the whole scheme after Drop. On the other hand, there is the flexibility of loading the files in an arbitrary order: for example if the batch size B == 16 and max_n_threads == 32, it is perfectly fine to start 32 workers and yield the first batch containing the 16 examples which happen to return first. This can be exploited for speed.
My naive implementation creates the DataLoader in 3 steps:
Create a n_working: Arc<AtomicUsize> to control the number of worker threads active and should_shutdown: Arc<AtomicBool> to signal shutdown (when Drop is called)
Create a thread responsible for maintaining the pool. It spins on n_working < max_n_threads and keeps spawning worker threads which terminate on should_shutdown, otherwise fetch a single example, send it down the worker->batcher channel and decrement n_working
Create a batching thread which polls the worker->batcher channel, upon receiving B objects concatenates them into a batch and sends down the batcher->yielder channel
#[pyclass]
struct DataLoader {
collate_worker: Option<thread::JoinHandle<()>>,
example_worker: Option<thread::JoinHandle<()>>,
should_shut_down: Arc<AtomicBool>,
receiver: Receiver<Batch>,
length: usize,
}
impl DataLoader {
fn new(
dataset: Dataset,
batch_size: usize,
capacity: usize,
) -> Self {
let n_batches = dataset.len() / batch_size;
let max_n_threads = capacity * batch_size;
let (example_sender, collate_receiver) = bounded((batch_size - 1) * capacity);
let should_shut_down = Arc::new(AtomicBool::new(false));
let shutdown_flag = should_shut_down.clone();
let example_worker = thread::spawn(move || {
rayon::scope_fifo(|s| {
let dataset = &dataset;
let n_working = Arc::new(AtomicUsize::new(0));
let mut current_index = 0;
while current_index < n_batches * batch_size {
if n_working.load(Ordering::Relaxed) == max_n_threads {
continue;
}
if shutdown_flag.load(Ordering::Relaxed) {
break;
}
let index = current_index.clone();
let sender = example_sender.clone();
let counter = n_working.clone();
let shutdown_flag = shutdown_flag.clone();
s.spawn_fifo(move |_s| {
let example = dataset.get_example(index);
if !shutdown_flag.load(Ordering::Relaxed) {
_ = sender.send(example);
} // if we should shut down, skip sending
counter.fetch_sub(1, Ordering::Relaxed);
});
current_index += 1;
n_working.fetch_add(1, Ordering::Relaxed);
};
});
});
let (batch_sender, final_receiver) = bounded(capacity);
let shutdown_flag = should_shut_down.clone();
let collate_worker = thread::spawn(move || {
'outer: loop {
let mut batch = vec![];
for _ in 0..batch_size {
if let Ok(example) = collate_receiver.recv() {
batch.push(example);
} else {
break 'outer;
}
};
let collated = collate(batch);
if shutdown_flag.load(Ordering::Relaxed) {
break; // skip sending
}
_ = batch_sender.send(collated);
};
});
Self {
collate_worker: Some(collate_worker),
example_worker: Some(example_worker),
should_shut_down: should_shut_down,
receiver: final_receiver,
length: n_batches,
}
}
}
#[pymethods]
impl DataLoader {
fn __iter__(slf: PyRef<Self>) -> PyRef<Self> { slf }
fn __next__(&mut self) -> Option<Batch> {
self.receiver.recv().ok()
}
fn __len__(&self) -> usize {
self.length
}
}
impl Drop for DataLoader {
fn drop(&mut self) {
self.should_shut_down.store(true, Ordering::Relaxed);
if self.collate_worker.take().unwrap().join().is_err() {
println!("Panic in collate worker");
};
if self.example_worker.take().unwrap().join().is_err() {
println!("Panic in example_worker");
};
println!("dropped the dataloader");
}
}
This implementation works and roughly matches the performance of PyTorch but provides no significant speedup. I don't know where to look for improvements, but I imagine it would help to have the thing load-balance automatically in a work-stealing way and to flexibly spawn workers depending on the proportion of IO and compute time. I am also expecting performance issues due to the spinning pool manager and likely corner cases in my handling of Drop.
My question is how to best approach the problem. I am generally unsure if this should be tackled with parallel crates like rayon, async crates like tokio, or a mix of both. I also have the hunch my implementation could be much simpler with the correct use of their combinators/higher order APIs. I tried with rayon but I couldn't get a solution which doesn't wastefully enforce the original sequential returning order and respects the Drop requirement.
Okay I think I've figured out a solution for you that uses rayon parallel iterators.
The trick is to use Results in the rayon iterators, and return Err if the cancellation flag is set.
I first created a utility type to create a cancellable thread in which you can execute rayon iterators. You use it by passing in the thread closure which takes the atomic cancellation token as a parameter. Then you have to check if the cancellation token is true, and if so, exit early.
use std::sync::Arc;
use std::sync::atomic::{Ordering, AtomicBool};
use std::thread::JoinHandle;
fn collate(batch: &[Computed]) -> Batch {
batch.iter().map(|&x| i128::from(x)).sum()
}
#[derive(Debug)]
struct Cancelled;
struct CancellableThread<Output: Send + 'static> {
cancel_token: Arc<AtomicBool>,
thread: Option<JoinHandle<Result<Output, Cancelled>>>,
}
impl<Output: Send + 'static> CancellableThread<Output> {
fn new<F: FnOnce(Arc<AtomicBool>) -> Result<Output, Cancelled> + Send + 'static>(init: F) -> Self {
let cancel_token = Arc::new(AtomicBool::new(false));
let thread_cancel_token = Arc::clone(&cancel_token);
CancellableThread {
thread: Some(std::thread::spawn(move || init(thread_cancel_token))),
cancel_token,
}
}
fn output(mut self) -> Output {
self.thread.take().unwrap().join().unwrap().unwrap()
}
}
impl<Output: Send + 'static> Drop for CancellableThread<Output> {
fn drop(&mut self) {
self.cancel_token.store(true, Ordering::Relaxed);
if let Some(thread) = self.thread.take() {
let _ = thread.join().unwrap();
}
}
}
I found it useful to create a closure that returns a Result<(), Cancelled> so I could use the try operator (?) to exit early.
CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
Err(Cancelled)
} else {
Ok(())
};
loop {
// was the thread dropped?
// if so, stop what we're doing
cancelled?;
// do stuff and
// eventually return a result
}
});
I then used that CancellableThread abstraction in the DataLoader. No need to create a special Drop impl for it, because by default, it will call drop on each field anyways, which will handle the cancellation.
type Data = Vec<u8>;
type Dataset = Vec<Data>;
type Computed = u64;
type Batch = i128;
use rayon::prelude::*;
use crossbeam::channel::{unbounded, Receiver};
struct DataLoader {
example_worker: CancellableThread<()>,
collate_worker: CancellableThread<()>,
receiver: Receiver<Batch>,
length: usize,
}
I used unbounded channels, as it was one less thing to bother about. It shouldn't be hard to switch to bounded ones instead.
impl DataLoader {
fn new(dataset: Dataset, batch_size: usize) -> Self {
let (example_sender, collate_receiver) = unbounded();
let (batch_sender, final_receiver) = unbounded();
I'm not sure if you can always guarantee that the number of items in your dataset will be a multiple of the batch_size, so I decided to handle that explicitly.
let length = if dataset.len() % batch_size == 0 {
dataset.len() / batch_size
} else {
dataset.len() / batch_size + 1
};
I created the collating worker first, though that may not be necessary. As you can see, I had to duplicate a little bit to handle partial batches.
let collate_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
Err(Cancelled)
} else {
Ok(())
};
'outer: loop {
let mut batch = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
cancelled()?;
if let Ok(data) = collate_receiver.recv() {
batch.push(data);
} else {
if !batch.is_empty() {
// handle the last batch, if there
// weren't enough items to fill it
let collated = collate(&batch);
cancelled()?;
batch_sender.send(collated).unwrap();
}
break 'outer;
}
}
let collated = collate(&batch);
cancelled()?;
batch_sender.send(collated).unwrap();
}
Ok(())
});
The example worker is where things are really made much simpler, because we can just use rayon parallel iterators. As you can see, we check for cancellation before each heavy computation.
let example_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
Err(Cancelled)
} else {
Ok(())
};
let heavy_compute = |data: Data| -> Result<Computed, Cancelled> {
cancelled()?;
Ok(data.iter().map(|&x| u64::from(x)).product())
};
dataset
.into_par_iter()
.map(heavy_compute)
.try_for_each(|computed| {
example_sender.send(computed?).unwrap();
Ok(())
})
});
Then we just construct the DataLoader. You can see the Python impl is identical:
DataLoader {
example_worker,
collate_worker,
receiver: final_receiver,
length,
}
}
}
// #[pymethods]
impl DataLoader {
fn __iter__(this: Self /* PyRef<Self> */) -> Self /* PyRef<Self> */ { this }
fn __next__(&mut self) -> Option<Batch> {
self.receiver.recv().ok()
}
fn __len__(&self) -> usize {
self.length
}
}
playground
Related
My rust code is like below.
#[tokio::main]
pub async fn main() {
for i in 1..10 {
tokio::spawn(async move {
println!("{}", i);
});
}
}
When run the code, I expect it to print 1 to 10 in a random sequence.
But it just print some random numbers:
1
3
2
Terminal will be reused by tasks, press any key to close it.
Why this is happenning?
https://docs.rs/tokio/latest/tokio/fn.spawn.html warns that:
There is no guarantee that a spawned task will execute to completion. When a runtime is shutdown, all outstanding tasks are dropped, regardless of the lifecycle of that task.
One solution that should work is to store all of the JoinHandles and then await all of them:
let mut join_handles = Vec::with_capacity(10);
for i in 1..10 {
join_handles.push(tokio::spawn(async move {
println!("{}", i);
}));
}
for join_handle in join_handles {
join_handle.await.unwrap();
}
P.S. In 1..10, the end is exclusive, so the last number is 9. You might want 1..=10 instead. (see https://doc.rust-lang.org/reference/expressions/range-expr.html)
How do you create an async recursive function that takes a mutex? Rust claims that this code holds a mutex across an await point. However, the value is dropped before the .await.
#[async_recursion]
async fn f(mutex: &Arc<Mutex<u128>>) {
let mut unwrapped = mutex.lock().unwrap();
*unwrapped += 1;
let value = *unwrapped;
drop(unwrapped);
tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;
if value < 100 {
f(mutex);
}
}
Error
future cannot be sent between threads safely
within `impl futures::Future<Output = ()>`, the trait `std::marker::Send` is not implemented for `std::sync::MutexGuard<'_, u128>`
required for the cast to the object type `dyn futures::Future<Output = ()> + std::marker::Send`rustc
lib.rs(251, 65): future is not `Send` as this value is used across an await
In this case, you can restructure the code to make it so unwrapped can't be used across an await:
let value = {
let mut unwrapped = mutex.lock().unwrap();
*unwrapped += 1;
*unwrapped
};
tokio::time::sleep(tokio::time::Duration::from_millis(1000)).await;
if value < 100 {
f(mutex);
}
If you weren't able to do this, then you'd need to make it so you don't return a Future that implements Send. The async_recursion docs specify an option you can pass to the macro to disable the Send bound it adds:
#[async_recursion(?Send)]
async fn f(mutex: &Arc<Mutex<u128>>) {
...
(playground)
You wouldn't be able to send such a Future across threads though.
I was unsure if I should post this here or in code review.
Code review seems to have only functioning code.
So I've a multitude of problems I don't really understand.
(I’m a noob) full code can be found here: https://github.com/NicTanghe/winder/blob/main/src/main.rs
main problem is here:
let temp = location_loc1.parent().unwrap();
location_loc1.push(&temp);
I’ve tried various things to get around problems with borrowing as mutable or as reference,
and I can’t seem to get it to work.
I just get a different set of errors with everything I try.
Furthermore, I'm sorry if this is a duplicate, but looking for separate solutions to the errors just gave me a different error. In a circle.
Full function
async fn print_events(mut selector_loc1:i8, location_loc1: PathBuf) {
let mut reader = EventStream::new();
loop {
//let delay = Delay::new(Duration::from_millis(1_000)).fuse();
let mut event = reader.next().fuse();
select! {
// _ = delay => {
// print!("{esc}[2J{esc}[1;1H{}", esc = 27 as char,);
// },
maybe_event = event => {
match maybe_event {
Some(Ok(event)) => {
//println!("Event::{:?}\r", event);
// if event == Event::Mouse(MouseEvent::Up("Left").into()) {
// println!("Cursor position: {:?}\r", position());
// }
print!("{esc}[2J{esc}[1;1H{}", esc = 27 as char,);
if event == Event::Key(KeyCode::Char('k').into()) {
if selector_loc1 > 0 {
selector_loc1 -= 1;
};
//println!("go down");
//println!("{}",selected)
} else if event == Event::Key(KeyCode::Char('j').into()) {
selector_loc1 += 1;
//println!("go up");
//println!("{}",selected)
} else if event == Event::Key(KeyCode::Char('h').into()) {
//-----------------------------------------
//-------------BackLogic-------------------
//-----------------------------------------
let temp = location_loc1.parent().unwrap();
location_loc1.push(&temp);
//------------------------------------------
//------------------------------------------
} else if event == Event::Key(KeyCode::Char('l').into()) {
//go to next dir
} if event == Event::Key(KeyCode::Esc.into()) {
break;
}
printtype(location_loc1,selector_loc1);
}
Some(Err(e)) => println!("Error: {:?}\r", e),
None => break,
}
}
};
}
}
also, it seems using
use async_std::path::{Path, PathBuf};
makes the rust not recognize unwrap() function → how would I use using ?
There are two problems with your code.
Your PathBuf is immutable. It's not possible to modify immutable objects, unless they support interior mutability. PathBuf does not. Therefore you have to make your variable mutable. You can either add mut in front of it like that:
async fn print_events(mut selector_loc1:i8, mut location_loc1: PathBuf) {
Or you can re-bind it:
let mut location_loc1 = location_loc1;
You cannot have borrow it both mutable and immutably - the mutable borrows are exclusive! Given that the method .parent() borrows the buffer, you have to create a temporary owned value:
// the PathBuf instance
let mut path = PathBuf::from("root/parent/child");
// notice the .map(|p| p.to_owned()) method - it helps us avoid the immutable borrow
let parent = path.parent().map(|p| p.to_owned()).unwrap();
// now it's fine to modify it, as it's not borrowed
path.push(parent);
Your second question:
also, it seems using use async_std::path::{Path, PathBuf}; makes the rust not recognize unwrap() function → how would I use using ?
The async-std version is just a wrapper over std's PathBuf. It just delegates to the standard implementation, so it should not behave differently
// copied from async-std's PathBuf implementation
pub struct PathBuf {
inner: std::path::PathBuf,
}
I'm trying to send and receive simultaneously to a multicast IP with Rust.
use futures::executor::block_on;
use async_std::task;
use std::{net::{UdpSocket, Ipv4Addr}, time::{Duration, Instant}};
fn main() {
let future = async_main();
block_on(future);
}
async fn async_main() {
let mut socket = UdpSocket::bind("0.0.0.0:8888").unwrap();
let multi_addr = Ipv4Addr::new(234, 2, 2, 2);
let inter = Ipv4Addr::new(0,0,0,0);
socket.join_multicast_v4(&multi_addr,&inter);
let async_one = first(&socket);
let async_two = second(&socket);
futures::join!(async_one, async_two);
}
async fn first(socket: &std::net::UdpSocket) {
let mut buf = [0u8; 65535];
let now = Instant::now();
loop {
if now.elapsed().as_secs() > 10 { break; }
let (amt, src) = socket.recv_from(&mut buf).unwrap();
println!("received {} bytes from {:?}", amt, src);
}
}
async fn second(socket: &std::net::UdpSocket) {
let now = Instant::now();
loop {
if now.elapsed().as_secs() > 10 { break; }
socket.send_to(String::from("h").as_bytes(), "234.2.2.2:8888").unwrap();
}
}
The issue with this is first it runs the receive function and then it runs the send function, it never sends and receives simultaneously. With Golang I can do this with Goroutines but I'm finding this quite difficult in Rust.
I'm not very experienced with async in Rust, but your first() and second() functions don't appear to have any asynchronous calls in them -- in other words, there are not any calls that use .await. My understanding is that if nothing is awaited, then the functions will run synchronously, and I believe you get a compiler warning about it as well.
It doesn't look like std::net::UdpSocket provides any async methods that can be awaited, and you need to use async_std::net::UdpSocket instead.
I have a loop:
let grace = 2usize;
for i in 0..100 {
if i % 10 == 0 {
expensive_function()
} else {
cheap_function()
}
}
The goal is that when it hits expensive_function(), it runs asynchronously and allows grace number of further iterations until waiting on expensive_function().
If expensive_function() triggers at iteration 10, it could then run iterations 11 and 12 before needing to wait for the expensive_function() run on iteration 10 to finish to continue.
How could I do this?
In my case expensive_function() is effectively:
fn expensive_function(&b) -> Vec<_> {
return b.iter().map(|a| a.inner_expensive_function()).collect();
}
As such I plan to use multi-threading within this function.
When you start the expensive computation, store the resulting future in a variable, along with the deadline time to wait for the result. Here, I use an Option of a tuple:
use std::{thread, time::Duration};
use tokio::task; // 0.2.21, features = ["full"]
#[tokio::main]
async fn main() {
let grace_period = 2usize;
let mut pending = None;
for i in 0..50 {
if i % 10 == 0 {
assert!(pending.is_none(), "Already had pending work");
let future = expensive_function(i);
let deadline = i + grace_period;
pending = Some((deadline, future));
} else {
cheap_function(i);
}
if let Some((deadline, future)) = pending.take() {
if i == deadline {
future.await.unwrap();
} else {
pending = Some((deadline, future));
}
}
}
}
fn expensive_function(n: usize) -> task::JoinHandle<()> {
task::spawn_blocking(move || {
println!("expensive_function {} start", n);
thread::sleep(Duration::from_millis(500));
println!("expensive_function {} done", n);
})
}
fn cheap_function(n: usize) {
println!("cheap_function {}", n);
thread::sleep(Duration::from_millis(1));
}
This generates the output of
cheap_function 1
expensive_function 0 start
cheap_function 2
expensive_function 0 done
cheap_function 3
cheap_function 4
cheap_function 5
Since you did not provide definitions of expensive_function and cheap_function, I have provided appropriate ones.
One tricky thing here is that I needed to add the sleep call in the cheap_function. Without it, my OS never schedules the expensive thread until it's time to poll it, effectively removing any parallel work. In a larger program, the OS is likely to schedule the thread simply because more work will be done by cheap_function. You might also be able to use thread::yield_now to the same effect.
See also:
How to create a dedicated threadpool for CPU-intensive work in Tokio?
How do I synchronously return a value calculated in an asynchronous Future in stable Rust?
What is the best approach to encapsulate blocking I/O in future-rs?