Efficiently retain a range of vec elements - vector

I want a method like this:
trait RetainRange {
fn retain_range(&mut self, range: std::ops::Range<usize>);
}
impl<T> RetainRange for Vec<T> {
fn retain_range(&mut self, range: std::ops::Range<usize>) {
// Retain only the elements within the given range.
let mut i = 0usize;
self.retain(|el| {
let r = range.contains(&i);
i += 1;
r
});
}
}
But calling a lambda and range.contains() every time seems like it might be inefficient. Is there a better way?

This code generates what appears to be more efficient assembly.
fn retain_range(&mut self, range: std::ops::Range<usize>) {
self.truncate(range.end);
if range.start < self.len() {
self.drain(0..range.start);
} else {
self.clear();
}
}
Adding the if range.start < self.len() check avoids a panic if the range is past the end of the Vec, and actually also improves the assembly.

Related

Parallel work stealing in arbitrary order in Rust

I'm trying to write a parallel data loader for deep learning in Rust. The task is to write an iterator that under the hood does the following
Reads files from disk and applies some compute-heavy preprocessing to them, the result is generally a numeric array (or multiple)
Groups the results of the previous step into batches of size B and "collates" them - this generally means just concatenating the arrays - moderately compute heavy
Yields the results from step 2.
Step 1 can be both IO and compute bound, depending on network latency, size of files and complexity of preprocessing. It has to be run in parallel by many workers. Step 2 should be off the main thread but likely doesn't need a pool of workers. Step 3 happens on main thread (exposed to Python).
The reason I write it in Rust is that Python offers two options: pure Python implementation shipped with PyTorch, based on multiprocessing, which is somewhat slow but very flexible (arbitrary user-defined data preprocessing and batching) and C++ implementation shipped with Tensorflow, which is assembled by the user from a set of predefined primitives. The latter is substantially faster but too restrictive for the kinds of data processing I wish to do. I expect that Rust will give me the speed of Tensorflow with flexibility of arbitrary code as in PyTorch.
My question is purely about the way to implement parallelism. The ideal setup is to have N workers for step 1) -> channel -> worker for step 2) -> channel -> step 3. Because the iterator object may be dropped at any time, there is a strict requirement to be able to terminate the whole scheme after Drop. On the other hand, there is the flexibility of loading the files in an arbitrary order: for example if the batch size B == 16 and max_n_threads == 32, it is perfectly fine to start 32 workers and yield the first batch containing the 16 examples which happen to return first. This can be exploited for speed.
My naive implementation creates the DataLoader in 3 steps:
Create a n_working: Arc<AtomicUsize> to control the number of worker threads active and should_shutdown: Arc<AtomicBool> to signal shutdown (when Drop is called)
Create a thread responsible for maintaining the pool. It spins on n_working < max_n_threads and keeps spawning worker threads which terminate on should_shutdown, otherwise fetch a single example, send it down the worker->batcher channel and decrement n_working
Create a batching thread which polls the worker->batcher channel, upon receiving B objects concatenates them into a batch and sends down the batcher->yielder channel
#[pyclass]
struct DataLoader {
collate_worker: Option<thread::JoinHandle<()>>,
example_worker: Option<thread::JoinHandle<()>>,
should_shut_down: Arc<AtomicBool>,
receiver: Receiver<Batch>,
length: usize,
}
impl DataLoader {
fn new(
dataset: Dataset,
batch_size: usize,
capacity: usize,
) -> Self {
let n_batches = dataset.len() / batch_size;
let max_n_threads = capacity * batch_size;
let (example_sender, collate_receiver) = bounded((batch_size - 1) * capacity);
let should_shut_down = Arc::new(AtomicBool::new(false));
let shutdown_flag = should_shut_down.clone();
let example_worker = thread::spawn(move || {
rayon::scope_fifo(|s| {
let dataset = &dataset;
let n_working = Arc::new(AtomicUsize::new(0));
let mut current_index = 0;
while current_index < n_batches * batch_size {
if n_working.load(Ordering::Relaxed) == max_n_threads {
continue;
}
if shutdown_flag.load(Ordering::Relaxed) {
break;
}
let index = current_index.clone();
let sender = example_sender.clone();
let counter = n_working.clone();
let shutdown_flag = shutdown_flag.clone();
s.spawn_fifo(move |_s| {
let example = dataset.get_example(index);
if !shutdown_flag.load(Ordering::Relaxed) {
_ = sender.send(example);
} // if we should shut down, skip sending
counter.fetch_sub(1, Ordering::Relaxed);
});
current_index += 1;
n_working.fetch_add(1, Ordering::Relaxed);
};
});
});
let (batch_sender, final_receiver) = bounded(capacity);
let shutdown_flag = should_shut_down.clone();
let collate_worker = thread::spawn(move || {
'outer: loop {
let mut batch = vec![];
for _ in 0..batch_size {
if let Ok(example) = collate_receiver.recv() {
batch.push(example);
} else {
break 'outer;
}
};
let collated = collate(batch);
if shutdown_flag.load(Ordering::Relaxed) {
break; // skip sending
}
_ = batch_sender.send(collated);
};
});
Self {
collate_worker: Some(collate_worker),
example_worker: Some(example_worker),
should_shut_down: should_shut_down,
receiver: final_receiver,
length: n_batches,
}
}
}
#[pymethods]
impl DataLoader {
fn __iter__(slf: PyRef<Self>) -> PyRef<Self> { slf }
fn __next__(&mut self) -> Option<Batch> {
self.receiver.recv().ok()
}
fn __len__(&self) -> usize {
self.length
}
}
impl Drop for DataLoader {
fn drop(&mut self) {
self.should_shut_down.store(true, Ordering::Relaxed);
if self.collate_worker.take().unwrap().join().is_err() {
println!("Panic in collate worker");
};
if self.example_worker.take().unwrap().join().is_err() {
println!("Panic in example_worker");
};
println!("dropped the dataloader");
}
}
This implementation works and roughly matches the performance of PyTorch but provides no significant speedup. I don't know where to look for improvements, but I imagine it would help to have the thing load-balance automatically in a work-stealing way and to flexibly spawn workers depending on the proportion of IO and compute time. I am also expecting performance issues due to the spinning pool manager and likely corner cases in my handling of Drop.
My question is how to best approach the problem. I am generally unsure if this should be tackled with parallel crates like rayon, async crates like tokio, or a mix of both. I also have the hunch my implementation could be much simpler with the correct use of their combinators/higher order APIs. I tried with rayon but I couldn't get a solution which doesn't wastefully enforce the original sequential returning order and respects the Drop requirement.
Okay I think I've figured out a solution for you that uses rayon parallel iterators.
The trick is to use Results in the rayon iterators, and return Err if the cancellation flag is set.
I first created a utility type to create a cancellable thread in which you can execute rayon iterators. You use it by passing in the thread closure which takes the atomic cancellation token as a parameter. Then you have to check if the cancellation token is true, and if so, exit early.
use std::sync::Arc;
use std::sync::atomic::{Ordering, AtomicBool};
use std::thread::JoinHandle;
fn collate(batch: &[Computed]) -> Batch {
batch.iter().map(|&x| i128::from(x)).sum()
}
#[derive(Debug)]
struct Cancelled;
struct CancellableThread<Output: Send + 'static> {
cancel_token: Arc<AtomicBool>,
thread: Option<JoinHandle<Result<Output, Cancelled>>>,
}
impl<Output: Send + 'static> CancellableThread<Output> {
fn new<F: FnOnce(Arc<AtomicBool>) -> Result<Output, Cancelled> + Send + 'static>(init: F) -> Self {
let cancel_token = Arc::new(AtomicBool::new(false));
let thread_cancel_token = Arc::clone(&cancel_token);
CancellableThread {
thread: Some(std::thread::spawn(move || init(thread_cancel_token))),
cancel_token,
}
}
fn output(mut self) -> Output {
self.thread.take().unwrap().join().unwrap().unwrap()
}
}
impl<Output: Send + 'static> Drop for CancellableThread<Output> {
fn drop(&mut self) {
self.cancel_token.store(true, Ordering::Relaxed);
if let Some(thread) = self.thread.take() {
let _ = thread.join().unwrap();
}
}
}
I found it useful to create a closure that returns a Result<(), Cancelled> so I could use the try operator (?) to exit early.
CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
Err(Cancelled)
} else {
Ok(())
};
loop {
// was the thread dropped?
// if so, stop what we're doing
cancelled?;
// do stuff and
// eventually return a result
}
});
I then used that CancellableThread abstraction in the DataLoader. No need to create a special Drop impl for it, because by default, it will call drop on each field anyways, which will handle the cancellation.
type Data = Vec<u8>;
type Dataset = Vec<Data>;
type Computed = u64;
type Batch = i128;
use rayon::prelude::*;
use crossbeam::channel::{unbounded, Receiver};
struct DataLoader {
example_worker: CancellableThread<()>,
collate_worker: CancellableThread<()>,
receiver: Receiver<Batch>,
length: usize,
}
I used unbounded channels, as it was one less thing to bother about. It shouldn't be hard to switch to bounded ones instead.
impl DataLoader {
fn new(dataset: Dataset, batch_size: usize) -> Self {
let (example_sender, collate_receiver) = unbounded();
let (batch_sender, final_receiver) = unbounded();
I'm not sure if you can always guarantee that the number of items in your dataset will be a multiple of the batch_size, so I decided to handle that explicitly.
let length = if dataset.len() % batch_size == 0 {
dataset.len() / batch_size
} else {
dataset.len() / batch_size + 1
};
I created the collating worker first, though that may not be necessary. As you can see, I had to duplicate a little bit to handle partial batches.
let collate_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
Err(Cancelled)
} else {
Ok(())
};
'outer: loop {
let mut batch = Vec::with_capacity(batch_size);
for _ in 0..batch_size {
cancelled()?;
if let Ok(data) = collate_receiver.recv() {
batch.push(data);
} else {
if !batch.is_empty() {
// handle the last batch, if there
// weren't enough items to fill it
let collated = collate(&batch);
cancelled()?;
batch_sender.send(collated).unwrap();
}
break 'outer;
}
}
let collated = collate(&batch);
cancelled()?;
batch_sender.send(collated).unwrap();
}
Ok(())
});
The example worker is where things are really made much simpler, because we can just use rayon parallel iterators. As you can see, we check for cancellation before each heavy computation.
let example_worker = CancellableThread::new(move |cancel_token| {
let cancelled = || if cancel_token.load(Ordering::Relaxed) {
Err(Cancelled)
} else {
Ok(())
};
let heavy_compute = |data: Data| -> Result<Computed, Cancelled> {
cancelled()?;
Ok(data.iter().map(|&x| u64::from(x)).product())
};
dataset
.into_par_iter()
.map(heavy_compute)
.try_for_each(|computed| {
example_sender.send(computed?).unwrap();
Ok(())
})
});
Then we just construct the DataLoader. You can see the Python impl is identical:
DataLoader {
example_worker,
collate_worker,
receiver: final_receiver,
length,
}
}
}
// #[pymethods]
impl DataLoader {
fn __iter__(this: Self /* PyRef<Self> */) -> Self /* PyRef<Self> */ { this }
fn __next__(&mut self) -> Option<Batch> {
self.receiver.recv().ok()
}
fn __len__(&self) -> usize {
self.length
}
}
playground

Rust: Joining and iterating over futures' results

I have some code that iterates over objects and uses an async method on each of them sequentially before doing something with the results. I'd like to change it so that the async method calls are joined into a single future before being executed. The important bit below is in HolderStruct::add_squares. My current code looks like this:
use anyhow::Result;
struct AsyncMethodStruct {
value: u64
}
impl AsyncMethodStruct {
fn new(value: u64) -> Self {
AsyncMethodStruct {
value
}
}
async fn get_square(&self) -> Result<u64> {
Ok(self.value * self.value)
}
}
struct HolderStruct {
async_structs: Vec<AsyncMethodStruct>
}
impl HolderStruct {
fn new(async_structs: Vec<AsyncMethodStruct>) -> Self {
HolderStruct {
async_structs
}
}
async fn add_squares(&self) -> Result<u64> {
let mut squares = Vec::with_capacity(self.async_structs.len());
for async_struct in self.async_structs.iter() {
squares.push(async_struct.get_square().await?);
}
let mut sum = 0;
for square in squares.iter() {
sum += square;
}
return Ok(sum);
}
}
I'd like to change HolderStruct::add_squares to something like this:
use futures::future::join_all;
// [...]
impl HolderStruct {
async fn add_squares(&self) -> Result<u64> {
let mut square_futures = Vec::with_capacity(self.async_structs.len());
for async_struct in self.async_structs.iter() {
square_futures.push(async_struct.get_square());
}
let square_results = join_all(square_futures).await;
let mut sum = 0;
for square_result in square_results.iter() {
sum += square_result?;
}
return Ok(sum);
}
}
However, the compiler gives me this error using the above:
error[E0277]: the `?` operator can only be applied to values that implement `std::ops::Try`
--> src/main.rs:46:20
|
46 | sum += square_result?;
| ^^^^^^^^^^^^^^ the `?` operator cannot be applied to type `&std::result::Result<u64, anyhow::Error>`
|
= help: the trait `std::ops::Try` is not implemented for `&std::result::Result<u64, anyhow::Error>`
= note: required by `std::ops::Try::into_result`
How would I change the code to not have this error?
for square_result in square_results.iter()
Lose the iter() call here.
for square_result in square_results
You seem to be under impression that calling iter() is mandatory to iterate over a collection. Actually, anything that implements IntoIterator can be used in a for loop.
Calling iter() on a Vec<T> derefs to slice (&[T]) and yields an iterator over references to the vectors elements. The ? operator tries to take the value out of the Result, but that is only possible if you own the Result rather than just have a reference to it.
However, if you simply use a vector itself in a for statement, it will use the IntoIterator implementation for Vec<T> which will yield items of type T rather than &T.
square_results.into_iter() does the same thing, albeit more verbosely. It is mostly useful when using iterators in a functional style, a la vector.into_iter().map(|x| x + 1).collect().

How to get the index of an element in a vector using pointer arithmetic?

In C you can use the pointer offset to get index of an element within an array, e.g.:
index = element_pointer - &vector[0];
Given a reference to an element in an array, this should be possible in Rust too.
While Rust has the ability to get the memory address from vector elements, convert them to usize, then subtract them - is there a more convenient/idiomatic way to do this in Rust?
There isn't a simpler way. I think the reason is that it would be hard to guarantee that any operation or method that gave you that answer only allowed you to use it with the a Vec (or more likely slice) and something inside that collection; Rust wouldn't want to allow you to call it with a reference into a different vector.
More idiomatic would be to avoid needing to do it in the first place. You can't store references into Vec anywhere very permanent away from the Vec anyway due to lifetimes, so you're likely to have the index handy when you've got the reference anyway.
In particular, when iterating, for example, you'd use the enumerate to iterate over pairs (index, &item).
So, given the issues which people have brought up with the pointers and stuff; the best way, imho, to do this is:
fn index_of_unchecked<T>(slice: &[T], item: &T) -> usize {
if ::std::mem::size_of::<T>() == 0 {
return 0; // do what you will with this case
}
(item as *const _ as usize - slice.as_ptr() as usize)
/ std::mem::size_of::<T>()
}
// note: for zero sized types here, you
// return Some(0) if item as *const T == slice.as_ptr()
// and None otherwise
fn index_of<T>(slice: &[T], item: &T) -> Option<usize> {
let ptr = item as *const T;
if
slice.as_ptr() < ptr &&
slice.as_ptr().offset(slice.len()) > ptr
{
Some(index_of_unchecked(slice, item))
} else {
None
}
}
although, if you want methods:
trait IndexOfExt<T> {
fn index_of_unchecked(&self, item: &T) -> usize;
fn index_of(&self, item: &T) -> Option<usize>;
}
impl<T> IndexOfExt<T> for [T] {
fn index_of_unchecked(&self, item: &T) -> usize {
// ...
}
fn index_of(&self, item: &T) -> Option<usize> {
// ...
}
}
and then you'll be able to use this method for any type that Derefs to [T]

Is it possible to have a closure call itself in Rust? [duplicate]

This is a very simple example, but how would I do something similar to:
let fact = |x: u32| {
match x {
0 => 1,
_ => x * fact(x - 1),
}
};
I know that this specific example can be easily done with iteration, but I'm wondering if it's possible to make a recursive function in Rust for more complicated things (such as traversing trees) or if I'm required to use my own stack instead.
There are a few ways to do this.
You can put closures into a struct and pass this struct to the closure. You can even define structs inline in a function:
fn main() {
struct Fact<'s> { f: &'s dyn Fn(&Fact, u32) -> u32 }
let fact = Fact {
f: &|fact, x| if x == 0 {1} else {x * (fact.f)(fact, x - 1)}
};
println!("{}", (fact.f)(&fact, 5));
}
This gets around the problem of having an infinite type (a function that takes itself as an argument) and the problem that fact isn't yet defined inside the closure itself when one writes let fact = |x| {...} and so one can't refer to it there.
Another option is to just write a recursive function as a fn item, which can also be defined inline in a function:
fn main() {
fn fact(x: u32) -> u32 { if x == 0 {1} else {x * fact(x - 1)} }
println!("{}", fact(5));
}
This works fine if you don't need to capture anything from the environment.
One more option is to use the fn item solution but explicitly pass the args/environment you want.
fn main() {
struct FactEnv { base_case: u32 }
fn fact(env: &FactEnv, x: u32) -> u32 {
if x == 0 {env.base_case} else {x * fact(env, x - 1)}
}
let env = FactEnv { base_case: 1 };
println!("{}", fact(&env, 5));
}
All of these work with Rust 1.17 and have probably worked since version 0.6. The fn's defined inside fns are no different to those defined at the top level, except they are only accessible within the fn they are defined inside.
As of Rust 1.62 (July 2022), there's still no direct way to recurse in a closure. As the other answers have pointed out, you need at least a bit of indirection, like passing the closure to itself as an argument, or moving it into a cell after creating it. These things can work, but in my opinion they're kind of gross, and they're definitely hard for Rust beginners to follow. If you want to use recursion but you have to have a closure, for example because you need something that implements FnOnce() to use with thread::spawn, then I think the cleanest approach is to use a regular fn function for the recursive part and to wrap it in a non-recursive closure that captures the environment. Here's an example:
let x = 5;
let fact = || {
fn helper(arg: u64) -> u64 {
match arg {
0 => 1,
_ => arg * helper(arg - 1),
}
}
helper(x)
};
assert_eq!(120, fact());
Here's a really ugly and verbose solution I came up with:
use std::{
cell::RefCell,
rc::{Rc, Weak},
};
fn main() {
let weak_holder: Rc<RefCell<Weak<dyn Fn(u32) -> u32>>> =
Rc::new(RefCell::new(Weak::<fn(u32) -> u32>::new()));
let weak_holder2 = weak_holder.clone();
let fact: Rc<dyn Fn(u32) -> u32> = Rc::new(move |x| {
let fact = weak_holder2.borrow().upgrade().unwrap();
if x == 0 {
1
} else {
x * fact(x - 1)
}
});
weak_holder.replace(Rc::downgrade(&fact));
println!("{}", fact(5)); // prints "120"
println!("{}", fact(6)); // prints "720"
}
The advantages of this are that you call the function with the expected signature (no extra arguments needed), it's a closure that can capture variables (by move), it doesn't require defining any new structs, and the closure can be returned from the function or otherwise stored in a place that outlives the scope where it was created (as an Rc<Fn...>) and it still works.
Closure is just a struct with additional contexts. Therefore, you can do this to achieve recursion (suppose you want to do factorial with recursive mutable sum):
#[derive(Default)]
struct Fact {
ans: i32,
}
impl Fact {
fn call(&mut self, n: i32) -> i32 {
if n == 0 {
self.ans = 1;
return 1;
}
self.call(n - 1);
self.ans *= n;
self.ans
}
}
To use this struct, just:
let mut fact = Fact::default();
let ans = fact.call(5);

Is it possible to make a recursive closure in Rust?

This is a very simple example, but how would I do something similar to:
let fact = |x: u32| {
match x {
0 => 1,
_ => x * fact(x - 1),
}
};
I know that this specific example can be easily done with iteration, but I'm wondering if it's possible to make a recursive function in Rust for more complicated things (such as traversing trees) or if I'm required to use my own stack instead.
There are a few ways to do this.
You can put closures into a struct and pass this struct to the closure. You can even define structs inline in a function:
fn main() {
struct Fact<'s> { f: &'s dyn Fn(&Fact, u32) -> u32 }
let fact = Fact {
f: &|fact, x| if x == 0 {1} else {x * (fact.f)(fact, x - 1)}
};
println!("{}", (fact.f)(&fact, 5));
}
This gets around the problem of having an infinite type (a function that takes itself as an argument) and the problem that fact isn't yet defined inside the closure itself when one writes let fact = |x| {...} and so one can't refer to it there.
Another option is to just write a recursive function as a fn item, which can also be defined inline in a function:
fn main() {
fn fact(x: u32) -> u32 { if x == 0 {1} else {x * fact(x - 1)} }
println!("{}", fact(5));
}
This works fine if you don't need to capture anything from the environment.
One more option is to use the fn item solution but explicitly pass the args/environment you want.
fn main() {
struct FactEnv { base_case: u32 }
fn fact(env: &FactEnv, x: u32) -> u32 {
if x == 0 {env.base_case} else {x * fact(env, x - 1)}
}
let env = FactEnv { base_case: 1 };
println!("{}", fact(&env, 5));
}
All of these work with Rust 1.17 and have probably worked since version 0.6. The fn's defined inside fns are no different to those defined at the top level, except they are only accessible within the fn they are defined inside.
As of Rust 1.62 (July 2022), there's still no direct way to recurse in a closure. As the other answers have pointed out, you need at least a bit of indirection, like passing the closure to itself as an argument, or moving it into a cell after creating it. These things can work, but in my opinion they're kind of gross, and they're definitely hard for Rust beginners to follow. If you want to use recursion but you have to have a closure, for example because you need something that implements FnOnce() to use with thread::spawn, then I think the cleanest approach is to use a regular fn function for the recursive part and to wrap it in a non-recursive closure that captures the environment. Here's an example:
let x = 5;
let fact = || {
fn helper(arg: u64) -> u64 {
match arg {
0 => 1,
_ => arg * helper(arg - 1),
}
}
helper(x)
};
assert_eq!(120, fact());
Here's a really ugly and verbose solution I came up with:
use std::{
cell::RefCell,
rc::{Rc, Weak},
};
fn main() {
let weak_holder: Rc<RefCell<Weak<dyn Fn(u32) -> u32>>> =
Rc::new(RefCell::new(Weak::<fn(u32) -> u32>::new()));
let weak_holder2 = weak_holder.clone();
let fact: Rc<dyn Fn(u32) -> u32> = Rc::new(move |x| {
let fact = weak_holder2.borrow().upgrade().unwrap();
if x == 0 {
1
} else {
x * fact(x - 1)
}
});
weak_holder.replace(Rc::downgrade(&fact));
println!("{}", fact(5)); // prints "120"
println!("{}", fact(6)); // prints "720"
}
The advantages of this are that you call the function with the expected signature (no extra arguments needed), it's a closure that can capture variables (by move), it doesn't require defining any new structs, and the closure can be returned from the function or otherwise stored in a place that outlives the scope where it was created (as an Rc<Fn...>) and it still works.
Closure is just a struct with additional contexts. Therefore, you can do this to achieve recursion (suppose you want to do factorial with recursive mutable sum):
#[derive(Default)]
struct Fact {
ans: i32,
}
impl Fact {
fn call(&mut self, n: i32) -> i32 {
if n == 0 {
self.ans = 1;
return 1;
}
self.call(n - 1);
self.ans *= n;
self.ans
}
}
To use this struct, just:
let mut fact = Fact::default();
let ans = fact.call(5);

Resources