I am implementing a simple B-tree in Rust (as hobby project).
The basic implementation works well.
As long as everything fits in memory, nodes can store direct references to their children and it is easy to traverse and use them.
However, to be able to store more data inside than fits in RAM, it should be possible to store tree nodes on disk and retrieve parts of the tree only when they are needed.
But now it becomes incredibly hard to return a reference to a part of the tree.
For instance, when implementing Node::lookup_value:
extern crate arrayvec;
use arrayvec::ArrayVec;
use std::collections::HashMap;
type NodeId = usize;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum NodeContents<Value> {
Leaf(ArrayVec<Value, 15>),
Internal(ArrayVec<NodeId, 16>)
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Node<Key, Value> {
pub keys: ArrayVec<Key, 15>,
pub contents: NodeContents<Value>,
}
pub trait Pool<T> {
type PoolId;
/// Invariants:
/// - Ony call `lookup` with a key which you *know* exists. This should always be possible, as stuff is never 'simply' removed.
/// - Do not remove or alter the things backing the pool in the meantime (Adding more is obviously allowed!).
fn lookup(&self, key: &Self::PoolId) -> T;
}
#[derive(PartialEq, Eq, PartialOrd, Ord, Clone, Debug)]
pub struct EphemeralPoolId(usize);
// An example of a pool that keeps all nodes in memory
// Similar (more complicated) implementations can be made for pools storing nodes on disk, remotely, etc.
pub struct EphemeralPool<T> {
next_id: usize,
contents: HashMap<usize, T>,
}
impl<T: Clone> Pool<T> for EphemeralPool<T> {
type PoolId = EphemeralPoolId;
fn lookup(&self, key: &Self::PoolId) -> T {
let val = self.contents.get(&key.0).expect("Tried to look up a value in the pool which does not exist. This breaks the invariant that you should only use it with a key which you received earlier from writing to it.").clone();
val
}
}
impl<Key: Ord, Value> Node<Key, Value> {
pub fn lookup_value<'pool>(&self, key: &Key, pool: &'pool impl Pool<Node<Key, Value>, PoolId=NodeId>) -> Option<&Value> {
match &self.contents {
NodeContents::Leaf(values) => { // Simple base case. No problems here.
self.keys.binary_search(key).map(|index| &values[index]).ok()
},
NodeContents::Internal(children) => {
let index = self.keys.binary_search(key).map(|index| index + 1).unwrap_or_else(|index| index);
let child_id = &children[index];
// Here the borrow checker gets mad:
let child = pool.lookup(child_id); // NodePool::lookup(&self, NodeId) -> Node<K, V>
child.lookup_value(key, pool)
}
}
}
}
Rust Playground
The borrow checker gets mad.
I understand why, but not why to solve it.
It gets mad, because NodePool::lookup returns a node by value. (And it has to do so, because it reads it from disk. In the future there may be a cache in the middle, but elements from this cache might be removed at any time, so returning references to elements in this cache is also not possible.)
However, lookup_value returns a reference to a tiny part of this value. But the value stops being around once the function returns.
In the Borrow Checker's parlance:
child.lookup_value(key, pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returns a reference to data owned by the current function
How to solve this problem?
The simplest solution would be to change the function in general to always return by-value (Option<V>). However, these values are often large and I would like to refrain from needless copies. It is quite likely that lookup_value is called in quick succession with closely related keys. (And also, there is a more complex scenario which the borrow checker similarly complains about, where we look up a range of sequential values with an interator. The same situation applies there.)
Are there ways in which I can make sure that child lives long enough?
I have tried to pass in an extra &mut HashMap<NodeId, Arc<Node<K, V>>> argument to store all children needed in the current traversal to keep their references alive for long enough. (See this variant of the code on the Rust Playground here)
However, then you end up with a recursive function that takes (and modifies) a hashmap as well as an element in the hashmap. Rust (rightfully) blocks this from working: It's (a) not possible to convince Rust that you won't overwrite earlier values in the hashmap, and (b) it might need to be reallocated when it grows which would also invalidate the references.
So that does not work.
Are there other solutions? (Even ones possibly involving unsafe?)
Minimal Reproducible Example
Your example contained a lot of errors and undefined types. Please provide a proper minimal reproducible example next time to avoid frustration and increase the chance of getting a quality answer.
EDIT: I saw your modified question too late with the proper minimal example, I hope this one fits. If not, I'll take another look. Let me know.
Either way, I attempted to reverse-engineer what your minimal example could have been:
use std::marker::PhantomData;
type NodeId = usize;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum NodeContents<Value> {
Leaf(Vec<Value>),
Internal(Vec<NodeId>),
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Node<Key, Value> {
pub keys: Vec<Key>,
pub contents: NodeContents<Value>,
}
struct NodePool<Key, Value> {
_p: PhantomData<(Key, Value)>,
}
impl<Key, Value> NodePool<Key, Value> {
fn lookup(&self, _child_id: NodeId) -> Node<Key, Value> {
todo!()
}
}
impl<Key, Value> Node<Key, Value>
where
Key: Ord,
{
pub fn lookup_value(&self, key: &Key, pool: &NodePool<Key, Value>) -> Option<&Value> {
match &self.contents {
NodeContents::Leaf(values) => {
// Simple base case. No problems here.
self.keys
.binary_search(key)
.map(|index| &values[index])
.ok()
}
NodeContents::Internal(children) => {
let index = self
.keys
.binary_search(key)
.map(|index| index + 1)
.unwrap_or_else(|index| index);
let child_id = &children[index];
// Here the borrow checker gets mad:
let child = pool.lookup(*child_id); // NodePool::lookup(&self, NodeId) -> Node<K, V>
child.lookup_value(key, pool)
}
}
}
}
error[E0515]: cannot return reference to local variable `child`
--> src/lib.rs:50:17
|
50 | child.lookup_value(key, pool)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ returns a reference to data owned by the current function
Thoughts and Potential Solutions
The main question you have to ask yourself: Who owns the data once it's loaded from disk?
This is not a simple question with a simple answer. It opens many new questions:
If you query the same key twice, should it load from disk twice? Or should it be cached and some other previously cached value should be moved to disk?
If it should be loaded from disk twice, should the entire Node that lookup returned be kept alive until the borrow is returned, or should only the Value be moved out and kept alive?
If your answer is "it should be loaded twice" and "only the Value should be kept alive", you could use an enum that can carry either a borrowed or owned Value:
enum ValueHolder<'a, Value> {
Owned(Value),
Borrowed(&'a Value),
}
impl<'a, Value> std::ops::Deref for ValueHolder<'a, Value> {
type Target = Value;
fn deref(&self) -> &Self::Target {
match self {
ValueHolder::Owned(val) => val,
ValueHolder::Borrowed(val) => val,
}
}
}
In your code, it could look like this:
use std::marker::PhantomData;
type NodeId = usize;
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub enum NodeContents<Value> {
Leaf(Vec<Value>),
Internal(Vec<NodeId>),
}
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct Node<Key, Value> {
pub keys: Vec<Key>,
pub contents: NodeContents<Value>,
}
pub struct NodePool<Key, Value> {
_p: PhantomData<(Key, Value)>,
}
impl<Key, Value> NodePool<Key, Value> {
fn lookup(&self, _child_id: NodeId) -> Node<Key, Value> {
todo!()
}
}
pub enum ValueHolder<'a, Value> {
Owned(Value),
Borrowed(&'a Value),
}
impl<'a, Value> std::ops::Deref for ValueHolder<'a, Value> {
type Target = Value;
fn deref(&self) -> &Self::Target {
match self {
ValueHolder::Owned(val) => val,
ValueHolder::Borrowed(val) => val,
}
}
}
impl<Key, Value> Node<Key, Value>
where
Key: Ord,
{
pub fn lookup_value(
&self,
key: &Key,
pool: &NodePool<Key, Value>,
) -> Option<ValueHolder<Value>> {
match &self.contents {
NodeContents::Leaf(values) => {
// Simple base case. No problems here.
self.keys
.binary_search(key)
.map(|index| ValueHolder::Borrowed(&values[index]))
.ok()
}
NodeContents::Internal(children) => {
let index = self
.keys
.binary_search(key)
.map(|index| index + 1)
.unwrap_or_else(|index| index);
let child_id = &children[index];
// Here the borrow checker gets mad:
let child = pool.lookup(*child_id); // NodePool::lookup(&self, NodeId) -> Node<K, V>
child
.into_lookup_value(key, pool)
.map(|val| ValueHolder::Owned(val))
}
}
}
pub fn into_lookup_value(self, key: &Key, pool: &NodePool<Key, Value>) -> Option<Value> {
match self.contents {
NodeContents::Leaf(mut values) => {
// Simple base case. No problems here.
self.keys
.binary_search(key)
.map(|index| values.swap_remove(index))
.ok()
}
NodeContents::Internal(children) => {
let index = self
.keys
.binary_search(key)
.map(|index| index + 1)
.unwrap_or_else(|index| index);
let child_id = &children[index];
// Here the borrow checker gets mad:
let child = pool.lookup(*child_id); // NodePool::lookup(&self, NodeId) -> Node<K, V>
child.into_lookup_value(key, pool)
}
}
}
}
In all other cases, however, it seems like further thought needs to be put into restructuring the project.
After a lot of searching and trying, and looking at what other databases are doing (great suggestion, #Finomnis!) the solution presented itself.
Use an Arena
The intuition to use a HashMap was a good one, but you have the problems with lifetimes, for two reasons:
A HashMap might be reallocated when growing, which would invalidate old references.
You cannot convince the compiler that new insertions into the Hashmap will not overwrite existing values.
However, there is an alternative datastructure which is able to provide these guarantees: Arena allocators.
With these, you essentially 'tie' the lifetimes of the values you insert into them, to the lifetime of the allocator as a whole.
Internally (at least conceptually), an arena keeps track of a list of memory regions in which data might be stored, the most recent one of which is not entirely full. Individual memory regions are never reallocated to ensure that old references remain valid; instead, when the current memory region is full, a new (larger) one is allocated and added to the list, becoming the new 'current'.
Everything is destroyed at once at the point the arena itself is deallocated.
There are two prevalent Rust libraries that provide arenas. They are very similar, with slightly different trade-offs:
typed-arena allows only a single type T per arena, but runs all Drop implementations when the arena is dropped.
bumpalo allows different objects to be stored in the same arena, but their Drop implementations do not run.
In the common situation of having only a single datatype with no special drop implementation, you might want to benchmark which one is faster in your use-case.
In the original question, we are dealing with a cache, and it makes sense to reduce copying of large datastructures by passing around Arcs between the cache and the arena. This means that you can only use typed-arena in this case, as Arc has a special Drop implementation.
I'm using the rusqlite crate and am doing some queries, basically i'm just trying to check the length of the query results before trying to proceed, however I'm running into error[E0382]: use of moved value: key_rows whilst trying to compile, I don't understand since i'm borrowing a reference to the variable so it wouldn't move it's local in memory?
Maybe it's due to the method i'm calling on the variable's pointer?
Full compiler error:
error[E0382]: use of moved value: `key_rows`
--> src/handle.rs:126:16
|
108 | let key_rows = key_stmt.query_map(&[(":key", key.as_str())], |row| {
| -------- move occurs because `key_rows` has type `MappedRows<'_, [closure#src/handle.rs:108:66: 113:6]>`, which does not implement the `Copy` trait
...
117 | if(&key_rows.count() == &0){
| ------- `key_rows` moved due to this method call
...
126 | for row in key_rows {
| ^^^^^^^^ value used here after move
|
note: this function takes ownership of the receiver `self`, which moves `key_rows`
Erroneous code:
let mut key_stmt = conn.prepare("SELECT id , key FROM key_table WHERE key = :key;").unwrap();
let key_rows = key_stmt.query_map(&[(":key", key.as_str())], |row| {
Ok(Table {
id: row.get(0)?,
payload: row.get(1)?,
})
}).unwrap();
//Checking that the key exists:
if(&key_rows.count() == &0){
panic!("Can't find the key...")
}
//Putting in a default value since the compiler is worried.
let mut reference_id : i32 = 0;
//For loop is nessessary since MappedRow type cannot be indexed regularly (weird)
for row in key_rows {
reference_id = row.unwrap().id;
println!("{:?}", reference_id.to_string());
}
You aren't borrowing key_rows then checking the count, but rather calling key_rows.count(), then borrowing the result.
Iterator#count consumes the entire iterator, returning how many elements were traversed.
let mut reference_id = key_rows.next().unwrap_or_else(|| panic!("Can't find the key..."));
// Remove the code past this point if you are only expecting one, or the first value.
for row in key_rows {
reference_id = row.unwrap().id;
println!("{:?}", reference_id.to_string());
}
How can I modify a Vec based on information from an item within the Vec without having both immutable and mutable references to the vector?
I've tried to create a minimal example that demonstrates my specific problem. In my real code, the Builder struct is already the intermediate struct that other answers propose. Specifically, I do not think this question is answered by other questions because:
Why does refactoring by extracting a method trigger a borrow checker error? - What would I put in the intermediate struct? I'm not calculating a separate value from Vec<Item>. The value being modified/operated on is Vec<Item> and is what would need to be in the intermediate struct
Suppose I have a list of item definitions, where items are either a String, a nested list of Items, or indicate that a new item should be added to the list of items being processed:
enum Item {
Direct(String),
Nested(Vec<Item>),
New(String),
}
There is also a builder that holds a Vec<Item> list, and builds an item at the specified index:
struct Builder {
items: Vec<Item>,
}
impl Builder {
pub fn build_item(&mut self, item: &Item, output: &mut String) {
match item {
Item::Direct(v) => output.push_str(v),
Item::Nested(v) => {
for sub_item in v.iter() {
self.build_item(sub_item, output);
}
}
Item::New(v) => self.items.push(Item::Direct(v.clone())),
}
}
pub fn build(&mut self, idx: usize, output: &mut String) {
let item = self.items.get(idx).unwrap();
self.build_item(item, output);
}
}
This doesn't compile due to the error:
error[E0502]: cannot borrow `*self` as mutable because it is also borrowed as immutable
--> src/main.rs:26:9
|
25 | let item = self.items.get(idx).unwrap();
| ---------- immutable borrow occurs here
26 | self.build_item(item, output);
| ^^^^^----------^^^^^^^^^^^^^^
| | |
| | immutable borrow later used by call
| mutable borrow occurs here
error: aborting due to previous error
For more information about this error, try `rustc --explain E0502`.
I don't know how to have the Builder struct be able to modify its items (i.e. have a mutable reference to self.items) based on information contained in one of the items (i.e. an immutable borrow of self.items).
Here is a playground example of the code.
Using Clone
#Stargateur recommended that I try cloning the item in build(). While this does work, I've been trying to not clone items for performance reasons. UPDATE: Without adding the Vec<Item> modification feature with Item::New, I implemented the clone() method in my real code and cloned the value in the equivalent of the example build() method above. I saw a 12x decrease in performance when I do self.items.get(idx).unwrap().clone() vs self.items.get(idx).unwrap(). I will continue looking for other solutions. The problem is, I'm still relatively new to Rust and am not sure how to bend the rules/do other things even with unsafe code.
Code that does work (playground)
impl Clone for Item {
fn clone(&self) -> Self {
match self {
Item::Direct(v) => Item::Direct(v.clone()),
Item::Nested(v) => Item::Nested(v.clone()),
Item::New(v) => Item::New(v.clone()),
}
}
}
and change build to clone the item first:
let item = self.items.get(idx).unwrap().clone();
Whenever approaching problems like this (which you will encounter relatively frequently while using Rust), the main goal should be to isolate the code requiring the immutable borrow from the code requiring the mutable borrow. If the borrow from the items vec in build is unavoidable (i.e. you cannot move the item out of self.items or copy/clone it) and you must pass in a reference to this item to build_item, you might want to consider rewriting your build_item function to not mutate self. In this case, build_item only ever appends new items to the end of self.items, which lets us make an interesting refactor: Rather than have build_item modify items, make it return the items to be added to the original vector, and then have the caller add the newly generated items to the items vector.
impl Builder {
fn generate_items(&self, item: &Item, output: &mut String) -> Vec<Item> {
match item {
Item::Direct(v) => {
output.push_str(v);
Vec::new()
}
Item::Nested(v) => {
v.iter()
.flat_map(|sub_item| self.generate_items(sub_item, output))
.collect()
}
Item::New(v) => vec![Item::Direct(v.clone())],
}
}
pub fn build_item(&mut self, item: &Item, output: &mut String) {
let mut new_items = self.generate_items(item, output);
self.items.append(&mut new_items);
}
pub fn build(&mut self, idx: usize, output: &mut String) {
// Non lexical lifetimes allow this to compile, as the compiler
// realizes that `item` borrow can be dropped before the mutable borrow
// Immutable borrow of self starts here
let item = self.items.get(idx).unwrap();
let mut new_items = self.generate_items(item, output);
// Immutable borrow of self ends here
// Mutable borrow of self starts here
self.items.append(&mut new_items);
}
}
Note that in order to preserve the API, your build_item function has been renamed to generate_items, and a new build_item function that uses generate_items has been created.
If you look carefully, you'll notice that generate_items doesn't even require self, and can be a free-standing function or a static function in Builder.
Playground
I'm very new to Rust and system languages in general. And I'm currently playing around with Rust to explore the language. I've a problem that I cannot fix by myself. And I think I've understanding problem whats going on.
I wan't to store objects that implements the trait BaseStuff in a vector. In Rust not a simple task for me :-).
Here is my example code that won't compile.
trait BaseStuff {}
struct MyStuff {
value: i32,
}
struct AwesomeStuff {
value: f32,
text: String,
}
impl BaseStuff for MyStuff {}
impl BaseStuff for AwesomeStuff {}
struct Holder {
stuff: Vec<BaseStuff>,
}
impl Holder {
fn register(&mut self, data: impl BaseStuff) {
self.stuff.push(data);
}
}
fn main() {
let my_stuff = MyStuff { value: 100 };
let awesome_stuff = AwesomeStuff {
value: 100.0,
text: String::from("I'm so awesome!"),
};
let mut holder = Holder { stuff: vec![] };
holder.register(my_stuff);
}
error[E0277]: the size for values of type (dyn BaseStuff + 'static)
cannot be known at compilation time --> src\main.rs:17:5 | 17 |
stuff: Vec, // unknown size | ^^^^^^^^^^^^^^^^^^^^^
doesn't have a size known at compile-time | = help: the trait
std::marker::Sized is not implemented for (dyn BaseStuff +
'static) = note: to learn more, visit
https://doc.rust-lang.org/book/second-edition/ch19-04-advanced-types.html#dynamically-sized-types-and-the-sized-trait
= note: required by std::vec::Vec
error: aborting due to previous error
For more information about this error, try rustc --explain E0277.
error: Could not compile playground.
The compiler message is clear and I understand the message. I can
implement the trait BaseStuff in any struct I want't so its unclear
which size it is. Btw the link isn't helpful because its pointing to
outdated site...
The size of String is also unknown, but the String implements the trait std::marker::Sized and that's why a Vec<String> will work without problems. Is that correct?
In the rust book I read for data types with unknown size I've to store these data on the heap instead of the stack. I changed my code as follows.
struct Holder {
stuff: Vec<Box<BaseStuff>>,
}
impl Holder {
fn register(&mut self, data: impl BaseStuff) {
self.stuff.push(Box::new(data));
}
}
Now I'm hitting a new compiler issue:
error[E0310]: the parameter type impl BaseStuff may not live long
enough --> src\main.rs:22:25 | 21 | fn register(&mut self,
data: impl BaseStuff) { |
--------------- help: consider adding an explicit lifetime bound impl BaseStuff: 'static... 22 | self.stuff.push(Box::new(data));
| ^^^^^^^^^^^^^^ | note: ...so that the
type impl BaseStuff will meet its required lifetime bounds -->
src\main.rs:22:25 | 22 | self.stuff.push(Box::new(data));
| ^^^^^^^^^^^^^^
error: aborting due to previous error
For more information about this error, try rustc --explain E0310.
error: Could not compile playground.
And know I'm out... I read in the book about lifetimes and changed my code a lot with 'a here and 'a in any combination but without luck... I don't want to write down any lifetime definition combination I tried.
I don't understand why I need the lifetime definition. The ownership is moved in any step so for my understanding its clear that Holder struct is the owner for all data. Is it?
How can I correct my code to compile?
Thanks for help.
You almost got it - the issue here is that the type for which BaseStuff is implemented may be a reference (e.g. impl BaseStuff for &SomeType). This means that even though you're passing data by value, the value may be a reference that will be outlived by your Box.
The way to fix this is to add a constraint such that the object has a 'static lifetime, meaning it will either be a value type or a static reference. You can apply this constraint to the trait or the method accepting the trait, depending on your use case.
Applying the constraint to the trait:
trait BaseStuff: 'static {}
Applying the constraint to the method:
impl Holder {
fn register(&mut self, data: impl BaseStuff + 'static) {
self.stuff.push(Box::new(data));
}
}
If you add the 'static constraint to the method, I would recommend also adding it to the Vec to avoid losing type information, like so:
struct Holder {
stuff: Vec<Box<dyn BaseStuff + 'static>>,
}
The following is the Deref example from The Rust Programming Language except I've added another assertion.
Why does the assert_eq with the deref also equal 'a'? Why do I need a * once I've manually called deref?
use std::ops::Deref;
struct DerefExample<T> {
value: T,
}
impl<T> Deref for DerefExample<T> {
type Target = T;
fn deref(&self) -> &T {
&self.value
}
}
fn main() {
let x = DerefExample { value: 'a' };
assert_eq!('a', *x.deref()); // this is true
// assert_eq!('a', x.deref()); // this is a compile error
assert_eq!('a', *x); // this is also true
println!("ok");
}
If I uncomment the line, I get this error:
error[E0308]: mismatched types
--> src/main.rs:18:5
|
18 | assert_eq!('a', x.deref());
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected char, found &char
|
= note: expected type `char`
found type `&char`
= help: here are some functions which might fulfill your needs:
- .to_ascii_lowercase()
- .to_ascii_uppercase()
= note: this error originates in a macro outside of the current crate (in Nightly builds, run with -Z external-macro-backtrace for more info)
First, let's spell out the generic types for your specific example: 'a' is char, so we have:
impl Deref for DerefExample<char> {
type Target = char;
fn deref(&self) -> &char {
&self.value
}
}
Notably, the return type of deref is a reference to a char. Thus it shouldn't be surprising that, when you use just x.deref(), the result is a &char rather than a char. Remember, at that point deref is just another normal method — it's just implicitly invoked as part of some language-provided special syntax. *x, for example, will call deref and dereference the result, when applicable. x.char_method() and fn_taking_char(&x) will also call deref some number of times and then do something further with the result.
Why does deref return a reference to begin with, you ask? Isn't that circular? Well, no, it isn't circular: it reduces library-defined smart pointers to the built-in type &T which the compiler already knows how to dereference. By returning a reference instead of a value, you avoid a copy/move (which may not always be possible!) and allow &*x (or &x when it's coerced) to refer to the actual char that DerefExample holds rather than a temporary copy.
See also:
Why is the return type of Deref::deref itself a reference?