I'm starting to use RWMutex in my Go project with map since now I have more than one routine running at the same time and while making all of the changes for that a doubt came to my mind.
The thing is that I know that we must use RLock when only reading to allow other routines to do the same task and Lock when writing to full-block the map. But what are we supposed to do when editing a previously created element in the map?
For example... Let's say I have a map[int]string where I do Lock, put inside "hello " and then Unlock. What if I want to add "world" to it? Should I do Lock or can I do RLock?
You should approach the problem from another angle.
A simple rule of thumb you seem to understand just fine is
You need to protect the map from concurrent accesses when at least one of them is a modification.
Now the real question is what constitutes a modification of a map.
To answer it properly, it helps to notice that values stored in maps are not addressable — by design.
This was engineered that way simply due to the fact maps internally have intricate implementation which
might move values they contain in memory
to provide (amortized) fast access time
when the map's structure changes due to insertions and/or deletions of its elements.
The fact map values are not addressable means you can not do
something like
m := make(map[int]string)
m[42] = "hello"
go mutate(&m[42]) // take a single element and go modifying it...
// ...while other parts of the program change _other_ values
m[123] = "blah blah"
The reason you are not allowed to do this is the
insertion operation m[123] = ... might trigger moving
the storage of the map's element around, and that might
involve moving the storage of the element keyed by 42
to some other place in memory — pulling the rug
from under the feet of the goroutine
running the mutate function.
So, in Go, maps really only support three operations:
Insert — or replace — an element;
Read an element;
Delete an element.
You cannot modify an element "in place" — you can only
go in three steps:
Read the element;
Modify the variable containing the (read) copy;
Replace the element by the modified copy.
As you can now see, the steps (1) and (3) are mere map accesses,
and so the answer to your question is (hopefully) apparent:
the step (1) shall be done under at least an read lock,
and the step (3) shall be done under a write (exclusive) lock.
In contrast, elements of other compound types —
arrays (and slices) and fields of struct types —
do not have the restriction maps have: provided the storage
of the "enclosing" variable is not relocated, it is fine to
change its different elements concurrently by different goroutines.
Since the only way to change the value associated with the key in the map is to reassign the changed value to the same key, that is a write / modification, so you have to obtain the write lock–simply using the read lock will not be sufficient.
Related
In computer graphics, one of the most basic patterns is creating several buffers for vertex attributes and an index buffer that groups these attributes together.
In Rust, this basic pattern looks like this:
struct Position { ... }
struct Uv { ... }
struct Vertex {
pos: usize,
uv: usize
}
// v_vertex[_].pos is an index into v_pos
// v_vertex[_].uv is an index into v_uv
struct Object {
v_pos: Vec<Position>, // attribute buffer
v_uv: Vec<Uv>, // attribute buffer
v_vertex: Vec<Vertex> // index buffer
}
However, this pattern leaves a lot to be desired. Any operations to the attribute buffers that modifies existing data is going to be primarily concerned with making sure the index buffer isn't invalidated. In short, it leaves most of the promises the rust compiler is capable of making on the table.
Making a scheme like this work isn't impossible. For example, the struct could include a HashMap that keeps track of changed indices and rebuilds the vectors when it gets too large. However, all the workarounds inevitably feel like another hack that doesn't address the underlying problem: there is no compile-checked guarantee that I'm not introducing data races or that the "reference" hasn't been invalidated somewhere else on accident.
When I first approached this problem when moving from C++ to Rust, I tried to make the Vertex object hold references to the attributes. That looked something like this:
struct Position { ... }
struct Uv { ... }
struct Vertex {
pos: &Position,
uv: &Uv
}
// obj.v_vertex[_].pos is a reference to an element of obj.v_pos
// obj.v_vertex[_].uv is a reference to an element of obj.v_uv
// This causes a lot of problems; it's effectively impossible to use
struct Object {
v_pos: Vec<Position>,
v_uv: Vec<Uv>,
v_vertex: Vec<Vertex>
}
...Which threw me down deep into the rabbit holes of self-referential structs and why they cannot exist in safe Rust. After learning more about, it turns out that as I suspected, the original implementation hid a lot of unsafety pitfalls that were caught by the compiler when I started being more explicit.
I'm aware of the existence unsafe solutions like Pin, but I feel like at that point I might as well stick to the original method.
This leads me to the core question: Is there an idiomatic way of representing this relationship? I want to be able to modify the contents of each of the Vecs in a compiler-checked manner.
Is there an idiomatic way of representing this relationship?
The usizes you started with are the idiomatic way of representing this relationship.
Any operations to the attribute buffers that modifies existing data is going to be primarily concerned with making sure the index buffer isn't invalidated. …
Yes; you should write those operations within the module that defines Object, and keep the fields private so the Object cannot become inconsistent as long as those operations are correctly defined.
In short, it leaves most of the promises the rust compiler is capable of making on the table.
It doesn't — because the Rust compiler is not actually capable of making those promises. & and even &mut references are actually very limited — they work by statically enforcing “Nobody (else) is going to change this value while you have the reference”. They don't have any bigger picture than that. In your case, assuming you're planning to edit this data, you will need to do operations that modify multiple parts in a consistent fashion, like “add a Position and also a Vertex that uses it”, or maybe “simultaneously add 3 vertices making up a triangle, using these 3 existing Positions”. References cannot help you do this correctly.
The only kind of data structure of this sort that you can in fact build using references is an append-only one, using the help of, for example, typed-arena. This might be suitable for an algorithm which is building a mesh. However, given that it's append-only, there is very little benefit — the operation “append a vertex, choosing indices as you go” is easy to write correctly without references. Additionally, you won't be able to store the mesh constructed that way long-term (because it is made of vectors that borrow from the arena) unless you also throw in ouroboros to wrap up the self-reference.
Fundamentally, references are designed to be used as temporary things — as a formalization and enforcement of common patterns used in C and C++ when passing and returning pointers — hence also being called “borrows”. The rules which the compiler understands about references are rules designed to handle those temporary uses. They are almost never what you should be building a data structure out of.
Raku provides many types that are immutable and thus cannot be modified after they are created. Until I started looking into this area recently, my understanding was that these Types were not persistent data structures – that is, unlike the core types in Clojure or Haskell, my belief was that Raku's immutable types did not take advantage of structural sharing to allow for inexpensive copies. I thought that statement my List $new = (|$old-list, 42); literally copied the values in $old-list, without the data-sharing features of persistent data structures.
That description of my understanding is in the past tense, however, due to the following code:
my Array $a = do {
$_ = [rand xx 10_000_000];
say "Initialized an Array in $((now - ENTER now).round: .001) seconds"; $_}
my List $l = do {
$_ = |(rand xx 10_000_000);
say "Initialized the List in $((now - ENTER now).round: .001) seconds"; $_}
do { $a.push: rand;
say "Pushed the element to the Array in $((now - ENTER now).round: .000001) seconds" }
do { my $nl = (|$l, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
do { my #na = |$l;
say "Copied List \$l into a new Array in $((now - ENTER now).round: .001) seconds" }
which produced this output in one run:
Initialized an Array in 5.938 seconds
Initialized the List in 5.639 seconds
Pushed the element to the Array in 0.000109 seconds
Appended an element to the List in 0.000109 seconds
Copied List $l into a new Array in 11.495 seconds
That is, creating a new List with the old values + one more is just as fast as pushing to a mutable Array, and dramatically faster than copying the List into a new Array – exactly the performance characteristics that you'd expect to see from a persistent List (copying to an Array is still slow because it can't take advantage of structural sharing without breaking the immutability of the List). The fast copying of $l into $nl is not due to either List being lazy; neither are.
All of the above leads me to believe that Lists in Rakudo actually are persistent data structures, with all the performance benefits that implies. That leaves me with several questions:
Am I right about Lists being persistent data structures?
Are all other immutable Types also persistent data structures? Or are any?
Is any of this part of Raku, or just an implementation choice Rakudo has made?
Are any of these performance characteristics documented/guaranteed anywhere?
I have to say, I am both extremely impressed and more than a bit baffled to discover evidence that at least some of Raku(do)'s types are persistent. It's the sort of feature that other languages list as a key selling point or that leads to the creation of libraries with 30k+ stars on GitHub. Have we really had it in Raku without even mentioning it?
I remember implementing these semantics, and I certainly don't recall thinking about them giving rise to a persistent data structure at the time - although it does seems fair to attach that label to the result!
I don't think you'll find anywhere that explicitly spells out this exact behavior, however the most natural implementation of things that are required by the language quite naturally leads to it. Taking the ingredients:
The infix:<,> operator is the List constructor in Raku
When a List is created, it is non-committal with regards to laziness and flattening (these arise from how we use the List, which we don't - in general - know at the point of its construction)
When we write (|$x, 1), the prefix:<|> operator constructs a Slip, which is a kind of List that should melt into its surrounding List. Thus what infix:<,> sees is a Slip and an Int.
Making the Slip melt into the result List immediately would mean making a commitment about eagerness, which List construction alone should not do. Thus the Slip and everything after it is placed into the lazily evaluated ("non-reified") portion of the List.
This last of these is what gives rise to the observed persistent data structure style behavior.
I expect it would be possible to have a implementation that inspects the Slip and chooses to eagerly copy things that are known not to be lazy, and still be in compliance with the specification test suite. That would change the time complexity of your example. If you want to be defensive against that, then:
do { my $nl = (|$l.lazy, rand);
say "Appended an element to the List in $((now - ENTER now).round: .000001) seconds" }
Should be sufficient to force the issue even if the implementation changed.
Of other cases that immediately come to mind that are related to persistent data structures or at least tail sharing:
The MoarVM implementation of strings, which is behind str and thus Str, implements string concatenation by creating a new string that refers to the two that are being concatenated instead of copying the data in the two strings (and does similar tricks for substr and repetition). This is strictly an optimization, not a language requirement, and in some delicate cases (the last grapheme of one string and the first grapheme of the next will form a single grapheme in the resulting string), it gives up and takes the copying path.
Outside of the core, modules like Concurrent::Stack, Concurrent::Queue, and Concurrent::Trie use tail sharing as a technique to implement relatively efficient lock-free data structures.
From the Golang source code, they seem to follow a pretty standard implementation of hash tables (ie array of buckets). Based on this it seems that iteration should be deterministic for an unchanged map (ie iterate the array in order, then iterate within the buckets in order). Why do they make the iteration random?
TL;DR; They intentionally made it random starting with Go 1 to make developers not rely on it (to not rely on a specific iteration order which order may change from release-to-relase, from platform-to-platform, or may even change during a single runtime of an app when map internals change due to accommodating more elements).
The Go Blog: Go maps in action: Iteration order:
When iterating over a map with a range loop, the iteration order is not specified and is not guaranteed to be the same from one iteration to the next. Since the release of Go 1.0, the runtime has randomized map iteration order. Programmers had begun to rely on the stable iteration order of early versions of Go, which varied between implementations, leading to portability bugs. If you require a stable iteration order you must maintain a separate data structure that specifies that order.
Also Go 1 Release Notes: Iterating in maps:
The old language specification did not define the order of iteration for maps, and in practice it differed across hardware platforms. This caused tests that iterated over maps to be fragile and non-portable, with the unpleasant property that a test might always pass on one machine but break on another.
In Go 1, the order in which elements are visited when iterating over a map using a for range statement is defined to be unpredictable, even if the same loop is run multiple times with the same map. Code should not assume that the elements are visited in any particular order.
This change means that code that depends on iteration order is very likely to break early and be fixed long before it becomes a problem. Just as important, it allows the map implementation to ensure better map balancing even when programs are using range loops to select an element from a map.
Notable exceptions
Please note that the "random" order applies when ranging over the map using for range.
For reproducible outputs (for easy testing and other conveniences it brings) the standard lib sorts map keys in numerous places:
1. encoding/json
The json package marshals maps using sorted keys. Quoting from json.Marshal():
Map values encode as JSON objects. The map's key type must either be a string, an integer type, or implement encoding.TextMarshaler. The map keys are sorted and used as JSON object keys by applying the following rules, subject to the UTF-8 coercion described for string values above:
keys of any string type are used directly
encoding.TextMarshalers are marshaled
integer keys are converted to strings
2. fmt package
Starting with Go 1.12 the fmt package prints maps using sorted keys. Quoting from the release notes:
Maps are now printed in key-sorted order to ease testing. The ordering rules are:
When applicable, nil compares low
ints, floats, and strings order by <
NaN compares less than non-NaN floats
bool compares false before true
Complex compares real, then imaginary
Pointers compare by machine address
Channel values compare by machine address
Structs compare each field in turn
Arrays compare each element in turn
Interface values compare first by reflect.Type describing the concrete > - type and then by concrete value as described in the previous rules.
3. Go templates
The {{range}} action of text/template and html/template packages also visit elements in sorted keys order. Quoting from package doc of text/template:
{{range pipeline}} T1 {{end}}
The value of the pipeline must be an array, slice, map, or channel.
If the value of the pipeline has length zero, nothing is output;
otherwise, dot is set to the successive elements of the array,
slice, or map and T1 is executed. If the value is a map and the
keys are of basic type with a defined order, the elements will be
visited in sorted key order.
This is important for security, among other things.
There are lots of resources talking about this online -- see this post for example
Go doesn't allow taking the address of a map member:
// if I do this:
p := &mm["abc"]
// Syntax Error - cannot take the address of mm["abc"]
The rationale is that if Go allows taking this address, when the map backstore grows or shinks, the address can become invalid, confusing the user.
But Go slice gets relocated when it outgrows its capacity, yet, Go allows us to take the address of a slice element:
a := make([]Test, 5)
a[0] = Test{1, "dsfds"}
a[1] = Test{2, "sdfd"}
a[2] = Test{3, "dsf"}
addr1 := reflect.ValueOf(&a[2]).Pointer()
fmt.Println("Address of a[2]: ", addr1)
a = append(a, Test{4, "ssdf"})
addrx := reflect.ValueOf(&a[2]).Pointer()
fmt.Println("Address of a[2] After Append:", addrx)
// Note after append, the first address is invalid
Address of a[2]: 833358258224
Address of a[2] After Append: 833358266416
Why is Go designed like this? What is special about taking address of slice element?
There is a major difference between slices and maps: Slices are backed by a backing array and maps are not.
If a map grows or shrinks a potential pointer to a map element may become a dangling pointer pointing into nowhere (uninitialised memory). The problem here is not "confusion of the user" but that it would break a major design element of Go: No dangling pointers.
If a slice runs out of capacity a new, larger backing array is created and the old backing array is copied into the new; and the old backing array remains existing. Thus any pointers obtained from the "ungrown" slice pointing into the old backing array are still valid pointers to valid memory.
If you have a slice still pointing to the old backing array (e.g. because you made a copy of the slice before growing the slice beyond its capacity) you still access the old backing array. This has less to do with pointers of slice elements, but slices being views into arrays and the arrays being copied during slice growth.
Note that there is no "reducing the backing array of a slice" during slice shrinkage.
A fundamental difference between map and slice is that a map is a dynamic data structure that moves the values that it contains as it grows. The specific implementation of Go map may even grow incrementally, a little bit during insert and delete operations until all values are moved to a bigger memory structure. So you may delete a value and suddenly another value may move. A slice on the other hand is just an interface/pointer to a subarray. A slice never grows. The append function may copy a slice into another slice with more capacity, but it leaves the old slice intact and is also a function instead of just an indexing operator.
In the words of the map implementor himself:
https://www.youtube.com/watch?v=Tl7mi9QmLns&feature=youtu.be&t=21m45s
"It interferes with this growing procedure, so if I take the address
of some entry in the bucket, and then I keep that entry around for a
long time and in the meantime the map grows, then all of a sudden that
pointer points to an old bucket and not a new bucket and that pointer
is now invalid, so it's hard to provide the ability to take the
address of a value in a map, without constraining how grow works...
C++ grows in a different way, so you can take the address of a bucket"
So, even though &m[x] could have been allowed and would be useful for short-lived operations (do a modification to the value and then not use that pointer again), and in fact the map internally does that, I think the language designers/implementors chose to be on the safe side with map, not allowing &m[x] in order to avoid subtle bugs with programs that might keep the pointer for a long time without realizing then it would point to different data than the programmer thought.
See also Why doesn't Go allow taking the address of map value? for related comments.
I've read a bunch of explanations about the difference between array pointers and map pointers and it all still seems a tad odd.
Consider this: https://go.dev/play/p/uzADxzdq2EP
I can get a pointer to the zeroth array object but after I add another object to the array the original pointer is still there but it no longer points to the zeroth object of the current array. It points to the original value. Sure, it's not pointing to a nil object, it's pointing to the same object, but it's no longer 'correct' for some version of correct.
I'm not sure what my point is here other than it's just...odd.
I'm currently considering the implementation of a 2D data structure to allow me to store and draw objects in correct Z-Order (GDI+, entities are drawn in call order). The requirements are loosely:
Ability to add new objects to the top of any depth index
Ability to remove arbitrary object
(Ability to move object to the top of new depth index, accomplished by 2 points above)
Fast in-order and reverse-order traversal
As the main requirement is speed of traversal across the full data, the first thing that came to mind was an array like structure, eg. vector. It also easily allows for pushing new objects (removing objects not so great..). This works perfectly fine for our requirements, as it just so happens that the bulk of drawable entities don't change, and the ones that do sit at the top end of the order.
However it got me thinking of the implications for more dynamic requirements:
A vector will resize itself as required -> as the 'depth' vectors would need to be maintained contiguously in memory (top-level vector enforces it), this could lead to some pretty expensive vector resizes. Worst case all vectors need to be moved to new memory location, average case requiring all vectors up the chain to be moved.
Vectors will often hold a buffer at the end for adding new objects -> traversal could still easily force a cache miss while jumping between 'depth' vectors, rendering the top-level vector's contiguous memory less beneficial
Could someone confirm that these observations are indeed correct, making a vector a mostly very expensive structure for storing larger dynamic data sets?
From my thoughts above, I end up deducing that while traversing the whole dataset, specifically jumping between different vectors in the top-level vector, you might as well use any other data structure with inferior traversal complexity, or similar random access complexity (linked_list; map). Traversal would effectively be the same, as we might as well assume the cache misses will happen anyway, and we save ourselves a lot of bother by not keeping the depth vectors contiguously in memory.
Would that indeed be a good solution? If I'm not mistaken, on a 1D problem space, this would come down to what's more important traversal or addition/removal, vector or linked-list. On a 2D space I'm not so sure it is so black and white.
I'm wondering what sort of application requires good traversal across a 2D space, without compromising data addition/removal, and what sort of data structures are used there.
P.S. I just noticed I'm completely ignoring space-complexity, so might as well keep on ignoring it (unless you feel like adding more insight :D)
Your first assumption is somewhat incorrect.
Instead of thinking of vectors as the blob of memory itself, think of it as a pointer to automatically managed blob of memory and some metadata to keep track of it. A vector itself is a fixed size, the memory it keeps track of isn't. (See this example, note that the size of the vector object is constant: https://ideone.com/3mwjRz)
A vector of vectors can be thought of as an array of pointers. Resizing what the pointers point to doesn't mean you need to resize the array that contains them. The promise of items being contiguous still holds: the parent array has all of the pointers adjacent to each other and each pointer points to a contiguous chunk of memory. However, it's not guaranteed that the end of arr[0][N-1] is adjacent to the beginning of arr[1][0]. (To this end, your second point is correct.)
I guess that a Linked List would be more appropriate as you will always be traversing the whole list (vectors are good for random access). Linked lists inserts and removal are very cheap and the traversal isn't that different from a vector traversal. Maybe you should consider a Doubly Linked List as you want to traverse it in both ways.