Simplified structure of what I have is:
some single roots
each root has many (e.g. 100s) of children
User may update the root information, and no other operation on children should be allowed (because root change may affect all of them).
Also, user may operate on children (if root is not in use, of course). For example, user may change 2 children in the same time, and this is allowed, since each children is independent.
I need locks in this structure in order to be sure there are no corruptions:
when children is in use, lock the children. This will not allow two operations on the same children in same time.
when root is in use, lock root AND all the children. This will forbid the operations on any children while root is updated.
What bothers me here is the need to lock all the children - in a distributed system that means sending that many requests to distributed lock.
Is there any better solution I don't see?
You're missing two things. First, it's safe for multiple threads to read from a node at the same time, as long as nobody is writing to it. Second, the child nodes can be viewed as their own roots of smaller trees, so the same algorithm/solution may be applied to all nodes except leaf nodes. The first one is most important. Here's how you could do this:
Use a read/write mutex on all nodes in the tree. This allows any number of processes to concurrently read, or a single process to write to a node at any time.
To read:
read-lock node and all parents all the way up to root.
read.
release all read-locks.
To write:
write-lock the least-upper-bound of the nodes you want to modify. If you're modifying a node (and possibly any of its children), write-lock that node.
do your modifications
release the write-lock
This means two siblings may be modified concurrently, and that any number of reads may execute concurrently. However, the cost of reading is that you need to grab O(log100(tree_height)) read-locks, for a tree with roughly 100 children at each level. It's unlikely to be a real problem, unless your tree is huge, with extremely many reads and writes to the same leaf node.
This assumes that no child may modify its parent.
Related
We need strong consistency (insert where not exists, check conditions etc) to keep things in order a fast moving DynamoDb store, however we do far more reads than writes, and would prefer to sent consistentRead = false because it is faster, more stable (when nodes are down) and (most importantly) less costly.
If we use a Transaction write items collection to commit changes, does this wait for all nodes to propagate before returning? If so, surely you don’t need to use a consistent read to query this… is that the case?
No. Transactional writes work like regular reads in that they are acknowledged when they are written to at least 2 of the 3 nodes in the partition. One of those 2 nodes must be the leader node for the partition. The difference in a transaction is that all of the writes in that transaction have to work or none of them work.
If you do an eventually consistent read after the transaction, there is a 33% chance you will get the one node that was not required for the ack. Now then, if all is healthy that third node probably has the write anyhow.
All that said, if your workload needs a strongly consistent read like you indicate, then do it. Don't play around. There should not be a performance hit for a strong consistent read, but like you pointed out, there is a cost implication.
I am looking for a collection data structure that is:
thread safe
lock free
non-allocating (amortized or pre-allocated is fine)
non-intrusive
not using exotic intrinsics
Element order does not matter. Stack, queue, bag, anything is fine. I have found plenty of examples that fulfill four of these five requirements, for example:
.NET's List is not thread safe.
If I put a mutex on it, then it's not lock free.
.NET's ConcurrentStack is thread safe, lock free, uses a simple CompareExchange, but allocates a new Node for each element.
If I move the next pointer from the Node to the element itself, then it's intrusive.
Array based lock free data structures tend to require multi-word intrinsics.
I feel like I'm missing something super obvious. This should be a solved problem.
.NET's ConcurrentQueue fulfills all five requirements. It does allocate when the backing storage runs out of space, similar to List<T>, but as long as there is extra capacity, no allocations occur. Unfortunately the only way to reserve extra capacity upfront, is to initialize it with a collection of same size and then dequeuing all the elements.
same is true for .NET's ConcurrentBag
I want to make it scalable. Suppose letters are all in lower case. For example, if I only have two machines, queries whose first character is within a ~ m can be dispatched to the first machine, while the n ~ z queries can be dispatched to the second machine.
However, when the third machine comes, to make the queries spread as even as possible, I have to re-calculate the rules and re-distribute the contents stored in the previous two machines. I feel it could be messy. For example, the more complex case, when I already have 26 machines, what should I do when the 27th one comes? What do people usually do to achieve the scalability here?
The process of (self-) organizing machines in a DHT to split the load of handling queries to a pool of objects is called Consistent Hashing:
https://en.wikipedia.org/wiki/Consistent_hashing
I don't think there's a definitive answer to your question.
First is the question of balance. The DHT is balanced when:
each node is under similar load? (load balancing is probably what you're after)
each node is responsible for similar amounts of objects? (this is what you seemed to suggest)
(less likely) each node is responsible for similar amount of the addressing space?
I believe your objective is to make sure none of the machines is overloaded. Unless queries to a single object are enough to saturate a single machine, this is unlikely to happen if you rebalance properly.
If one of the machines is under significantly lower load than the other, you can make the less-load machine take over some of the objects of the higher-load machine by shifting their positions in the ring.
Another way of rebalancing is through virtual nodes -- each machine can simulate being k machines. If its load is low, it can increase the amount of virtual nodes (and take over more more objects). If its load is high, it can remove some of its virtual nodes.
I use a QAbstractItemModel to represent a tree model (of up to a few hounded items). The data itself is dynamic, at any time nodes may appear or disappear, values (or other roles) may change.
Making changes to the model is easy; I am wondering how to efficiently emit the signals in order to notify a QTreeView of the changes (most of it's nodes are collapsed).
At any given time multiple changes may occur simultaneously (row insertions and/or deletions).
Using beginInsertRows / endInsertRows / beginRemoveRows / endRemoveRows - shouldn't there be a method to notify the view of multiple changes?
In terms of performance, what would be the best strategy? For example, starting at the leaves and going up to the root / for each node - bottom to top (vs top to botton) / deletions before insertions / etc.
Would beginResetModel / endResetModel necessarily be less efficient?
Is there any advantage for using QStandardItemModel? (for this specific case).
Yes. The method of notifying everyone of disjoint removals/additions is to emit multiple signals. It would, in most cases, cause more overhead to pass some complex data structure instead of just the parent index and delimiting row/column indices.
You should only notify about the removal/addition of an item closes to the root. It makes no sense to notify about removal of children if their parent is subsequently to vanish. The notification about the parent means that the children, obviously, aren't there anymore.
It's not only about efficiency, but also about state. A model reset resets the view state. The view, upon receiving a reset, can't but assume that it got an entirely new, unrelated model - so you lose selections, expanded/collapsed state, etc. There's no way for a view to act any other way in face of a reset. Doing otherwise would require a view to keep its own copy of the model's contents.
Since a model reset implies relayout of all of the items, and it can be a very expensive thing to do, you should only do it if, in aggregate, more than 50% of the original items are changed (removed/replaced/added).
No, there is no advantage, and unless you store your data as variants, using QStandardItemModel will always end up with larger memory overhead. It is a convenience class that makes sense if it fits your needs exactly. In fact, if you're not careful about how you use it, it will work worse.
For example, if you remove an item by iterating depth-first and removing the farthest children first, then the QStandardItemModel can't foresee the future - namely, that you really want to remove the common ancestor of all of those children, and will emit a lot of change events unnecessarily. You can deal with it properly in your own model, or if you simply remove the common parent without touching the children - as they will be implicitly removed, too.
I've messed up my views a bit (big surprise for CC) and I have a child stream that has most of what I need, but not all. My parent stream needs to be updated, but I can't because of some issues (maybe evil twins I dunno).
Is it possible/wise to do the following
1) clear all elements in the parent stream
2) use clearfsimport to perform mass update on child stream
3) deliver child stream to parent
This is of course dependent on the fact that child stream elements are not deleted when deleted from parent.
Should I just clear out all elements of both views and start over? Any suggestions would be appreciated.
Yes, you can do a clearfsimport from whatever source you want to the child stream.
But I wouldn't recommend "clearing" (as in "rmnam'ing") all elements from the parent stream, even though it doesn't rmname them in the child stream, as my answer to your previous question details.
If you have a valid source (ie some directory with every file you need), you can clearfsimport it to your child stream view, in order to be complete.
Then try the deliver and identify the potential evil twins: your deliver will stop quickly at the "directory merge" stage, asking you to choose between two (identically named) files: you will chose the one coming from the child stream.
All the other files present in both stream will see their history updated as expected by that deliver.