How the chances of getting "read-your-writes" consistency are increased in Dynamo? - amazon-dynamodb

In Section 5 of Dynamo paper, there is the following content:
In particular, since each write usually follows a read operation, the
coordinator for a write is chosen to be the node that replied fastest to the
previous read operation which is stored in the context information of the
request. This optimization enables us to pick the node that has the data that
was read by the preceding read operation thereby increasing the chances of
getting "read-your-writes" consistency.
How the chances of getting "read-your-writes" consistency are increased?
"read-your-writes" means that a read following a write gets the value set by the
write. The read and the write are performed by two different clients for this
context. The reason is that the choice of the write coordinator does not impact
on the chances of getting "read-your-writes" by the same client.
But the above text is talking about a write following a read. Here is my guess.
The read coordinator will try to do syntactic reconciliation if it is possible.
If syntactic reconciliation is impossible because of divergent versions, the
client need to do semantic reconciliation before doing a write. Either way, the
versions on all the nodes involved in the read operation is an ancestor of the
reconciled version. So the following write can be sent to any of them to get
applied. The earliest time for a write to be seen by a read is after the
following steps are finished:
Client contact the write coordinator.
The write coordinator generates the version clock for the new version.
The write coordinator writes the new version locally.
The shorter the time to perform the above steps, the more likely another
following read sees the new version. Since it is very possible that the node
which replied fastest to the previous read can perform the following steps in a
shorter time. Such a node is chosen as the write coordinator.

Section 2.3 talks about performing the reconciliation at read time rather than write time.
Data versioning - "One can determine whether two versions of an
object are on parallel branches or have a causal ordering, by
examine their vector clocks."
This paragraph from section 4. [emphasis mine]
In Dynamo, when a client wishes to update an object, it must specify
which version it is updating. This is done by passing the context it
obtained from an earlier read operation, which contains the vector
clock information. Upon processing a read request, if Dynamo has
access to multiple branches that cannot be syntactically reconciled,
it will return all the objects at the leaves, with the corresponding
version information in the context. An update using this context is
considered to have reconciled the divergent versions and the
branches are collapsed into a single new version
So by performing the read first, you're effectively reconciling all divergent versions prior to writing. By writing to that same node, the version you've updated is marked with the context and vector clock of the most up to date version and all divergent branches can be collapsed. This is sent to the top N nodes (as you've stated) as fast as possible. But by removing the divergent branches - you reduce the chance that multiple values could be returned. You only need one of the N nodes read in the next read to get the reconciled write. ie - the node as part of the quorum of R reads says - "I am the reconciled version, and all others must bow to me". (and if that has already been distributed to another of the "R" nodes, then there's even greater chance of getting the reconciled version in the quorum)
But, if you wrote to a different node, one that you hadn't read from - the vector clock that is being updated may not necessarily be a reconciled version of the object. Therefore, you could still have divergent branches. The following read will try and reconcile it, but it's more more probable that you could have multiple divergent data and no reconciliation.
If you've made it this far, I think the most interesting part is that per Section 6, client applications can dictate the values of N, R and W - ie - number of nodes that constitute the pool to draw from, and the number of nodes that must agree on a read or write for it to be successful.
Geez - my head hurts now.

I re-read the Dynamo paper. I have a new understanding of "read-your-write" consistency. "read-your-writes" involves only one client. Image the following requests performed by one client on the same key:
read-1
write-1
read-2
"read-your-writes" means that read-2 sees write-1. The write coordinator has the best chance to have write-1. To ensure "read-your-writes", it is desired that the write coordinator replies fastest to read-2. It is highly possible that the node replies fastest to read-1 also reply fastest to read-2. So choose the node replies fastest to read-1 as the write coordinator.
And what is the node that replied fastest to the previous read operation? Such a node only makes sense if client-driven coordination is used. For server-side coordination, the coordinator nodes replies to the client and the other involved nodes reply to the coordinator node. replied fastest is meaningless in this case.

Related

Gremlin console keeps returning "Connection to server is no longer active" error

I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?
The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.

How do I guarantee task order processing for a queue with multiple consumers in RabbitMQ?

Say I want to start friendship between A and B.
Say I want to end friendship between A and B.
Those are two tasks I want to send to a queue having multiple consumers (workers).
I want to guarantee processing order so, how to avoid the second task to be performed before the first?
My solution: make tasks sticky (tasks about A are always sent to the same consumer).
Implementation: use RabbitMQ's exchanges and map tasks to the available consumers.
How do I map A to its consumer? I'm thinking about nginx's ip_hash. I think I need something similar.
I don't know if it is relevant but A and B are uuid.v4() UUIDs.
Can you point me out to the algorithm I need to accomplish mapping, please?
Well, there are two options:
make one exchange / queue for all events and guarantee that they're gonna be inserted in proper order. Create one worker for them. This costs more on inserting data (and doesn't give you option of scalability).
prepare your app for such situation, e.g. when you get message destroyFriendship and friendship does not exist - save message to db containing future friendship ending. Then you can have multiple workers making and destroying friendship and do not have to care about proper order. Simply do your job, make friends and if there's row in db about ending of friendship - destroy it (or simply do not create). Of course you need to check timestamp of creation/destroying time and check if destroying time was after creation time!
Of course you can count somehow hash of A/B, but it would be IMO more costfull then preparing app. Scalling app using excahnges/queues is not really good - you're going to create more and more queues and it's going to end up in too many queues/exchanges in rabbitmq.
If you have to use solution you specified - you can for example count crc32 from A and B, and using it's value calcalate to which queue task should be send. But having multiple consumers might result wrong here - what if one of consumers is blocked somehow and other receive message with destroying friendship? Using this solution I'd say that it's dangerous to have more than 1 worker per group of A/B.

Setting a new label on all nodes takes too long in a huge graph

I'm working on a graph containing about 50 million nodes and 40 million relationships.
I need to update every node.
I'm trying to set a new label to these nodes, but it's taking too long.
The label applies to all 50 million nodes, so the operation never ends.
After some research, i found out that Neo4j treats this operation as a single transaction (i don't know if optimistic or not), keeping the changes uncommitted, until the end (which will never happen in this fashion).
I'm currently using Neo4j 2.1.4, which has a feature called "USING PERIODIC COMMIT" (already present in earlier versions). Unfortunately, this feature is coupled to the "LOAD CSV" feature, and not available to every cypher command.
The cypher is quite simple:
match n set n:Person;
I decided to use a workaround, and make some sort of block update, as follows:
match n
where not n:Person
with n
limit 500000
set n:node;
It's ugly, but i couldn't come up with a better solution yet.
Here are some of my confs:
== neo4j.properties =========
neostore.nodestore.db.mapped_memory=250M
neostore.relationshipstore.db.mapped_memory=500M
neostore.propertystore.db.mapped_memory=900M
neostore.propertystore.db.strings.mapped_memory=1300M
neostore.propertystore.db.arrays.mapped_memory=1300M
keep_logical_logs=false
node_auto_indexing=true
node_keys_indexable=name_autocomplete,document
relationship_auto_indexing=true
relationship_keys_indexable=role
execution_guard_enabled=true
cache_type=weak
=============================
== neo4j-server.properties ==
org.neo4j.server.webserver.limit.executiontime=20000
org.neo4j.server.webserver.maxthreads=200
=============================
The hardware spec is:
RAM: 24GB
PROC: Intel(R) Xeon(R) X5650 # 2.67GHz, 32 cores
HDD1: 1.2TB
In this environment, each block update of 500000 nodes took from 200 to 400 seconds. I think this is because every node satisfies the query at the start, but as the updates take place, more nodes need to be scanned to find the unlabeled ones (but again, it's a hunch).
So what's the best course of action whenever an operation needs to touch every node in the graph?
Any help towards a better solution to this will be appreciated!
Thanks in advance.
The most performant way to achieve this is using the batch inserter API. You might use the following recipe:
take a look at http://localhost:7474/webadmin and note the "node count". In fact it's not the number of nodes it's more the highest node id in use - we'll need that later on.
make sure to cleanly shut down your graph database.
take a backup copy of your graph.db directory.
write a short piece of java/groovy/(whatever jvm language you prefer...) program that performs the following tasks
open your graph.db folder using the batch inserter api
in a loop from 0..<node count> (from step above) check if the node with given id exists, if so grab its current labels and amend the list by the new label and use setNodeLabels to write it back.
make sure you run shutdown with the batchinserter
start up your Neo4j instance again

How to write with a single node in MPI

I want to implement some file io with the routines provided by MPI (in particular Open MPI).
Due to possible limitations of the environment, I wondered, if it is possible to limit the nodes, which are responsible for IO, so that all other nodes are required to perform a hidden mpi_send to this group of processes, to actually write the data. This would be nice in cases, where e.g. the master node is placed on a node with high-performance filesystem and the other nodes have only access to a low-performance filesystem, where the binaries are stored.
Actually, I already found some information, which might be helpful, but I couldn't find further information, how to actually implement these things:
1: There is an info key MPI_IO belonging to the communicator, which tells which ranks provide standard-conforming IO-routines. As this is listed as an environmental inquiry, I don't see, where I could modify this.
2: There is an info key io_nodes_list which seems to belong to file-related info-objects. Unfortunately, the possible values for this key are not documented and Open MPI doesn't seem to implement them in any way. Actually, I can't even get the filename from the info-object which is returned by mpi_file_get_info...
As a workaround, I could imagine two things: On the one hand, I could perform the IO with standard Fortran routines, or on the other hand, create a new communicator, which is responsible for IO. But in both cases, the processes, which are responsible for IO have to check for possible IO from the other processes to perform manual communication and file interaction.
Is there a nice and automatic way to restrict the IO to certain nodes? If yes, how could I implement this?
You explicitly asked about OpenMPI, but there are two MPI-IO implementations in OpenMPI. The old workhorse is ROMIO, the MPI-IO implementation shared among just about every MPI implementation. OpenMPI also has OMPIO, but I don't know a whole lot about tuning that one.
Next, if you want things to happen automatically for you, you'll have to use collective i/o. The independent I/O routines cannot send a message to anyone else -- they are independent and there's no way to know if the other side will be listening.
With those preliminaries out of the way...
You are asking about "i/o aggregaton". There is a bit of information here in the context of another optimization called "deferred open" (and which OMPIO calls Lazy Open)
https://press3.mcs.anl.gov/romio/2003/08/05/deferred-open/
In short, you can definitely say "only these N processes should do I/O", and then the collective I/O library will exchange data and make sure that happens. The optimization was developed some 15-odd years ago for just the situation you proposed: some nodes being better connected to storage than others (as was the case on the old ASCI Red machine, to give you a sense for how old this optimization is...)
I don't know where you got io_nodes_list. You probably want to use the MPI-IO info keys cb_config_list and cb_nodes
So, you've got a cluster with master1, master2, master3, and compute1, compute2, compute3 (or whatever the hostnames actually are). You can do something like this (in c, sorry. I'm not proficient in Fortran):
MPI_Info info;
MPI_File fh;
MPI_Info_create(&info);
MPI_Info_set(info, "cb_config_list", "master1:1,master2:1,master3:1");
MPI_File_open(MPI_COMM_WORLD, filename, MPI_MODE_CREATE|MPI_MODE_WRONLY, info, &fh)
With these hints, MPI_File_write_all will aggregate all the I/O through the MPI processes on master1, master2, and master3. ROMIO won't blow up your memory because it will chunk up the I/O into a smaller working set (specified with the "cb_buffer_size" hint: cranking this up, if you have the memory, is a good way to get better performance).
There is a ton of information about the hints you can set in the ROMIO users guide:
http://www.mcs.anl.gov/research/projects/romio/doc/users-guide/node6.html

Biztalk Ordered Delivery failure

We have a BizTalk application where the order of messages being inputted is very important and has to be kept, meaning they have to be outputted in the same order. Normally ordered delivery would do the trick here.
However I read that ordered delivery is only guaranteed when you connect a receive location directly to a send port. The moment you use orchestrations the order delivery isn't guaranteed anymore. Is there a way to work around or fix this? Because this kind of ruins our whole application and we've been working on this for months.
I read a work around from Microsoft where they use an extra field which has a counter and where they use an end orchestration which checks the counters. But this is way too much work for us to do now. So this work around is a no go. Plus not all messages are translated which creates holes in our flow and not all messages are coming from the same source either which makes this work around useless anyway.
Any other ideas?
Check out this page.
It explains that if you have an orchestration that follows the singleton pattern to ensure only one instance of the orchestration exists, and you make sure you set the orchestration's receive port to ordered delivery, than you should get a valid end-to-end ordered delivery scenario
To provide end-to-end ordered delivery the following conditions must be met:
Messages must be received with an adapter that preserves the order of the messages when submitting them to BizTalk Server. In BizTalk Server 2006, examples of such adapters are MSMQ, MQSeries, and MSMQT. In addition, HTTP or SOAP adapters can be used to submit messages in order, but in that case the HTTP or SOAP client needs to enforce the order by submitting messages one at a time.
You must subscribe to these messages with a send port that has the Ordered Delivery option set to True.
If an orchestration is used to process the messages, only a single instance of the orchestration should be used, the orchestration should be configured to use a sequential convoy, and the Ordered Delivery property of the orchestration's receive port should be set to True.
Resequencing strategies for ordered delivery in BizTalk:
I recently responded to a LinkedIn user's question regarding ordered delivery options in BizTalk.
I thought it would be useful for people to understand some of the strategies for re-sequencing messages using BizTalk.
Often as an BizTalk Developer, you are required to integrate to line-of-business systems which are unchangeable. This can be for one or more of many different reasons. As an example, the cost of changing a system can be too high or the vendor license states that support may be withdrawn if the system is changed.
This would not normally represent a problem where the vendor has provided a thoughtfully designed API as a point-of-integration endpoint. However, as many Integration Developers quickly learn, this is very rarely the case.
What do I mean by a thoughtfully designed API? Well, aside from all the SODA principals (service composition, fault contracts etc.), the most important feature of an API is to support the consumption of data which arrives in the wrong order.
This is a fairly simple thing to do. For example, if you are a vendor and you provide a HTTP operation as your integration point then one of the fields you could expose on your operation is a time-stamp or, even better, a sequence number. This means that if a call is made with an out-of-date payload, the relevant compensating mechanism can kick-in - which can be as simple as discarding the data.
This article discusses what to do when the vendor has not built this feature into an API, and as an integrator you therefore are forced to implement end-to-end ordered delivery as part of your integration solution.
As stated in my response to the user's post on LinkedIn (see link above), in BizTalk ordered delivery in any but the simplest of scenarios is complicated at best and at worst can represent a huge cost in increased complexity, both in terms of development and support. The basic reason is that BizTalk is designed to be massively concurrent to enable high throughput, and there is a direct and unavoidable conflict between concurrency and ordering. Shoe-horning E2E ordered delivery into a BizTalk solution relies on artefacts such as singleton orchestrations which introduce complexity and increase both failure rate and cost-per-failure numbers.
A far better solution is to maintain concurrent processing to as near as possible to the line-of-business system endpoints, and then implement what is called a re-sequencer wrapper around each of the endpoints which require data to be delivered in the correct order.
How to implement such a wrapper in BizTalk depends on some factors, which are outlined in the following table:
|Sequencing |Messages|Database |Wrapper |
|field |are |integration?|strategy |
| |deltas? | | |
|--------------|--------|------------|----------------------------------|
|n of a total m| N | Y |Stored procedure |
|n of a total m| N | N |Singleton orchestration |
|n of a total m| Y | Y |Batched singleton orchestration |
|n of a total m| Y | N |Batched singleton orchestration |
|Timestamp | N | Y |Stored procedure |
|Timestamp | N | N |Singleton orchestration |
|Timestamp | Y | Y |Buffer table with staggered reader|
|Timestamp | Y | N |Buffer table with staggered reader|
The first factor Sequencing field relates to the idea that in order to implement any kind of re-sequencer wrapper, as a minimum you will require that your message data contains some sequencing information. This can take the form of a source time-stamp; however a better, though rarer, kind of sequencing information consists of a sequence number combined with the total number of messages, for example, message 1 of 10, 2 of 10, etc..
The second factor Messages are deltas? relates to whether or not the payload of your message contains a single state change to the data or the sum of all past changes to the data. Put another way, is it possible to reconstruct the full current state of the data from this message? If the message payload contains just a single change then it may not be possible to reconstruct the state of the data from the single message, and in this instance your message is a delta.
The third factor Database integration? relates to whether or not the integration-entry-point to a system is a database. The reason this matters is that integrating at the database layer is a fairly common integration scenario, and if available can greatly simplify handling re-sequencing.
The strategies from the above table are described in detail below:
Stored procedure wrapper
This is the simplest of the resequencing strategies. A new stored procedure is created which queries the target data before making a decision about whether to update the target data. The decision can be as simple as Is the data I have newer than the data in the target system?
Of course, in order to implement this strategy, the target data also has to include the sequencing field of the source data, although an approximation can be made if necessary by relying on existing time-stamps which may already exist in the target data. The stored procedure wrapper can be contained either in the target database or ideally in a separate database.
Singleton orchestration wrapper
The idea behind this strategy is the singleton orchestration. This is a pattern you can implement to ensure that only a single instance of the orchestration will exist at any one time. There are many articles on the web demonstrating how to implement this pattern in BizTalk.
The core of the idea is that the singleton simply keeps a track of the most recent successfully processed message sequence (or time-stamp). If the singleton receives a message which is older than the most recent sequence it is simply discarded. This works because the messages are non-deltas, so the target system can commit only the most recent of a number of messages and the data will be in the most recent state. Only when data is committed successfully is the most recent sequence held by the singleton updated.
Batched singleton orchestration wrapper
This strategy is based on the Singleton orchestration wrapper above, except it is more complex. Rather than only keep the most recent sequence information in memory the singleton is required to create and hold a working set of messages in memory which it will re-order and then process once all expected messages from the batch have arrived. This is because the messages are deltas so the target system MUST receive each message in the order they were intended. Once the batch has been sent successfully the singleton can terminate.
To do this it is a requisite of the source data that it contain a correlation identifier of some description which allows the batch of messages to be defined. For example, processing a defined set of orders from a customer, the inbound messages must contain an identifier for the customer. This can then be used to route the messages to the singleton orchestration instance correlated with this customer. Furthermore the message sequence field available must be of the n of a total m form.
Once the singleton is initialised it assembles a working set of messages in memory and proceeds to populate it as new messages arrive. One way I have seen this done is using a System.Collections.Generic.List as the container for the working set. Once the list has been fully populated (list length = m) then it is assumed all messages in the batch have been received and the orchestration then loops over the working set in sequence and processes the messages into the target system.
One of the benefits of the batched singleton orchestration wrapper is it allows concurrent processing by correlation identifier. In the example above this means that messages from two customers would be processed concurrently.
Buffer table with staggered reader wrapper
Arguably the most complex of the strategies presented, this solution is to be used when you have delta messaging with a time-stamp-based sequencing field. It can be implemented with a database of some description which acts as a re-sequencing buffer.
It is worth noting here that this re-sequencing wrapper does not guarantee ordered delivery, but used well it makes ordered delivery highly likely.
As messages arrive, they are written into the buffer and in the same operation the buffer is reordered, so that the order of messages held in the buffer are always correct.
To create the buffer reader, have a receive location which reads the messages in the buffer before passing the messages to a send port with ordered delivery enabled, which then will process the messages into the target system. You can also use a singleton orchestration as an intermediary if your target system's API semantics are too complex for a send port.
However, using this wrapper as I have described it above will not enable ordered delivery, as the messages will almost certainly be committed to the buffer in the wrong order, which will result in the messages being processed into the target system in the same (wrong) order. This is where the staggered query comes in. This is a fancy way of saying your buffer query needs to only select data at intervals of time T, AND only select those rows where the row-number is lower than buffer total row count minus C.
This has the effect of allowing sequencing to occur over an appropriate timespan. T will be familiar to most BizTalk developers as the polling interval of some adapters (such as the WCF-SQL adapter). C is slightly more difficult to set, but by increasing this number you are reducing the chances that when you poll, you will miss a message older than the most recent one in your retrieved data set.
What T and C are depends on many things, although these values should be based on your latency SLA and your message volume (or throughput). As a guideline, if you have a SLA to deliver data into your target system within 30 seconds and you process 10 messages per second then T should be around 10 seconds and C should be around 100 rows.
Of course this only works if your messages for a given correlation id are sent by the source system during a short space of time (ideally back-to-back). The longer the interval between sends, the greater the required value of C, and the less effective the wrapper becomes.
One of the benefits of this strategy is you can also perform de-duplication of messages in the buffer if your data source is prone to sending duplicate messages and your target system endpoint is not idempotent. You can also use the buffer to implement FILO and other non-standard queueing semantics.
Conclusions
The strategies I have discussed here are ways of bending BizTalk to a task which is wasn't designed to do. As a result each has caveats around cost and complexity to support, and also may not work in certain scenarios. I would like to hear from anyone who has implemented other patterns for ordered delivery in BizTalk.

Resources