Can I do a read from replica using spring data couchbase? - spring-data-couchbase

For reads, Couchbase recommends checking for certain exceptions and doing a read from a replica (in order to improve availability for operations that happen during a failover, as long as you're OK with possibly stale data).Does spring data provide anything for this? There's no getFromReplica operation exposed that I can find.

Indeed getFromReplica is not exposed in Spring Data. You have to go lower level to do that.
Most people using spring data expect results to be consistent. So we want the developer to be very aware of when he is making a decision that will impact the consistency level. This is why getFromReplica is not available through Spring Data. And why you have to use the Couchbase bucket object directly. It has to be your decision because it might give you unconsistent result.
Now this is just for Key/Value Get. If you are using queries, you can tune the consistency level by modifying a property in application.properties:
# Default level of consistency (read-your-own-writes|eventually-consistent|strongly-consistent|update-after)
spring.data.couchbase.consistency=read-your-own-writes
Level of consistency are explained in the documentation: http://docs.spring.io/spring-data/couchbase/docs/current/reference/html/#couchbase.repository.consistency

Related

Storing data on edges of GraphDB

It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.

Corda dynamic data handling at the time creating a contract

I am finding a solution to handle dynamic data / list to all nodes or specified nodes at the time creating a contract in Corda. I don’t think Oracle is a good approach to use in my case for the following reasons:
The data can be a list of for example legal entity names, they are not from outside world, not a single value;
The list is depended on particular field(s) selected, therefore will need perhaps a centralized place to maintain the data relationship;
Appreciate if anyone can help on this. Thanks.
Kwan
This question is a little difficult to answer without further details on your use-case. However, on the surface, an Oracle doesn't sound like a bad solution:
The data provided by an oracle can be a list
The term "outside world" simply refers to any information not included in the transaction itself. This term should not be taken too literally.
Ultimately, you can think of an Oracle as a provider of "official" data. You request a command including the data from the oracle, include it in the transaction, and the oracle will sign over the transaction if and only if it agrees that the data in the command is true. As long as the Oracle is trusted by all parties involved, this allows data from outside the transaction to be included in the transaction in a reliable way.

Lack of timestamps in some entities in OneNote

I am currently working on a mobile application that uses the OneNote REST API that sometimes have really enormous ping, so the cache implementation is one of the most important aspects with impact on my application's performance. But, for the purpose of implementing traffic-effective cache with data staying up-to-date the timestamp is needed for all of the entities that are not stable or their amount can grow (practically any entities fit these conditions). So, The question is whether the timestamp (e.g. lastModified, lastModifiedTime etc.) properties do not present in some entities, for example - permissions, principal objects or they are just hidden and it is possible to use the $expand to get them.
If you don't see a timestamp in one of the entities, then the entity does not have it.
Adding timestamp for some of this entities can be challenging for us, as they the datastore we're using might not have it, but I encourage you to ask for suggestions in our uservoice page.
https://onenote.uservoice.com/forums/245490-onenote-developer-apis

High write concurrency backend for storing large set/array based data?

The problem:
I have a web service that needs to check membership of a given string against a set of strings, where number of elements in the set will be under constant growth, potentially numbering in the hundreds of millions.
If the string is not a member of the set, it gets added to the set. The string size will be a constant 32 bytes. Only one set variable is required, no other variables need to be persisted.
This check is performed as part of a callback on a webhook, thus performance is critical.
While my use case pretty much fits a bloom filter perfectly, I'm having trouble finding a solution to deal with the persistent storage vs i/o concurrency portion of the problem.
Environment:
DigitalOcean/Linux/Python/Flask, but open to change if required
Possible Solutions:
redis, storing the variable in a set, and then querying via sismember for a nice o(1) based solution. This is what we are currently using, but this solution doesn't scale well with a large number of keys given that everything must fit in memory, and it also has issues with write concurrency when traffic increases.
sqlite, with WAL mode turned on. concerned about lock contention when the server gets hit with a significant number of webhook requests (SQLITE_BUSY). Local server file doesn't scale across host machines.
postgres, seems like a nice middle ground solution, but might have to deal with lock contention here as well for write concurrency.
cassandra, given it's focus on write performance. overkill for storing a single column though?
custom bloom filter backend, not sure if something like this exists that provides the functionality of a bloom filter with a high i/o concurrency storage backend.
Thoughts?
The Redis solution can scale well with data sharding. You can set up several Redis instances (or use Redis-Cluster), split your data into several parts, i.e. shardings, and save each part in a different Redis instance.
When you want to check the membership of a given string, you can send the sismenber command to the corresponding Redis instance. Take this answer as an example of how to split data with hash functions.
Also, you can implement bloom filter with Redis (GETBIT and SETBIT). Just a reminder, bloom filter has the false positive problem.
First, you don't need to use sismember. Just do sadd systematically, and test the returned value. If it's 0, the value was already in the set, and so was not added. Doing so you will very easily reduce the number of requests to Redis.
Second, the description of your problem looks like a perfect match for Hbase, which is made for storing very large data set and query them using bloom filters. But you'll probably find it's overkill, just like Cassandra.

DynamoDB: Conditional writes vs. the CAP theorem

Using DynamoDB, two independent clients trying to write to the same item at the same time, using conditional writes, and trying to change the value that the condition is referencing. Obviously, one of these writes is doomed to fail with the condition check; that's ok.
Suppose during the write operation, something bad happens, and some of the various DynamoDB nodes fail or lose connectivity to each other. What happens to my write operations?
Will they both block or fail (sacrifice of "A" in the CAP theorem)? Will they both appear to succeed and only later it turns out that one of them actually was ignored (sacrifice of "C")? Or will they somehow both work correctly due to some magic (consistent hashing?) going on in the DynamoDB system?
It just seems like a really hard problem, but I can't find anything discussing the possibility of availability issues with conditional writes (unlike with, for instance, consistent reads, where the possibility of availability reduction is explicit).
There is a lack of clear information in this area but we can make some pretty strong inferences. Many people assume that DynamoDB implements all of the ideas from its predecessor "Dynamo", but that doesn't seem to be the case and it is important to keep the two separated in your mind. The original Dynamo system was carefully described by Amazon in the Dynamo Paper. In thinking about these, it is also helpful if you are familiar with the distributed databases based on the Dynamo ideas, like Riak and Cassandra. In particular, Apache Cassandra which provides a full range of trade-offs with respect to CAP.
By comparing DynamoDB which is clearly distributed to the options available in Cassandra I think we can see where it is placed in the CAP space. According to Amazon "DynamoDB maintains multiple copies of each item to ensure durability. When you receive an 'operation successful' response to your write request, DynamoDB ensures that the write is durable on multiple servers. However, it takes time for the update to propagate to all copies." (Data Read and Consistency Considerations). Also, DynamoDB does not require the application to do conflict resolution the way Dynamo does. Assuming they want to provide as much availability as possible, since they say they are writing to multiple servers, writes in DyanmoDB are equivalent to Cassandra QUORUM level. Also, it would seem DynamoDB does not support hinted handoff, because that can lead to situations requiring conflict resolution. For maximum availability, an inconsistent read would only have to be at the equivalent of Cassandras's ONE level. However, to get a consistent read given the quorum writes would require a QUORUM level read (following the R + W > N for consistency). For more information on levels in Cassandra see About Data Consistency in Cassandra.
In summary, I conclude that:
Writes are "Quorum", so a majority of the nodes the row is replicated to must be available for the write to succeed
Inconsistent Reads are "One", so only a single node with the row need be available, but the data returned may be out of date
Consistent Reads are "Quorum", so a majority of the nodes the row is replicated to must be available for the read to succeed
So writes have the same availability as a consistent read.
To specifically address your question about two simultaneous conditional writes, one or both will fail depending on how many nodes are down. However, there will never be an inconsistency. The availability of the writes really has nothing to do with whether they are conditional or not I think.

Resources