What’s the difference between KVStore and KVEngine in NebulaGraph Database? - nebula-graph

Nebula Graph version: 3.3
I’ve been checking the source code of the NebulaGraph Database and have gotten confused about the relationship and the difference between KVStore and KVEngine. KVEngine implements RocksEngine.cpp to access rocksdb, while KVStore seems more like a logical concept, and all kinds of CRUD are executed through KVStore. So, how KVstore interacts with the underlying database?

Good question.
KVEngine is the abstract of the key-value engine per single(server) storage instance, and RocksEngine is one of its implementations.
KVStore is the abstract of the general Storage layer (as you mentioned, logical), i.e. the Distributed Consensus + KVEngine together construct the KVStore. NebulaStore is one of its implementations.

Related

Storing data on edges of GraphDB

It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.

How can I sort my results by property length?

I have these user vertices:
g.addV("user").property(single,"name", "bob")
g.addV("user").property(single,"name", "thomas")
g.addV("user").property(single,"name", "mike")
I'd like to return these sorted by the length of the name property.
bob
mike
thomas
Is this possible with Gremlin on AWS Neptune without storing a separate nameLength property to sort on?
Currently the Gremlin language does not have a step that can return the length of a string. This is something that may be added to Gremlin in a future version, possibly in the 3.6 release. You can of course do it using closures (in-line code) but many hosted TinkerPop graph stores, including Amazon Neptune, do not allow arbitrary code blocks to be run as part of Gremlin queries. At this moment in time this will need to be handled application side when using Neptune, or as you suggest, using a nameLength property. This is an area where the TinkerPop community recognizes some additional steps are needed and does plan to prioritize this work.

CosmosDB API selection: does it dictate how the data is stored, or only how we communicate with the instance?

When creating a CosmosDB instance, we can choose the API that we will use to communicate with the instance (e.g. SQL, MongoDB, Cassandra, etc.)
What is not clear to me is - does this selection dictates how the data is stored, or only the way we communicate with the instance? For example, if we choose MongoDB, does it mean that CosmosDB will store data in a MongoDB fashion?
The choice of API does not change how the data is stored. Cosmos DB always stores data using something called atom-record-sequence (ARS) which is essentially a set of primitive types, structs and arrays. The database engine translates the native ARS format into the data structures used by the various APIs (i.e. json documents, table rows, etc.)
So the answer to your question is that the choice of API only impacts how you communicate with the databases for that Cosmos DB account.
As David Makogon points out in his comment on another answer, while the way the data is stored is the same regardless of the API used, the content of the data will be different because each API requires it's own metadata so that the underlying data can be projected into the format expected by each API.
Here is a good technical overview of how Cosmos works under the hood.
https://azure.microsoft.com/en-us/blog/a-technical-overview-of-azure-cosmos-db/
Data is always stored in the same fashion (as a bunch of json documents), only the way you interact with the data changes
https://learn.microsoft.com/en-us/azure/cosmos-db/introduction#develop-applications-on-cosmos-db-using-popular-open-source-software-oss-apis

If a Corda OwnableState is owned by an AnonymousParty, who stores it?

In Corda, an OwnableState must specify an AbstractParty as an owner. There are two types of AbstractParty:
Party, with a well-known identity
AnonymousParty, identified solely by public key
If I create a CompositeKey to own the OwnableState, who then will store it in their vault as part of FinalityFlow?
At the moment nobody will unless lower level APIs are used.
The vault needs more work to fully understand multi-sig states, e.g. with cash, we need a way to select coins that we're participants of.
It's quite an advanced feature because composite keys have so many use cases. This is typical in the blockchain space, Bitcoin supported CHECKMULTISIG outputs in the protocol long before wallets that knew how to use them existed. And when wallets did start to appear, they had different code and features for different use cases. E.g. using multisig/composite keys for more secure wallets is different to using them to do dispute mediation protocols.
At least with flows we have a straightforward way to implement support - we can make flows that understand composite keys and either have the certs linking the components to real parties, or know who they are some other way, and then go gather the signatures automatically.

Which technology is best suited to store and query a huge readonly graph?

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.

Resources