Azure Cosmos Gremlin API: transactions and efficient graph traversal - azure-cosmosdb-gremlinapi

We are experimenting with the Cosmos Gremlin API because we are building a large scale knowledge-management-system which is naturally suited for a graph DB. Knowledge items are highly interconnected and therefore a graph is much better than a relational or a document-oriented (hierarchical) structure.
We need atomic write operations (not full transaction support, just atomic writes). E.g. we need to create several vertices and edges in one atomic write operation.
After carefully reading the documentation and extensively searching for solutions, our current state of knowledge is following:
Cosmos Gremlin API stores vertices as documents and outgoing edges as part of the "outgoing document".
A Gremlin statement creating vertices and egdes might be split up and executed in parallel.
There is no transaction support and there are no atomic write operations.
Write operations are not idempotent.
The two facts taken together mean: If you execute a graph write operation and an error occurs somewhere along the traversal, you have no chance whatsoever to recover from it in a clean way. Let's say you add an edge, add some vertices, perform some side-effect-steps and something goes wrong. Which vertices and edges are persisted and which are not? Since you cannot simply run the statement a second time (vertices with the ids already exist), you're kind of stuck. In addition, this is nothing which can be solved on the end-user-level in the UI.
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app. When you have a look at the Gremlin "data explorer" in the portal, that seems even more true. I looks like a prototype.
Since edges are stored on the "outgoing document", one should always traverse the graph using the outgoing edges, not the incoming.
This takes away a lot from the efficiency of working with a graph DB: To traverse both directions efficiently.
It leads to workarounds: For each outgoing edge, create an "inverse edge" on the incoming vertex.
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.

Write operations are not idempotent.
It is possible to write queries in an idempotent way however it's not really done in a nice readable and maintainable way. See an idempotent gremlin example here: https://spin.atomicobject.com/2021/08/10/idempotent-queries-in-gremlin/
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app
This really depends on your application requirements, not all production applications require atomicity or transactions. Sometimes some systems can drop data or if needed you can do various things to ensure data integrity - Though this often puts more responsibility of the application developer
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.
I haven't seen too many stories of using it in production anecdotally, it looks like CosmosDB is relatively popular but hard to tell what proportion of users are running which API.

Related

Does microservice architecture break graph database optimisations & design?

I began researching graph databases and have hit a wall, hopefully someone can bring up a point I have not considered.
I wanted to build my application using GraphQL as it's simple to use and I love it's flexibility. It works well with microservice architecture and while it has it's pitfalls I prefer it over REST. For the database, I assumed a graph database would be a naturally good fit. As a client might traverse several entities/nodes it would be far easier to use a graph database, which is built with traversals as the key motivator. It's highly flexible and efficient.
Unfortunately, my issue is that while graphql with a graph db would work well for a monolith app, it wouldn't work well with Microservices (MS) unless I'm missing something.
For MS you want to restrict your database access per service, either via schema, tables or entire db. This allows separation of concerns and follows best coding practices, isolating business logic. This essentially limits very quickly the traversals across entities which in turn limits the optimisations offered by a graph database. For example, a request to read or write across multiple domains could be done in a graph database in one query easily. For write, I have no issue separating as there's often a considerable amount of business logic for each domain. However for a read, it's often just a permission check. The entire point of using a graph database is to map data that's graph-related, but by separating concerns you get very small graphs that at least for my app, would be near useless. The power is lost.
The separation of concerns is far more important than speed, however I would like to know whether a CQRS adoption would bring back the power of the graph. By providing MS architecture for writes, but one single endpoint for reads, the gains are kept. By this I mean not even the services themselves do reads - everything uses the one endpoint (ideally replicated behind a load balancer). I am wondering what the pitfalls here would be with regards a) deployments: would there be breaking changes across the read & write? I'm thinking it's possible but unlikely if reads are exclusive to the single endpoint b) development experience - would this end up being a pain?
Is there something I have not yet considered, or are graph databases more suited for second-tier ops where they are loaded with data to answer specific questions but not the first-level data store?

How to programmatically create a new graph instance at runtime on Gremlin Server

I have a Java Project that connects to a Gremlin Server. I want to create 5 new graph instances of Neo4j at runtime? How can i do this on Gremlin Server please?
Also, I have heard about session and sessionless states in Gremlin Server but don't really understand the purpose of this?! Can someone please touch on this but more importantly, to show how to use a session state and a sessionless state in my Java project on Gremlin Server?
Many thanks in advance.
For example:
List<GraphTraversalSource> graphs;
for (int x= 0 ; x < 5 ; x++) {
graph = Neo4jGraph.open();
g = graph.traversal();
graphs.add(g);
}
The short answer is that Gremlin Server doesn't allow for programmatic graph creation - graphs are configured up front prior to Gremlin Server starting.
The longer answer is that Gremlin Server is a bit of a reference implementation of the Gremlin Server Protocol, which means that depending on the TinkerPop-enabled graph database you use you might get a different answer to your question. For example, DS Graph and JanusGraph both have options for dynamic graph construction. Neo4j and TinkerGraph on the other hand utilize the raw reference implementation of Gremlin Server and therefore don't have that kind of functionality.
That last point about the reference implementation leads to yet a longer answer. You can submit a script to create graphs like Neo4jGraph or TinkerGraph but it won't add them to the global list of graphs that Gremlin Server holds (which you've tried to simulate in your pseudocode with graphs.add(g)). That of course means that you won't be able to access those newly created Graph instances on future requests........unless, you use a session. The reason that TinkerPop has both sessionless and sessioned based requests is that sessions tend to be more expensive to the server because they maintain more state between requests and they bind requests to a single Gremlin Server rather than spreading requests across a cluster. TinkerPop recommends using sessionless for almost all use cases and to reserve sessioned requests for some fairly narrow use cases (like tools - a Gremlin-based visualization UI).
There are likely some ways to extend Gremlin Server for your purposes (JanusGraph did it with their packaging of Gremlin Server), but it would require you to get knowledgeable on the code itself. I could probably provide you some additional guidance but StackOverflow probably isn't the right place to do that. Feel free to ask questions on the gremlin-users mailing list if you'd like to discuss that option in greater detail.

How to calculate the time and memory consumption of a Gremlin Query

I have found some good articles about optimizing queries on Gremlin, but I still don't know how to get the memory consumption and time of execution of a query.
Some places that I found talking about query optimization:
https://github.com/tinkerpop/gremlin/wiki/Traversal-Optimization
https://academy.datastax.com/content/dse-gremlin-queries-good-better-best
https://medium.com/#jayanta.mondal/analyzing-and-improving-the-performance-azure-cosmos-db-gremlin-queries-7f68bbbac2c
This link
https://github.com/tinkerpop/gremlin/wiki/Traversal-Optimization
explicitly states that it refers to "an outdated version of the TinkerPop framework and Gremlin language documentation" - please ignore that...it is for TinkerPop 2.x.
That said, the profile() step is the best that Gremlin directly offers and it can tel you a lot about query execution as you can identify which steps are running slowest and if you are seeing the expected number of traversers at specific parts of your query.
If you need memory consumption information you will either need to use tools specific to the graph database that you are using to get that information (if they offer such things) or you will need to use standard profiling tools like Java Flight Recorder, VisualVM, etc.

Gremlin JavaAPI vs Gremlin-Server?

Use gremlin java api directly in my application
Deploy a gremlin-server, use gremlin-driver api, connect to gremlin-server
Which one is better? Or what is the the advantages and disadvantages?
As of today, the answer depends on your environment to some degree. If you ever intend to use a non-jvm language (python, C#, js, etc), then you should likely use Gremlin Server as that will be the only way to build your application. If you want to be in the best position to switch to other graph databases and away from the one you've currently chosen, then using Gremlin Server might be better as not all graph databases are available through the direct Java API (there are at least two, that embed Gremlin Server and only allow connection via driver).
So, if the answer to those two questions are a resounding "no", then I don't think I could convince you to include Gremlin Server in your application, especially if you don't have a terribly complex project to worry about. It's adding another layer to your architecture that you would probably prefer to avoid.
If you do choose to embed, then be sure to use the Traversal API over the Structure API. Recall that the Structure API is meant for graph providers implementing the TinkerPop interfaces for their graph system. In other words, for this code:
graph = TinkerGraph.open()
g = graph.traversal()
prefer use of the API provided by g over the one provided by graph for querying and mutating the graph.
I do believe however that TinkerPop's future does have a solid dependence on "Gremlin Server" (the future incarnation may look different, but would be a "server component" of some sort) as something that will be less of a question to users to include. It would be great to see this decision point removed and simplified.

some generic questions about neo4j

I'm new to non-php web applications and to nosql databases. I was looking for a smart solution matching my application requirements and I was very surprised when I knew that there exist graph based db. Well I found neo4j very nice and very suitable for my application, but as I've already wrote I'm new to this and I have some limitations in understending how it works. I hope you guys could help me to learn.
If I embed neo4j in a servlet program then the database access I create is shared among the different threads of that servet right? so I need to put database creation in init() method and the shutdown in the destroy() right? And it will be thread safe.(every dot is a "right?") But what if I want to create a database shared among the whole application?
I heard that graph databases in general relies on a relational low level. Is that true for neo4j? But if it is then I see an high level interface to the real persistence layer, so what a Connection is in this case? Are there some techniques like connection pooling or these low level things are all managed by neo4j?
In my application I need to join some objects to users and many other classification stuff. any of these object has an unique id (a String). then If some one asks to view some stuff about object having id=QW then I need to load the vertex associate to object.QW. Is this an easy operation for graph datbases?
If I need to manage authentications, so as I receive the couple (usr,pwd) and I need to check whether exists this couple in my graph. Is the same problem as before or there exist some good variation for managing authentications?
thanks
If you're coming from PHP world in most cases you're better of running Neo4j in server mode and access it either via REST directly or use a client driver like https://github.com/jadell/neo4jphp. If you still want to embed Neo4j in a servlet environment, the GraphDatabaseService is a shared component, maybe stored within the ServletContext. On a per request (and therefore per-thread) basis you start and commit transactions.
Neo4j is a native graph database. The bare metal persistence layer is optimized for navigating from one node to its neighbors as fast as possible and written by the Neo4j devteam themselves. There are other graph databases out there reusing other persistence technologies for their underlying persistence.
Best thing is to run the Neo4j online course at http://www.neo4j.org/learn/online_course.
see SecurityRules
As the Neo4j is NoSql Graph Database,
Genration of the Unique ID you have to handle using the GUID(with 3.x autonincremented proery also supported for particular label),
as the Neo4j default genrated id is unique but can be realocated to the another object once the first assigned object is deleted,
I am .net developer in my project I used the Neo4j rest api it works well, i will sugesst you to go with that,as it is implemented using async-awit programing pattern, so long running operation you can pass to DB and utilize your web server resources in more prominent way.

Resources