Gremlin schema validation - azure-cosmosdb

Gremlin schema validation - azure-cosmosdb

I'm Gremlin nub, and may be I missed some basics in Gremlin docs, but I didn't found a way to define a schema validation rules for Gremlin.
I mean how can I allow in my graph DB (Gremlin Api in Azure Cosmos DB) the relations:
person->creates->software and person->knows-person,
but restrict:
person->knows->software or software->creates->person?

Gremlin and TinkerPop have no notion of a graph schema. The approaches to schema from different graph systems are too diverse to generalize (some don't even support a schema). If you need a schema, you need to either choose a TinkerPop-enabled system that has that support and use their APIs to define that schema or you need to handle such logic yourself in your application.
For the latter, you might consider a couple of options outside of just encapsulating that logic somewhere in your code:
Write a DSL for graph mutations - that can help enforce the schema you want at the API level
Develop a TraversalStrategy that will verify the mutations made as being schema-compliant. This is a Java only approach at this time and requires bytecode based traversals which CosmosDB doesn't yet support (though that support is currently under development).

Related

Azure Cosmos Gremlin API: transactions and efficient graph traversal

We are experimenting with the Cosmos Gremlin API because we are building a large scale knowledge-management-system which is naturally suited for a graph DB. Knowledge items are highly interconnected and therefore a graph is much better than a relational or a document-oriented (hierarchical) structure.
We need atomic write operations (not full transaction support, just atomic writes). E.g. we need to create several vertices and edges in one atomic write operation.
After carefully reading the documentation and extensively searching for solutions, our current state of knowledge is following:
Cosmos Gremlin API stores vertices as documents and outgoing edges as part of the "outgoing document".
A Gremlin statement creating vertices and egdes might be split up and executed in parallel.
There is no transaction support and there are no atomic write operations.
Write operations are not idempotent.
The two facts taken together mean: If you execute a graph write operation and an error occurs somewhere along the traversal, you have no chance whatsoever to recover from it in a clean way. Let's say you add an edge, add some vertices, perform some side-effect-steps and something goes wrong. Which vertices and edges are persisted and which are not? Since you cannot simply run the statement a second time (vertices with the ids already exist), you're kind of stuck. In addition, this is nothing which can be solved on the end-user-level in the UI.
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app. When you have a look at the Gremlin "data explorer" in the portal, that seems even more true. I looks like a prototype.
Since edges are stored on the "outgoing document", one should always traverse the graph using the outgoing edges, not the incoming.
This takes away a lot from the efficiency of working with a graph DB: To traverse both directions efficiently.
It leads to workarounds: For each outgoing edge, create an "inverse edge" on the incoming vertex.
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.

Write operations are not idempotent.
It is possible to write queries in an idempotent way however it's not really done in a nice readable and maintainable way. See an idempotent gremlin example here: https://spin.atomicobject.com/2021/08/10/idempotent-queries-in-gremlin/
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app
This really depends on your application requirements, not all production applications require atomicity or transactions. Sometimes some systems can drop data or if needed you can do various things to ensure data integrity - Though this often puts more responsibility of the application developer
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.
I haven't seen too many stories of using it in production anecdotally, it looks like CosmosDB is relatively popular but hard to tell what proportion of users are running which API.

How to programmatically create a new graph instance at runtime on Gremlin Server

I have a Java Project that connects to a Gremlin Server. I want to create 5 new graph instances of Neo4j at runtime? How can i do this on Gremlin Server please?
Also, I have heard about session and sessionless states in Gremlin Server but don't really understand the purpose of this?! Can someone please touch on this but more importantly, to show how to use a session state and a sessionless state in my Java project on Gremlin Server?
Many thanks in advance.
For example:
List<GraphTraversalSource> graphs;
for (int x= 0 ; x < 5 ; x++) {
graph = Neo4jGraph.open();
g = graph.traversal();
graphs.add(g);
}

The short answer is that Gremlin Server doesn't allow for programmatic graph creation - graphs are configured up front prior to Gremlin Server starting.
The longer answer is that Gremlin Server is a bit of a reference implementation of the Gremlin Server Protocol, which means that depending on the TinkerPop-enabled graph database you use you might get a different answer to your question. For example, DS Graph and JanusGraph both have options for dynamic graph construction. Neo4j and TinkerGraph on the other hand utilize the raw reference implementation of Gremlin Server and therefore don't have that kind of functionality.
That last point about the reference implementation leads to yet a longer answer. You can submit a script to create graphs like Neo4jGraph or TinkerGraph but it won't add them to the global list of graphs that Gremlin Server holds (which you've tried to simulate in your pseudocode with graphs.add(g)). That of course means that you won't be able to access those newly created Graph instances on future requests........unless, you use a session. The reason that TinkerPop has both sessionless and sessioned based requests is that sessions tend to be more expensive to the server because they maintain more state between requests and they bind requests to a single Gremlin Server rather than spreading requests across a cluster. TinkerPop recommends using sessionless for almost all use cases and to reserve sessioned requests for some fairly narrow use cases (like tools - a Gremlin-based visualization UI).
There are likely some ways to extend Gremlin Server for your purposes (JanusGraph did it with their packaging of Gremlin Server), but it would require you to get knowledgeable on the code itself. I could probably provide you some additional guidance but StackOverflow probably isn't the right place to do that. Feel free to ask questions on the gremlin-users mailing list if you'd like to discuss that option in greater detail.

Gremlin JavaAPI vs Gremlin-Server?

Use gremlin java api directly in my application
Deploy a gremlin-server, use gremlin-driver api, connect to gremlin-server
Which one is better? Or what is the the advantages and disadvantages?

As of today, the answer depends on your environment to some degree. If you ever intend to use a non-jvm language (python, C#, js, etc), then you should likely use Gremlin Server as that will be the only way to build your application. If you want to be in the best position to switch to other graph databases and away from the one you've currently chosen, then using Gremlin Server might be better as not all graph databases are available through the direct Java API (there are at least two, that embed Gremlin Server and only allow connection via driver).
So, if the answer to those two questions are a resounding "no", then I don't think I could convince you to include Gremlin Server in your application, especially if you don't have a terribly complex project to worry about. It's adding another layer to your architecture that you would probably prefer to avoid.
If you do choose to embed, then be sure to use the Traversal API over the Structure API. Recall that the Structure API is meant for graph providers implementing the TinkerPop interfaces for their graph system. In other words, for this code:
graph = TinkerGraph.open()
g = graph.traversal()
prefer use of the API provided by g over the one provided by graph for querying and mutating the graph.
I do believe however that TinkerPop's future does have a solid dependence on "Gremlin Server" (the future incarnation may look different, but would be a "server component" of some sort) as something that will be less of a question to users to include. It would be great to see this decision point removed and simplified.

Best UI interface/Language to query MarkLogic Data

We will be moving from Oracle and use MarkLogic 8 as our datastore and will be using MarkLogic's Java api to talk with data.
I am exploring for any UI tool (like SQL Developer is there for Oracle), which can be used for ML. I found that ML's Query Manager can used for accessing data. But I see multiple options wrt language:
SQL
SPARQL
XQuery
JavaScript
We need to perform CRUD operations and search for data, and our testing team is aware of SQL (for Oracle), so I am confused which route I should follow and on what basis I should decide which one/two will be better to explore. We are most likely to use JSON document type.
Any help/suggestions would be helpful.

You already mention you will be using the MarkLogic Java Client API, that should provide most of the common needs you could have, including search, CRUD, facets, lexicon values, and also custom extension though REST extensions as the Client API will be leveraging the MarkLogic REST API. It saves you from having to code inside MarkLogic to a large extent.
Apart from that you can run ad hoc commands from the Query Console, using one of the above mentioned languages. SQL will require the presence of a so-called SQL view (see also your earlier question Using SQL in Query Manager in MarkLogic). SPARQL will require enabling the triple index, and ingestion of RDF data.
That leaves XQuery and JavaScript, that have pretty much identical expression power, and performance. If you are unfamiliar with XQuery and XML languages in general, JavaScript might be more appealing.
HTH!

some generic questions about neo4j

I'm new to non-php web applications and to nosql databases. I was looking for a smart solution matching my application requirements and I was very surprised when I knew that there exist graph based db. Well I found neo4j very nice and very suitable for my application, but as I've already wrote I'm new to this and I have some limitations in understending how it works. I hope you guys could help me to learn.
If I embed neo4j in a servlet program then the database access I create is shared among the different threads of that servet right? so I need to put database creation in init() method and the shutdown in the destroy() right? And it will be thread safe.(every dot is a "right?") But what if I want to create a database shared among the whole application?
I heard that graph databases in general relies on a relational low level. Is that true for neo4j? But if it is then I see an high level interface to the real persistence layer, so what a Connection is in this case? Are there some techniques like connection pooling or these low level things are all managed by neo4j?
In my application I need to join some objects to users and many other classification stuff. any of these object has an unique id (a String). then If some one asks to view some stuff about object having id=QW then I need to load the vertex associate to object.QW. Is this an easy operation for graph datbases?
If I need to manage authentications, so as I receive the couple (usr,pwd) and I need to check whether exists this couple in my graph. Is the same problem as before or there exist some good variation for managing authentications?
thanks

If you're coming from PHP world in most cases you're better of running Neo4j in server mode and access it either via REST directly or use a client driver like https://github.com/jadell/neo4jphp. If you still want to embed Neo4j in a servlet environment, the GraphDatabaseService is a shared component, maybe stored within the ServletContext. On a per request (and therefore per-thread) basis you start and commit transactions.
Neo4j is a native graph database. The bare metal persistence layer is optimized for navigating from one node to its neighbors as fast as possible and written by the Neo4j devteam themselves. There are other graph databases out there reusing other persistence technologies for their underlying persistence.
Best thing is to run the Neo4j online course at http://www.neo4j.org/learn/online_course.
see SecurityRules

As the Neo4j is NoSql Graph Database,
Genration of the Unique ID you have to handle using the GUID(with 3.x autonincremented proery also supported for particular label),
as the Neo4j default genrated id is unique but can be realocated to the another object once the first assigned object is deleted,
I am .net developer in my project I used the Neo4j rest api it works well, i will sugesst you to go with that,as it is implemented using async-awit programing pattern, so long running operation you can pass to DB and utilize your web server resources in more prominent way.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex