We are planning to use tinkerpop for our project.
I had a few doubts about this:
I see that there are GremlinServer.start() GremlinServer.stop() APIs which I can use to get gremlin server running in embedded style. This means it will not start any other process unlike when we execute gremlin server bash script. Is this correct?
My graph size could be large and so I may not want my graph to be in-memory always like it is the case with tinkerpop graph. So if I have my graph in gremlin server, how will it be stored? In memory only or is there a way to persist it?
Thanks.
I think that depends on what you pass into the constructor as specified here
There are several persistence layers you can look into which implement the gremlin server API. For example you can go open source with TitanDB or go for an enterprise solution with DSE Graph.
Related
We are experimenting with the Cosmos Gremlin API because we are building a large scale knowledge-management-system which is naturally suited for a graph DB. Knowledge items are highly interconnected and therefore a graph is much better than a relational or a document-oriented (hierarchical) structure.
We need atomic write operations (not full transaction support, just atomic writes). E.g. we need to create several vertices and edges in one atomic write operation.
After carefully reading the documentation and extensively searching for solutions, our current state of knowledge is following:
Cosmos Gremlin API stores vertices as documents and outgoing edges as part of the "outgoing document".
A Gremlin statement creating vertices and egdes might be split up and executed in parallel.
There is no transaction support and there are no atomic write operations.
Write operations are not idempotent.
The two facts taken together mean: If you execute a graph write operation and an error occurs somewhere along the traversal, you have no chance whatsoever to recover from it in a clean way. Let's say you add an edge, add some vertices, perform some side-effect-steps and something goes wrong. Which vertices and edges are persisted and which are not? Since you cannot simply run the statement a second time (vertices with the ids already exist), you're kind of stuck. In addition, this is nothing which can be solved on the end-user-level in the UI.
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app. When you have a look at the Gremlin "data explorer" in the portal, that seems even more true. I looks like a prototype.
Since edges are stored on the "outgoing document", one should always traverse the graph using the outgoing edges, not the incoming.
This takes away a lot from the efficiency of working with a graph DB: To traverse both directions efficiently.
It leads to workarounds: For each outgoing edge, create an "inverse edge" on the incoming vertex.
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.
Write operations are not idempotent.
It is possible to write queries in an idempotent way however it's not really done in a nice readable and maintainable way. See an idempotent gremlin example here: https://spin.atomicobject.com/2021/08/10/idempotent-queries-in-gremlin/
Taken this points into account, it seems, that the Cosmos Gremlin API is not ready for a production app
This really depends on your application requirements, not all production applications require atomicity or transactions. Sometimes some systems can drop data or if needed you can do various things to ensure data integrity - Though this often puts more responsibility of the application developer
So I'd like to ask the question: Should one use Cosmos Gremlin API in production? So far I haven't seen or read about anyone who does so.
I haven't seen too many stories of using it in production anecdotally, it looks like CosmosDB is relatively popular but hard to tell what proportion of users are running which API.
I have a Java Project that connects to a Gremlin Server. I want to create 5 new graph instances of Neo4j at runtime? How can i do this on Gremlin Server please?
Also, I have heard about session and sessionless states in Gremlin Server but don't really understand the purpose of this?! Can someone please touch on this but more importantly, to show how to use a session state and a sessionless state in my Java project on Gremlin Server?
Many thanks in advance.
For example:
List<GraphTraversalSource> graphs;
for (int x= 0 ; x < 5 ; x++) {
graph = Neo4jGraph.open();
g = graph.traversal();
graphs.add(g);
}
The short answer is that Gremlin Server doesn't allow for programmatic graph creation - graphs are configured up front prior to Gremlin Server starting.
The longer answer is that Gremlin Server is a bit of a reference implementation of the Gremlin Server Protocol, which means that depending on the TinkerPop-enabled graph database you use you might get a different answer to your question. For example, DS Graph and JanusGraph both have options for dynamic graph construction. Neo4j and TinkerGraph on the other hand utilize the raw reference implementation of Gremlin Server and therefore don't have that kind of functionality.
That last point about the reference implementation leads to yet a longer answer. You can submit a script to create graphs like Neo4jGraph or TinkerGraph but it won't add them to the global list of graphs that Gremlin Server holds (which you've tried to simulate in your pseudocode with graphs.add(g)). That of course means that you won't be able to access those newly created Graph instances on future requests........unless, you use a session. The reason that TinkerPop has both sessionless and sessioned based requests is that sessions tend to be more expensive to the server because they maintain more state between requests and they bind requests to a single Gremlin Server rather than spreading requests across a cluster. TinkerPop recommends using sessionless for almost all use cases and to reserve sessioned requests for some fairly narrow use cases (like tools - a Gremlin-based visualization UI).
There are likely some ways to extend Gremlin Server for your purposes (JanusGraph did it with their packaging of Gremlin Server), but it would require you to get knowledgeable on the code itself. I could probably provide you some additional guidance but StackOverflow probably isn't the right place to do that. Feel free to ask questions on the gremlin-users mailing list if you'd like to discuss that option in greater detail.
I'm thinking about learn JanusGraph to use in my new big project but i can't understand some things.
Janus can be used like any database and supports "insert", "update", "delete" operations so JanusGraph will write data into Cassandra or other database to store these data, right?
Where JanusGraph store the Nodes, Edges, Attributes etc, it will write these into database, right?
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data that JanusGraph read, must be load in JanusGraph in every query or it will do selects in database to retrieve the data I need?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Should I use JanusGraph in my project in production or should I wait until it becomes production ready?
I'm developing some kind of social network that need to store friendship, posts, comments, user blocks and do some elasticsearch too, in this case, what database backend should I use?
Janus will write data into Cassandra or other database to store these data, right?
Where Janus store the Nodes, Edges, Attributes etc, it will write these into database, right?
Janus Graph will write the data into whatever storage backend you configure it to use. This includes Cassandra. It writes this data into the underlaying database using the data model roughly outlined here
These data should be loaded in memory by Janus or will be read from Cassandra all the time?
The data retrieved in database is only what I need or Janus will read all records in database all the time?
Janus Graph will only load into memory vertices and edges which you touch during a query/traversal. So if you do something like:
graph.traversal().V().hasLabel("My Amazing Label");
Janus will read and load into memory only the vertices with that label. So you don't need to worry about initializing a graph connection and then waiting for the entire graph to be serialised into memory before you can query. Janus is a lazy reader.
Should I use Janus in my project in production or should I wait until it becomes production ready?
That is entirely up to you and your use case. Janus is being used in production already as can be seen here at the bottom of the page. Janus was forked from and improved on TitanDB which is also used in several production use cases. So if you wondering "is it ready" then I would say yes, it's clearly ready given it's existing uses.
what database backend should I use?
Again, that's entirely up to you. I use Cassandra because it can scale horizontally and I find it easier to work with. It also seems to suit all different sizes of data.
I have toyed with Google Big Table and that seems very powerful as well. However, it's only really suited for VERY big data and it's also only on the cloud where as Cassandra can be hosted locally very easily.
I have not used Janus with HBase or BerkeleyDB so I can't comment there.
It's very simple to change between backends though (all you need to do is adjust some configs and check your dependencies are in place) so during your development feel free to play around with the backends. You only really need to commit to a backend when you go production or are more sure of each backend.
When considering what storage backend to use for a new project it's important to consider what tradeoffs you'd like to make. In my personal projects, I've enjoyed using NoSQL graph databases due to the following advantages over relational dbs
Not needing to migrate schemas increases productivity when rapidly iterating on a new project
Traversing a heavily normalized data-model is not as expensive as with JOINs in an RDBMS
Most include in-memory configurations which are great for experimenting & testing.
Support for multi-machine clusters and Partition Tolerance.
Here are sample JanusGraph and Neo4j backends written in Kotlin:
https://github.com/pm-dev/janusgraph-exploration
https://github.com/pm-dev/neo4j-exploration
The main advantage with JanusGraph is the flexibility of pluging-in whichever storage backend you'd like.
Use gremlin java api directly in my application
Deploy a gremlin-server, use gremlin-driver api, connect to gremlin-server
Which one is better? Or what is the the advantages and disadvantages?
As of today, the answer depends on your environment to some degree. If you ever intend to use a non-jvm language (python, C#, js, etc), then you should likely use Gremlin Server as that will be the only way to build your application. If you want to be in the best position to switch to other graph databases and away from the one you've currently chosen, then using Gremlin Server might be better as not all graph databases are available through the direct Java API (there are at least two, that embed Gremlin Server and only allow connection via driver).
So, if the answer to those two questions are a resounding "no", then I don't think I could convince you to include Gremlin Server in your application, especially if you don't have a terribly complex project to worry about. It's adding another layer to your architecture that you would probably prefer to avoid.
If you do choose to embed, then be sure to use the Traversal API over the Structure API. Recall that the Structure API is meant for graph providers implementing the TinkerPop interfaces for their graph system. In other words, for this code:
graph = TinkerGraph.open()
g = graph.traversal()
prefer use of the API provided by g over the one provided by graph for querying and mutating the graph.
I do believe however that TinkerPop's future does have a solid dependence on "Gremlin Server" (the future incarnation may look different, but would be a "server component" of some sort) as something that will be less of a question to users to include. It would be great to see this decision point removed and simplified.
I'm new to non-php web applications and to nosql databases. I was looking for a smart solution matching my application requirements and I was very surprised when I knew that there exist graph based db. Well I found neo4j very nice and very suitable for my application, but as I've already wrote I'm new to this and I have some limitations in understending how it works. I hope you guys could help me to learn.
If I embed neo4j in a servlet program then the database access I create is shared among the different threads of that servet right? so I need to put database creation in init() method and the shutdown in the destroy() right? And it will be thread safe.(every dot is a "right?") But what if I want to create a database shared among the whole application?
I heard that graph databases in general relies on a relational low level. Is that true for neo4j? But if it is then I see an high level interface to the real persistence layer, so what a Connection is in this case? Are there some techniques like connection pooling or these low level things are all managed by neo4j?
In my application I need to join some objects to users and many other classification stuff. any of these object has an unique id (a String). then If some one asks to view some stuff about object having id=QW then I need to load the vertex associate to object.QW. Is this an easy operation for graph datbases?
If I need to manage authentications, so as I receive the couple (usr,pwd) and I need to check whether exists this couple in my graph. Is the same problem as before or there exist some good variation for managing authentications?
thanks
If you're coming from PHP world in most cases you're better of running Neo4j in server mode and access it either via REST directly or use a client driver like https://github.com/jadell/neo4jphp. If you still want to embed Neo4j in a servlet environment, the GraphDatabaseService is a shared component, maybe stored within the ServletContext. On a per request (and therefore per-thread) basis you start and commit transactions.
Neo4j is a native graph database. The bare metal persistence layer is optimized for navigating from one node to its neighbors as fast as possible and written by the Neo4j devteam themselves. There are other graph databases out there reusing other persistence technologies for their underlying persistence.
Best thing is to run the Neo4j online course at http://www.neo4j.org/learn/online_course.
see SecurityRules
As the Neo4j is NoSql Graph Database,
Genration of the Unique ID you have to handle using the GUID(with 3.x autonincremented proery also supported for particular label),
as the Neo4j default genrated id is unique but can be realocated to the another object once the first assigned object is deleted,
I am .net developer in my project I used the Neo4j rest api it works well, i will sugesst you to go with that,as it is implemented using async-awit programing pattern, so long running operation you can pass to DB and utilize your web server resources in more prominent way.