Representing a graph in java application - graph

In my application, I've a domain model which is essentially a graph. I need to essentially perform the following operations and the send the resulting graph to the client over network
Operations to be performed
Filter certain nodes based on business policy
Augment with more nodes and relationships (potentially from other data providers
After filtering, I need a serialization mechanism as well. After working with Neo4j and Tinkerpop, I feel Tinkerpop fits well for my usecase as it has
In-memory graph support (TinkerGraph)
Serialization mechanisms: GraphML, GML and GrapjSON
I am wondering if my understanding is accurate and approach is correct. Please suggest.

Sounds right. I often extract subgraphs and store them in a TinkerGraph for follow-on processing. I also use GraphSON for serialization. Seems like you're on the right track.
Here are 2 good sources for additional information:
gremlindocs.com
https://groups.google.com/forum/#!forum/gremlin-users

Related

NetworkX vs GraphDB: do they serve similar purposes? When to use one or the other and when to use them together?

I am trying to understand if I should use a GraphDB for my project. I am mapping a computer network and I use NetworkX. The relationships are physical or logical adjacency (L2 and L3) . In the current incarnation my program scans the network and dumps the adjacency info in a Postgress RDB. From there I use Python to build my graphs using NetworkX.
I am trying to understand if I should change my approach and if there is any benefit in storing the info in a GaphDB. Postgress has AgensGraph which seems to be built on top of Postgress as a GraphDB overlay or addon. I don not know yet if installing this on top will make my life easier. I barely survived the migration from SQLite to Postgress :-) and to SQLAlchemy so now in not even 3 months I am reconsidering things while I can (the migration is not complete)
I could chose to use a mix but I am not sure if it makes sense to use a GraphDB. From what I understand these has advantages as not needing a schema (which helps a lot for a DB newbie like me)
I am also wondering if NetworkX (Python librayr) and GraphDB overlap in any way. As far as I understand these things NetworkX could be instrumental in analyzing the topology of the graph while GraphDB is mainly used to query the data stored in the DB. Do they overlap in anyway? Can they be used together?
TLDR: Use Neo4j or OrientDB to store data and networkx for processing it (if you need complicated algorithms). It will be the best solution.
I strongly don't recommend you to use GraphDB for your purposes. GraphDB is based on RDF that is used for semantic web and common knowledge storage. It is not supposed to be used with problems like yours. There are many graph databases that will fit to you much better. I can recommend Neo4j (the most popular graph database, as you can see; free, but non-open-source) or OrientDB (the most popular open-source graph database).
I used graph database when I had a similar problem (but I used HP UCMDB, that is corporate software and is not free). It was really MUCH better than average relational DBs. So the idea of graph database usage is good and it fits to this kind of problems naturally.
I am not sure you really need networkx to analyze the graph (you can use graph query languages to it), but if you want, you can load the data from your DB to networkx with GraphML or some another methods (OrientDB is similar) to process it using networkx.
And the little question-answer quiz in the end:
As far as I understand these things NetworkX could be instrumental in analyzing the topology of the graph
Absolutely right.
while GraphDB is mainly used to query the data stored in the DB.
It is a database. And, yes, it is mainly used to query the data.
Do they overlap in anyway?
They are both about graphs. Of course they overlap :)
Can they be used together?
Yes, they can. No, they should not be used together for your problem.

Using Neo4j ImpermanentGraphDatabase for real life scenarios with large amounts of data

I use Neo4j to do calculations on complex graphs from data stored on a relational database, these calculations must be done frequently so the natural solution has been to use Neo4j to create impermanent neo4j graphs on the fly.
I continue to find references like the one below on the internet (Neo4j: is it a in-memory graph database?):
Neo4j features a stripped down variant called
ImpermanentGraphDatabase. This one is intended to be used for testing
only. E.g. when you develop a graph enabled application your unit
tests might use it. It is not recommended to use
ImpermanentGraphDatabase for real life scenarios with large amounts of
data.
I'm doing exactly the above, using ImpermanentGraphDatabase for a real life scenario with thousands of nodes on which I do on the fly calculations.
Creating an embedded database each time I need to do a calculation on the fly is not feasible so what solution does Neo4j offer for this scenario? what exactly happens if you use Neo4j ImpermanentGraphDatabase for real life scenarios with large amounts of data?
There is the little word 'not' in the text.
Also in the (free) book "Graph Databases" (download) impermanent server mode is only recommended for testing, NOT for productive environments. If you want to use your graph locally (no cluster), you might use the embedded mode. Keep in mind, that there are always two parts for graph dbs: storage and traversal engine. But I think, you're looking for a (cypher) queryable datastructure? It's worth to have a look at TinkerPop/Gremlin (link).
For your purpose it depends on the part that'll change. The structure of your data? The formula for calculation? Or just the values? If your graph is static, you just need to update your data (not the graph database). If your graph is dynamic you might optimize your algorithm and use different data-structures. A graph is not always the best choice, neither are trees, lists or dictionaries.

Difference between traversal using gremlin and methods from Graph

Suppose I've the following simple graph.
I see two ways of traversing this graph
Option 1
I can use the following API provided by the Graph class
Graph factory = ...
factory.getVertices("type", "Response");
Option 2
I can also use GremlinPipeline API as documented here
Graph g = ... // a reference to a Blueprints graph
GremlinPipeline pipe = new GremlinPipeline();
pipe.start(g.getVertex(1))
My question are
Why two API's?
When to use which one?
Does GremlinPipeline take advantage of the indeces created using index related methods of TnkerGraph?
There are two APIs for getting data because one represents a Blueprints-level which is a lower level of abstraction having utility-level functions for accessing graphs and Gremlin-level which is a higher level of abstraction having a much higher degree of expressivity when traversing graphs. The design principle is built around the fact that Blueprints is an abstraction layer over different graph databases (i.e. Neo4j, OrientDB, etc) and that it needs to be simple enough for the implementations to be developed quickly. Gremlin however is a graph traversal language which works over the Blueprints API making it possible for Gremlin to operate over multiple graph databases.
Your examples don't allow Gremlin to shine at all. Sure, in those cases, there really isn't a reason to choose one over the other. Here's a similar example which I think is better:
// blueprints
g.getVertices()
// gremlin
g.V
Other than saving a few characters, Gremlin really isn't getting me anything. However, consider this Gremlin:
g.V.out('knows').outE('bought').has('weight',T.gt,100).inV.groupCount().cap()
I won't supply the Blueprints equivalent of that because it's a lot of code to type. You'll have to trust that this single line of Gremlin is worth many lines of code with tons of ugly looping.
Most of the time, usage of the raw Blueprints API isn't really necessary for traversals. You'll find yourself using it for loading data, but other than that, use Gremlin.

Tinkerpop - Is it better to use Redis for key-value property indexes or to use KeyIndexableGraph

Pretty straightforward question but I can't find the info that I want - is it advisable to use the KeyIndexableGraph of tinkerpop or to roll your own super performant key/index solution on the most performant and specialized stores like redis to get the node/edge locations you need?
It would appear to me that Redis should be better here as a technology that only focuses on key/value lookups and then pass the address in to the graph but I'd like to justify the costs.
The promise from tinkerpop is that index lookups should be log(n) on articles that are indexed with the property which is pretty good. Is it possible to do better in redis, or that the n*constant is much better than in the graph lookup?
Edit: I realized later this isn't really an intelligent question - Redis is an in memory store so is bounded by memory. Looking up a graph node location is still going to require a second lookup of the node in the graph.
It is important to remember that aside from TinkerGraph (an in-memory graph), TinkerPop is not a graph database on its own. KeyIndexableGraph is an interface that is implemented by an underlying graph databases (Titan, Neo4j, OrientDB, etc.) utilizing that graph's index capability. Therefore, you should make your indexing choice based on the capabilities of the underlying graph database itself.
Generally speaking, implementing Redis for indexing purposes for the graphs that do implement KeyIndexableGraph seems like an unnecessary layer. I would guess that it will complicate your programming without much benefit.
Here is the difference:
Databases like OrientDb have apx O(log2n) lookup times on an index.
Reddis has O(1) - constant time lookup.

Specification of a directed graph

This is a fairly advanced topic related to directed graphs. I am looking into NOSQL technology for a project. In particular, using graph databases. It's a perfect fit in supporting the rich model I want to save relationships for and the problem domain is a graph itself (vertices and edges). Now obviously this made me look at Neo4j and other vendors in this space. Now I believe they def have solved or closed the gap with persisting data in a graph data structure way...which is perfect.
However my requirement goes further where there is a need to understand a specification of a directed graph to create an actual instance of a 'directed graph', such that particular rules and constraints are adhered to when actually creating the graph. The graph database doesn't concern itself with this which is correct and wouldnt want it too (happy that it's agnostic of this). The problem is this leaves it a little open ended as what ensures that the graph complies with your graph rules (ie that certain nodes can have certain relationships or even have relationships to other nodes). What should I be using that will allow me to specify the specification/metadata of the directed graph such that when creating an instance of it at runtime it adheres correctly to it's
specification.
Any help or suggestions on what is available or what is the standard way to approach this would be appreciated
I think you should take a look at Spring Data Graph, http://www.springsource.org/spring-data/neo4j that is as close as you get in having a powerful mapping layer that can project rules etc, much like JPA or Hibernate.
WDYT?
/peter

Resources