Why is an RDBMS bad to store a considerably big graph? - graph

I started using a graph database to store a big graph that im generating. But im not convinced as to why to use a graph database to do my job, and why not do what i want with a conventional RDBMS. My question in specific is, why is a Relational Database bad or rather Graph Database is BETTER than RDBMS to store graphs?

After reading Graph Databases (early release, available here: http://graphdatabases.com/) it all comes down to performance. Depending on how deep or recursive your query is, the longer it will take for your RDBMS or graph database to return results. Traditional RDBMS are not designed for the quick traversal of relationships between entities whereas graph databases are built specifically for this. This may not be an issue if your recursion is only 2 levels deep but after this level performance apparently degrades significantly.
Please take this with a grain of salt. This is coming directly from Graph Databases and I have not replicated these results.

Related

NetworkX vs GraphDB: do they serve similar purposes? When to use one or the other and when to use them together?

I am trying to understand if I should use a GraphDB for my project. I am mapping a computer network and I use NetworkX. The relationships are physical or logical adjacency (L2 and L3) . In the current incarnation my program scans the network and dumps the adjacency info in a Postgress RDB. From there I use Python to build my graphs using NetworkX.
I am trying to understand if I should change my approach and if there is any benefit in storing the info in a GaphDB. Postgress has AgensGraph which seems to be built on top of Postgress as a GraphDB overlay or addon. I don not know yet if installing this on top will make my life easier. I barely survived the migration from SQLite to Postgress :-) and to SQLAlchemy so now in not even 3 months I am reconsidering things while I can (the migration is not complete)
I could chose to use a mix but I am not sure if it makes sense to use a GraphDB. From what I understand these has advantages as not needing a schema (which helps a lot for a DB newbie like me)
I am also wondering if NetworkX (Python librayr) and GraphDB overlap in any way. As far as I understand these things NetworkX could be instrumental in analyzing the topology of the graph while GraphDB is mainly used to query the data stored in the DB. Do they overlap in anyway? Can they be used together?
TLDR: Use Neo4j or OrientDB to store data and networkx for processing it (if you need complicated algorithms). It will be the best solution.
I strongly don't recommend you to use GraphDB for your purposes. GraphDB is based on RDF that is used for semantic web and common knowledge storage. It is not supposed to be used with problems like yours. There are many graph databases that will fit to you much better. I can recommend Neo4j (the most popular graph database, as you can see; free, but non-open-source) or OrientDB (the most popular open-source graph database).
I used graph database when I had a similar problem (but I used HP UCMDB, that is corporate software and is not free). It was really MUCH better than average relational DBs. So the idea of graph database usage is good and it fits to this kind of problems naturally.
I am not sure you really need networkx to analyze the graph (you can use graph query languages to it), but if you want, you can load the data from your DB to networkx with GraphML or some another methods (OrientDB is similar) to process it using networkx.
And the little question-answer quiz in the end:
As far as I understand these things NetworkX could be instrumental in analyzing the topology of the graph
Absolutely right.
while GraphDB is mainly used to query the data stored in the DB.
It is a database. And, yes, it is mainly used to query the data.
Do they overlap in anyway?
They are both about graphs. Of course they overlap :)
Can they be used together?
Yes, they can. No, they should not be used together for your problem.

Using Neo4j ImpermanentGraphDatabase for real life scenarios with large amounts of data

I use Neo4j to do calculations on complex graphs from data stored on a relational database, these calculations must be done frequently so the natural solution has been to use Neo4j to create impermanent neo4j graphs on the fly.
I continue to find references like the one below on the internet (Neo4j: is it a in-memory graph database?):
Neo4j features a stripped down variant called
ImpermanentGraphDatabase. This one is intended to be used for testing
only. E.g. when you develop a graph enabled application your unit
tests might use it. It is not recommended to use
ImpermanentGraphDatabase for real life scenarios with large amounts of
data.
I'm doing exactly the above, using ImpermanentGraphDatabase for a real life scenario with thousands of nodes on which I do on the fly calculations.
Creating an embedded database each time I need to do a calculation on the fly is not feasible so what solution does Neo4j offer for this scenario? what exactly happens if you use Neo4j ImpermanentGraphDatabase for real life scenarios with large amounts of data?
There is the little word 'not' in the text.
Also in the (free) book "Graph Databases" (download) impermanent server mode is only recommended for testing, NOT for productive environments. If you want to use your graph locally (no cluster), you might use the embedded mode. Keep in mind, that there are always two parts for graph dbs: storage and traversal engine. But I think, you're looking for a (cypher) queryable datastructure? It's worth to have a look at TinkerPop/Gremlin (link).
For your purpose it depends on the part that'll change. The structure of your data? The formula for calculation? Or just the values? If your graph is static, you just need to update your data (not the graph database). If your graph is dynamic you might optimize your algorithm and use different data-structures. A graph is not always the best choice, neither are trees, lists or dictionaries.

What are the factors to consider while choosing a Graph DB for about 30 TB data

I'm in the process of developing a software system ( Graph Database ) to study the interconnection between multiple components. It could end up with about 30 TB of data. I would like to know what all factors to consider in choosing the right database.
Some of the options i'm looking are Apache Giraph, TitanDB. I'm also wondering if a smaller scale DB like neo4j or OrientDB might itself work
This is a very broad question so I would define exactly what you looking for because size can be a bit vague.
I think any of the example graph dbs you provided can model data that large.
A few "more detailed" questions you could ask yourself include:
Do you care about Horizontal Scaling ? If yes then you should be looking at TitanDB, OrientDB or DSE Graph because Neo4J (at the time of writing) does not scale horizontally so it is limited by the size of the server.
Does a standardised language query/traversal language matter ? If yes then maybe you should be looking more at Tinkerpop vendors such as TitanDB, OrientDB, DSE Graph, and others. If no then any option will suit you.
Does my data have super nodes ? If yes then you should see how each vendor deals with super nodes. Some vendors shard, others use clever graph partioning algorithms.
How much support do you want ? If you need a lot then maybe you should look at strong enterprise solutions such as DSE, OrientDB or Neo4J. Neo4J is currently considered the most popular graph db and with that comes a large support base.
Do you want to use open source software ? If yes then TitanDB, Neo4j, or OrientDB may be for you
These are just some of the things you can look into when making a better decision between all the vendors. Note: There are many other vendors you could consider, Blazegraph, HypergraphDB, just to name a few.

anybody tried neo4j vs titan - pros and cons [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
Can anybody please provide or point out to a good comparison between Neo4j and Titan?
One thing i can see is in terms of scale - Titan is scaleout and requires an underlying scalable datastore like cassandra. Neo4j is only for HA and has its own embedded database. Any other pros and cons? Any specific usecases. (Is Titan being used anywhere currently?)
I also have the following link: http://architects.dzone.com/articles/16-graph-databases-compared that gives a objective compare for graph databases but not much on pros and cons between Neo4j and Titan.
We have a social graph in which in a day we add almost 1 millions of node and twice as many edges. We started with neo4j graph because yes, it is very fast due to fact that its storage is on the same machine on which graph engine runs. But following are the experiences that we would like to share with you about neo4j.
Not good fit for real time query. We have social structure like twitter. We have to show latest 20 activities (and its associated activities) of all the users that a user follow on his time line.
We have some users who follows more than 1000 users. The gremlin query that we wrote for this (if you are interested then we can share gremlin query) really produced so much GC that a server with 8 cpu and 48 gb ram used to freeze and we had to restart the server to get it online again.
Many a time network partition observed.
There is not vertex centric index that is very much required in graoh database.
Ultimately we are so much fade up with server performance with gremlin query that we had to change the database to titan.
On titan we are getting reasonable performance and also scaling is very easy as we are using cassandra as backend storage. But mind you that .. using gremlin here also not a good idea as multiget query is very ugly to write and without multiget its query becomes very slow.
Great to see you exploring graph databases. I will speak to the Neo4j part of your question:
More than 30 of the Global 2000 now use Neo4j in production for a wide range of use cases, many of them surprising, even to us! (And we invented the property graph!)
A partial list of customers can be found below:
www.neotechnology.com/customers
Neo4j has been in 24x7 production for 10 years, and while the product has of course evolved significantly since then, it's built on a very solid foundation.
Most the companies moving to graph databases--speaking for Neo4j, which is what I know about-- are doing so because either a) their RDBMSs weren't able to handle the scope & scale of their connected query requirements, and/or b) the immense convenience and speed that comes from modeling domains that are a graph (social, network & data center management, fraud, portfolios, identity, etc.) as a graph, not as tables.
For kicks, you can find a number of customer talks here, from the four (soon five) GraphConnect conferences that were held this year in major cities around the world:
http://watch.neo4j.org/
If you're in London, the last one will be held next week:
http://www.graphconnect.com
You'll find a summary below of some of the technology behind Neo4j, with some customer examples. To speak very directly to your question about scaling: Neo4j has a unique architecture designed to maximize query response time & query predictability, by allowing horizontal scale-out in such a way that each instance can access the graph without having to hop over the network. (Need more read throughput. Just add instances.) It turns out that this approach works well for 95+% of the graphs out there, including some production customers who have more than half of the Facebook social graph running in a single Neo4j cluster, backing an "always on" 24x7 web site.
www.neotechnology.com/neo4j-scales-for-the-enterprise/
One of the world's largest postal delivery services does all of their real-time package routing with Neo4j. Railroads are building routing systems on Neo4j. Some of the world's largest customers are using them for HR and data governance, alternate-path routing, network & data center management, real-time fraud detection, bioinformatics, etc.
Neo4j's Cypher query language is the only declarative query language built expressly for property graphs. It takes all of the lessons learned from our 13-year old native Java API (which was the basis for Blueprints, which some of the other graph databases have since adopted) and rolls them into a next-generation language. Cypher is a great way to learn graphs, and to develop applications; and there's always the native Java API if you have special needs or value "bare metal" performance (i.e. sub millisecond vs. single-digit millisecond) performance above convenience. Neo4j is built from the ground up to support graphs, and has a graph storage engine that is built to store graphs; unlike some of the more recent additions to the graph database ecosystem, which are architected as graph libraries on top of non-graph databases, and are subject to some of the inherent limitations. (e.g. FlockDB, because it is based on MySQL, will still be very slow for anything greater than one hop.)
Definitely feel free to contact the Neo team if you need anything more specific. We'll be more than happy to help you! http://info.neotechnology.com/ContactUs.html
Good luck!

query language for graph sets: data modeling question

Suppose I have a set of directed graphs. I need to query those graphs. I would like to get a feeling for my best choice for the graph modeling task. So far I have these options, but please don't hesitate to suggest others:
Proprietary implementation (matrix)
and graph traversal algorithms.
RDBM and SQL option (too space consuming)
RDF and SPARQL option (too slow)
What would you guys suggest? Regards.
EDIT: Just to answer Mad's questions:
Each one is relatively small, no more than 200 vertices, 400 edges. However, there are hundreds of them.
Frequency of querying: hard to say, it's an experimental system.
Speed: not real time, but practical, say 4-5 seconds tops.
You didn't give us enough information to respond with a well thought out answer. For example: what size are these graphs? With what frequencies do you expect to query these graphs? Do you need real-time response to these queries? More information on what your application is for, what is your purpose, will be helpful.
Anyway, to counter the usual responses that suppose SQL-based DBMSes are unable to handle graphs structures effectively, I will give some references:
Graph Transformation in Relational Databases (.pdf), by G. Varro, K. Friedl, D. Varro, presented at International Workshop on Graph-Based Tools (GraBaTs) 2004;
5 Conclusion and Future Work
In the paper, we proposed a new graph transformation engine based on off-the-shelf
relational databases. After sketching the main concepts of our approach, we carried
out several test cases to evaluate our prototype implementation by comparing it to
the transformation engines of the AGG [5] and PROGRES [18] tools.
The main conclusion that can be drawn from our experiments is that relational
databases provide a promising candidate as an implementation framework for graph
transformation engines. We call attention to the fact that our promising experimental
results were obtained using a worst-case assessment method i.e. by recalculating
the views of the next rule to be applied from scratch which is still highly inefficient,
especially, for model transformations with a large number of independent matches
of the same rule. ...
They used PostgreSQL as DBMS, which is probably not particularly good at this kind of applications. You can try LucidDB and see if it is better, as I suspect.
Incremental SQL Queries (more than one paper here, you should concentrate on " Maintaining Transitive Closure of Graphs in SQL "): "
.. we showed that transitive closure, alternating paths, same generation, and other recursive queries, can be maintained in SQL if some auxiliary relations are allowed. In fact, they can all be maintained using at most auxiliary relations of arity 2. ..
Incremental Maintenance of Shortest Distance and Transitive Closure in First Order Logic and SQL.
Edit: you give more details so... I think the best way is to experiment a little with both a main-memory dedicated graph library and with a DBMS-based solution, then evaluate carefully pros and cons of both solutions.
For example: a DBMS need to be installed (if you don't use an "embeddable" DBMS like SQLite), only you know if/where your application needs to be deployed and what your users are. On the other hand, a DBMS gives you immediate benefits, like persistence (I don't know what support graph libraries gives for persisting their graphs), transactions management and countless other. Are these relevant for your application? Again, only you know.
The first option you mentioned seems best. If your graph won't have many edges (|E|=O(|V|)) then you might earn better complexity of time and space using Dictionary:
var graph = new Dictionary<Vertex, HashSet<Vertex>>();
An interesting graph library is QuickGraph. Never used it but it seems promising :)
I wrote and designed quite a few graph algorithms for various programming contests and in production code. And I noticed that every time I need one, I have to develop it from scratch, assembling together concepts from graph theory (BFS, DFS, topological sorting etc).
Perhaps a lack of experience is a reason, but it seems to me that there's still no reasonable general-purpose query language to solve graph problems. Pick a couple of general-purpose graph libraries and solve your particular task in a programming (not query!) language. That will give you best performance and space consumption, but will also require understanding of graph theory basic concepts and of their limitations.
And the last one: do not use SQL for graphs.

Resources