best graph database for saving millon of node - graph

I want to ask a question about graph database.
First im using networkx in python and creating graph in memory, but when i reach more nodes - my RAM was not enough.
So, for next time i try to neo4j. Its nice, write graph on disk, but its slow(how i think. With index and other things, more slow than networkx). Now i create 500k nodes and 2000000 relationships, try to find path between two nodes, and neo4j just stuck on my server.
I hear about orientdb, but not try yet now.
So, i need advice, what the best graph database, who can write graph on disk?
Big thanks to you.
PS want only open-source graph database

First of all there are real or native graph databases or non native graph databases. The native graph databases really organize your data in a graph structure and connect the nodes to each other, while the non native are using some kind of model to store your graph representation. You can simply represent a graph as Adjacency matrix which is a table and you maybe could be stored in a row key store with columns (but that wouldn't be very effective and stupid in my opinion). So you first need to ask yourself if you really need a graph database? Second you need to think about the operations read und write you want to perform.
There is not best (graph) database. But there are many different databases for many different use cases - so you need to identify your exact use case and than you can think about the database.
For your tries with neo4j: Writing in neo4j is indeed very slow if you do it wrong. May you like to have a look at this question and answer about write performance of neo4j.

Almost all graph database can write graph on disk.
But if you're doing some calculation, such as shortest path for very deep search (dozens hop), memory is much much more important than disk.

Related

Simple Graph database

I know about Neo4j, RedisGraph, DGraph, ArangoDB.
But I don't want a so heavy client/server application. I just need to load 4 millions nodes and 10 millions relationships and request for the shortest path between 2 nodes. The graph in unweighted and not directed.
I wrote a Go code that can do the job in memory within 2 seconds on my laptop with the BFS algorithm. But I don't have to re-invent the wheel. I can't actually do it in a few milliseconds like Neo4j for example.
Is there an open source project for this ?
If all you're doing is calculating bidirectional shortest-path, you may not need a full graph database for this. I would suggesting looking at existing graph libraries that can quickly load your graph into memory and that have built-in functions for shortest-path. Examples are things like:
https://networkx.programmingpedia.net/
https://igraph.org/
https://graph-tool.skewed.de/
A graph database is going to provide you with persistence, but is going to come with a trade-off of maintaining the database and perhaps needing to learn how to leverage the query language supported by that graph database. Graph databases are great for continual updates of the graph or performing more custom queries/path look-ups.

Perfomance SQL Server 2017 Graph vs Neo4j

I am researching about graph databases. I stumbled into SQL Server 2017 and learned that they added the option to use a graph database. But I have some uncertainties about the performance. I watched several Youtube videos, tutorials and papers about this SQL Server 2017 Graph. For example this page.
With the image above in mind. When I try to find a node, is it true that the time complexity is O(n)? And is the performance in other graph databases like Neo4j similar? I am only talking about node lookup and not shortest path algorithms etc.
I also have a feeling that the graph functionality in SQL Server is just a relational database in disguise. Is this correct?
Thanks in advance.
There is a big difference between a graph database and a relational database with graph capabilities, in the sense of how the data is stored.
To summarise simply, when a triple ( aka 2 nodes connected by a relationship ) is stored, the underlying database difference will be :
Neo4j, the triple is stored as a graph on disk, nodes have pointers to the relationships they have, so during retrieval it will just be pointer chasing from nodes
SQL like : one node is stored in one table, the other node is stored in another table, yet you can query as a graph but the operation will be really making a JOIN
Based on those two facts, we can say that in native graphs the join is performed at write time compared to having joins at query time in non-native graphs.
Be very careful when you hear distributed graphs, partitions, planet scale and the like. If you start having relationships that have to be traversed over the network you will always suffer performance issues. Most of the distributed graphs platforms note also that for maximum performance you have to store everything on one partition (which defeats the partitioning purpose).

Proper way to store graphs with Neo4J

I'm building a system which allows the user to call N number of different graphs through an API. Currently I have a working prototype which pulls graphs from CouchDB. However, for obvious reasons, I would like to move to a graph DB. My understanding is that Neo4J can only handle one graph at a time or requires so sort of tagging system to not mix graphs. Neither of those approaches seem optimal. What's the best practice approach for this?
A few more things to note: I will be calling these graphs and manipulating them with something like networkx, and I've considered storing the graphs in a "regular" DB then moving them to Neo4J as the requests come in, which seems pretty intense.
Neo4j does not have a concept of multiple databaseslike most relational databases do using CREATE DATABASE. In Neo4j there is one graph space which you can use.
So you have 2 options:
use seperate Neo4j instances (single or clustered) for each graph, maybe using Neo4j in embedded mode is helpful here
use one Neo4j instance (single or clustered) and store your data in distinct subgraphs. If the subgraphs need some interconnections you can use labels to identify to which subgraph a certain node belongs.

RRD basics and more!

I'm trying to use rrdtool to monitor Access Points and what I'd like is to have separate rrd file for each access point, which is something I'm not sure how to do. Anyway if I can do that then for each site I'd be able to get a graph from different rrd databases according to site location. However when I want to see a company level graph I'd like to aggregate data across multiple rrd databases and get that to show on one graph, so if bandwidth is measured for two devices in two separate rrd databases then I would like to get an "average" of these two data sources and show it in my graph for the site that has these access points. Is this possible? I'm quite new to thinking in RRD way and rrdtool so please do let me know if there are better ways of doing this.
Also how RRD uses space internally? From what I read so far, there are people saying the size of file never gets bigger for RRD database. On the other side people asking about how much of file size it would accumulate over years. So I'm kind of confused here. I thought it would be holding stuff in memory and writing to disk based on consolidated functions.
Can I generate pie charts from rrdtool as well? I need to find number of users connected to a access point and it would be good if I can show that as a pie chart for total number of users connected to an access point at any given time for a given site. For instance,
access point 1: 20
access poin 2: 40
access poin 3: 1
If I can generate a pie chart for that it would be sliced according to the number of users.
Sorry it's quite a few questions. If rrdtool doesn't make a big difference then I might as well use Mysql as I have running mysql server in production. And I can produce graphs on the fly using some funky flash stuff too. If someone can enlighten me on pros and cons of using RRD over any RDBMS for time series data that would be amazing.
Many Thanks guys!!
You can aggregate data from multiple RRDs into one graph; you'd use the CDEF command in your rrdgraph statement to combine DEFs from individual databases.
rrd files stay the same size unless you explicitly resize them by adding rows. Older data is aged out and replaced with new data. (Hence the name "round robin database".)
pie charts...I dunno. :) I've never seen it, but that certainly doesn't mean it's not possible.
Have you read the basic tutorial? http://oss.oetiker.ch/rrdtool/tut/rrdtutorial.en.html That might help you decide what to do.
Cacti is what you are after I would say;
It is a web front end to rrdtool (and much more). You can create devices, add them, set up graphs and it will poll them for data into RRD files. You can have all kinds of graphs, and create aggregate ones etc. You can also query against rrd files for monthly/weekly/yearly/any-time-frame statistics you like.
Everything you have asked for can be done with Cacti except for pie charts.

Which technology is best suited to store and query a huge readonly graph?

I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.

Resources