I know about Neo4j, RedisGraph, DGraph, ArangoDB.
But I don't want a so heavy client/server application. I just need to load 4 millions nodes and 10 millions relationships and request for the shortest path between 2 nodes. The graph in unweighted and not directed.
I wrote a Go code that can do the job in memory within 2 seconds on my laptop with the BFS algorithm. But I don't have to re-invent the wheel. I can't actually do it in a few milliseconds like Neo4j for example.
Is there an open source project for this ?
If all you're doing is calculating bidirectional shortest-path, you may not need a full graph database for this. I would suggesting looking at existing graph libraries that can quickly load your graph into memory and that have built-in functions for shortest-path. Examples are things like:
https://networkx.programmingpedia.net/
https://igraph.org/
https://graph-tool.skewed.de/
A graph database is going to provide you with persistence, but is going to come with a trade-off of maintaining the database and perhaps needing to learn how to leverage the query language supported by that graph database. Graph databases are great for continual updates of the graph or performing more custom queries/path look-ups.
Related
I want to ask a question about graph database.
First im using networkx in python and creating graph in memory, but when i reach more nodes - my RAM was not enough.
So, for next time i try to neo4j. Its nice, write graph on disk, but its slow(how i think. With index and other things, more slow than networkx). Now i create 500k nodes and 2000000 relationships, try to find path between two nodes, and neo4j just stuck on my server.
I hear about orientdb, but not try yet now.
So, i need advice, what the best graph database, who can write graph on disk?
Big thanks to you.
PS want only open-source graph database
First of all there are real or native graph databases or non native graph databases. The native graph databases really organize your data in a graph structure and connect the nodes to each other, while the non native are using some kind of model to store your graph representation. You can simply represent a graph as Adjacency matrix which is a table and you maybe could be stored in a row key store with columns (but that wouldn't be very effective and stupid in my opinion). So you first need to ask yourself if you really need a graph database? Second you need to think about the operations read und write you want to perform.
There is not best (graph) database. But there are many different databases for many different use cases - so you need to identify your exact use case and than you can think about the database.
For your tries with neo4j: Writing in neo4j is indeed very slow if you do it wrong. May you like to have a look at this question and answer about write performance of neo4j.
Almost all graph database can write graph on disk.
But if you're doing some calculation, such as shortest path for very deep search (dozens hop), memory is much much more important than disk.
I am researching about graph databases. I stumbled into SQL Server 2017 and learned that they added the option to use a graph database. But I have some uncertainties about the performance. I watched several Youtube videos, tutorials and papers about this SQL Server 2017 Graph. For example this page.
With the image above in mind. When I try to find a node, is it true that the time complexity is O(n)? And is the performance in other graph databases like Neo4j similar? I am only talking about node lookup and not shortest path algorithms etc.
I also have a feeling that the graph functionality in SQL Server is just a relational database in disguise. Is this correct?
Thanks in advance.
There is a big difference between a graph database and a relational database with graph capabilities, in the sense of how the data is stored.
To summarise simply, when a triple ( aka 2 nodes connected by a relationship ) is stored, the underlying database difference will be :
Neo4j, the triple is stored as a graph on disk, nodes have pointers to the relationships they have, so during retrieval it will just be pointer chasing from nodes
SQL like : one node is stored in one table, the other node is stored in another table, yet you can query as a graph but the operation will be really making a JOIN
Based on those two facts, we can say that in native graphs the join is performed at write time compared to having joins at query time in non-native graphs.
Be very careful when you hear distributed graphs, partitions, planet scale and the like. If you start having relationships that have to be traversed over the network you will always suffer performance issues. Most of the distributed graphs platforms note also that for maximum performance you have to store everything on one partition (which defeats the partitioning purpose).
I'm building a system which allows the user to call N number of different graphs through an API. Currently I have a working prototype which pulls graphs from CouchDB. However, for obvious reasons, I would like to move to a graph DB. My understanding is that Neo4J can only handle one graph at a time or requires so sort of tagging system to not mix graphs. Neither of those approaches seem optimal. What's the best practice approach for this?
A few more things to note: I will be calling these graphs and manipulating them with something like networkx, and I've considered storing the graphs in a "regular" DB then moving them to Neo4J as the requests come in, which seems pretty intense.
Neo4j does not have a concept of multiple databaseslike most relational databases do using CREATE DATABASE. In Neo4j there is one graph space which you can use.
So you have 2 options:
use seperate Neo4j instances (single or clustered) for each graph, maybe using Neo4j in embedded mode is helpful here
use one Neo4j instance (single or clustered) and store your data in distinct subgraphs. If the subgraphs need some interconnections you can use labels to identify to which subgraph a certain node belongs.
I've been experimenting with Titan over the past few weeks and would like some pointers on the way forward, plus a few specific questions. The purpose of the project is to store log data on a Cassandra cluster (for this question let's use the example of web traffic) and represent relationships in a Titan graph. All nodes are modelled as having an entity value and type (e.g. "google.com","hostname"), and edges have a label (e.g. "connects") as well as several attributes of the relationship (timestamp, flow length and so on).
Once this data is stored in cassandra and represented as a Titan graph, I plan to use d3 code to generate visualisations. At the end of the tunnel I am hoping to be able to build large-scale, interactive, complex graph networks that look something like this: http://goo.gl/CVEd55
My current setup is as follows:
A python script to convert log files into vertices.csv and edges.csv files for Gremlin to load in
Titan Server 0.4 (using CassandraThrift as the storage backend) - gremlin script to load converted data into Titan
Python script that uses NetworkX to open a RexPro connection, allowing the analyst to enter a custom Gremlin query, outputting the result as a JSON
Local web front-end that uses the generated JSON and d3 to display the results of the query as a graph
Ideally as a test base case, I would like the user to be able to type a Gremlin query into the web front-end and be directed to a page containing an interactive d3 graph of the result.
My specific questions are are follows:
What is the process for assigning attributes to edges? I have had trouble finding sample code that helps me represent the graph using the model listed above.
My gremlin script to load data into Titan uses bg.commit() to create a batch graph which is later referenced in the RexPro connection conn= RexProConnection('localhost,8184,'bg'). This was working originally but after changing my load script, clearing the graph in Gremlin and then reloading, the RexPro connection cannot be opened due to the graph bg apparently not existing. What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
Apologies for the long post, but if anyone could share their expertise that would be much appreciated!
For d3 visualization, you can use force directed graph. There are a few variations of them.
Relationship Graph
https://vida.io/documents/qZ5SJdRJfj3XmSXYJ
Force Layout Tree
https://vida.io/documents/sy7vzWW7BJEvKdZeL
If your network contains a large number of node and edges, you'll need to cluster data before visualizing. You can use tools like Gephi, NodeXL to perform clustering. Then use clustered data to build force directed visualization.
What is the process for assigning attributes to edges?
The process is the same as adding properties to vertices. Get an Edge instance then do:
Edge e = g.addEdge(v1,v2,'label')
e.setProperty('weight',0.1d)
As for:
What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
You don't want a reference to a BatchGraph after loading as it comes with limitations that will prevent you from querying. It sounds like you should just configure "yourgraph" in rexster.xml, when you load through your script, simply wrap your rexster.xml configured Graph in your code, and perform your load operations against it. When you want to query it, simply reference "yourgraph" instead of "bg".
conn = RexProConnection('localhost,8184,'yourgraph')
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
It's hard to say if that's "easy" as that depends on factors outside of just the technology. I'll say that it's possible to to build an interface that accepts Gremlin queries (your wrote SQL, but I assume you meant Gremlin), passes them to Rexster and gets back an answer. What you do with that answer is up to you, but as far as Rexster's part plays into it, I don't see why that would be a problem.
I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.