number of 'graphs' in a ArangoDB database - graph

I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks

You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

best graph database for saving millon of node

I want to ask a question about graph database.
First im using networkx in python and creating graph in memory, but when i reach more nodes - my RAM was not enough.
So, for next time i try to neo4j. Its nice, write graph on disk, but its slow(how i think. With index and other things, more slow than networkx). Now i create 500k nodes and 2000000 relationships, try to find path between two nodes, and neo4j just stuck on my server.
I hear about orientdb, but not try yet now.
So, i need advice, what the best graph database, who can write graph on disk?
Big thanks to you.
PS want only open-source graph database
First of all there are real or native graph databases or non native graph databases. The native graph databases really organize your data in a graph structure and connect the nodes to each other, while the non native are using some kind of model to store your graph representation. You can simply represent a graph as Adjacency matrix which is a table and you maybe could be stored in a row key store with columns (but that wouldn't be very effective and stupid in my opinion). So you first need to ask yourself if you really need a graph database? Second you need to think about the operations read und write you want to perform.
There is not best (graph) database. But there are many different databases for many different use cases - so you need to identify your exact use case and than you can think about the database.
For your tries with neo4j: Writing in neo4j is indeed very slow if you do it wrong. May you like to have a look at this question and answer about write performance of neo4j.
Almost all graph database can write graph on disk.
But if you're doing some calculation, such as shortest path for very deep search (dozens hop), memory is much much more important than disk.

Proper way to store graphs with Neo4J

I'm building a system which allows the user to call N number of different graphs through an API. Currently I have a working prototype which pulls graphs from CouchDB. However, for obvious reasons, I would like to move to a graph DB. My understanding is that Neo4J can only handle one graph at a time or requires so sort of tagging system to not mix graphs. Neither of those approaches seem optimal. What's the best practice approach for this?
A few more things to note: I will be calling these graphs and manipulating them with something like networkx, and I've considered storing the graphs in a "regular" DB then moving them to Neo4J as the requests come in, which seems pretty intense.
Neo4j does not have a concept of multiple databaseslike most relational databases do using CREATE DATABASE. In Neo4j there is one graph space which you can use.
So you have 2 options:
use seperate Neo4j instances (single or clustered) for each graph, maybe using Neo4j in embedded mode is helpful here
use one Neo4j instance (single or clustered) and store your data in distinct subgraphs. If the subgraphs need some interconnections you can use labels to identify to which subgraph a certain node belongs.

Hierarchy in a graph

How can one compute or quantify the hierarchical nature of a given graph, if there is a hierarchy in that graph ?
More specifically, i want to know if some sort of hierarchy existed in artificial neural network (with some number of hidden layers). and also want to measure that.
This is an interesting open ended question and so I'll answer loosely and off the cuff.
So you want to find out if the graph is logically like a tree? There is no up or down in a graph, so probably what you are really looking for is a way to find the node or nodes in your graph that are the most highly connected to other highly connected nodes, and then take a certain perspective supposition and determine if a "tree" like graph makes sense using that node as the trunk or root of the tree. Another thing you could do is just choose any random node, suppose that is the root of the tree, and then see what happens. You can re-balance the tree if the node connections lead you to want to do that based on a certain number of connections or traversals, and you can attempt to find the "real root" - if there is such a thing. If you re-balance a certain number of times, or detect a circular path has been traversed, you may decide that the graph is not very hierarchical at all. If you do find a "real root", then you might then decide to look for depth, avg. branch numbers, balance stats, etc. If you refine the question, I'll refine my answer.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Resources