Graph database design: Should I add relationships, or just traverse - graph

I have recently started exploring graph databases and Neo4J, and would like to work with my own data. At the moment I've hit some confusion. I've created an example image to illustrate my issue. In terms of efficiency, I'm wondering which option is better (and I want to get it right now in early days before I start handling larger amounts).
Option A: Using only the blue relationships, I can work out whether things are related to, or come under, the Ancient group. This process will be done many many times, however it is unlikely to be more than ~6 generations.
Option B: I implement the red relationships, so that it is much faster to work out if young structures belong to the Ancient group.
I'm trying not to use Labels in this scenario, as I'm trying to use labels for a specific purpose to simplify my life (linking structures across seperate networks), and I'm not sure if I should have a label to represent a node that already exists.
In summary, I'm wondering whether adding a whole new bunch of relationships, whilst taking more space, is worth it, or whether traversing to find all relatives is such a simple/inexpensive task that it isn't worth doing so. Or alternatively, both options are viable and this isn't a real issue at all. Thanks for reading.

I'd go with Option A. One of the strengths of Neo4j is that it traverses relationships very efficiently and quickly, and so, there is no need to materialise relationships (sometimes, relationships are materialised in complex and/or extremely large graphs, but this is not your case).
Not sure why you don't want to use labels? Labels serve to group nodes into sets of the same type, and are also index backed- this makes it much faster to find the starting point of your query (index lookup over full database scan).

Related

Firebase Database data consumption optimization (observing node only partly)

On my database, I have Post and User Models
User Model has a lot of information, but when I load the posts, I only need 3 out of like 20 parameters.
What I am currently doing is just loading the entire node. This is obviously not really efficient.
My question: Is it more efficient if I observe all 3 values (making 3 connections) individually or just observe the entire node once (making only a single connection).
I don't know exactly what would be more expensive (higher consumption as making 3 connections is probably not better than 1)
Kind regards
Edit
Firebase always loads complete nodes. While it is possible to get a subset of nodes with queries, that doesn't apply here.
So you will either have to load all nodes and do the subselection client-side, or you'll have to create another higher level node that only contains the three properties that you're interested in.
Which one to choose is highly dependent, and (honestly) largely subjective. The main options:
You can reduce bandwidth a bit by only loading the three properties, but if you store them as a duplicate you'll then end up paying for the storage of duplicate information.
You can also store the three properties separately, but not duplicate them. But that means that if you need all properties, you'll need to execute two read operations that then add some overhead and complicate code.

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Gremlin: Which one is faster? within() or out().in()

I have a user search feature in my app where the searcher don't want to see some results, he does this by "blocking" a tag, when blocking a tag all users that are "subscribed" to that tag will be ignored in his search results.
I'm writing the query to filter the search results and I found 2 ways of getting the same:
First:
g.V(1991)
.out("blocked").fold().as("blockedTags")
.V().hasLabel("user")
.not(
where(
out("subscribed").where(
within("blockedTags")
)
)
)
Second:
g.V(1991).as("user")
.V().hasLabel("user")
.not(
where(
out("subscribed")
.in("blocked")
.as("user")
)
)
Gremlify: https://gremlify.com/xnqhvtzo6b
One uses within() and the other performs 2 steps out() and in(), I want to know which one is faster so I can decide which one to use, these 2 options are possible in many queries of my application.
EDIT:
I ran both queries in the gremlin console with profile() step at the end but the >TOTAL field gives random time numbers from 0.300ms to 1.220ms for both queries, because of this I don't know how to compare the performance of 2 queries.
I will offer a general answer here that is largely derived from the comments on the question itself. It really isn't possible to profile() one graph and then project those results on another. They will each have different capabilities and performance characteristics. If you need to know which of two approaches to a query is better, then you must test both traversals on the graph system you intend to target.
I'd also be wary of going too far in a particular development direction without doing ongoing testing on the target graph. Just as you wouldn't do all your development on MySQL only to switch to Oracle when it was time to go to production, you really shouldn't try to build your entire application against a graph you don't intend to use. There are subtle differences in these systems that could make a significant differences to you.
As to the differences in profile() times on TinkerGraph, there is bound to be timing differences on the JVM for what I'm guessing is a test on a small dataset that resides in memory. Or perhaps for TinkerGraph there is no significant difference between the two approaches. Consider trying to execute the queries a few thousand times and average the time taken and compare that. Gremlin Console has a clock() function that helps with that. Of course, as I alluded to earlier what you learn there is no guarantee that you have the right solution on Neptune.
If you'd like a bit of analysis about your queries I could offer a few words (though I don't base this thinking on Neptune specifically). How each performs depends a lot on your graph structure, but I think I'd be the first query to be faster because it captures "blocked" vertices with:
.out("blocked").fold()
and re-use it over and over for however many V().hasLabel('user') there are. That's just a gut feeling though. I'm guessing the blocked list will be relatively small for a single user so traversing the opposing way with:
out("subscribed").in("blocked")
would just be more expensive as you would have to traverse a lot more "blocked" edges that don't terminate with the initial vertex.

How to deal with high redundancy in Neo4j?

I am working on an application that uses both relationnal and graph databases (sqlite and neo4j). I am trying to see if I can't get rid of sqlite to use only neo4j, and I am confronted to a problem of redundancy.
Let say I have nodes that represent audio tracks. I want to store of what musical genre each track is. With hundreds of thousands of nodes, I don't think repeating "South-African Psytrance" as a string property is a good idea, and I am pretty sure that creating a "South-African Psytrance" node and linking it to all concerned nodes is an even worse idea (bottleneck?).
Am I right if I say that using 1) properties takes too much space, and using 2) relationships is a bad design for this particular problem?
The current code uses the sqlite db to store a set of musical genres, and their indexes as properties in nodes (which are converted to their string representation in the application layer).
Is there a way to use only neo4j and avoid bottlenecks and redundancy?
Option 1 is definitely NOT the way to go, as it will waste space and is antithetical to good graph DB design.
Option 2 is the classic way you would do this with a graph DB. There are many examples of neo4j DBs with very large numbers of relationships per node. And neo4j currently supports up to 34 billion relationships in a DB, so there is little danger that you will exceed a capacity limit. So, I would recommend that you at least try using this approach.
There are also a few blogs about people using neo4j for storing similar data. For example:
http://neo4j.com/blog/musicbrainz-in-neo4j-part-1/
http://neo4j.com/blog/fun-with-music-neo4j-and-talend/
http://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/
[EDITED]
As the slides mentioned by #Pawamoy imply, there is actually a third option. That is, you can create a specific node label for each genre, and apply the appropriate genre label (a node can have more than one) to every track node. This would allow you to avoid using relationships for genres. However, it would tend to "muddy" the label space, since labels at least feel like "node types", and a "music genre" is not an "album track". Also, neo4j supports a very limited number of labels per node, and the maximum number of labels in a DB is also relatively small. So, I would not use this approach unless there was a definite advantage to doing so and the capacity limits are not an issue.

Storing multiple graphs in Neo4J

I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).

Resources