Does an Increased Number of Node Types Impact Performance of Graph DBs? - gremlin

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.

The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

Related

Modeling time in graph databases

I read in Neo4j documentation a section about how to make queries that depends on time more efficient:
One way to model time-specific data and relationships is by including
data in the relationship type. Because Neo4j is optimized specifically
for traversing relationships between entities, you can often improve
query performance by specifying a date as the relationship type and
only traversing particular dated relationships.
But I was wondering, using this technique you will have to repeat the same things any time you want to make the time-based-queries more efficient. For example if you want to query the posts created by specific user at specific date you have to add (similarly to AirportDay) something like UserDay.
My question is there a possible way to model time universally in your graph?, so that time become the main entry-point to query events and activities in the graph.
There's no answer to modelling time universally in your graph. It depends on your use cases.
The example in your post is one way to optimise non-performant queries that traverse too many relationships of the same type from a node.
You could also store time as a property on the node, and index it.
And then there's the option of a timetree https://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html
To summarise, it depends on your use cases- usually no need to prematurely optimise.

Graph database design: Should I add relationships, or just traverse

I have recently started exploring graph databases and Neo4J, and would like to work with my own data. At the moment I've hit some confusion. I've created an example image to illustrate my issue. In terms of efficiency, I'm wondering which option is better (and I want to get it right now in early days before I start handling larger amounts).
Option A: Using only the blue relationships, I can work out whether things are related to, or come under, the Ancient group. This process will be done many many times, however it is unlikely to be more than ~6 generations.
Option B: I implement the red relationships, so that it is much faster to work out if young structures belong to the Ancient group.
I'm trying not to use Labels in this scenario, as I'm trying to use labels for a specific purpose to simplify my life (linking structures across seperate networks), and I'm not sure if I should have a label to represent a node that already exists.
In summary, I'm wondering whether adding a whole new bunch of relationships, whilst taking more space, is worth it, or whether traversing to find all relatives is such a simple/inexpensive task that it isn't worth doing so. Or alternatively, both options are viable and this isn't a real issue at all. Thanks for reading.
I'd go with Option A. One of the strengths of Neo4j is that it traverses relationships very efficiently and quickly, and so, there is no need to materialise relationships (sometimes, relationships are materialised in complex and/or extremely large graphs, but this is not your case).
Not sure why you don't want to use labels? Labels serve to group nodes into sets of the same type, and are also index backed- this makes it much faster to find the starting point of your query (index lookup over full database scan).

How to deal with high redundancy in Neo4j?

I am working on an application that uses both relationnal and graph databases (sqlite and neo4j). I am trying to see if I can't get rid of sqlite to use only neo4j, and I am confronted to a problem of redundancy.
Let say I have nodes that represent audio tracks. I want to store of what musical genre each track is. With hundreds of thousands of nodes, I don't think repeating "South-African Psytrance" as a string property is a good idea, and I am pretty sure that creating a "South-African Psytrance" node and linking it to all concerned nodes is an even worse idea (bottleneck?).
Am I right if I say that using 1) properties takes too much space, and using 2) relationships is a bad design for this particular problem?
The current code uses the sqlite db to store a set of musical genres, and their indexes as properties in nodes (which are converted to their string representation in the application layer).
Is there a way to use only neo4j and avoid bottlenecks and redundancy?
Option 1 is definitely NOT the way to go, as it will waste space and is antithetical to good graph DB design.
Option 2 is the classic way you would do this with a graph DB. There are many examples of neo4j DBs with very large numbers of relationships per node. And neo4j currently supports up to 34 billion relationships in a DB, so there is little danger that you will exceed a capacity limit. So, I would recommend that you at least try using this approach.
There are also a few blogs about people using neo4j for storing similar data. For example:
http://neo4j.com/blog/musicbrainz-in-neo4j-part-1/
http://neo4j.com/blog/fun-with-music-neo4j-and-talend/
http://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/
[EDITED]
As the slides mentioned by #Pawamoy imply, there is actually a third option. That is, you can create a specific node label for each genre, and apply the appropriate genre label (a node can have more than one) to every track node. This would allow you to avoid using relationships for genres. However, it would tend to "muddy" the label space, since labels at least feel like "node types", and a "music genre" is not an "album track". Also, neo4j supports a very limited number of labels per node, and the maximum number of labels in a DB is also relatively small. So, I would not use this approach unless there was a definite advantage to doing so and the capacity limits are not an issue.

number of 'graphs' in a ArangoDB database

I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks
You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.

How is representing all information in Nodes vs Attributes affect storage, computations?

While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.

Resources