Neo4j - family graph design and ancestor/pedigree lookup - graph

I just started playing around with Neo4j, so my apologies if this is a simple concept...
I'm building a relatively large database of family information (a few million nodes with about 5-15 properties per node). As of right now, all data is being stored in a mysql database using Redis as a caching layer, but I'm playing around with switching out Redis for Neo4j to help speed up some of our more expensive queries (and eventually using Neo4j as the main data store instead of mysql).
I'm playing around with storing all my nodes and their properties in Neo4j, and connecting them via HAS_FATHER and HAS_MOTHER relationships. Is this a good approach? Would it be more beneficial to use HAS_PARENT and set a parent_type property on each relationship to either father or mother? Should I also save a reverse relationship called HAS_CHILD on all parents? What are the pros and cons of my options?
Secondly, assuming that I'm using the HAS_FATHER and HAS_MOTHER relationships, what's the optimal query to grab all nodes, properties, and relationships for all direct ancestors (pedigree) 7 generations away? Here's an example query that I'm currently playing with, but I'm new to Cypher and I'm not too familiar with the bottlenecks, optimizations, etc.
MATCH tree = (c)-[:HAS_FATHER|HAS_MOTHER*0..7]->(p)
WHERE c.id = 29421
RETURN nodes(tree), rels(tree)
Any help or tips would be appreciated. Thanks!

Having HAS_MOTHER and HAS_FATHER instead of a HAS_PARENT with a type property is definitely better. In case of more verbose relationships e.g. when you query for mothers your traversals don't need to dig into properties - they can solely rely on relationships.
The reason for that being more performant is that properties are lazy loaded on demand, see http://neo4j.com/docs/stable/performance-guide.html#_neo4j_primitives_lifecycle.
If you have semantically inverse relationships you don't have to model them explicitly because if a is mother of b, consequently b is son of a. So for querying children just follow HAS_FATHER and HAS_MOTHER in inverse direction.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

How to deal with high redundancy in Neo4j?

I am working on an application that uses both relationnal and graph databases (sqlite and neo4j). I am trying to see if I can't get rid of sqlite to use only neo4j, and I am confronted to a problem of redundancy.
Let say I have nodes that represent audio tracks. I want to store of what musical genre each track is. With hundreds of thousands of nodes, I don't think repeating "South-African Psytrance" as a string property is a good idea, and I am pretty sure that creating a "South-African Psytrance" node and linking it to all concerned nodes is an even worse idea (bottleneck?).
Am I right if I say that using 1) properties takes too much space, and using 2) relationships is a bad design for this particular problem?
The current code uses the sqlite db to store a set of musical genres, and their indexes as properties in nodes (which are converted to their string representation in the application layer).
Is there a way to use only neo4j and avoid bottlenecks and redundancy?
Option 1 is definitely NOT the way to go, as it will waste space and is antithetical to good graph DB design.
Option 2 is the classic way you would do this with a graph DB. There are many examples of neo4j DBs with very large numbers of relationships per node. And neo4j currently supports up to 34 billion relationships in a DB, so there is little danger that you will exceed a capacity limit. So, I would recommend that you at least try using this approach.
There are also a few blogs about people using neo4j for storing similar data. For example:
http://neo4j.com/blog/musicbrainz-in-neo4j-part-1/
http://neo4j.com/blog/fun-with-music-neo4j-and-talend/
http://neo4j.com/blog/upload-last-fm-data-neo4j-rneo4j-transactional-endpoint/
[EDITED]
As the slides mentioned by #Pawamoy imply, there is actually a third option. That is, you can create a specific node label for each genre, and apply the appropriate genre label (a node can have more than one) to every track node. This would allow you to avoid using relationships for genres. However, it would tend to "muddy" the label space, since labels at least feel like "node types", and a "music genre" is not an "album track". Also, neo4j supports a very limited number of labels per node, and the maximum number of labels in a DB is also relatively small. So, I would not use this approach unless there was a definite advantage to doing so and the capacity limits are not an issue.

Neo4j Design: Property vs "Node & Relationship"

I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.

How is representing all information in Nodes vs Attributes affect storage, computations?

While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.

Riak link-walking like a join?

I am looking to store pictures in a NoSQL database (<5MB) and link them to articles in a different bucket. What kind of speed does Riak's link walking feature offer? Is it like a RDBMS join at all?
Links are not at all similar to JOINs (which involve a Cartesian product), but they can be used for similar purposes in some senses. They are very similar to links in an HTML document.
With link-walking you either start with a single key, or you create a map-reduce job that starts with multiple keys. (Link-walking/traversal is actually a special case of map-reduce.) Those values are fetched, their links filtered against your specification (bucket, tag) and then the matched links are passed along to the next phase (or back to the client). Of course, all of this is done in parallel (unlike a JOIN) with high data-locality.
Also, map-reduce isn't slow by itself, you just don't have a sophisticated query planner to do the hard work for you; you have to think about how you will query and organize your data around that as necessary.
Think one-way relationships and as fast as querying normally. Not as slow as MapReduce.
From:
http://seancribbs.com/tech/2010/02/06/why-riak-should-power-your-next-rails-app/
The first way that Riak deals with
this is with link-walking. Every datum
stored in Riak can have one-way
relationships to other data via the
Link HTTP header. In the canonical
example, you know the key of a band
that you have stored in the “artists”
bucket (Riak buckets are like database
tables or S3 buckets). If that artist
is linked to its albums, which are in
turn linked to the tracks on the
albums, you can find all of the tracks
produced in a single request. As I’ll
describe in the next section, this is
much less painful than a JOIN in SQL
because each item is operated on
independently, rather than a table at
a time. Here’s what that query would
look like:
GET
/raw/artists/TheBeatles/albums,,/tracks,_,1
“/raw” is the top of the URL
namespace, “artists” is the bucket,
“TheBeatles” is the source object key.
What follows are match specifications
for which links to follow, in the form
of bucket,tag,keep triples, where
underscores match anything. The third
parameter, “keep” says to return
results from that step, meaning that
you can retrieve results from any step
you want, in any combination. I don’t
know about you, but to me that feels
more natural than this:
SELECT tracks.* FROM tracks INNER
JOIN albums ON tracks.album_id =
albums.id INNER JOIN artists ON
albums.artist_id = artists.id WHERE
artists.name = "The Beatles" The
caveat of links is that they are
inherently unidirectional, but this
can be overcome with little difficulty
in your application. Without
referential integrity constraints in
your SQL database (which ActiveRecord
has made painful in the past), you
have no solid guarantee that your
DELETE or UPDATE won’t cause a row to
become orphaned, anyway. We’re kind of
spoiled because ActiveRecord handles
the linkage of associations
automatically.
The place where the link-walking
feature really shines is in
self-referential and deep transitive
relationships (think has_many :through
writ large). Since you don’t have to
create a virtual table via a JOIN and
alias different versions of the same
table, you can easily do things like
social network graphs
(friends-of-friends-of-friends), and
data structures like trees and lists.

Resources