In Neo4j, is it possible for a relationship to have a relationship?
To illustrate: Imagine a domain model that encompasses a collection of geometric planes. Each plane has a collection of lines on it, and each line has a collection of points on it. Each point on a line is connected to the point after it by an outgoing -[NEXT]-> relationship, and to the point preceding it by an incoming one. The way I have it now, each of these NEXT relationships contains a property lineID, which identifies the line on which it exists: The node entities representing lines in the database contain only an id, and perhaps a bit of metadata, and we return line X by traversing the graph, finding all -[NEXT{lineID:X}]-> relationships, fetching the start and end nodes of each and returning an list of them along with the line's metadata.
I was a bit more longwinded there than I intended to be, but my question is this: What if, rather than having a lineID property on each [NEXT] relationship, I wanted to create an -[ON]-> relationship between each [NEXT] and the node entity representing the line it is on?
To illustrate: Rather than doing
CREATE (:point)-[:NEXT{lineID:x}]->(:point)-[:NEXT{lineID:x}-> ...
, what about something like:
CREATE (:point)-[z:NEXT]->(:point), (z)-[:ON]->(:line)`
That's some ugly cypher, but I hope it clarifies my point. Intuitively, it seems like this would make line traversals more efficient (because we'd be playing to neo4j's strength by asking it to traverse all [ON] relationships from a line node rather than simply searching for a (presumably indexed) property. It would also make it easier to specify nested relationships:
(z)-[:ON]->(:line), (z)-[:ON]->(:plane)
Is this intuition misconceived? If not, would something like this be possible? I don't think it is, but am contemplating a workaround that would involve creating a node entity for each "relationship". Something like this:
(:point)<-[:FROM]-(x:next)-[:TO]->(:point), (x)-[:ON]->(:line)
, which would have the added advantage of facilitating hypergraph structures, which is something else I'm interested in. Leaving that conversation for another day (and another post), would such an approach be more trouble/expensive than its worth the purposes elucidated here? Might there be any dis/advantages (aside from plain cost) I'm not considering? Or am I reinventing the wheel here - is there an extant solution in this situation that I'm unaware of?
There are no relationships that you can link to other relations. I think that when you ask yourself this kind of questions, you may have a modelling problem for your data, and the next thing to do is try to model the data differently. For instance, why the relationship that links two points knows the line on which the points are ? Wouldn't it be more natural that the point knows the line, therefore having the property lineID on the points? This way you may have points on several lines, which you can't model properly if the lineID is on the NEXT relationship. Perhaps even better, you can have a node Line that has a relationship CONTAINS with all the points on that particular line instead of using lineID property.
This is not possible.
Restructure your model so any data on your relationship which needs linking is a node
Related
I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.
Please may you help me to write a query that returns each source vertex in my traversal along with its associated edges and vertices as arrays on each such source vertex? In short, I need a result set comprising an array of 3-tuples with item 1 of each tuple being the source vertex and items 2 and 3 being the associated arrays.
Thanks!
EDIT 1: Expanded on the graph data and added my current problem query.
EDIT 2: Improved Gremlin sample graph code (apologies, didn't think anyone would actually run it.)
Sample Graph
g.addV("blueprint").property("name","Mall").
addV("blueprint").property("name","HousingComplex").
addV("blueprint").property("name","Airfield").
addV("architect").property("name","Tom").
addV("architect").property("name","Jerry").
addV("architect").property("name","Sylvester").
addV("buildingCategory").property("name","Civil").
addV("buildingCategory").property("name","Commercial").
addV("buildingCategory").property("name","Industrial").
addV("buildingCategory").property("name","Military").
addV("buildingCategory").property("name","Resnameential").
V().has("name","Tom").addE("designed").to(V().has("name","HousingComplex")).
V().has("name","Tom").addE("assisted").to(V().has("name","Mall")).
V().has("name","Jerry").addE("designed").to(V().has("name","Airfield")).
V().has("name","Jerry").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Sylvester").addE("designed").to(V().has("name","Mall")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","Airfield")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Mall").addE("classification").to(V().has("name","Commercial")).
V().has("name","HousingComplex").addE("classification").to(V().has("name","Resnameential")).
V().has("name","Airfield").addE("classification").to(V().has("name","Civil"))
Please note that the above is a very simplified rendering of our data.
Needed Query Results
I need to bring back each blueprint vertex as a base with each of its associated edges / vertices as arrays.
My Current Solution
Currently I do this very cumbersome query that gets the blueprints and assigns a label, gets the architects and assigns a label, then selects both labels. The solution is ok; however, it gets messy when I need to include edges or I need to get blueprint classification vertices (industrial, military, residential, commercial, etc.). In effect, the more associated data that I need to pull back for each blueprint, the sloppier my solution becomes.
My current query looks something like this:
g.V().hasLabel("blueprint").as("blueprints").
outE().or(hasLabel("designed"),hasLabel("assisted")).inV().as("architects").
select("blueprints").coalesce(out("classification"),constant()).as("classifications").
select("blueprints","architects","classifications")
The above produces a lot of duplication. If the number of: blueprints is b, architects is a, and classifications is c, the result set comprises b * a * c results. I'd like one blueprint with an array of its associated architects and an array of its associated classifications, if any.
Complications
I'm trying to do this in one query so that I can get all blueprint data from the graph to populate a filtered list. Once I have the list comprising all of the vertices, edges, and their properties, users can then click links to blobs, browse to project sites, etc. Accordingly, I've got pagination as well as filtering to think about and I'd prefer to make one trip to the server each time I get a new page or the filters change.
I figured out an answer; however, it quadruples the compute charge for the query. Not sure if this can be optimized further.
g.V().hasLabel("blueprint").
project("blueprints","architects").
by().
by(outE().or(hasLabel("designed"),hasLabel("assisted")).inV().dedup().fold())
I just solved for blueprints and architects, but classifications just needs another by(...traversal...) and projection label.
I may have to just get the blueprints in one query, get each of their associated items in parallel queries, then put it all together in the API. That would be very bad design for the API data layer but may be necessary for performance reasons.
I have an Edgelabel
ContainsAttribute which has Multiplicity.SIMPLE
These edges also have a property let's call it x that I want to make the vertex-centric index on.
PropertyKey propertyX = mgmt.getPropertyKey("x");
EdgeLabel containsAttributeLabel = mgmt.makeEdgeLabel(EdgeLabels.ContainsAttribute).multiplicity(Multiplicity.SIMPLE).make();
mgmt.buildEdgeIndex(containsAttributeLabel,"propXIndex",Direction.IN, propertyX);
So the edges represent Entity --containsAttribute --> Attribute. The query I am trying to make will try to search Entities given queries by filtering on the Property x.
I wonder why it doesn't allow me saying:
The relation type [ContainsAttribute] has a multiplicity or cardinality constraint in direction [IN] and can therefore not be indexed
I think my use case makes sense and I wouldn't want to relax my edge label multiplicity from SIMPLE to MANY2ONE,ONE2MANY or MULTI, to make it work.
Edit: According to the example http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html Hercules battled a lot of monsters so the edges labeled 'battled' are found coming out of 'Hercules' multiple times connecting with different monsters. Then the edge index is on attribute 'time' so filtering can be done. I want to do something similar and I thought vertex-centric indices are the way.. Those edges are Multiplicity.SIMPLE because there is at most one edge labeled 'battled' between Hercules and each of the monsters.
Edit 2:
Similar to the given example again a SIMPLE graph:
I believe it would make sense to have a vertex-centeric index for Hercules and the out-going SIMPLE 'battled' edges. That would make queries like time >=20 faster when traversing from Hercules to the monsters.
I don't understand why we must have a MULTI graph (less strict) like below to leverage vertex-centric indices..
Any help would be appreciated!
Thanks!
I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.
While using Graph Databases(my case Neo4j), we can represent the same information many ways. Making each entity a Node and connecting all entities through relationships or just adding the entities to attribute list of a Node.diff
Following are two different representations of the same data.
Overall, which mechanism is suitable in which conditions?
My use case involves traversing the Database from different nodes until 4 depths and examining the information through connected nodes or attributes (based on which approach it is).
One query of interest may be, "Who are the friends of John who went to Stanford?"
What is the difference in terms of Storage, computations
Normally,
properties are loaded lazily, and are more expensive to hold in cache, especially strings. Nodes and Relationships are most effective for traversal, especially since the relationships types are stored together with the relatoinship records and thus don't trigger property loads when used in traversals.
Also, a balanced graph (that is, not many dense nodes with over say 10K relationships) is most effective to traverse.
I would try to model most of the reoccurring proeprties as nodes connecting to the entities, thus using the graph itself to index on these values, instead of having to revert to filter on property values or index the property with an expensive index lookup.
The first one is much better since you're querying on entities such as Stanford- and that entity is related to many person nodes. My opinion that modeling as nodes is more intuitive and easier to query on. "Find all persons who went to Stanford" would not be very easy to do in your second model as you don't have a place to start traversing from.
I'd use attributes mainly to describe the node/entity use them to filter results from the query e.g. Who are friends of John who went to Stanford in the year 2010. In this case, the year attribute would just be used to trim the results. Depends on your use case- if year is really important and drives a lot of queries or is used to represent a timeline, you could even model the year as a node attached to Stanford.