I am designing a graph, and see examples where several vertices will have a similar label, such as 'user', etc. When knowing its unique value, one can assign it to the vertex' id, and look it up as:
g.V('person').has('id','unique-value'). ...
Or assign that unique value as a label, and reference it that way.
g.V('unique-value'). ...
Is there a particular reason not use unique values (an id, essentially) as a label, such as performance? What is the best strategy for this?
Your question and your Gremlin examples don't quite align. I think that you mean to compare:
g.V().hasLabel('person').has(T.id,'unique-value')
and
g.V('unique-value')
Note my corrections in that first Gremlin statement. V() does not take a vertex label as an argument - it can only take a vertex id or a Vertex object. Also, the actual vertex identifier must be referenced by T.id and not 'id', the latter being a reference to a user-defined property named "id". T.id is what you get returned from g.V().id(). We often refer to T.id as just id and I will do so going forward.
With that being straightened out, there is no need to do hasLabel('person') if you have the id handy, so the two examples above return the same value and I would think that most graph databases would likely optimize away the label filter and just use the id for their lookup so I wouldn't imagine that you'd see a difference in performance, but for readability purposes I'd stick to just using V('unique-value').
Your question specifically asked about using a unique label as a way to identify a vertex, so I will also address that. A label is not meant for unique identification of a graph element. It is meant to categorize groups of elements. Aside from that convention, I think there are a number of technical reasons not to do that. Some graphs have limits on the number of labels you can have so that could be a problem depending on your graph provider. At the very least, you reduce the portability of your code by doing that. I think it would impact performance as label lookups are not going to be as fast as id lookups (especially as you scale the graph up in size).
Related
I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.
Please may you help me to write a query that returns each source vertex in my traversal along with its associated edges and vertices as arrays on each such source vertex? In short, I need a result set comprising an array of 3-tuples with item 1 of each tuple being the source vertex and items 2 and 3 being the associated arrays.
Thanks!
EDIT 1: Expanded on the graph data and added my current problem query.
EDIT 2: Improved Gremlin sample graph code (apologies, didn't think anyone would actually run it.)
Sample Graph
g.addV("blueprint").property("name","Mall").
addV("blueprint").property("name","HousingComplex").
addV("blueprint").property("name","Airfield").
addV("architect").property("name","Tom").
addV("architect").property("name","Jerry").
addV("architect").property("name","Sylvester").
addV("buildingCategory").property("name","Civil").
addV("buildingCategory").property("name","Commercial").
addV("buildingCategory").property("name","Industrial").
addV("buildingCategory").property("name","Military").
addV("buildingCategory").property("name","Resnameential").
V().has("name","Tom").addE("designed").to(V().has("name","HousingComplex")).
V().has("name","Tom").addE("assisted").to(V().has("name","Mall")).
V().has("name","Jerry").addE("designed").to(V().has("name","Airfield")).
V().has("name","Jerry").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Sylvester").addE("designed").to(V().has("name","Mall")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","Airfield")).
V().has("name","Sylvester").addE("assisted").to(V().has("name","HousingComplex")).
V().has("name","Mall").addE("classification").to(V().has("name","Commercial")).
V().has("name","HousingComplex").addE("classification").to(V().has("name","Resnameential")).
V().has("name","Airfield").addE("classification").to(V().has("name","Civil"))
Please note that the above is a very simplified rendering of our data.
Needed Query Results
I need to bring back each blueprint vertex as a base with each of its associated edges / vertices as arrays.
My Current Solution
Currently I do this very cumbersome query that gets the blueprints and assigns a label, gets the architects and assigns a label, then selects both labels. The solution is ok; however, it gets messy when I need to include edges or I need to get blueprint classification vertices (industrial, military, residential, commercial, etc.). In effect, the more associated data that I need to pull back for each blueprint, the sloppier my solution becomes.
My current query looks something like this:
g.V().hasLabel("blueprint").as("blueprints").
outE().or(hasLabel("designed"),hasLabel("assisted")).inV().as("architects").
select("blueprints").coalesce(out("classification"),constant()).as("classifications").
select("blueprints","architects","classifications")
The above produces a lot of duplication. If the number of: blueprints is b, architects is a, and classifications is c, the result set comprises b * a * c results. I'd like one blueprint with an array of its associated architects and an array of its associated classifications, if any.
Complications
I'm trying to do this in one query so that I can get all blueprint data from the graph to populate a filtered list. Once I have the list comprising all of the vertices, edges, and their properties, users can then click links to blobs, browse to project sites, etc. Accordingly, I've got pagination as well as filtering to think about and I'd prefer to make one trip to the server each time I get a new page or the filters change.
I figured out an answer; however, it quadruples the compute charge for the query. Not sure if this can be optimized further.
g.V().hasLabel("blueprint").
project("blueprints","architects").
by().
by(outE().or(hasLabel("designed"),hasLabel("assisted")).inV().dedup().fold())
I just solved for blueprints and architects, but classifications just needs another by(...traversal...) and projection label.
I may have to just get the blueprints in one query, get each of their associated items in parallel queries, then put it all together in the API. That would be very bad design for the API data layer but may be necessary for performance reasons.
I'm trying to produce a Gremlin query whereby I need to find vertexes which have edges from specific other vertexes. The less abstract version of this query is I have user vertexes, and those are related to group vertexes (i.e subjects in a school, so students who are in "Year 6 Maths" and "Year 6 English" etc). An extra difficulty is the ability for subgroups to exist in this query.
The query I need to find those users who are in 2 or more groups specified by the user.
Currently I have a brief solution, but in production usage using Amazon Netpune this query performs way too poorly, even with a small amount of data. I'm sure there's a simpler way of achieving this :/
g.V()
.has('id', 'group_1')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.aggregate("q-1")
.V()
.has('id', 'group_2')
.repeat(out("STUDENT", "SUBGROUP"))
.until(hasLabel("USER"))
.where(within("q-1"))
.aggregate("q-2")
.V()
.hasLabel(USER)
.where(within("q-2"))
# We add some more filtering here, such as search terms
.dedup()
.range(0, 10)
.values("id")
.toList()
The first major change you can do is to not bother iterating all of V() again for "USER" - that's already that output from the prior steps so collecting "q-2" just to use it for a filter doesn't seem necessary:
g.V().
has('id', 'group_1').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
aggregate("q-1").
V().
has('id', 'group_2').
repeat(out("STUDENT", "SUBGROUP")).
until(hasLabel("USER")).
where(within("q-1")).
# We add some more filtering here, such as search terms
dedup().
range(0, 10).
values("id")
That should already be a huge savings to your query because that change avoids iterating the entire graph in memory (i.e. full scan of all vertices) as there was no index lookup there.
I don't know what your additional filters are here:
# We add some more filtering here, such as search terms
but I would definitely look to try to filter the users earlier in your query rather than later. Perhaps consider using emit() on your repeats() to filter better. You should probably also dedup() your "q-1" and reduce the size of the list there.
I'd be curious to know how much just the initial change I suggested works as that was probably the biggest bulk of your query cost (unless you have a really deep/wide tree of student/subgroups I guess). Perhaps there is more that could be tweaked here though, but it would be nice to know that you at least have a traversal with satisfying performance at this point.
I have an Edgelabel
ContainsAttribute which has Multiplicity.SIMPLE
These edges also have a property let's call it x that I want to make the vertex-centric index on.
PropertyKey propertyX = mgmt.getPropertyKey("x");
EdgeLabel containsAttributeLabel = mgmt.makeEdgeLabel(EdgeLabels.ContainsAttribute).multiplicity(Multiplicity.SIMPLE).make();
mgmt.buildEdgeIndex(containsAttributeLabel,"propXIndex",Direction.IN, propertyX);
So the edges represent Entity --containsAttribute --> Attribute. The query I am trying to make will try to search Entities given queries by filtering on the Property x.
I wonder why it doesn't allow me saying:
The relation type [ContainsAttribute] has a multiplicity or cardinality constraint in direction [IN] and can therefore not be indexed
I think my use case makes sense and I wouldn't want to relax my edge label multiplicity from SIMPLE to MANY2ONE,ONE2MANY or MULTI, to make it work.
Edit: According to the example http://s3.thinkaurelius.com/docs/titan/1.0.0/indexes.html Hercules battled a lot of monsters so the edges labeled 'battled' are found coming out of 'Hercules' multiple times connecting with different monsters. Then the edge index is on attribute 'time' so filtering can be done. I want to do something similar and I thought vertex-centric indices are the way.. Those edges are Multiplicity.SIMPLE because there is at most one edge labeled 'battled' between Hercules and each of the monsters.
Edit 2:
Similar to the given example again a SIMPLE graph:
I believe it would make sense to have a vertex-centeric index for Hercules and the out-going SIMPLE 'battled' edges. That would make queries like time >=20 faster when traversing from Hercules to the monsters.
I don't understand why we must have a MULTI graph (less strict) like below to leverage vertex-centric indices..
Any help would be appreciated!
Thanks!
In Neo4j, is it possible for a relationship to have a relationship?
To illustrate: Imagine a domain model that encompasses a collection of geometric planes. Each plane has a collection of lines on it, and each line has a collection of points on it. Each point on a line is connected to the point after it by an outgoing -[NEXT]-> relationship, and to the point preceding it by an incoming one. The way I have it now, each of these NEXT relationships contains a property lineID, which identifies the line on which it exists: The node entities representing lines in the database contain only an id, and perhaps a bit of metadata, and we return line X by traversing the graph, finding all -[NEXT{lineID:X}]-> relationships, fetching the start and end nodes of each and returning an list of them along with the line's metadata.
I was a bit more longwinded there than I intended to be, but my question is this: What if, rather than having a lineID property on each [NEXT] relationship, I wanted to create an -[ON]-> relationship between each [NEXT] and the node entity representing the line it is on?
To illustrate: Rather than doing
CREATE (:point)-[:NEXT{lineID:x}]->(:point)-[:NEXT{lineID:x}-> ...
, what about something like:
CREATE (:point)-[z:NEXT]->(:point), (z)-[:ON]->(:line)`
That's some ugly cypher, but I hope it clarifies my point. Intuitively, it seems like this would make line traversals more efficient (because we'd be playing to neo4j's strength by asking it to traverse all [ON] relationships from a line node rather than simply searching for a (presumably indexed) property. It would also make it easier to specify nested relationships:
(z)-[:ON]->(:line), (z)-[:ON]->(:plane)
Is this intuition misconceived? If not, would something like this be possible? I don't think it is, but am contemplating a workaround that would involve creating a node entity for each "relationship". Something like this:
(:point)<-[:FROM]-(x:next)-[:TO]->(:point), (x)-[:ON]->(:line)
, which would have the added advantage of facilitating hypergraph structures, which is something else I'm interested in. Leaving that conversation for another day (and another post), would such an approach be more trouble/expensive than its worth the purposes elucidated here? Might there be any dis/advantages (aside from plain cost) I'm not considering? Or am I reinventing the wheel here - is there an extant solution in this situation that I'm unaware of?
There are no relationships that you can link to other relations. I think that when you ask yourself this kind of questions, you may have a modelling problem for your data, and the next thing to do is try to model the data differently. For instance, why the relationship that links two points knows the line on which the points are ? Wouldn't it be more natural that the point knows the line, therefore having the property lineID on the points? This way you may have points on several lines, which you can't model properly if the lineID is on the NEXT relationship. Perhaps even better, you can have a node Line that has a relationship CONTAINS with all the points on that particular line instead of using lineID property.
This is not possible.
Restructure your model so any data on your relationship which needs linking is a node