Does Gremlin support unique properties other than ID property?
Trying to figure out if there's an equivalent in Gremlin to defining a field unique in Postgres for example.
I could always query the graph to see if the property exists first, I'm trying to figure out if there's a more efficient elegant way.
This is something that Apache TinkerPop leaves up to the database implementor, and it does vary by implementation. I see you tagged the question with amazon-neptune. Currently Neptune only enforces the unique ID constraints for vertices and edges.
It's possible in a future release that additional schema constraint capabilities will be added, but at the present time you would have to monitor/control that in your application logic.
Related
I am working with Azure CosmosDB, and more specifically with the Gremlin API, and I am a little bit stuck as to what to select as a partition key.
Indeed, since I'm using graph data, not all vertices follow the same data schema. If I select a property that not all vertices have in common, Azure won't let me store vertices which don't have a value for the partition key. The problem is, the only property they all have in common is /id, but Azure doesn't allow for this property to be used as a partition key.
Does that mean I need to create a property that all my vertices will have in common ? Doesn't that kill a little bit the purpose of graph data ? Or is there something I'm missing out ?
For example, in my case, I want to model an object and its parts. Each object and each part have a property /identificationNumber. Would it be better to use this property as a parition key, or to create a new property /partitionKey dedicated to the purpose of partitioning ? My concern is that, if I select /identificationNumber as the partition key, and if my data model has to evolve in the future, if I have to model new objects without an /identificationNumber, I will have to artificially add this property to these objects the data model, which might lead to some confusion.
Creating a dedicated property to use as a synthetic partition key is a good practice if there isn't an obvious existing property to use. This approach can mitigate cases where you don't have an /identificationNumber in some objects, since you can assign some other value as the partitionKey in those cases. This also allows flexibility around refactoring /identificationNumber in the future, since partitionKey is what needs to be unchanging.
We shouldn't be concerned about an "artificial property" because this is inherent with using a partitioned database. It doesn't need to be exposed to users, but devs need to understand Cosmos is somewhat different than traditional DBs. It's also possible to migrate to a new partition key by copying all data to a new container, in the worst case of regret down the road. It's probably best to start working on the project with a best guess and seeing how things work, and perhaps iterating on different ideas to compare performance etc.
We are stamping user permission as a property (of SET cardinality) on each nodes and edges. Wondering what is best way to apply the has step on all the visited nodes/edges for a given traversal gremlin query.
like a very simple travarsal query:
// Flights from London Heathrow (LHR) to airports in the USA
g.V().has('code','LHR').out('route').has('country','US').values('code')
add has('permission', 'team1') to all the visited vertices and edges while traversal using the above query.
There are two approaches you may consider.
Write a custom TraversalStrategy
Develop a Gremlin DSL
For a TraversalStrategy you would develop one similar to SubgraphStrategy or PartitionStrategy which would take your user permissions on construction and then automatically inject the necessary has() steps after out() / in() sorts of steps. The drawback here is that your TraversalStrategy must be written in a JVM language and if using Gremlin Server must be installed on the server. If you intend to configure this TraversalStrategy from the client-side in any way you would need to build custom serializers to make that possible.
For a DSL you would create new navigational steps for out() / in() sorts of steps and they would insert the appropriate combination of navigation step and has() step. The DSL approach is nice because you could write it in any programming language and it would work, but it doesn't allow server-side configuration and you must always ensure clients use the DSL when querying the graph.
We are stamping user permission as a property (of SET cardinality) on each nodes and edges.
As a final note, by "SET cardinality" I assume that you mean multi-properties. Edges don't allow for those so you would only be able to stamp such a property on vertices.
I’m just playing with the Graph API in Cosmos DB
which uses the Gremlin syntax for query.
I have a number of users (Vertex) in the graph and each have ‘knows’ properties to other users. Some of these are out edges (outE) and others are in edges (inE) depending on how the relationship was created.
I’m now trying to create a query which will return all ‘knows’ relationships for a given user (Vertex).
I can easily get the ID of either inE or outE via:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').inE('knows')
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').outE('knows')
where '7112138f-fae6-4272-92d8-4f42e331b5e1' is the Id of the user I’m querying, but I don’t know ahead of time whether this is an in or out edge, so want to get both (e.g. if the user has in and out edges with the ‘knows’ label).
I’ve tried using a projection and OR operator and various combinations of things e.g.:
g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').where(outE('knows').or().inE('knows'))
but its not getting me back the data I want.
All I want out is a list of the Id’s of all inE and outE that have the label ‘knows’ for a given vertex.
Or is there a simpler/better way to model bi-directional associations such as ‘knows’ or ‘friendOf’?
Thanks
You can use the bothE step in this case. g.V('7112138f-fae6-4272-92d8-4f42e331b5e1').bothE('knows')
The gremlin documentation says:
Many graph vendors do not allow the user to specify an element ID and
in such cases, an exception is thrown.
I assume this refers to only specifying an ID when creating a new vertex or edge, not to the overall use of IDs in queries. So which gremlin implementations do, and which do not allow specifying and ID along with vertex or edge creation?
It's easier to specify the graph databases that do allow id assignment rather than those that don't as most graph databases do not allow you to specify the id when you create a vertex/edge.
I'm only aware of two that allow you to specify the id: TinkerGraph and elastic-gremlin. The rest do not support that.
You can always check what a graph supports by calling the features() method on the Graph instance:
http://tinkerpop.incubator.apache.org/docs/3.0.1-incubating/#_features
I have a node type that has a string property that will have the same value really often. Etc. Millions of nodes with only 5 options of that string value. I will be doing searches by that property.
My question would be what is better in terms of performance and memory:
a) Implement it as a node property and have lots of duplicates (and search using WHERE).
b) Implement it as 5 additional nodes, where all original nodes reference one of them (and search using additional MATCH).
Without knowing further details it's hard to give a general purpose answer.
From a performance perspective it's better to limit the search as early as possible. Even more beneficial if you do not have to look into properties for a traversal.
Given that I assume it's better to move the lookup property into a seperate node and use the value as relationship type.
Use labels; this blog post is a good intro to this new Neo4j 2.0 feature:
Labels and Schema Indexes in Neo4j
I've thought about this problem a little as well. In my case, I had to represent state:
STARTED
IN_PROGRESS
SUBMITTED
COMPLETED
Overall the Node + Relationship approach looks more appealing in that only a single relationship reference needs to be maintained each time rather than a property string and you don't need to scan an extra additional index which has to be maintained on the property (memory and performance would intuitively be in favor of this approach).
Another advantage is that it easily supports the ability of a node being linked to multiple "special nodes". If you foresee a situation where this should be possible in your model, this is better than having to use a property array (and searching using "in").
In practice I found that the problem then became, how do you access these special nodes each time. Either you maintain some sort of constants reference where you have the node ID of these special nodes where you can jump right into them in your START statement (this is what we do) or you need to do a search against property of the special node each time (name, perhaps) and then traverse down it's relationships. This doesn't make for the prettiest of cypher queries.