Formatting CosmosDB Gremlin Query - azure-cosmosdb

I'm new to Gremlin and CosmosDB. I've been following the tinkerpop tutorials and am using the TinkerFactory.createModern() test graph.
What I am looking for is to return a graphson object similar to this from cosmosdb.
{
"user": {
"name": "Marko",
"age": 29
},
"knows": [
{"name": "josh", "age": 32},
{"name": "vadas", "age": 27}
],
"created": [
{"name": "lop", "lang": "java"}
]
}
My thoughts were to try
g.V().has('name', 'marko').as('user').out('knows').as('knows').out('created').as('created').select('user', 'knows', 'created')
What i really get back is in the picture below.
I was hoping to have single user object, with an array of knows objects and software objects.
If this is possible can you please explain what steps need to be used to get this format.
Hope my question is clear and thanks to anyone that can help =)

You should use project():
gremlin> g.V().has('person','name','marko').
......1> project('user','knows','created').
......2> by(project('name','age').by('name').by('age')).
......3> by(out('knows').project('name','age').by('name').by('age')).
......4> by(out('created').project('name','lang').by('name').by('lang'))
==>[user:[name:marko,age:29],knows:[name:vadas,age:27],created:[name:lop,lang:java]]
That syntax should work with CosmosDB. In TinkerPop 3.4.0, things get a little nicer as you can use valueMap() a bit more effectively (but I don't think that CosmosDB supports that as of the time of this answer):
gremlin> g.V().has('person','name','marko').
......1> project('user','knows','created').
......2> by(valueMap('name','age').by(unfold())).
......3> by(out('knows').valueMap('name','age').by(unfold())).
......4> by(out('created').valueMap('name','lang').by(unfold()))
==>[user:[name:marko,age:29],knows:[name:vadas,age:27],created:[name:lop,lang:java]]

Related

How to parse GraphSON data from Neptune as a list of dictionaries?

If you make a signed request using the code provided by AWS here: https://docs.aws.amazon.com/neptune/latest/userguide/iam-auth-connecting-python.html
Then if you do a query like this from a python script:
make_signed_request(query="g.V().limit(10).valueMap(true).toList()")
It outputs an ugly unusable thing like this:
{
"requestId": "bf942e84-ff49-42c7-a65c-ef43f45g5h63",
"status": {
"message": "",
"code": 200,
"attributes": {
"#type": "g:Map",
"#value": []
}
},
"result": {
"data": {
"#type": "g:List",
"#value": [
{
"#type": "g:Map",
"#value": [
"names",
{
"#type": "g:List",
"#value": ["David Bowie"]
}
..., etc.
Whereas if I run the same query on a notebook, like this:
%%gremlin --store-to foo
g.V().limit(10).valueMap(true).toList()
Then foo is a nicely formatted list of dictionaries, like this:
[
{'names': ['David Bowie'], 'dob': [08-01-1947]},
{'names': ['Michael Jackson'], 'dob': [29-08-1958]},
]
How do I get the maked_signed_request function to return data in the same way that the notebook does?
What you are seeing is the default HTTP response format that you can expect to see from any "Gremlin Server compatible" TinkerPop endpoint. Under the covers, the graph-notebook notebooks are using the Gremlin Python client and sending the request over a web socket. The Gremlin Python client nicely de-serializes that result for you. You essentially have two options when calling the Neptune Gremlin endpoint.
Use a specific Gremlin client for your preferred programming language (if one exists).
Call the HTTP endpoint and post process the GraphSON result. Rather than write your own you can most likely repurpose the serializers from one of the clients. If possible I would use option 1.

Merging list of maps in gremlin

I have this relationship:
person --likes--> subject
This is my query:
g.V().
hasLabel('person').
has('name', 'Joe').
outE('likes').
range(0, 2).
union(identity(), inV().hasLabel('subject')).
valueMap('rating', 'name').
At this point, I get result that looks like this:
[
{
"rating": 3.236155563
},
{
"rating": 3.162886797
},
{
"name": "math"
},
{
"name": "history"
}
]
I'd like to get something like this:
[
{
"rating": 3.236155563,
"name": "math"
},
{
"rating": 3.162886797,
"name": "history"
},
]
I've tried grouping the results - which gives me the structure I want - but because of the identical keys, I only get 1 set of results back.
It always helps when you post the code to create the graph so we can give you a tested answer. Like so
g.addV('person').property('name', 'P1').as('p1').
addV('subject').property('name', 'Math').as('math').
addV('subject').property('name', 'History').as('history').
addV('subject').property('name', 'Geography').as('geography').
addE('likes').from('p1').to('math').property('rating', 1.2).
addE('likes').from('p1').to('history').property('rating', 2.3).
addE('likes').from('p1').to('geography').property('rating', 3.4)
I believe you are trying to write a traversal that starts from a certain person, go out along the first two "likes" edges and get the names of the subjects that he likes and the rating on the corresponding "likes" edge.
g.V().has('person', 'name', 'P1').
outE('likes').
range(0, 2).
project('SubjectName', 'Rating').
by(inV().values('name')).
by(values('rating'))

How to get the results from properties of different verticies in gremlin?

I have this Database:
Clients => Incident => File => Filename
Clients have an ID
Incidents have an ID and a reportedOn property
Files have an ID and a fileSize, mimeType, malware property
Filenames have an ID
Client have a outgoing Edge to Incidents (reported), incident have a outgoing Edge to file (containsFile), file have a outgoing Edge to filename (hasName).
What Query I have to execute in gremlin to get the filename-ID, the file-ID, the file-fileSize and the incident-reportedOn values in one result?
Here is some sample DATA:
g.addV('client').property('id','1')
addV('incident').property('id','11').property('reportedON'2/15/2019 8:01:19 AM')
addV('file').property('id','100').property('fileSize', '432534')
addV('fileName').property('id','file.pdf')
addE('reported').from('1').to('11').
addE('containsFile').from('11').to('100').
addE('hasName').from('100').to('file.pdf').iterate()
The traversal you've posted to create the sample data contains dozens of errors. Don't be a pain in the ass, double-check what you post.
Anyway, here's a fixed version of your query:
g.addV('client').property('id','1').as('1').
addV('incident').property('id','11').property('reportedON', '2/15/2019 8:01:19 AM').as('11').
addV('file').property('id','100').property('fileSize', '432534').as('100').
addV('fileName').property('id','file.pdf').as('file.pdf').
addE('reported').from('1').to('11').
addE('containsFile').from('11').to('100').
addE('hasName').from('100').to('file.pdf').iterate()
get the filename-ID, the file-ID, the file-fileSize and the incident-reportedOn values
gremlin> g.V().has('client','id','1').
......1> out('reported').as('incident').
......2> out('containsFile').
......3> out('hasName').
......4> path().
......5> from('incident').
......6> by(union(group().
......7> by(label).
......8> by('id'),
......9> valueMap()).
.....10> unfold().
.....11> filter(select(keys).is(neq('id'))).
.....12> group().
.....13> by(keys).
.....14> by(select(values).unfold())).
.....15> unfold().unfold().
.....16> group().
.....17> by(keys).
.....18> by(select(values).unfold())
==>[fileName:file.pdf,file:100,reportedON:2/15/2019 8:01:19 AM,fileSize:432534,incident:11]
Only getting the path().from('incident').by(valueMap()) alone would already give you everything you need. However, I added a bit of re-grouping to get a nicer formatted result.

Gremlin on Azure CosmosDB: how to project the related vertices' properties?

I use Microsoft.Azure.Graphs library to connect to a Cosmos DB instance and query the graph database.
I'm trying to optimize my Gremlin queries in order to only select those properties that I only require. However, I don't know how to choose which properties to select from edges and vertices.
Let's say we start from this query:
gremlin> g.V().hasLabel('user').
project('user', 'edges', 'relatedVertices')
.by()
.by(bothE().fold())
.by(both().fold())
This will return something along the lines of:
{
"user": {
"id": "<userId>",
"type": "vertex",
"label": "user",
"properties": [
// all vertex properties
]
},
"edges": [{
"id": "<edgeId>",
"type": "edge",
"label": "<edgeName>",
"inV": <relatedVertexId>,
"inVLabel": "<relatedVertexLabel>",
"outV": "<relatedVertexId>",
"outVLabel": "<relatedVertexLabel>"
"properties": [
// edge properties, if any
]
}],
"relatedVertices": [{
"id": "<vertexId>",
"type": "vertex",
"label": "<relatedVertexLabel>",
"properties": [
// all related vertex properties
]
}]
}
Now let's say we only take a couple of properties from the root vertex which we named "User":
gremlin> g.V().hasLabel('user').
project('id', 'prop1', 'prop2', 'edges', 'relatedVertices')
.by(id)
.by('prop1')
.by('prop2')
.by(bothE().fold())
.by(both().fold())
Which will make some progress for us and yield something along the lines of:
{
"id": "<userId>",
"prop1": "value1",
"prop2": "value2",
"edges": [{
"id": "<edgeId>",
"type": "edge",
"label": "<edgeName>",
"inV": <relatedVertexId>,
"inVLabel": "<relatedVertexLabel>",
"outV": "<relatedVertexId>",
"outVLabel": "<relatedVertexLabel>"
"properties": [
// edge properties, if any
]
}],
"relatedVertices": [{
"id": "<vertexId>",
"type": "vertex",
"label": "<relatedVertexLabel>",
"properties": [
// all related vertex properties
]
}]
}
Now is it possible to do something similar to edges and related vertices? Say, something along the lines of:
gremlin> g.V().hasLabel('user').
project('id', 'prop1', 'prop2', 'edges', 'relatedVertices')
.by(id)
.by('prop1')
.by('prop2')
.by(bothE().fold()
.project('edgeId', 'edgeLabel', 'edgeInV', 'edgeOutV')
.by(id)
.by(label)
.by(inV)
.by(outV))
.by(both().fold()
.project('vertexId', 'someProp1', 'someProp2')
.by(id)
.by('someProp1')
.by('someProp2'))
My aim is to get an output like this:
{
"id": "<userId>",
"prop1": "value1",
"prop2": "value2",
"edges": [{
"edgeId": "<edgeId>",
"edgeLabel": "<edgeName>",
"edgeInV": <relatedVertexId>,
"edgeOutV": "<relatedVertexId>"
}],
"relatedVertices": [{
"vertexId": "<vertexId>",
"someProp1": "someValue1",
"someProp2": "someValue2"
}]
}
You were pretty close:
gremlin> g.V().hasLabel('person').
......1> project('name','age','edges','relatedVertices').
......2> by('name').
......3> by('age').
......4> by(bothE().
......5> project('id','inV','outV').
......6> by(id).
......7> by(inV().id()).
......8> by(outV().id()).
......9> fold()).
.....10> by(both().
.....11> project('id','name').
.....12> by(id).
.....13> by('name').
.....14> fold())
==>[name:marko,age:29,edges:[[id:9,inV:3,outV:1],[id:7,inV:2,outV:1],[id:8,inV:4,outV:1]],relatedVertices:[[id:3,name:lop],[id:2,name:vadas],[id:4,name:josh]]]
==>[name:vadas,age:27,edges:[[id:7,inV:2,outV:1]],relatedVertices:[[id:1,name:marko]]]
==>[name:josh,age:32,edges:[[id:10,inV:5,outV:4],[id:11,inV:3,outV:4],[id:8,inV:4,outV:1]],relatedVertices:[[id:5,name:ripple],[id:3,name:lop],[id:1,name:marko]]]
==>[name:peter,age:35,edges:[[id:12,inV:3,outV:6]],relatedVertices:[[id:3,name:lop]]]
Two points you should consider when writing Gremlin:
The output of the previous step feeds into the input of the following step and if you don't clearly see what's coming out of a particular step, then the steps that follow may not end up being right. In your example, in the first by() you added the project() after the fold() which was basically saying "Hey, Gremlin, project that List of edges for me." But in the by() modulators for project() you treated the input to project not as a List but as individual edges which likely led to an error. In Java, that error is: "java.util.ArrayList cannot be cast to org.apache.tinkerpop.gremlin.structure.Element". An error like that is a clue that somewhere in your Gremlin you are not properly following the outputs and inputs of your steps.
fold() takes all the elements in the stream of the traversal and converts them to a List. So where you had many objects, you will now have one after the fold(). To process them as a stream again, you would need to unfold() them for steps to operate on them individually. In this case, we just needed to move the fold() to the end of the statement after doing the sub-project() for each edge/vertex. But why do we need fold() at all? The answer is that the traversal passed to the by() modulator is not iterated completely by the step that it modifies (in this case project()). The step only calls next() to get the first element in the stream - this is by design. Therefore, in cases where you want the entire stream of a by() to be processed you must reduce the stream to a single object. You might do that with fold(), but other examples include sum(), count(), mean(), etc.

How to represent the Data Model of a Graph

So we have been developing some graph based analysis tools, using neo4j as a persistence engine in the background. As part of this we are developing a graph data model suitable for our domain, and we want to use this in the application layer to restrict the types of nodes, or to ensure that nodes of certain types must carry certain properties. Normal data model restrictions.
So thats the background, what I am asking is if there is some standard way to represent a data-model for a graph db? The graph equivalent of an xsd perhaps?
There's an open-source project supporting strong schema definitions in Neo4j: Structr (http://structr.org, see it in action: http://vimeo.com/structr/videos)
With Structr, you can define an in-graph schema of your data model including
Type inheritance
Supported data types: Boolean, String, Integer, Long, Double, Date, Enum (+ values)
Default values
Cardinality (1:1, 1:*, *:1)
Not-null constraints
Uniqueness constraints
Full type safety
Validation
Cardinality enforcement
Support for methods (custom action) is currently being added to the schema.
The schema can be edited with an editor, or directly via REST, modifiying the JSON representation of the data model:
{
"query_time": "0.001618446",
"result_count": 4,
"result": [
{
"name": "Whisky",
"extendsClass": null,
"relatedTo": [
{
"id": "96d05ddc9f0b42e2801f06afb1374458",
"name": "Flavour"
},
{
"id": "28f85dca915245afa3782354ea824130",
"name": "Location"
}
],
"relatedFrom": [],
"id": "df9f9431ed304b0494da84ef63f5f2d8",
"type": "SchemaNode",
"_name": "String"
},
{
"name": "Flavour",
...
},
{
"name": "Location",
...
},
{
"name": "Region",
...
}
],
"serialization_time": "0.000829985"
}
{
"query_time": "0.001466743",
"result_count": 3,
"result": [
{
"name": null,
"sourceId": "28f85dca915245afa3782354ea824130",
"targetId": "e4139c5db45a4c1cbfe5e358a84b11ed",
"sourceMultiplicity": null,
"targetMultiplicity": "1",
"sourceNotion": null,
"targetNotion": null,
"relationshipType": "LOCATED_IN",
"sourceJsonName": null,
"targetJsonName": null,
"id": "d43902ad7348498cbdebcd92135926ea",
"type": "SchemaRelationship",
"relType": "IS_RELATED_TO"
},
{
"name": null,
"sourceId": "df9f9431ed304b0494da84ef63f5f2d8",
"targetId": "96d05ddc9f0b42e2801f06afb1374458",
"sourceMultiplicity": null,
"targetMultiplicity": "1",
"sourceNotion": null,
"targetNotion": null,
"relationshipType": "HAS_FLAVOURS",
"sourceJsonName": null,
"targetJsonName": null,
"id": "bc9a6308d1fd4bfdb64caa355444299d",
"type": "SchemaRelationship",
"relType": "IS_RELATED_TO"
},
{
"name": null,
"sourceId": "df9f9431ed304b0494da84ef63f5f2d8",
"targetId": "28f85dca915245afa3782354ea824130",
"sourceMultiplicity": null,
"targetMultiplicity": "1",
"sourceNotion": null,
"targetNotion": null,
"relationshipType": "PRODUCED_IN",
"sourceJsonName": null,
"targetJsonName": null,
"id": "a55fb5c3cc29448e99a538ef209b8421",
"type": "SchemaRelationship",
"relType": "IS_RELATED_TO"
}
],
"serialization_time": "0.000403616"
}
You can access nodes and relationships stored in Neo4j as JSON objects through a RESTful API which is dynamically configured based on the in-graph schema.
$ curl try.structr.org:8082/structr/rest/whiskies?name=Ardbeg
{
"query_time": "0.001267211",
"result_count": 1,
"result": [
{
"flavour": {
"name": "J",
"description": "Full-Bodied, Dry, Pungent, Peaty and Medicinal, with Spicy, Feinty Notes.",
"id": "626ba892263b45e29d71f51889839ebc",
"type": "Flavour"
},
"location": {
"region": {
"name": "Islay",
"id": "4c7dd3fe2779492e85bdfe7323cd78ee",
"type": "Region"
},
"whiskies": [
...
],
"name": "Port Ellen",
"latitude": null,
"longitude": null,
"altitude": null,
"id": "47f90d67e1954cc584c868e7337b6cbb",
"type": "Location"
},
"name": "Ardbeg",
"id": "2db6b3b41b70439dac002ba2294dc5e7",
"type": "Whisky"
}
],
"serialization_time": "0.010824154"
}
In the UI, there's also a data editing (CRUD) tool, and CMS components supporting to create web applications on Neo4j.
Disclaimer: I'm a developer of Structr and founder of the project.
No, there's no standard way to do this. Indeed, even if there were, keep in mind that the only constraints that neo4j currently supports are uniqueness constraints.
Take for example some sample rules:
All nodes labeled :Person must have non-empty properties fname and lname
All nodes labeled :Person must have >= 1 outbound relationship of type :works_for
The trouble with the present neo4j is that even in the case where you did have a schema language (standardized) that could express these things, there wouldn't be a way that the db engine itself could actually enforce that constraint.
So the simple answer is no, there's no standard way of doing that right now.
A few tricks I've seen people use to simulate the same:
Assemble a list of "test suite" cypher queries, with known results. Query for things you know shouldn't be there; non-empty result sets are a sign of a problem/integrity violation. Query for things you know should be there; empty result sets are a problem.
Application-level control -- via some layer like spring-data or similar, control who can talk to the database. This essentially moves your data integrity/testing problem up into the app, away from the database.
It's a common (and IMHO annoying) aspect of many NoSQL solutions (not specifically neo4j) that because of their schema-weakness, they tend to force validation up the tech stack into the application. Doing these things in the application tends to be harder and more error-prone. SQL databases permit you to implement all sorts of schema constraints, triggers, etc -- specifically to make it really damn hard to put the wrong data into the database. The NoSQL databases typically either aren't there yet, or don't do this as a design decision. There are indeed flexibility/performance tradeoffs. Databases can insert faster and be more flexible to adapt quickly if they aren't burdened with checking each atom of data against a long list of schema rules.
EDIT: Two relevant resources: the metagraphs proposal talks about how you could represent the schema as a graph, and neoprofiler is an application that attempts to infer the actual structure of a neo4j database and show you its "profile".
With time, I think it's reasonable to hope that neo would include basic integrity features like requiring certain labels to have certain properties (the example above), restricting the data types of certain properties (lname must always be a String, never an integer), and so on. The graph data model is a bit wild and wooly though (in the computational complexity sense) and there are some constraints on graphs that people desperately would want, but will probably never get. An example would be the constraint that a graph can't have cycles in it. Enforcing that on the creation of every relationship would be very computationally intensive. (

Resources