Assuming I have an arbitrary Gremlin query I don't control as input, and a graph database that I run it against, how can I capture the paths of all accessed nodes in the graph, as in, how can I see what parts of the graph are needed by an arbitrary query?
Clarification:
If I run the arbitrary, how can I capture all the accessed data as the query runs, not just the result, but all the data accessed during the query.
Different databases may have explain plan options that give some insight into how a query will run but really the only way to know what a Gremlin query is going to need to visit in the graph is to run it. If you know the schema of the graph you could potentially write some code that analyzes the query to look at the various steps and labels used to make an estimate of what the query will touch but I am not aware of any existing tools that do that.
Related
I have a Graph model region (vertex) -> has_person (edge) -> person (vertex). I want to get region vertices that has person with name Tom.
This query works fine:
g.V().hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region").
But why following queries hang:
g.V().hasLabel("region").and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
)
g.V().and(
__.hasLabel("person").has("name", "Tom").inE("has_person").outV().hasLabel("region")
).hasLabel("region")
When writing graph traversals with Gremlin you need to think about how the graph database you are using is optimizing your traversal (e.g. is a global index being used)?
You should consider the indexing capability of your graph database and examine the output of the profile() step. It will tell you if indices are being used and where. My guess is that the query that works "fine" is using and index to find "Tom" and then is able to quickly traverse that one index to find the regions that have "has_person" edges related to him. Most every graph will be capable of optimizing that sort of pattern. Your following queries that "hang" will typically not be optimized by most graphs to utilize an index and it's mostly because of the pattern you've chosen with and() step which isn't a pattern most optimizations seek. My guess is that both of those traversals are filtering almost completely in-memory.
Fwiw, your query that works "fine" is the optimal way to write that I think given what you state as your desired output. Your first hanging query I don't think will ever return results because it requires that the vertex have a label that is both "region" and "person" which is not possible. The second hanging query seems to not require the and() in the first place and is double filtering the "region" label.
I have a query (link below) I must execute once per day or once per week in my application to find groups of connected users. In the query I check all possible groups for each user of the application (not all users are evaluated but could be a lot). For the moment I'm only making performance tests in localhost using Gremlin Server, since my application is not live yet.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default, another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Solution 1:
Send a query to get the users first and then send one query per user, then remove duplicates in the server code, this should work in my case and since I can send all the queries at the same time I can use all resources available and bypass the time limits.
Solution 2:
Use OLAP. I guess OLAP does not have a time limit. The problem: My idea is to use Amazon Neptune and OLAP is not supported there as far as I know.
In this question about it:
Gremlin OLAP queries on AWS Neptune
David says:
Update: Since GA (June 2018), Neptune supports multiple queries in a single request/transaction
What does it mean "multiple queries in a single request"?
How my solution 1 compares with OLAP?
Should I look for another database service that supports OLAP instead of Neptune? Which one could be? I don't want an option that implies learning to setup my own "Neptune like" server, I have limited time.
My query in case you want to take a look:
https://gremlify.com/69cb606uzaj
This is a bit of a complicated question.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default,
I'll assume there is a reason you can't change the default value, but for those who might be reading this answer the timeout is configurable both at the server (with evaluationTimeout in the server yaml) and per request both for scripts and bytecode based requests.
another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
If you're testing with TinkerGraph in Gremlin Server then know that TinkerGraph is really simple. It doesn't do anything internally to run any aspect of a traversal in parallel (without TinkerGraphComputer which is OLAP related).
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Either approach has the potential to work. In the first solution you suggest a form of poor man's OLAP where you must devise your own methods for doing this parallel processing (i.e. manage thread pools, synchronize state, etc). I think that this approach is a common first step that folks take to deal with this sort of problem. I'd wonder if you need to be as fine grained as one user per request. I would think that sending several at a time would be acceptable but only testing in your actual environment would yield the answer to that. The nice thing about this solution is that it will typically work on any graph system, including Neptune.
Using your second solution with OLAP is trickier. You have the obvious problem that Neptune does not directly support it, but going to a different provider that does will not instantly solve your problem. While OLAP rids you of having to worry about how to optimally parallelize your workload, it doesn't mean that you can instantly take that Gremlin query you want to run, throw it into Spark and get an instant win. For example, and I take this from the TinkerPop Reference Documentation:
In OLAP, where the atomic unit of computing is the vertex and its local
"star graph," it is important that the anonymous traversal does not leave the
confines of the vertex’s star graph. In other words, it can not traverse to an
adjacent vertex’s properties or edges.
In your query, there are already a places where you "leave the star graph" so you would immediately find problems there to solve. Usually that limitation can be worked around for OLAP purposes but it's not as simple as adding withComputer() to your traversal and getting a win in this case.
Going further down this path of using OLAP with a graph other than Neptune, you would probably want to at least consider if this complex traversal could be better written as a custom VertexProgram which might better bind your use case to the the capabilities of BSP than what the more generic TraversalVertexProgram does when processing arbitrary Gremlin. For that matter, a mix of Gremlin OLAP, a custom VertexProgram and some standard map/reduce style processing might ultimately lead to the most elegant and efficient answer.
An idea I've been considering for graphs that don't support OLAP has been to subgraph() (with Java) the portion of the graph that is relevant to your algorithm and then execute it locally in TinkerGraph! I think that might make sense in some use cases where the algorithm has some limits that can be defined ahead of time to form the subgraph, where those limits can be easily filtered and where the resulting subgraph is not so large that it takes an obscene amount of time to construct. It would be even better if the subgraph had some use beyond a single algorithm - almost behaving like a cache graph. I have no idea if that is useful to you but it's a thought. Here's a recent blog post I wrote that talks about writing VertexPrograms. Perhaps you will find it interesting.
All that said about OLAP, I think that your first solution seems fine to start with. You don't have a multi-billion edge graph yet and can probably afford to take this approach for now.
What does it mean "multiple queries in a single request"?
I believe that this just means that you can send a script like:
g.addV().iterate()
g.addV().iterate()
g.V()
where multiple Gremlin commands can be executed within the scope of a single transaction where each command must be "separated by newline ('\n'), spaces (' '), semicolon ('; '), or nothing (for example: g.addV(‘person’).next()g.V() is valid)". I think that only the last command returns a value. It doesn't seem like that particular feature would be helpful in your case. I would look more to batch users within a particular request where possible.
If you a looking for a native OLAP graph engine, perhaps take look at AnzoGraphDB which scales and performs much better for that style of more complex querying than anything else we know of. It's an MPP engine, so every core works on the query in parallel. Depending on how much data you need it to act on, the free version (single node only, RAM limited) may well be all you need and can be used commercially. You can find it in the AWS Marketplace or on Docker Hub.
Disclaimer: I work for Cambridge Semantics Inc.
I've been experimenting with Titan over the past few weeks and would like some pointers on the way forward, plus a few specific questions. The purpose of the project is to store log data on a Cassandra cluster (for this question let's use the example of web traffic) and represent relationships in a Titan graph. All nodes are modelled as having an entity value and type (e.g. "google.com","hostname"), and edges have a label (e.g. "connects") as well as several attributes of the relationship (timestamp, flow length and so on).
Once this data is stored in cassandra and represented as a Titan graph, I plan to use d3 code to generate visualisations. At the end of the tunnel I am hoping to be able to build large-scale, interactive, complex graph networks that look something like this: http://goo.gl/CVEd55
My current setup is as follows:
A python script to convert log files into vertices.csv and edges.csv files for Gremlin to load in
Titan Server 0.4 (using CassandraThrift as the storage backend) - gremlin script to load converted data into Titan
Python script that uses NetworkX to open a RexPro connection, allowing the analyst to enter a custom Gremlin query, outputting the result as a JSON
Local web front-end that uses the generated JSON and d3 to display the results of the query as a graph
Ideally as a test base case, I would like the user to be able to type a Gremlin query into the web front-end and be directed to a page containing an interactive d3 graph of the result.
My specific questions are are follows:
What is the process for assigning attributes to edges? I have had trouble finding sample code that helps me represent the graph using the model listed above.
My gremlin script to load data into Titan uses bg.commit() to create a batch graph which is later referenced in the RexPro connection conn= RexProConnection('localhost,8184,'bg'). This was working originally but after changing my load script, clearing the graph in Gremlin and then reloading, the RexPro connection cannot be opened due to the graph bg apparently not existing. What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
Apologies for the long post, but if anyone could share their expertise that would be much appreciated!
For d3 visualization, you can use force directed graph. There are a few variations of them.
Relationship Graph
https://vida.io/documents/qZ5SJdRJfj3XmSXYJ
Force Layout Tree
https://vida.io/documents/sy7vzWW7BJEvKdZeL
If your network contains a large number of node and edges, you'll need to cluster data before visualizing. You can use tools like Gephi, NodeXL to perform clustering. Then use clustered data to build force directed visualization.
What is the process for assigning attributes to edges?
The process is the same as adding properties to vertices. Get an Edge instance then do:
Edge e = g.addEdge(v1,v2,'label')
e.setProperty('weight',0.1d)
As for:
What is the process of updating graphs in Titan? Presumably running a load script twice using the same graph will only add nodes/vertices to the existing one, so how would I go about generating a new graph with the same name every time I update my model, and have RexPro be able to reference it when running a query?
You don't want a reference to a BatchGraph after loading as it comes with limitations that will prevent you from querying. It sounds like you should just configure "yourgraph" in rexster.xml, when you load through your script, simply wrap your rexster.xml configured Graph in your code, and perform your load operations against it. When you want to query it, simply reference "yourgraph" instead of "bg".
conn = RexProConnection('localhost,8184,'yourgraph')
How easy would it be to extend the interface to allow an analyst to enter SQL queries into the front end, using RexPro to access the graph in a similar way to the one described?
It's hard to say if that's "easy" as that depends on factors outside of just the technology. I'll say that it's possible to to build an interface that accepts Gremlin queries (your wrote SQL, but I assume you meant Gremlin), passes them to Rexster and gets back an answer. What you do with that answer is up to you, but as far as Rexster's part plays into it, I don't see why that would be a problem.
I have an application that stores relationship information in a MySQL table (contact_id, other_contact_id, strength, recorded_at). This is fine if all I need to do is show who a contact's relationships are or even to generate a list of mutual contacts for two contacts.
But now I need to generate stats like: 'what was the total number of 2-way connections of strength 3 or better in January 2011' or (assuming that each contact is part of a group) 'which group has the most number of connections to other groups' etc.
I quickly found that the SQL for generating these stats became unwieldy real fast.
So I wrote a script that for any given date it will generate a graph in memory. I could then run whatever stat I wanted against that graph. Much easier to understand and in general, much more performant also -- except for the generating the graph part.
My next thought was to cache those graphs so I could call on them whenever I needed to run a new stat (or generate a later graph: eg for today's graph I take yesterday's graph and apply any changes that happened since yesterday). I tried memcached which worked great until the graphs grew > 1 MB.
So now I'm thinking about using a graph database like Neo4J.
Only problem is, I don't have just one graph. Or I do, but it is one that changes over time and I need to be able to query it with different reference times.
So, can I:
store multiple graphs in Neo4J and rertrieve/interact with them separately? i would then create and store separate social graphs for each date.
or
add valid to and from timestamps to each edge and filter the graph appropriately: so if i wanted a graph for "May 1st" i would only follow the newest edge between two noeds that was created before "May 1st" (and if all the edges were created after May 1st then those nodes wouldn't be connected).
I'm pretty new to graph databases so any help/pointers/hints will be appreciated.
Right now you can store just one graph database in a single Neo4j instance, but this one graphdb can contain as many different sub-graphs as you like. You only have to keep that in mind when doing global operations (like index queries) but there you can do compound queries that include timestamped properties as well to limit the results.
One way of doing that is, as you said adding temporal information to edges to represent the structure of a graph for a given date you can then traverse the structure of the graph back then.
Reference node has a different meaning in Neo4j.
Using category nodes per day (and linking them and also aggregating them for higher level timespans) is the more graphy way of categorizing nodes than indexed properties. (Effectively these are in-graph indices that you can easily include in your traversals and graph queries).
You don't have to duplicate the nodes as long as you are only interested in different temporal structures. If your nodes are also different (e.g. changing properties, you could either duplicate them, and so effectively creating different subgraphs) or create a connected list of history nodes on each node that contain just the changes (or the full snapshot depending on your requirements).
Your domain sounds very fitting for the graph database. If you have more and detailed questions feel free to join the Neo4j mailing list.
Not the easiest solution (I'm assuming you only work with one machine), but if you really want to separate your graphs, you only need to remember that a graph is a directory.
You can then create a dynamic loader class which takes the path of the database you want, load it in memory for the query, and close it after you getting your answer. You could also configure a proxy server, and send 2 parameters to your loader: your query (which I presume is a cypher query in this case) and the path of the database you want to query.
This is not adequate if you have tons of real-time queries to answer. But if it is simply for storing and doing some analytics over data sets, it can definitly answer your needs.
This is an old question, but starting with Neo4j 4.x, multi-tenancy is supported and you can have different databases within the same Neo4j server (with distinct RBAC permissions).
I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.