I've been looking into Neo4j graph algorithms and I've seen that a number of algorithms are only available in the write back format while others have both stream and write back implementations. However, I haven't been able to find anything explaining the difference between the two.
So my questions are:
When and Why is write back a better implementation than stream? (basically what are the advantages and disadvantages of write back)
How does write back handle graph alterations? (if we add/remove nodes or edges from the graph after running the algorithm is there any way to tell that the property is now invalid?)
From what I see all graph algos have the stream & write behaviour, except some that only have the stream one (pretty much all the path algos).
Graph algos consume a lot of resources (they work on the entire graph), so if you have a big dataset, it will take times.
That's why it's really usefull to write back the result, it allows you to make some cypher queries based on the result of your graph algos.
For your question about the invalidation, there is no internal mecanism to do it, but with APOC, you can create a trigger to invalidate the result when a node is created/deleted or when you add/remove a relationship.
Related
It's being proposed that we store a data about a relationship between two vertices on the edge between them. The idea would be that these two vertices are related and there are user level pieces of information that are looking to be stored in graph. The best example I can think of would be a Book, and a Reader, and the Reader can store cliff notes on the edges for retrieval later on.
Is this common practice? It seems to me that we should minimize the amount of data living in edges and that a vast majority of GraphDB data be derived data, rather than using it as an actual data store. Given that its in memory, what happens when it goes down? (We're using Neptune so.. there are technically backups).
Sorry if the question is a bit vague, but I'm not sure else how to ask. I've googled around looking for best practices and its all pretty generic data related to the concepts and theories of graph db.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it?
Without too much additional detail it is hard to provide exact modeling advice , but in general one of the advantages of using a graph databases is that edges are first class citizens and allow for properties on edges. A common use case for this would be something like PERSON - purchases -> Product where you might have a purchase_date on the purchases edge to represent the date of the purchase, as someone might buy the same thing multiple times.
I am not sure what exactly you mean by that a vast majority of GraphDB data be derived data as you can use graphs to derive and infer data/relationships based on the connections but they do fully support storing data in them as well.
Given that its in memory, what happens when it goes down? - Amazon Neptune (and most other DBS) use a buffer cache to store some data in memory, but that data is also persisted to disk, so if the instance goes down, there is no problem with recovering it from the durable storage.
An additional question, is it common practice to expose the gremlin API directly to users, or should there always be a GraphQL (or other) API in front of it? - Just as with any database, I would not recommend exposing the Gremlin API directly to consumers, as doing so comes with a whole host of potential security risks. Generally, the underlying data store of any application should be transparent to the users. They should be interacting with an interface like REST/GraphQL that is designed to answer business related questions and not really know or care that there is a graph database backing those requests.
I have a user search feature in my app where the searcher don't want to see some results, he does this by "blocking" a tag, when blocking a tag all users that are "subscribed" to that tag will be ignored in his search results.
I'm writing the query to filter the search results and I found 2 ways of getting the same:
First:
g.V(1991)
.out("blocked").fold().as("blockedTags")
.V().hasLabel("user")
.not(
where(
out("subscribed").where(
within("blockedTags")
)
)
)
Second:
g.V(1991).as("user")
.V().hasLabel("user")
.not(
where(
out("subscribed")
.in("blocked")
.as("user")
)
)
Gremlify: https://gremlify.com/xnqhvtzo6b
One uses within() and the other performs 2 steps out() and in(), I want to know which one is faster so I can decide which one to use, these 2 options are possible in many queries of my application.
EDIT:
I ran both queries in the gremlin console with profile() step at the end but the >TOTAL field gives random time numbers from 0.300ms to 1.220ms for both queries, because of this I don't know how to compare the performance of 2 queries.
I will offer a general answer here that is largely derived from the comments on the question itself. It really isn't possible to profile() one graph and then project those results on another. They will each have different capabilities and performance characteristics. If you need to know which of two approaches to a query is better, then you must test both traversals on the graph system you intend to target.
I'd also be wary of going too far in a particular development direction without doing ongoing testing on the target graph. Just as you wouldn't do all your development on MySQL only to switch to Oracle when it was time to go to production, you really shouldn't try to build your entire application against a graph you don't intend to use. There are subtle differences in these systems that could make a significant differences to you.
As to the differences in profile() times on TinkerGraph, there is bound to be timing differences on the JVM for what I'm guessing is a test on a small dataset that resides in memory. Or perhaps for TinkerGraph there is no significant difference between the two approaches. Consider trying to execute the queries a few thousand times and average the time taken and compare that. Gremlin Console has a clock() function that helps with that. Of course, as I alluded to earlier what you learn there is no guarantee that you have the right solution on Neptune.
If you'd like a bit of analysis about your queries I could offer a few words (though I don't base this thinking on Neptune specifically). How each performs depends a lot on your graph structure, but I think I'd be the first query to be faster because it captures "blocked" vertices with:
.out("blocked").fold()
and re-use it over and over for however many V().hasLabel('user') there are. That's just a gut feeling though. I'm guessing the blocked list will be relatively small for a single user so traversing the opposing way with:
out("subscribed").in("blocked")
would just be more expensive as you would have to traverse a lot more "blocked" edges that don't terminate with the initial vertex.
I have a query (link below) I must execute once per day or once per week in my application to find groups of connected users. In the query I check all possible groups for each user of the application (not all users are evaluated but could be a lot). For the moment I'm only making performance tests in localhost using Gremlin Server, since my application is not live yet.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default, another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Solution 1:
Send a query to get the users first and then send one query per user, then remove duplicates in the server code, this should work in my case and since I can send all the queries at the same time I can use all resources available and bypass the time limits.
Solution 2:
Use OLAP. I guess OLAP does not have a time limit. The problem: My idea is to use Amazon Neptune and OLAP is not supported there as far as I know.
In this question about it:
Gremlin OLAP queries on AWS Neptune
David says:
Update: Since GA (June 2018), Neptune supports multiple queries in a single request/transaction
What does it mean "multiple queries in a single request"?
How my solution 1 compares with OLAP?
Should I look for another database service that supports OLAP instead of Neptune? Which one could be? I don't want an option that implies learning to setup my own "Neptune like" server, I have limited time.
My query in case you want to take a look:
https://gremlify.com/69cb606uzaj
This is a bit of a complicated question.
The problem is that when testing this query simulating many users the query reaches the time limit a request can take that is configured in Gremlin Server by default,
I'll assume there is a reason you can't change the default value, but for those who might be reading this answer the timeout is configurable both at the server (with evaluationTimeout in the server yaml) and per request both for scripts and bytecode based requests.
another problem is that the query does not take full CPU usage since it seems a single query is designed to use a single thread or a reduced amount of CPU processing in some way.
If you're testing with TinkerGraph in Gremlin Server then know that TinkerGraph is really simple. It doesn't do anything internally to run any aspect of a traversal in parallel (without TinkerGraphComputer which is OLAP related).
So I have 2 solutions in mind, divide the query in one chunk per user or use OLAP:
Either approach has the potential to work. In the first solution you suggest a form of poor man's OLAP where you must devise your own methods for doing this parallel processing (i.e. manage thread pools, synchronize state, etc). I think that this approach is a common first step that folks take to deal with this sort of problem. I'd wonder if you need to be as fine grained as one user per request. I would think that sending several at a time would be acceptable but only testing in your actual environment would yield the answer to that. The nice thing about this solution is that it will typically work on any graph system, including Neptune.
Using your second solution with OLAP is trickier. You have the obvious problem that Neptune does not directly support it, but going to a different provider that does will not instantly solve your problem. While OLAP rids you of having to worry about how to optimally parallelize your workload, it doesn't mean that you can instantly take that Gremlin query you want to run, throw it into Spark and get an instant win. For example, and I take this from the TinkerPop Reference Documentation:
In OLAP, where the atomic unit of computing is the vertex and its local
"star graph," it is important that the anonymous traversal does not leave the
confines of the vertex’s star graph. In other words, it can not traverse to an
adjacent vertex’s properties or edges.
In your query, there are already a places where you "leave the star graph" so you would immediately find problems there to solve. Usually that limitation can be worked around for OLAP purposes but it's not as simple as adding withComputer() to your traversal and getting a win in this case.
Going further down this path of using OLAP with a graph other than Neptune, you would probably want to at least consider if this complex traversal could be better written as a custom VertexProgram which might better bind your use case to the the capabilities of BSP than what the more generic TraversalVertexProgram does when processing arbitrary Gremlin. For that matter, a mix of Gremlin OLAP, a custom VertexProgram and some standard map/reduce style processing might ultimately lead to the most elegant and efficient answer.
An idea I've been considering for graphs that don't support OLAP has been to subgraph() (with Java) the portion of the graph that is relevant to your algorithm and then execute it locally in TinkerGraph! I think that might make sense in some use cases where the algorithm has some limits that can be defined ahead of time to form the subgraph, where those limits can be easily filtered and where the resulting subgraph is not so large that it takes an obscene amount of time to construct. It would be even better if the subgraph had some use beyond a single algorithm - almost behaving like a cache graph. I have no idea if that is useful to you but it's a thought. Here's a recent blog post I wrote that talks about writing VertexPrograms. Perhaps you will find it interesting.
All that said about OLAP, I think that your first solution seems fine to start with. You don't have a multi-billion edge graph yet and can probably afford to take this approach for now.
What does it mean "multiple queries in a single request"?
I believe that this just means that you can send a script like:
g.addV().iterate()
g.addV().iterate()
g.V()
where multiple Gremlin commands can be executed within the scope of a single transaction where each command must be "separated by newline ('\n'), spaces (' '), semicolon ('; '), or nothing (for example: g.addV(‘person’).next()g.V() is valid)". I think that only the last command returns a value. It doesn't seem like that particular feature would be helpful in your case. I would look more to batch users within a particular request where possible.
If you a looking for a native OLAP graph engine, perhaps take look at AnzoGraphDB which scales and performs much better for that style of more complex querying than anything else we know of. It's an MPP engine, so every core works on the query in parallel. Depending on how much data you need it to act on, the free version (single node only, RAM limited) may well be all you need and can be used commercially. You can find it in the AWS Marketplace or on Docker Hub.
Disclaimer: I work for Cambridge Semantics Inc.
I want to ask a question about graph database.
First im using networkx in python and creating graph in memory, but when i reach more nodes - my RAM was not enough.
So, for next time i try to neo4j. Its nice, write graph on disk, but its slow(how i think. With index and other things, more slow than networkx). Now i create 500k nodes and 2000000 relationships, try to find path between two nodes, and neo4j just stuck on my server.
I hear about orientdb, but not try yet now.
So, i need advice, what the best graph database, who can write graph on disk?
Big thanks to you.
PS want only open-source graph database
First of all there are real or native graph databases or non native graph databases. The native graph databases really organize your data in a graph structure and connect the nodes to each other, while the non native are using some kind of model to store your graph representation. You can simply represent a graph as Adjacency matrix which is a table and you maybe could be stored in a row key store with columns (but that wouldn't be very effective and stupid in my opinion). So you first need to ask yourself if you really need a graph database? Second you need to think about the operations read und write you want to perform.
There is not best (graph) database. But there are many different databases for many different use cases - so you need to identify your exact use case and than you can think about the database.
For your tries with neo4j: Writing in neo4j is indeed very slow if you do it wrong. May you like to have a look at this question and answer about write performance of neo4j.
Almost all graph database can write graph on disk.
But if you're doing some calculation, such as shortest path for very deep search (dozens hop), memory is much much more important than disk.
I have a huge directed graph: It consists of 1.6 million nodes and 30 million edges. I want the users to be able to find all the shortest connections (including incoming and outgoing edges) between two nodes of the graph (via a web interface). At the moment I have stored the graph in a PostgreSQL database. But that solution is not very efficient and elegant, I basically need to store all the edges of the graph twice (see my question PostgreSQL: How to optimize my database for storing and querying a huge graph).
It was suggested to me to use a GraphDB like neo4j or AllegroGraph. However the free version of AllegroGraph is limited to 50 million nodes and also has a very high-level API (RDF), which seems too powerful and complex for my problem. Neo4j on the other hand has only a very low level API (and the python interface is not mature yet). Both of them seem to be more suited for problems, where nodes and edges are frequently added or removed to a graph. For a simple search on a graph, these GraphDBs seem to be too complex.
One idea I had would be to "misuse" a search engine like Lucene for the job, since I'm basically only searching connections in a graph.
Another idea would be, to have a server process, storing the whole graph (500MB to 1GB) in memory. The clients could then query the server process and could transverse the graph very quickly, since the graph is stored in memory. Is there an easy possibility to write such a server (preferably in Python) using some existing framework?
Which technology would you use to store and query such a huge readonly graph?
LinkedIn have to manage a sizeable graph. It may be instructive to check out this info on their architecture. Note particularly how they cache their entire graph in memory.
There is also OrientDB a open source document-graph dbms with commercial friendly license (Apache 2). Simple API, SQL like language, ACID Transactions and the support for Gremlin graph language.
The SQL has extensions for trees and graphs. Example:
select from Account where friends traverse (1,7) (address.city.country.name = 'New Zealand')
To return all the Accounts with at least one friend that live in New Zealand. And for friend means recursively up to the 7th level of deep.
I have a directed graph for which I (mis)used Lucene.
Each edge was stored as a Document, with the nodes as Fields of the document that I could then search for.
It performs well enough, and query times for fetching in and outbound links from a node would be acceptable to a user using it as a web based tool. But for computationally intensive, batch calculations where I am doing many 100000s queries I am not satisfied with the query times I'm getting. I get the sense that I am definitely misusing Lucene so I'm working on a second Berkeley DB based implementation so that I can do a side by side comparison of the two. If I get a chance to post the results here I will do.
However, my data requirements are much larger than yours at > 3GB, more than could fit in my available memory. As a result the Lucene index I used was on disk, but with Lucene you can use a "RAMDirectory" index in which case the whole thing will be stored in memory, which may well suit your needs.
Correct me if I'm wrong, but since each node is list of the linked nodes, seems to me a DB with a schema is more of a burden than an advantage.
It also sound like Google App Engine would be right up your alley:
It's optimized for reading - and there's memcached if you want it even faster
it's distributed - so the size doesn't affect efficiency
Of course if you somehow rely on Relational DB to find the path, it won't work for you...
And I just noticed that the q is 4 months old
So you have a graph as your data and want to perform a classic graph operation. I can't see what other technology could fit better than a graph database.