Graph exploration: does the choice to use incoming edges or outgoing edges affect performance? - graph

I have been tinkering with Graphs for some time, with the objective that I implement appropriate portions of the server-side stack using them. I have used Scala-Graph and Neo4J, and I am learning Spark GraphX. In almost all the applications I have implemented, the model has been that of a Property Graph (Node -> Edge -> Node, with attributes).
When designing the graph (DAGs to be precise), if I spot a strong and directed relationship between two nodes, I set up an edge from one node to one node. This is obvious and intuitive. If a Person likes a Site, an edge with property 'likes' connects them. Thus:
[Nirmalya] -- (Likes) --> [StackOverFlow]
[John] -- (Likes) --> [StackOverFlow]
[Ted] -- (Likes) --> [GoogleGroups ]
[Nirmalya] -- (Likes) --> [Neo4J]
Now, using outgoing edges, I can easily find out which sites Nirmalya likes.
But, when I want to find out who else likes what Nirmalya likes (i.e.,John), I tend to think that I should create an edge from Site-type Node to Person-type Node also (with property 'isLikedBy'), so that the path is obvious and the traversal is intuitive. Every Person and Site must be connected in both the directions, so that I can reach the other from either to answer queries like this one.
[Nirmalya] -- (Likes) --> [StackOverFlow] -- (IsLikedBy) --> [John]
But from many examples given by experts, I see that this is not prescribed. Instead, this is achieved by making use of operators like incoming. In other words, if two Nodes have an edge set up between them, I don't need to set both the directions of the edge explicitly (just 'likes' is sufficient, 'isLikedBy' is superfluous). Implementation of adjacency matrix makes this possible perhaps but I get a bit confused because I am being allowed to derive a contra-direction even when that direction is not explicit in the DAG.
My question is where is the gap in my understanding? Is it that 'IsLikedBy' direction should ideally be present, but we are optimizing? Alternatively, is it that there can be UseCases where such bidirectional edges are necessary and I need to spot them? Am I completely missing a theoretical underpinning?
I will be glad to become wiser.

I think it depends on the software. I can speak for Neo4j, but not for the other tools that you mentioned ;)
In Neo4j relationships are designed to be traversable both forwards and backwards without a performance cost. This applies both to traversing in the Java APIs as well as using Cypher. You can query both specifying a direction of incoming/outgoing as well as querying for relationships without concern for the direction and it should also be the same performance characteristics.

Related

Does an Increased Number of Node Types Impact Performance of Graph DBs?

I am in the process of creating a graph database, a simple one for movies with several types of information like the actors, producers, directors and so on.
What I would like to know is, is it better to break down your nodes to a more granular level? For example, is it better to have two kinds of nodes for 'actors' and 'directors' or is it better to have one node, say 'person' and use different kinds of relationships like 'acted_in' and 'directed'? Does this even matter at all?
Further, is there any impact on the traversal queries? Does having more types of nodes mean that the traversal is slower?
Note: I intend to implement this using the Gremlin console in Amazon Neptune.
The answer really is it depends. If I were building such a model I would break out the key "nouns" into their own nodes. I would also label the edges appropriately such as ACTED_IN or DIRECTED.
The performance of any graph query depends on how much data it will need to touch (the fan out factor as you go from depth to depth).
The best advice I can give you is think about the questions you will need the graph to answer and try to design your data model so that writing those queries is as easy as possible. Don't be afraid to iterate multiple times on your data model also. That is common and expected.
Properties can be useful when you want to add a unique piece of information to a node - perhaps the birthday of the director.
Edge properties can be useful for filtering out unneeded edges but edge labels can also. In some cases you may find a label such as DIRECTED-IN-2005 is a useful short cut to avoid checking a label and a property on an edge.

How to do efficient bulk upserts (insert new vertex, or update property(ies)) in Gremlin?

Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.

Graph database design: Should I add relationships, or just traverse

I have recently started exploring graph databases and Neo4J, and would like to work with my own data. At the moment I've hit some confusion. I've created an example image to illustrate my issue. In terms of efficiency, I'm wondering which option is better (and I want to get it right now in early days before I start handling larger amounts).
Option A: Using only the blue relationships, I can work out whether things are related to, or come under, the Ancient group. This process will be done many many times, however it is unlikely to be more than ~6 generations.
Option B: I implement the red relationships, so that it is much faster to work out if young structures belong to the Ancient group.
I'm trying not to use Labels in this scenario, as I'm trying to use labels for a specific purpose to simplify my life (linking structures across seperate networks), and I'm not sure if I should have a label to represent a node that already exists.
In summary, I'm wondering whether adding a whole new bunch of relationships, whilst taking more space, is worth it, or whether traversing to find all relatives is such a simple/inexpensive task that it isn't worth doing so. Or alternatively, both options are viable and this isn't a real issue at all. Thanks for reading.
I'd go with Option A. One of the strengths of Neo4j is that it traverses relationships very efficiently and quickly, and so, there is no need to materialise relationships (sometimes, relationships are materialised in complex and/or extremely large graphs, but this is not your case).
Not sure why you don't want to use labels? Labels serve to group nodes into sets of the same type, and are also index backed- this makes it much faster to find the starting point of your query (index lookup over full database scan).

Understanding Neo4j Algo write back option

I've been looking into Neo4j graph algorithms and I've seen that a number of algorithms are only available in the write back format while others have both stream and write back implementations. However, I haven't been able to find anything explaining the difference between the two.
So my questions are:
When and Why is write back a better implementation than stream? (basically what are the advantages and disadvantages of write back)
How does write back handle graph alterations? (if we add/remove nodes or edges from the graph after running the algorithm is there any way to tell that the property is now invalid?)
From what I see all graph algos have the stream & write behaviour, except some that only have the stream one (pretty much all the path algos).
Graph algos consume a lot of resources (they work on the entire graph), so if you have a big dataset, it will take times.
That's why it's really usefull to write back the result, it allows you to make some cypher queries based on the result of your graph algos.
For your question about the invalidation, there is no internal mecanism to do it, but with APOC, you can create a trigger to invalidate the result when a node is created/deleted or when you add/remove a relationship.

how to ensure there single edge in a graph for a given order_id?

My current scenario is like I have I have products,customer and seller nodes in my graph ecosystem. The problem I am facing is that I have to ensure uniqueness of
(customer)-[buys]->product
with order_item_id as property of the buys relation edge.I have to ensure that there is an unique edge with buys property for a given order_item_id. In this way I want to ensure that my graph insertion remains idempotent and no repeated buys edges are created for a given order_item_id.
creating a order_item_id property
if(!mgmt.getPropertyKey("order_item_id")){
order_item_id=mgmt.makePropertyKey("order_item_id").dataType(Integer.class).make();
}else{
order_item_id=mgmt.getPropertyKey("order_item_id");
}
What I have found so far is that building unique index might solve my problem. like
if(mgmt.getGraphIndex('order_item_id')){
ridIndexBuilder=mgmt.getGraphIndex('order_item_id')
}else{
ridIndexBuilder=mgmt.buildIndex("order_item_id",Edge.class).addKey(order_item_id).unique().buildCompositeIndex();
}
Or I can also use something like
mgmt.buildEdgeIndex(graph.getOrCreateEdgeLabel("product"),"uniqueOrderItemId",Direction.BOTH,order_item_id)
How should I ensure this uniqueness of single buys edge for a given
order_item_id. (I don't have a use-case to search based on
order_item_id.)
What is the basic difference in creating an index on edge using
buildIndex and using buildEdgeIndex?
You cannot enforce the uniqueness of properties at the edge-level, ie. between two vertices (see this question on the mailing list). If I understand your problem correctly, building a CompositeIndex on edge with a uniqueness constraint for a given property should address your problem, even though you do not plan to search these edges by the indexed property. However, this could lead to performance issues when inserting many edges simultaneously due to locking. Depending on the rate at which you insert data, you may have to skip the locking (= skip the uniqueness constraint) and risk duplicate edges, then handle the deduplication yourself at read time and/or run post-import batch jobs to clean up potential duplicates.
buildIndex() builds a global, graph-index (either CompositeIndex or MixedIndex). These kind of indexes typically allows you to quickly find starting points of your Traversal within a graph.
However, buildEdgeIndex() allows you to build a local, "vertex-centric index" which is meant to speed-up the traversal between vertices with a potentially high degree (= huge number of incident edges, incoming and/or outgoing). Such vertices are called "super nodes" (see A solution to the supernode problem blog post from Aurelius) - and although they tend to be quite rare, the likelihood of traversing them isn't that low.
Reference: Titan documentation, Ch. 8 - Indexing for better Performance.

Resources