Light Weight Edge - graph

Can you define Light Weight Edge Properly?
Just make a short example about the concept of Light Weight Edge that you have in your mind?
What is the importance of Light weight edge in Graphs and What are the disadvantages?

From the official documentation:
Lightweight edges
Starting from OrientDB v1.4.x edges, by default, are managed as lightweight edges: they don't have own identities as record, but are physically stored as links inside vertices. OrientDB automatically uses Lightweight edges only when edges have no properties, otherwise regular edges are used. From the logic point of view, lightweight edges are edges at all the effects, so all the graph functions work correctly. This is to improve performance and reduce the space on disk. But as a consequence, since lightweight edges don't exist as separate records in the database, the following query will not return the lightweight edges:
SELECT FROM E
In most of the cases Edges are used from Vertices, so this doesn't cause any particular problem. In case you need to query Edges directly, even those with no properties, disable lightweight edge feature by executing this command once:
ALTER DATABASE CUSTOM useLightweightEdges=false
This will only take effect for new edges. For more information look at: https://github.com/orientechnologies/orientdb/wiki/Troubleshooting#why-i-cant-see-all-the-edges.
For more information look at Graph API.

Related

How to do efficient bulk upserts (insert new vertex, or update property(ies)) in Gremlin?

Context:
I do have a graph with about 2000 vertices, and 6000 edges, this over time might grow to 10000 vertices and 100000 edges. Currently I am upserting the new vertices using the following traversal query:
Upserting Vertices & Edges
queryVertex = "g.V().has(label, name, foo).fold().coalesce(
unfold(), addV(label).property(name, foo).property(model, 2)
).property(model, 2)"
The intent here is to look for vertex, named foo, and if found update its model property, otherwise create a new vertex and set the model property. this is issued twice: once for the source vertex and then for the target vertex.
Once the two related vertices are created, another query is issued to create the edge between them:
queryEdge = "g.V('id_of_source_vertex').coalesce(
outE(edge_label).filter(inV().hasId('id_of_target_vertex')),
addE(edge_label).to(V('id_of_target_vertex'))
).property(model, 2)"
here, if there is an edge between the two vertices, the model property on edge is updated, otherwise it creates the edge between them.
And the pseudocode that does this, is something as follows:
for each edge in the list of new edges:
//upsert source and target vertices:
execute queryVertex for edge.source
execute queryVertex for edge.target
// upsert edge:
execute queryEdge
This works, but it is highly inefficient; for example for the mentioned graph size it takes several minutes to finish, and with some in-app concurrency, it reduces the time only by couple of minutes. Surely, there must be a more efficient way of doing this for such a small graph size.
Question
* How can I make these upserts faster?
Bulk loading should typically be relegated to the provider specific tools that are optimized to handle such tasks. Gremlin really doesn't provide abstractions to cover the diverse group of bulk loader tools that are out there for each of the various graph database systems that implement TinkerPop. For Neptune, which is how you tagged your question, that would mean using the Neptune Bulk Loader.
Speaking specifically to your question, though you might see some optimizations to what you described as your approach. From a Gremlin perspective, I imagine you would see some savings here by submitting a single Gremlin request per edge by combining your existing traversals:
g.V().has(label, name, foo).fold().
coalesce(unfold(),
addV(label).property(name, foo)).
property(model, 2).as('source').
V().has(label, name, bar).fold().
coalesce(unfold(),
addV(label).property(name, bar)).
property(model, 2).as('target').
coalesce(inE(edge_label).where(outV().as('source')),
addE(edge_label).from('source').to('target')).
property(model, 2)
I think I got that right - untested, but hopefully you get the idea. Basically, we just reference the vertices already in memory via step labels so that we don't need to requery them. You might try other tactics as well if you continue with Gremlin-style bulk loading like ordering your edges so that you could batch together more edge loads to reduce the amount of vertex lookups and submit vertex/edge data in a more dynamic fashion as described here.

graph representation tool for huge data with specific features

I'm building a JavaFX app and I want to display interactive graph of my huge data in it. something like placing cytoscape in javaFX app and working with graph inside of my app. my node may be up to 30000 nodes at max but usually its about 200 nodes after filtering nodes.
key features (sorted by importance):
generating graph with best layout and good looking with good performance and low overlapping (same as cytoscape)
selection some nodes and mark them (same as ctrl+L in cytoscape)
selecting neighbours of some nodes
building new graph from number 3
filtering graph base on weights, number of edges and ...
hiding and showing some selected edges and nodes
capturing image of built graph
Additional features :
zoom in zoom out
node tagging
multi color nodes and edges
Changing width of edges base on weight
Changing color of specific nodes and edges without rebuilding graph
Directed edge support
I have tested cytoscape.js but couldnt use it in javaFX browser. im testing WebVowl now. is anything better than these for my purpose ? if you suggest something that it cant be placed in javaFX app directly, please show how I do it.
Thanks
Depending on what you're trying to do, you could use Cytoscape as the data model, and build a JavaFX renderer around it. I've wanted to do this, but it's not in roadmap associated with our funding.
I've done a few JavaFX projects that might be good starting points, but they don't integrate directly with Cytoscape, which has a richer model of subnetworks, groups, etc.
https://github.com/AdamStuart/appFX/tree/master2/src/main/java/diagrams
one of which is based on a great example from TESIS DYNAware GmbH.
As you realize, the key issue is filtering down the network before trying to visualize it. The number of edges associated with 30000 nodes will bog down most any system, if you try to build something interactive.

How to reduce the number of same edge label between two Vertex in Titan

Let's say we have two type of Vertex: LOGIN_USER(property:user_id) and IP(property:ip), EDGE between them is : LOGIN(property:session_id, login_time).
This model's problem is that two many edges between one USER and IP(Can be thousands).
Is there anyway to reduce the edge number of the two vertexes and at the same time can keep property: sessionId and login_time? We want to filter these two properties for some query.
Edge property doesn't support cardinality:list which vertex property support.
If put all edge property into Vertex, does it impact performance to fetch the Vertex?
When titan load property for a Vertex?? When traversal to a Vertex, let's g.V(1).next(), does Titan load all Property for the Vertex?
When you say "thousands" of edges between USER and IP, do you think it could actually be "millions" or "tens of millions" or more? If not, then "thousands" should not be a problem for Titan with vertex centric indices. Index your edge properties and you should have fast ordering and traversals.
When you start to get deep into "millions", you might start to experience some problems - for me that has always been with processing global queries with titan-hadoop as the Vertex and its edges must be held in memory. That can make for some trouble spots when you're doing global analytics. From an operational perspective, Titan was always happy to keep writing edges into the millions on a vertex, but I'd tend to avoid it. Of course, much of my experience with this came before vertex cutting in Titan 1.0:
Cutting a vertex means storing a subset of that vertex’s adjacency
list on each partition in the graph. In other words, the vertex and
its adjacency list is partitioned thereby effectively distributing the
load on that single vertex across all of the instances in the cluster
and removing the hot spot.
which you might experiment with as you start to grow supernodes into the millions.
I suppose the other option for supernodes in the millions of edges would be to model around it. Perhaps you introduce some structure between USER and IP. Convert that single LOGIN edge to some vertices/edges that might introduce a time concept between them like:
USER -> LOGIN_YEAR -> LOGIN_MONTH -> IP
So now, instead of creating just one edge between USER and IP you create a LOGIN_YEAR vertex and a LOGIN_MONTH vertex.

Graph exploration: does the choice to use incoming edges or outgoing edges affect performance?

I have been tinkering with Graphs for some time, with the objective that I implement appropriate portions of the server-side stack using them. I have used Scala-Graph and Neo4J, and I am learning Spark GraphX. In almost all the applications I have implemented, the model has been that of a Property Graph (Node -> Edge -> Node, with attributes).
When designing the graph (DAGs to be precise), if I spot a strong and directed relationship between two nodes, I set up an edge from one node to one node. This is obvious and intuitive. If a Person likes a Site, an edge with property 'likes' connects them. Thus:
[Nirmalya] -- (Likes) --> [StackOverFlow]
[John] -- (Likes) --> [StackOverFlow]
[Ted] -- (Likes) --> [GoogleGroups ]
[Nirmalya] -- (Likes) --> [Neo4J]
Now, using outgoing edges, I can easily find out which sites Nirmalya likes.
But, when I want to find out who else likes what Nirmalya likes (i.e.,John), I tend to think that I should create an edge from Site-type Node to Person-type Node also (with property 'isLikedBy'), so that the path is obvious and the traversal is intuitive. Every Person and Site must be connected in both the directions, so that I can reach the other from either to answer queries like this one.
[Nirmalya] -- (Likes) --> [StackOverFlow] -- (IsLikedBy) --> [John]
But from many examples given by experts, I see that this is not prescribed. Instead, this is achieved by making use of operators like incoming. In other words, if two Nodes have an edge set up between them, I don't need to set both the directions of the edge explicitly (just 'likes' is sufficient, 'isLikedBy' is superfluous). Implementation of adjacency matrix makes this possible perhaps but I get a bit confused because I am being allowed to derive a contra-direction even when that direction is not explicit in the DAG.
My question is where is the gap in my understanding? Is it that 'IsLikedBy' direction should ideally be present, but we are optimizing? Alternatively, is it that there can be UseCases where such bidirectional edges are necessary and I need to spot them? Am I completely missing a theoretical underpinning?
I will be glad to become wiser.
I think it depends on the software. I can speak for Neo4j, but not for the other tools that you mentioned ;)
In Neo4j relationships are designed to be traversable both forwards and backwards without a performance cost. This applies both to traversing in the Java APIs as well as using Cypher. You can query both specifying a direction of incoming/outgoing as well as querying for relationships without concern for the direction and it should also be the same performance characteristics.

What is wrong with light weight edges?

I created an edge without attribute and guess what? it was created but still can not query it but then i created the same edge again and now they both are having same rid>?
I suggest you to start using OrientDB from the tutorial. This is an extract:
Starting from OrientDB v1.4.x edges, by default, are managed as lightweight edges: they don't have own identities as record, but are physically stored as links inside vertices. OrientDB automatically uses Lightweight edges only when edges have no properties, otherwise regular edges are used. From the logic point of view, lightweight edges are edges at all the effects, so all the graph functions work correctly. This is to improve performance and reduce the space on disk. But as a consequence, since lightweight edges don't exist as separate records in the database, the following query will not return the lightweight edges:
SELECT FROM E
In most of the cases Edges are used from Vertices, so this doesn't cause any particular problem. In case you need to query Edges directly, even those with no properties, disable lightweight edge feature by executing this command once:
ALTER DATABASE CUSTOM useLightweightEdges=false
This will only take effect for new edges. For more information look at Troubleshooting.
You can query for a list of names of edges with:
select name from ( select expand(classes) from metadata:schema ) where superClass="E"

Resources