TITAN : Identify and remove duplicate vertices in graph - graph

I am using TITAN 0.4 over Cassandra, I have indexed my key ("ip_address" in my case), but as NON-UNIQUE, for performance and scalability.
Now the challenge is graph allows duplicates vertices.
I am running a background task to cleanup the duplicate vertices in graph, by iterating through all vertices.
What is the best way or approach to identify a duplicate vertex in a graph.
The the estimated size of graph in production is around 10M ~ 15M vertices or even more than that.
Is there any feature exist in TITAN index, which helps to easily identify a duplicate?
Thanks in advance
Index creation Gremlin script
g.makeKey("ip_address").dataType(String.class).indexed("standard",Vertex.class).make();

I would start with a Titan/Hadoop job:
g.V().ip_address.groupCount()
Then use those IP addresses with a count > 1 to clean up / merge duplicated vertices in OLTP mode.

Related

Is there any efficient way to find all connected subgraphs in NebulaGraph

I have several isolated subgraphs in a graph space like: [1-2, 2-3], [4-5,5-6], [7-8]. I want to get subsets of nodes of all connected subgraphs such as:
[1,2,3], [4,5,6], [7,8] # 3 subgraphs
Can I get these results in an efficient way by the nGQL in Nebula Graph?
This is typically a Graph Computation job rather than a graph database query.
In theory, we could do this with LOOKUP xxx | GET SUBGRAPH (or of course with a complex multiple MATCH OpenCypher query) while, it in most cases(unless we are in an extremely small dataset, or, the graph is quite isolated).
Instead, if we are talking about all isolated subgraphs, we should go for NebulaGraph Algorithm, which is Open Source, too.

How to write this in Cypher

I have around 644 nodes in my graph database(Neo4j) . I need to compute distances between all these 644 nodes and display it visually in the GUI. I want to pre-compute and store the distances between every two pairs of nodes in the database itself rather than retrieving the nodes on to the server and then finding the distances between them on the fly and then showing on the GUI.
I want to understand how to write such a query in CYPHER. Please let me know.
I think this can work:
// half cross product
match (a),(b)
where id(a) < id(b)
match p=shortestPath((a)-[*]-(b))
with a,b,length(p) as l
create (a)-[:DISTANCE {distance:l}]->(b)
Set 4950 properties, created 4950 relationships, returned 0 rows in 4328 ms
But the browser viz will blow up with this, just that you know.
Regarding your distance measure (it won't be that fast but should work):
MATCH (a:User)-[:READ]->(book)<-[:READ]-(b:User)
WITH a,b,count(*) as common,
length(a-[:READ]->()) as a_read,
length(b-[:READ]->()) as b_read
CREATE UNIQUE (a)-[:DISTANCE {distance:common/(a_read+b_read-common)}]-(b)

Reachable vertices from each other

Given a directed graph, I need to find all vertices v, such that, if u is reachable from v, then v is also reachable from u. I know that, the vertex can be find using BFS or DFS, but it seems to be inefficient. I was wondering whether there is a better solution for this problem. Any help would be appreciated.
Fundamentally, you're not going to do any better than some kind of search (as you alluded to). I wouldn't worry too much about efficiency: these algorithms are generally linear in the number of nodes + edges.
The problem is a bit underspecified, so I'll make some assumptions about your data structure:
You know vertex u (because you didn't ask to find it)
You can iterate both the inbound and outbound edges of a node (efficiently)
You can iterate all nodes in the graph
You can (efficiently) associate a couple bits of data along with each node
In this case, use a convenient search starting from vertex u (depth/breadth, doesn't matter) twice: once following the outbound edges (marking nodes as "reachable from u") and once following the inbound edges (marking nodes as "reaching u"). Finally, iterate through all nodes and compare the two bits according to your purpose.
Note: as worded, your result set includes all nodes that do not reach vertex u. If you intended the conjunction instead of the implication, then you can save a little time by incorporating the test in the second search, rather than scanning all nodes in the graph. This also relieves assumption 3.

Cycle detection in graphs containing multiple cycles

I have the following graph:
Is there a way I can identify all cycles in this graph? I know that DFS can be used to detect cycles by simply doing DFS until a back edge is found, but I was wondering if there is a computationally efficient way to return the individual cycles, considering that there are actually 3 cycles in the graph (1-2-3-4-5-6, 4-5-7-8-9, 1-2-3-4-9-8-7-5-6). I am a bit stuck because it seems like the carbon atom belongs to multiple graphs and I can't think of any way other than brute-forcing all possible paths originating from every vertex.
You don't have to find all pathes from EVERY vertex.
Only vertex refering to 3 or more other may belong to multiple cycles
You have to check only 4,5,6(,9)

Three ways to store a graph in memory, advantages and disadvantages

There are three ways to store a graph in memory:
Nodes as objects and edges as pointers
A matrix containing all edge weights between numbered node x and node y
A list of edges between numbered nodes
I know how to write all three, but I'm not sure I've thought of all of the advantages and disadvantages of each.
What are the advantages and disadvantages of each of these ways of storing a graph in memory?
One way to analyze these is in terms of memory and time complexity (which depends on how you want to access the graph).
Storing nodes as objects with pointers to one another
The memory complexity for this approach is O(n) because you have as many objects as you have nodes. The number of pointers (to nodes) required is up to O(n^2) as each node object may contain pointers for up to n nodes.
The time complexity for this data structure is O(n) for accessing any given node.
Storing a matrix of edge weights
This would be a memory complexity of O(n^2) for the matrix.
The advantage with this data structure is that the time complexity to access any given node is O(1).
Depending on what algorithm you run on the graph and how many nodes there are, you'll have to choose a suitable representation.
A couple more things to consider:
The matrix model lends itself more easily to graphs with weighted edges, by storing the weights in the matrix. The object/pointer model would need to store edge weights in a parallel array, which requires synchronization with the pointer array.
The object/pointer model works better with directed graphs than undirected graphs because the pointers would need to be maintained in pairs, which can become unsynchronized.
The objects-and-pointers method suffers from difficulty of search, as some have noted, but are pretty natural for doing things like building binary search trees, where there's a lot of extra structure.
I personally love adjacency matrices because they make all kinds of problems a lot easier, using tools from algebraic graph theory. (The kth power of the adjacency matrix give the number of paths of length k from vertex i to vertex j, for example. Add an identity matrix before taking the kth power to get the number of paths of length <=k. Take a rank n-1 minor of the Laplacian to get the number of spanning trees... And so on.)
But everyone says adjacency matrices are memory expensive! They're only half-right: You can get around this using sparse matrices when your graph has few edges. Sparse matrix data structures do exactly the work of just keeping an adjacency list, but still have the full gamut of standard matrix operations available, giving you the best of both worlds.
I think your first example is a little ambiguous — nodes as objects and edges as pointers. You could keep track of these by storing only a pointer to some root node, in which case accessing a given node may be inefficient (say you want node 4 — if the node object isn't provided, you may have to search for it). In this case, you'd also lose portions of the graph that aren't reachable from the root node. I think this is the case f64 rainbow is assuming when he says the time complexity for accessing a given node is O(n).
Otherwise, you could also keep an array (or hashmap) full of pointers to each node. This allows O(1) access to a given node, but increases memory usage a bit. If n is the number of nodes and e is the number of edges, the space complexity of this approach would be O(n + e).
The space complexity for the matrix approach would be along the lines of O(n^2) (assuming edges are unidirectional). If your graph is sparse, you will have a lot of empty cells in your matrix. But if your graph is fully connected (e = n^2), this compares favorably with the first approach. As RG says, you may also have fewer cache misses with this approach if you allocate the matrix as one chunk of memory, which could make following a lot of edges around the graph faster.
The third approach is probably the most space efficient for most cases — O(e) — but would make finding all the edges of a given node an O(e) chore. I can't think of a case where this would be very useful.
Take a look at comparison table on wikipedia. It gives a pretty good understanding of when to use each representation of graphs.
Okay, so if edges don't have weights, the matrix can be a binary array, and using binary operators can make things go really, really fast in that case.
If the graph is sparse, the object/pointer method seems a lot more efficient. Holding the object/pointers in a data structure specifically to coax them into a single chunk of memory might also be a good plan, or any other method of getting them to stay together.
The adjacency list - simply a list of connected nodes - seems by far the most memory efficient, but probably also the slowest.
Reversing a directed graph is easy with the matrix representation, and easy with the adjacency list, but not so great with the object/pointer representation.
There is another option: nodes as objects, edges as objects too, each edge being at the same time in two doubly-linked lists: the list of all edges coming out from the same node and the list of all edges going into the same node.
struct Node {
... node payload ...
Edge *first_in; // All incoming edges
Edge *first_out; // All outgoing edges
};
struct Edge {
... edge payload ...
Node *from, *to;
Edge *prev_in_from, *next_in_from; // dlist of same "from"
Edge *prev_in_to, *next_in_to; // dlist of same "to"
};
The memory overhead is big (2 pointers per node and 6 pointers per edge) but you get
O(1) node insertion
O(1) edge insertion (given pointers to "from" and "to" nodes)
O(1) edge deletion (given the pointer)
O(deg(n)) node deletion (given the pointer)
O(deg(n)) finding neighbors of a node
The structure also can represent a rather general graph: oriented multigraph with loops (i.e. you can have multiple distinct edges between the same two nodes including multiple distinct loops - edges going from x to x).
A more detailed explanation of this approach is available here.

Resources