How to get biggest group of nodes connected to each other? - graph

I've been given huge graph of nodes randomly connected. I need to get biggest group of nodes where each on is connected with each other node in group.
I can solve it using bruteforce, but I'd like to know if there isn't any better way to do it.

I think your problem consists of finding the maximal clique in the graph. There are several algorithms in literature which should get you started

Related

Pruning large amount of stale records in Neptune

I am following the best practices of pruning stale data from our Neptune Graph Database seen below.
https://docs.aws.amazon.com/neptune/latest/userguide/best-practices-gremlin-prune.html
g.V().has('timestamp', lt(datetime('2021-02-23'))).drop()
This works fine on small datasets, but my graph generates about a million vertices a day. Am I supposed to have a service running continuously just dropping vertices in batches like below? What's the best approach for pruning large datasets?
while (pruneCount > 0):
g.V().has('timestamp', lt(datetime('2021-02-23'))).limit(1000).drop()
pruneCount = g.V().has('timestamp', lt(datetime('2021-02-23'))).count()
If you need to drop say a million vertices, one strategy that I find works is to retrieve the ID's of all the vertices (and edges) that you need to drop and then drop them in batches across multiple threads. In this way you can drop 1M elements fairly efficiently. It's generally best to drop the edges before you drop the vertices to avoid possible concurrent modification exceptions if two threads are trying to drop two adjacent vertices.
You may be able to adapt the algorithm used here to your purposes: https://github.com/awslabs/amazon-neptune-tools/tree/master/drop-graph

Gremlin query to find the entire sub-graph that a specific node is connected in any way to

I am brand new to Gremlin and am using gremlin-python to traverse my graph. The graph is made up of many clusters or sub-graphs which are intra-connected, and not inter-connected with any other cluster in the graph.
A simple example of this is a graph with 5 nodes and 3 edges:
Customer_1 is connected to CreditCard_A with 1_HasCreditCard_A edge
Customer_2 is connected to CreditCard_B with 2_HasCreditCard_B edge
Customer_3 is connected to CreditCard_A with 3_HasCreditCard_A edge
I want a query that will return a sub-graph object of all nodes and edges connected (in or out) to the queried node. I can then store this sub-graph as a variable and then run different traversals on it to calculate different things.
This query would need to be recursive as these clusters could be made up of nodes which are many (inward or outward) hops away from each other. There are also many different types of nodes and edges, and they all must be returned.
For example:
If I specified Customer_1 in the query, the resulting sub-graph would contain Customer_1, Customer_3, CreditCardA, 1_HasCreditCard_A, and 3_HasCreditCard_A.
If I specififed Customer_2, the returned sub-graph would consist of Customer_2, CreditCard_B, 2_HasCreditCard_B.
If I queried Customer_3, the exact same subgraph object as returned from the Customer_1 query would be returned.
I have used both Neo4J with Cypher and Dgraph with GraphQL and found this task quite easy in these two langauges, but am struggling a bit more with understanding gremlin.
EDIT:
From, this question, the selected answer should achieve what I want, but without specifying the edge type by changing .both('created') to just .both().
However, the loop syntax: .loop{true}{true} is invalid in Python of course. Is this loop function available in gremlin-python? I cannot find anything.
EDIT 2:
I have tried this and it seems to be working as expected, I think.
g.V(node_id).repeat(bothE().otherV().simplePath()).emit()
Is this a valid solution to what I am looking for? Is it also possible to include the queried node in this result?
Regarding the second edit, this looks like a valid solution that returns all the vertices connected to the starting vertex.
Some small fixes:
you can change the bothE().otherV() to both()
if you want to get also the starting vertex you need to move the emit step before the repeat
I would add a dedup step to remove all duplicate vertices (can be more than 1 path to a vertex)
g.V(node_id).emit().repeat(both().simplePath()).dedup()
exmaple: https://gremlify.com/jngpuy3dwg9

How do you group social network nodes?

A graph is made up of nodes/vertexes connected by edges/arcs. Multiple sub groups of nodes often exist (colored below). These can be people in a social network, items and purchase records, travel data, or many other things.
How do you:
Split the nodes into groups based on edges
find the leader (most connected node) in each subgroup?
To answer your first question, you can use one of several community structure algorithms, such as:
Minimum-cut method
Hierarchical clustering
Girvan–Newman algorithm
Modularity maximization
Among others.
As for your second question, once you know the members within a group you can rank them by number of connections.

number of 'graphs' in a ArangoDB database

I am exploring Arangodb and I am not sure I understand correctly how to use the graph concept in ArangoDb.
Let's say I am modelling a social network. Should I create a single graph for the whole social network or should I create a graph for every person and its connections ?
I've got the feeling I should use a single graph... But is there any performance/fonctionality issue related to that choice ?
Maybe the underlying question is this: should I consider the graph concept in arangodb as a technical or as a business-related concept ?
Thanks
You should use not use a graph per person. The first quick answer would be to use a single graph.
In general, I think you should treat the graph concept as a technical concept. Having said that, quite often, a (mathematical) graph models a relationship arising from business very naturally. Thus, the technical concept graph in a graph database maps very well to the business logic.
A social network is one of the prime examples. Typical questions here are "find the friends of a user?", "find the friends of the friends of a user?" or "what is the shortest path from person A to person B?". A graph database will be most useful for questions involving an a priori unknown path length, like for example in the shortest path example.
Coming back to the original question: You should start by looking at the queries you will have about your data. You then want to make it, so that these queries map conveniently onto the standard graph operations (or indeed other queries) your data store can answer. This then tells you what kind of information should be in the same graph, and which bits belong in separate graphs.
In your original use case of a social network, I would assume that you want to run queries involving chains of friendship-relations, so the edges in these chains must be in the same graph. However, in more complicated cases it is for example conceivable that you have a "friendship" graph and a "follows" graph, both using different edges but the same vertices. In that case you might have two graphs for your social network.

Hierarchy in a graph

How can one compute or quantify the hierarchical nature of a given graph, if there is a hierarchy in that graph ?
More specifically, i want to know if some sort of hierarchy existed in artificial neural network (with some number of hidden layers). and also want to measure that.
This is an interesting open ended question and so I'll answer loosely and off the cuff.
So you want to find out if the graph is logically like a tree? There is no up or down in a graph, so probably what you are really looking for is a way to find the node or nodes in your graph that are the most highly connected to other highly connected nodes, and then take a certain perspective supposition and determine if a "tree" like graph makes sense using that node as the trunk or root of the tree. Another thing you could do is just choose any random node, suppose that is the root of the tree, and then see what happens. You can re-balance the tree if the node connections lead you to want to do that based on a certain number of connections or traversals, and you can attempt to find the "real root" - if there is such a thing. If you re-balance a certain number of times, or detect a circular path has been traversed, you may decide that the graph is not very hierarchical at all. If you do find a "real root", then you might then decide to look for depth, avg. branch numbers, balance stats, etc. If you refine the question, I'll refine my answer.

Resources