use METIS to partition heterogeneous architectures - graph

Based on METIS official manual, it can partition graphs into k unequal parts with different capacities for vertices:
METIS’ graph and mesh partitioning programs and API routines are
designed to partition a graph into k parts such that each part contains a pre-specified fraction of the total number of vertices/elements/nodes. In addition, in the case of multi-constraint partitioning, these pre-specified fractions are provided for each one of the vertex weights.
Now my question is how is it possible? how can we tell metis to partition graph into k unequal parts and how we can specify the capacity of these parts. I found an option -tpwgts to define Target partition weights but i don't understand how it's affecting the partitioning process and the description in manual is not very intelligible! So please can you describe how is possible to make different partitions with different sizes?

It is probably possible to unequally partition a graph if we take into consideration the sum of the weights of the edges contained in each partition or any metric related to that
For instance if the sum of the weights are equal for each partition that does not necessarily mean that they contain an equal number of nodes

Related

Graph Querying on edge

Attributed Graphs are most commonly represented as an adjacency matrix or a list where nodes are considered first class citizens. There are many graph queries such as neighborhood, shortest path, page rank, connected component that operate on these matrix and list structures on nodes. The attributes of the node/edge can also be stored apart from the connections.
Another representation of the graph is an incidence matrix where the incident edges of a node are recorded in a matrix. I understand they represent exactly the same information as previous node-based methods.
My question is, are there any graph queries/workloads/algorithms that can benefit from the incidence matrix structure rather than using the node-based structures i.e. favoring an edge-based structure? When exactly are the incidence matrix used?
I can think of only one case where incidence matrix may prove faster:
Finding the degree of a node or finding adjacent nodes is an operation with complexity O(V) when using an adjacency matrix and O(E) when using an incidence matrix.
Usually E>V, but this may not be the case if the graph has many 0-degree nodes. Since finding adjacent nodes is a basic operation, many algorithms may prove to be faster on such graphs.

How to choose k in Shi-Malik Algorithm?

I'm wondering how one chooses a specific k in Shi-Malik Algo.
Do we choose several ks and rank them via their SSE measures?
Does k reflect the number of clusters we assume for the data?
kind regards Mikey
Yes, K is the number of natural grouping we believe their is in the data.
You can find K by exploring the eigenvalues.
One tool which is particularly designed for spectral clustering is the Eigengap heuristic (also called spectral gap) - number of clusters k is usually given by the value of k that maximizes the eigengap (difference between consecutive eigenvalues). i.e., choose the number k such that all eigenvalues λ1, . . . , λk are very small, but λk+1 is relatively large.
The larger this eigengap is, the closer the eigenvectors of the ideal case and hence the better spectral clustering works. If you're interested on the justifications for this procedure, it is based on perturbation theory and spectral graph theory.
You can read more here: A Tutorial on Spectral Clustering - Ulrike von Luxburg
Other way to explore the natural grouping: number of connected components and the spectrum of the Laplacian matrix - the number of times 0 appears as an eigenvalue in the Laplacian is the number of connected components in the graph. Your affinity matrix can be considered as a graph and then, try to look how many connected components you have in the graph. That will give you a sense of the neutral structure of your data..
In addition, as you mentioned, we can set a validation criterion (for example, SSE) and see its value under different values of K. That's fine once you have a labeled data (which is not always the case in clustering) and you know that this criterion/quality measure is really meaningful.

How to find total number of nodes in a Distributed hash table

How to find total number of nodes in a Distributed hash table in efficient way?
You generally do that by estimating from a small sample of the network as enumerating all nodes of a large network is prohibitively expensive for most use-cases. And would still be inaccurate due to NAT anyway. So you have to consider that you are sampling the reachable nodes.
Assuming that nodes are randomly distributed throughout the keyspace and you have some sort of distance metric in your DHT (e.g. XOR metric in Kademlia's case) you can find the median of the distances of a sample and than calculate the keyspace size divided by the average distance between nodes times.
If you use the median you may have to compensate by some factor due to the skewedness of the distribution. but my statistics are rusty, maybe someone else can chip in on that
The result will be very noisy, so you'll want to keep enough samples around for averaging. Together with the skewed distribution and the fact that everything happens at an exponential scale (twiddle one bit to the left and the population estimate suddenly doubles or halves).
I would also suggest to only base estimates on outgoing queries that you control, not on incoming traffic, as incoming traffic may be biased by some implementation details.
Another, crude way to get rough estimates is simply extrapolating from your routing table structure, assuming it scales with the network size.
Depending on your statistics prowess you might either want to do some of the following: scientific papers describing the network, steal code from existing implementations that already do estimation or do simulations over broad ranges of population sizes - simply fitting a few million random node addresses into ram and doing some calculations on them shouldn't be too difficult.
Maybe also talk to developers of existing implementations.

Need a graph partitioning technique

I have a graph G = (V,E), with V the set of nodes, and E the set of edges. I have two types of nodes: Source nodes, and Consumer nodes (the number of Source nodes are way lower than the Consumer nodes). The nodes have geographic positions.
I want to partition the graph into a collection of sub-graphs which are:
a- connected sub-graphs,
b- of a proper size (the size of the partitions must be balanced; however not necessarily equal. e.g. between 2000-3000 nodes),
c- the partitions should preferably be directly connected to a Source. So if there is no Source in a partition, the path between the partition to a Source node should not include any nodes in the other partitions. (The most important constraint)
d- The nodes in a partition should be close to each other (geographically)
The minimum cut set is preferable. The Source nodes can be isolated from the other partitions (can be in partitions of one; only themselves).
Is there any existing partitioning technique that I can use? Any kind of help is fully appreciated.
There are some works based on the modularity measure used in community detection. For instance, in Chen et al. 2012, they extend the modularity to spatial, weighted, directed networks. The spatial distance is used to modulate the link weights.
This would fit your points a) and d). However, the (regular) modularity is not designed to find communities of similar size, so it won't fulfil your point b). Maybe you'd better use a classic minimum-cut approach, by modifying a measure such as the conductance in a way similar to that of Chen et al.
For your point c), I must say I never met this type of constraint before, and I find it very interesting. I guess you could try to perform some bi-criterion optimization, trying to minimize both conductance (or modularity) and a criterion such as the average distance to the closest source. But that would not guarantee the respect of point c). You can also force the number of detected communities so that it is less than the number of sources.

What are the differences between community detection algorithms in igraph?

I have a list of about 100 igraph objects with a typical object having about 700 vertices and 3500 edges.
I would like to identify groups of vertices within which ties are more likely. My plan is to then use a mixed model to predict how many within-group ties vertices have using vertex and group attributes.
Some people may want to respond to other aspects of my project, which would be great, but the thing I'm most interested in is information about functions in igraph for grouping vertices. I've come across these community detection algorithms but I'm not sure of their advantages and disadvantages, or whether some other function would be better for my case. I saw the links here as well, but they aren't specific to igraph. Thanks for your advice.
Here is a short summary about the community detection algorithms currently implemented in igraph:
edge.betweenness.community is a hierarchical decomposition process where edges are removed in the decreasing order of their edge betweenness scores (i.e. the number of shortest paths that pass through a given edge). This is motivated by the fact that edges connecting different groups are more likely to be contained in multiple shortest paths simply because in many cases they are the only option to go from one group to another. This method yields good results but is very slow because of the computational complexity of edge betweenness calculations and because the betweenness scores have to be re-calculated after every edge removal. Your graphs with ~700 vertices and ~3500 edges are around the upper size limit of graphs that are feasible to be analyzed with this approach. Another disadvantage is that edge.betweenness.community builds a full dendrogram and does not give you any guidance about where to cut the dendrogram to obtain the final groups, so you'll have to use some other measure to decide that (e.g., the modularity score of the partitions at each level of the dendrogram).
fastgreedy.community is another hierarchical approach, but it is bottom-up instead of top-down. It tries to optimize a quality function called modularity in a greedy manner. Initially, every vertex belongs to a separate community, and communities are merged iteratively such that each merge is locally optimal (i.e. yields the largest increase in the current value of modularity). The algorithm stops when it is not possible to increase the modularity any more, so it gives you a grouping as well as a dendrogram. The method is fast and it is the method that is usually tried as a first approximation because it has no parameters to tune. However, it is known to suffer from a resolution limit, i.e. communities below a given size threshold (depending on the number of nodes and edges if I remember correctly) will always be merged with neighboring communities.
walktrap.community is an approach based on random walks. The general idea is that if you perform random walks on the graph, then the walks are more likely to stay within the same community because there are only a few edges that lead outside a given community. Walktrap runs short random walks of 3-4-5 steps (depending on one of its parameters) and uses the results of these random walks to merge separate communities in a bottom-up manner like fastgreedy.community. Again, you can use the modularity score to select where to cut the dendrogram. It is a bit slower than the fast greedy approach but also a bit more accurate (according to the original publication).
spinglass.community is an approach from statistical physics, based on the so-called Potts model. In this model, each particle (i.e. vertex) can be in one of c spin states, and the interactions between the particles (i.e. the edges of the graph) specify which pairs of vertices would prefer to stay in the same spin state and which ones prefer to have different spin states. The model is then simulated for a given number of steps, and the spin states of the particles in the end define the communities. The consequences are as follows: 1) There will never be more than c communities in the end, although you can set c to as high as 200, which is likely to be enough for your purposes. 2) There may be less than c communities in the end as some of the spin states may become empty. 3) It is not guaranteed that nodes in completely remote (or disconencted) parts of the networks have different spin states. This is more likely to be a problem for disconnected graphs only, so I would not worry about that. The method is not particularly fast and not deterministic (because of the simulation itself), but has a tunable resolution parameter that determines the cluster sizes. A variant of the spinglass method can also take into account negative links (i.e. links whose endpoints prefer to be in different communities).
leading.eigenvector.community is a top-down hierarchical approach that optimizes the modularity function again. In each step, the graph is split into two parts in a way that the separation itself yields a significant increase in the modularity. The split is determined by evaluating the leading eigenvector of the so-called modularity matrix, and there is also a stopping condition which prevents tightly connected groups to be split further. Due to the eigenvector calculations involved, it might not work on degenerate graphs where the ARPACK eigenvector solver is unstable. On non-degenerate graphs, it is likely to yield a higher modularity score than the fast greedy method, although it is a bit slower.
label.propagation.community is a simple approach in which every node is assigned one of k labels. The method then proceeds iteratively and re-assigns labels to nodes in a way that each node takes the most frequent label of its neighbors in a synchronous manner. The method stops when the label of each node is one of the most frequent labels in its neighborhood. It is very fast but yields different results based on the initial configuration (which is decided randomly), therefore one should run the method a large number of times (say, 1000 times for a graph) and then build a consensus labeling, which could be tedious.
igraph 0.6 will also include the state-of-the-art Infomap community detection algorithm, which is based on information theoretic principles; it tries to build a grouping which provides the shortest description length for a random walk on the graph, where the description length is measured by the expected number of bits per vertex required to encode the path of a random walk.
Anyway, I would probably go with fastgreedy.community or walktrap.community as a first approximation and then evaluate other methods when it turns out that these two are not suitable for a particular problem for some reason.
A summary of the different community detection algorithms can be found here: http://www.r-bloggers.com/summary-of-community-detection-algorithms-in-igraph-0-6/
Notably, the InfoMAP algorithm is a recent newcomer that could be useful (it supports directed graphs too).

Resources