What is the difference between the basic Graph Convolutional Neural Networks and GraphSage?
Which of the methods is more suited to unsupervised learning and in that case how is the loss function defined?
Please share the base papers for both the methods.
Graph Convolutional Networks are inherently transductive i.e they can only generate embeddings for the nodes present in the fixed graph during the training.
This implies that, if in the future the graph evolves and new nodes (unseen during the training) make their way into the graph then we need to retrain the whole graph in order to compute the embeddings for the new node. This limitation makes the transductive approaches inefficient to get applied on the ever-evolving graphs (like social networks, protein-protein networks, etc) because of their inability to generalize on unseen nodes.
On the other hand, the GraphSage algorithm exploits the rich node features and the topological structure of each node’s neighborhood simultaneously to generate representations for new nodes without retraining efficiently. In addition to this GraphSage performs neighborhood sampling which provides the GraphSage algorithm its unique ability to scale up to billions of nodes in the graph
To find more detail one can follow this blogpost https://sachinsharma9780.medium.com/a-comprehensive-case-study-of-graphsage-algorithm-with-hands-on-experience-using-pytorchgeometric-6fc631ab1067
GCN Paper
GraphSage
Related
I am applying the Cluster Info Map algorithm for community detection for a set of large networks. I can achieve high modularity scores of around 0.6 however this comes with getting very high numbers of communities (18-24) which complicates my final analysis. Ideally, I would like to work with 10-15 communities. Is there any way to adjust how many communities are ultimately detected in my network using a resolution/modularity parameter or otherwise?
I've seen similar questions for Louvain and other algorithms, but nothing specifically on the Cluster Info Map algorithm. Nor does the official documentation for the standard function show anything about a resolution parameter. https://search.r-project.org/CRAN/refmans/igraph/html/cluster_infomap.html
I'm wondering if there is some other workaround available? Example of my code below for a weighted, directed network.
ci <-cluster_infomap(network,
e.weights = E(network)$weight,
nb.trial=50)
The Infomap implementation included in igraph does not have any parameters that allow controlling the number of communities or the resolution. igraph's cluster_infomap is based on the old Infomap package.
The new Python Infomap package supports several features which you might find helpful, including hierarchical partitioning.
Imagine, you have 100-1000 images that look like the following
What is the best algorithm to identify this pattern uniquely, even if it's rotated
or zoomed
or even shifted and/or partly cropped?
What you are trying to solve here is Cluster identification problem. The 100-1000 images you describe in your question are all large cluster of unlabeled dataset. There exist multiple Cluster Identification algorithms which will be perfect in your case such as k-means algorithm, k-modes algorithm or k-Nearest Neighbor algorithm.
Basically how data clustering works is that they statistically categorize similar clusters based on multiple similarity features like the cluster's size, density, distance, shape, etc. into classes such that there forms a group of similar and dissimilar clusters. Using the clustering algorithm your machine can learn to recognize patterns by observing as many dataset you intend to feed it.
Now, when you zoom the image or rotate/crop the image you just increase the noise in your dataset. Noises makes the data clustering process more tedious but it is doable. You can refer to this paper if you want to learn more about data clustering algorithms.
For my thesis assignment I need to perform a cluster analysis on a high dimensional data set containing purchase data from a retail store (+1000 dimensions). Because traditional clustering algorithms are not well suited for high dimensions (and dimension reduction is not really an option), I would like to try algorithms specifically developed for high dimensional data(e.g. ProClus).
Here however, my problem starts.
I have no clue what value I should use for parameter d. Can anyone help me?
This is just one of the many limitations of ProClus.
The parameter is the average dimensionality of your cluster. It assumes there is a linear cluster somewhere in your data. This likely will not hold for purchase data, but you can try. For sparse data such as purchases, I would rather focus on frequent itemset mining.
There is no universal clustering algorithm. Any clustering algorithm will come with a variety of parameters that you need to experiment with.
For cluster analysis it is essential that you somehow can visualize or analyze the result, to be able to find out if and how well the method worked.
I am having unweighted and undirected graph as my network which is basically the network of proteins.I want to cluster this graph and divide this graph in to disjoint clusters. Can any 1 suggest clustering algorithms which i can apply on the biological network which is unweighted and undirected graph.
Several graph partitioning algorithms exist, they use different paradigm to tackle the same problem.
The most common is the Louvain's method optimizing Newman's modularity.
In python using Networkx as graph library you can use community to partition your graph.
The fastest graph partitioning uses Metis. It is based hierarchical graph coarsening.
You also have N-cut originally designed to segment images.
Finally, you can partition graphs using stochastic block-model techniques. A very good python implementation of Louvain and several block-model techniques can be found in Graph-tool.
My favorite is the latter, it is fast (based on the Boost graph library), relatively easy to use and tuneable.
EDIT: Note that in graph-tool, what we call Louvain modularity is indeed Newman's algorithm, the docs are here.
I have a JUNG graph containing about 10K vertices and 100K edges, and I'd like to get a measure of similarity between any pair of vertices.
The vertices represent concepts (e.g. dog, house, etc), and the links represent relations between concepts (e.g. related, is_a, is_part_of, etc).
The vertices are densely inter-linked, so a shortest-path approach doesn't give good results (the shortest paths are always very short).
What approaches would you recommend to rank the connectivity between vertices?
JUNG has some algorithms to score the importance of vertices, but I don't understand if there are measures of similarity between 2 vertices.
SimPack seems also promising.
Any hints?
The centrality scores don't measure similarity of pairs of vertices, but some kind of (depending on the method) centrality of single nodes of the network in general. Therefore this approach is possibly not what you want.
SimPack indeed has a nice goal set out, but for graphs it implements isomorphism-based comparations, which rather compare multiple graphs for similarity than pairs of nodes of one given graph. Therefore this is out of scope for now.
What you are seeking are so-called graph clustering methods (also called network module determination or network community determination methods), which divide the graph (network) into multiple partitions so that the nodes in each partition are more strongly interconnected with each other than with nodes of other partitions.
The most classic method is maybe the betweenness centrality clustering of Newman & Girvan where you can exploit the dendrogram for similarity calculation, and it is in JUNG. Of course there are throngs of methods nowadays. You may want to try (shameless plug) our ModuLand method, or read the fine table of module detection algorithms at the end of the Electronic Supplementary Material. That is an overlapping graph clustering method family, that is its result for each node is a vector containing the strengths of belonging to any respective cluster of the network. Pairwise node similarity is easy to derive from pairs of these node-to-cluster vectors.
Graph clustering is non-trivial, and possible you would need to adapt any method for very precise domain-specific results, but that's up to the reader ;) Good luck!