revealing clusters of interaction in igraph - r

I have an interaction network and I used the following code to make an adjacency matrix and subsequently calculate the dissimilarity between the nodes of the network and then cluster them to form modules:
ADJ1=abs(adjacent-mat)^6
dissADJ1<-1-ADJ1
hierADJ<-hclust(as.dist(dissADJ1), method = "average")
Now I would like those modules to appear when I plot the igraph.
g<-simplify(graph_from_adjacency_matrix(adjacent-mat, weighted=T))
plot.igraph(g)
However the only thing that I have found thus far to translate hclust output to graph is as per the following tutorial: http://gastonsanchez.com/resources/2014/07/05/Pretty-tree-graph/
phylo_tree = as.phylo(hierADJ)
graph_edges = phylo_tree$edge
graph_net = graph.edgelist(graph_edges)
plot(graph_net)
which is useful for hierarchical lineage but rather I just want the nodes that closely interact to cluster as follows:
Can anyone recommend how to use a command such as components from igraph to get these clusters to show?

igraph provides a bunch of different layout algorithms which are used to place nodes in the plot.
A good one to start with for a weighted network like this is the force-directed layout (implemented by layout.fruchterman.reingold in igraph).
Below is a example of using the force-directed layout using some simple simulated data.
First, we create some mock data and clusters, along with some "noise" to make it more realistic:
library('dplyr')
library('igraph')
library('RColorBrewer')
set.seed(1)
# generate a couple clusters
nodes_per_cluster <- 30
n <- 10
nvals <- nodes_per_cluster * n
# cluster 1 (increasing)
cluster1 <- matrix(rep((1:n)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# cluster 2 (decreasing)
cluster2 <- matrix(rep((n:1)/4, nodes_per_cluster) +
rnorm(nvals, sd=1),
nrow=nodes_per_cluster, byrow=TRUE)
# noise cluster
noise <- matrix(sample(1:2, nvals, replace=TRUE) +
rnorm(nvals, sd=1.5),
nrow=nodes_per_cluster, byrow=TRUE)
dat <- rbind(cluster1, cluster2, noise)
colnames(dat) <- paste0('n', 1:n)
rownames(dat) <- c(paste0('cluster1_', 1:nodes_per_cluster),
paste0('cluster2_', 1:nodes_per_cluster),
paste0('noise_', 1:nodes_per_cluster))
Next, we can use Pearson correlation to construct our adjacency matrix:
# create correlation matrix
cor_mat <- cor(t(dat))
# shift to [0,1] to separate positive and negative correlations
adj_mat <- (cor_mat + 1) / 2
# get rid of low correlations and self-loops
adj_mat <- adj_mat^3
adj_mat[adj_mat < 0.5] <- 0
diag(adj_mat) <- 0
Cluster the data using hclust and cutree:
# convert to dissimilarity matrix and cluster using hclust
dissim_mat <- 1 - adj_mat
dend <- dissim_mat %>%
as.dist %>%
hclust
clusters = cutree(dend, h=0.65)
# color the nodes
pal = colorRampPalette(brewer.pal(11,"Spectral"))(length(unique(clusters)))
node_colors <- pal[clusters]
Finally, create an igraph graph from the adjacency matrix and plot it using the fruchterman.reingold layout:
# create graph
g <- graph.adjacency(adj_mat, mode='undirected', weighted=TRUE)
# set node color and plot using a force-directed layout (fruchterman-reingold)
V(g)$color <- node_colors
coords_fr = layout.fruchterman.reingold(g, weights=E(g)$weight)
# igraph plot options
igraph.options(vertex.size=8, edge.width=0.75)
# plot network
plot(g, layout=coords_fr, vertex.color=V(g)$color)
In the above code, I generated two "clusters" of correlated rows, and a third group of "noise".
Hierarchical clustering (hclust + cuttree) is used to assign the data points to clusters, and they are colored based on cluster membership.
The result looks like this:
For some more examples of clustering and plotting graphs with igraph, checkout: http://michael.hahsler.net/SMU/LearnROnYourOwn/code/igraph.html

You haven't shared some toy data for us to play with and suggest improvements to code, but your question states that you are only interested in plotting your clusters distinctly - that is, graphical presentation.
Although igraph comes with some nice force directed layout algorithms, such as layout.fruchterman.reingold, layout_with_kk, etc., they can, in presence of a large number of nodes, quickly become difficult to interpret and make sense of at all.
Like this:
With these traditional methods of visualising networks,
the layout algorithms, rather than the data, determine the visualisation
similar networks may end up being visualised very differently
large number of nodes will make the visualisation difficult to interpret
Instead, I find Hive Plots to be better at displaying important network properties, which, in your instance, are the cluster and the edges.
In your case, you can:
Plot each cluster on a different straight line
order the placement of nodes intelligently, so that nodes with certain properties are placed at the very end or start of each straight line
Colour the edges to identify direction of edge
To achieve this you will need to:
use the ggnetwork package to turn your igraph object into a dataframe
map your clusters to the nodes present in this dataframe
generate coordinates for the straight lines and map these to each cluster
use ggplot to visualise
There is also a hiveR package in R, should you wish to use a packaged solution. You might also find another visualisation technique for graphs very useful: BioFabric

Related

Applying Graph Clustering Algorithms on the (famous) Iris data set

My question deals with the application of graph clustering algorithms. Most times, I see that graphs are made by using nodes and edges within the data. For example, suppose we have social media data: each individual in the data could be represented as a node and the relationship between individuals could be represented as edges. Using this information, we could build a graph and then perform graph clustering algorithms (e.g. Louvain Clustering) on this graph.
Sometimes, graphs can also be made using distances between points. Distances between points can be thought of as edges. For example, in the Spectral Clustering algorithm, a KNN (k nearest neighbor) graph is made from the data and then the K-Means clustering algorithm is performed on this graph.
My question is this: Suppose we take the famous Iris data and remove the response variable ("Species"). Would it make any sense to create a graph of this Iris data in which each node corresponds to an individual flower and edges correspond to pairwise Euclidean distances between each points? Assuming this is a logical and correct approach, could graph clustering algorithms be then performed on this Iris graph?
Below, I have attempted to first create a graph of the Iris data using pairwise Euclidean distances (in R). I then performed Louvain Clustering and Infomap Clustering on the resulting graph. After that, I attempted to create a KNN graph of the Iris data and perform MST (minimum spanning tree) clustering on this KNN graph, as well as perform Louvain Clustering.
Could someone please provide an opinion on what I have done? Is this intuitive and does it make mathematical sense? As a way of "cheating" - the Iris data only has 3 species. Thus, if a given clustering algorithm returns significantly more than 3 clusters, we know that the graph and/or the clustering algorithm may not be the best choice. However, in real applications, we are unable to know how many "true" classes exist within the data.
library(igraph)
library(network)
library(reshape2)
library(mstknnclust)
library(visNetwork)
library(cluster)
/****louvain clustering done on a distance based graph - maybe this is correct****/
x <- iris[,1:4]
dist <- daisy(x,
metric = "euclidean"
)
d_mat <- as.matrix(dist)
d_long <- melt(d_mat)
colnames(d_long) <- c("from", "to", "correlation")
d_mat_long <- d_long[which(d_long$correlation > .5),]
graph <- graph_from_data_frame(d_mat_long, directed = FALSE)
nodes <- as_data_frame(graph, what = "vertices")
colnames(nodes) <- "id"
nodes$label <- nodes$id
links <- as_data_frame(graph, what = "edges")
visNetwork(nodes, links) %>% visIgraphLayout(layout = "layout_with_fr")
cluster <- cluster_louvain(graph)
nodes$cluster <- cluster$membership
nodes$color <- ifelse(nodes$cluster == 1, "red", "blue")
visNetwork(nodes, links) %>% visIgraphLayout(layout = "layout_with_fr") %>% visOptions(selectedBy = "cluster") %>% visNodes(color = "color")
/***infomap and louvain clustering done a distance based graph but with a different algorithm: I think this is wrong***/
imc <- cluster_infomap(graph)
membership(imc)
communities(imc)
plot(imc, graph)
lc <- cluster_louvain(graph, weights = NULL)
membership(lc)
communities(lc)
plot(lc, graph)
/****mst spanning algorithm on the knn graph : based on the number of clusters I think this is wrong****/
cg <- generate.complete.graph(1:nrow(x),d_mat)
##Generates kNN graph
knn <- generate.knn(cg)
plot(knn$knn.graph,
main=paste("kNN \n k=", knn$k, sep=""))
results <- mst.knn(d_mat)
igraph::V(results$network)$label.cex <- seq(0.6,0.6,length.out=2)
plot(results$network, vertex.size=8,
vertex.color=igraph::clusters(results$network)$membership,
layout=igraph::layout.fruchterman.reingold(results$network, niter=10000),
main=paste("MST-kNN \n Clustering solution \n Number of clusters=",results$cnumber,sep="" ))
/*****louvain clustering and infomap done on the knn graph - maybe this is correct****/
#louvain
lc <- cluster_louvain(knn$knn.graph, weights = NULL)
membership(lc)
communities(lc)
plot(lc, knn$knn.graph)
imc <- cluster_infomap(knn$knn.graph)
membership(imc)
communities(imc)
plot(imc, knn$knn.graph)
"louvain clustering done on a distance based graph - maybe this is correct"
Not really, distance is used when graphing things like betweenness centrality. If your interest is similarity then convert distance to similarity.

R: Convert correlation matrix to edge list

I want to create a network graph of my data, where the weight of the edges is defined by the correlation coefficient in a correlation matrix. The connection is defined by being statistically significant or not.
Since I want to play around with some parameters I need to have this information in an edge list rather than in matrix form, but I'm struggling as to how to convert this. I have tried to used igraph as shown below, but I cannot figure out how to get the information on which correlations are significant and which are not into the edge list. I guess weight could be set to zero to code that info, but how do I combine a correlation matrix and a p-value matrix?
library(igraph)
g <- graph.adjacency(a,weighted=TRUE)
df <- get.data.frame(g)
df
It'd be great if you could provide a minimal reproducable example, but I think I understand what you're asking for. You'll need to make a graph from a matrix using graph_from_adjacency_matrix, but make sure to input something in the weighted parameter, because otherwise the elements in the matrix represent number of edges (less than 1 means no edges). Then you can create an edge list from the graph using as_data_frame. Then perform whatever calculation you want, or join any external data you have, then you can convert it back to a graph by using graph_from_data_frame
cor_mat <- cor(mtcars)
cor_g <- graph_from_adjacency_matrix(cor_mat, mode='undirected', weighted = 'correlation')
cor_edge_list <- as_data_frame(cor_g, 'edges')
only_sig <- cor_edge_list[abs(cor_edge_list$correlation) > .75, ]
new_g <- graph_from_data_frame(only_sig, F)
For the ones who still need this, here is the answer
library(igraph)
g <- graph.adjacency(a, mode="upper", weighted=TRUE, diag=FALSE)
e <- get.edgelist(g)
df <- as.data.frame(cbind(e,E(g)$weight))

Change Layout Structure in IGraph Plot based on Community

I created an igraph with a community membership identified:
fc <- fastgreedy.community(graph)
colors <- rainbow(max(membership(fc)))
This provided me the clusters that each of the nodes belong to.
Now when I plot this:
plot(graph,vertex.color=colors[membership(fc)],
layout=layout.kamada.kawai)
it doesn't provide a layout where it exclusively separates each group of nodes based on the membership. Does anyone know a different layout that can provide this? All this is doing is taking the layout: kamada.kawai and coloring in the memberships rather than restructuring the layout so that it is organized by membership.
Hope this question makes sense. Thanks!
You have to calculate the Kamada-Kawai layout with an artificial weight vector that assigns a high weight to edges within clusters and a low weight to edges that cross cluster boundaries:
> graph <- grg.game(100, 0.2) # example graph
> cl <- fastgreedy.community(graph)
> weights <- ifelse(crossing(cl, graph), 1, 100)
> layout <- layout_with_kk(graph, weights=weights)
> plot(graph, layout=layout)
The trick here is the ifelse(crossing(cl, graph), 1, 100) part -- crossing(cl, graph) takes a clustering and the graph that the clustering belongs to, and returns a Boolean vector that defines for each edge whether the edge is crossing cluster boundaries or not. The ifelse() call then simply replaces TRUE (i.e. edge crossing boundaries) in this vector with 1 and FALSE (i.e. edge stays within the cluster) with 0.

After clustering in R (iGraph, etc), can you maintain nodes+edges from a cluster to do individual cluster analysis?

Basically I have tried a few different ways of clustering. I can usually get to a point in iGraph where each node is labeled with a cluster. I can then identify all the nodes within a single cluster. However, this loses their edges.
I'd have to re-iterate back over the original dataset for all the nodes in cluster 1 to get only those where both nodes+the edge are within the cluster. I'd have to do this for every cluster.
This seems like a painfully long process and there is probably a shortcut my google-fu is missing.
So, is there an easy way to, after clustering or performing community detection processes, to maintain an individual cluster/community as its own smaller graph -- that is, retaining all nodes AND edges between them?
You can use delete.vertices() to create a subgraph. Example:
library(igraph)
set.seed(123)
# create random graph
g <- barabasi.game(100, directed = F)
plot(g, layout=layout.fruchterman.reingold)
# do community detection
wc <- multilevel.community(g)
V(g)$community <- membership(wc)
# make community 1 subgraph
g_sub <- delete.vertices(g, V(g)[community != 1])
plot(g_sub, layout=layout.fruchterman.reingold)
An alternative:
#Create random network
d <- sample_gnm(n=50,m=40)
#Identify the communities
dc <- cluster_walktrap(d)
#Induce a subgraph out of the first community
dc_1 <- induced.subgraph(d,dc[[1]])
#plot that specific community
plot(dc_1)

How to spread out community graph made by using igraph package in R

Trying to find communities in tweet data. The cosine similarity between different words forms the adjacency matrix. Then, I created graph out of that adjacency matrix. Visualization of the graph is the task here:
# Document Term Matrix
dtm = DocumentTermMatrix(tweets)
### adjust threshold here
dtms = removeSparseTerms(dtm, 0.998)
dim(dtms)
# cosine similarity matrix
t = as.matrix(dtms)
# comparing two word feature vectors
#cosine(t[,"yesterday"], t[,"yet"])
numWords = dim(t)[2]
# cosine measure between all column vectors of a matrix.
adjMat = cosine(t)
r = 3
for(i in 1:numWords)
{
highElement = sort(adjMat[i,], partial=numWords-r)[numWords-r]
adjMat[i,][adjMat[i,] < highElement] = 0
}
# build graph from the adjacency matrix
g = graph.adjacency(adjMat, weighted=TRUE, mode="undirected", diag=FALSE)
V(g)$name
# remove loop and multiple edges
g = simplify(g)
wt = walktrap.community(g, steps=5) # default steps=2
table(membership(wt))
# set vertex color & size
nodecolor = rainbow(length(table(membership(wt))))[as.vector(membership(wt))]
nodesize = as.matrix(round((log2(10*membership(wt)))))
nodelayout = layout.fruchterman.reingold(g,niter=1000,area=vcount(g)^1.1,repulserad=vcount(g)^10.0, weights=NULL)
par(mai=c(0,0,1,0))
plot(g,
layout=nodelayout,
vertex.size = nodesize,
vertex.label=NA,
vertex.color = nodecolor,
edge.arrow.size=0.2,
edge.color="grey",
edge.width=1)
I just want to have some more gap between separate clusters/communities.
To the best of my knowledge, you can't layout vertices of the same community close to each other, using igraph only. I have implemented this function in my package NetPathMiner. It seems it is a bit hard to install the package just for the visualization function. I will write the a simple version of it here and explain what it does.
layout.by.attr <- function(graph, wc, cluster.strength=1,layout=layout.auto) {
g <- graph.edgelist(get.edgelist(graph)) # create a lightweight copy of graph w/o the attributes.
E(g)$weight <- 1
attr <- cbind(id=1:vcount(g), val=wc)
g <- g + vertices(unique(attr[,2])) + igraph::edges(unlist(t(attr)), weight=cluster.strength)
l <- layout(g, weights=E(g)$weight)[1:vcount(graph),]
return(l)
}
Basically, the function adds an extra vertex that is connected to all vertices belonging to the same community. The layout is calculated based on the new graph. Since each community is now connected by a common vertex, they tend to cluster together.
As Gabor said in the comment, increasing edge weights will also have similar effect. The function leverages this information, by increasing a cluster.strength, edges between created vertices and their communities are given higher weights.
If this is still not enough, you extend this principle (calculating the layout on a more connected graph) by adding edges between all vertices of the same communities (forming a clique). From my experience, this is a bit of an overkill.

Resources