How to find measures after community detection in igraph (R)? - r

I am working with Community Detection in graphs. I have been through the different community detection algorithms implemented in igraph and plotting the community structures. Now after getting the communities object for different algorithms, I want to compare the algorithms based on different measures like density,cut ratio, coverage. (I know that modularity is already implemented). I can obtain a subgraph and then calculate the intra-cluster density but to find the inter-cluster density, I dont not know how to proceed. This is the code I have been using to find intra-cluster density:
karate <- graph.famous("Zachary")
wckarate <- walktrap.community(karate) #any algorithm
subg1<-induced.subgraph(karate, which(membership(wckarate)==1)) #membership id differs for each cluster
intradensity1 <- ecount(subg1)/ecount(karate) #for each cluster
Similarly I could proceed for each cluster and add all the densities or take the average of the all. My question is that if the number of communities is very large, then how to proceed?
And if I want to extract the number of edges between different communities, is there a nice way to extract the number of edges?
Please pardon me if this question is already asked. I am novice to igraph and R.

Well, we can just adapt your code to loop over the different subgroups
karate <- graph.famous("Zachary")
wckarate <- walktrap.community(karate) #any algorithm
sapply(unique(membership(wckarate)), function(g) {
subg1<-induced.subgraph(karate, which(membership(wckarate)==g)) #membership id differs for each cluster
ecount(subg1)/ecount(karate)
})
and as far as getting the edges between the communities, you could do
#get all combinations of communities
cs <- data.frame(combn(unique(membership(wckarate)),2))
cx <- sapply(cs, function(x) {
es<-E(karate)[V(karate)[membership(wckarate)==x[1]] %--%
V(karate)[membership(wckarate)==x[2]]]
length(es)
})
cbind(t(cs),cx)
Also you can plot the communities to make sure that looks reasonable
plot.communities(wckarate, karate)

Related

Generating multiple random graphs in R with same number of nodes and ties?

I would like to generate a large number of random graphs with the same number of nodes and ties, and use the result to find the distributions etc of the standard metrics.
I found this link for generating random graphs with a given number of nodes and ties (Graph generation given number of edges and nodes). Is there an easy way to tell R to do this 1000x or so, and combine all of those into one object, that I can then analyze? (for things like av. distance, degree, diameter, etc).
Ultimately I want to be able to use this information for comparison with an empirical network.
I got this answer from a friend, and it appears to work:
RR1 <- list()
for(i in 1:1000) {
RR1[[i]] <- erdos.renyi.game(559,9640,type=c("gnm"),directed=FALSE,loops=FALSE)
}
number_of_edges <- sapply(RR1, gsize)
number_of_edges

Spatstat Point Pattern Analysis for colocalization

I am trying to do some cololcaization analysis, i.e. I want to show if one cell type tends to show up closer to another different celltype significantly in a microsopy image.
I tried to do this with R spatstat package I was able to visualize my dataset:
mypattern is one kind of cell and mypattern2 ist another kind of cell. When u look at the L-plots you can see that there is some kind of clustering as the curve is deviating from poission.
I thought about using nearest neighbor apporoach which is the nncross function in spatstat.
But how can I show now if this distance is random (two random point pattern) or significantly relevant? Does anyone has an idea? I saw a lot about simulations like Monte-Carlo but I have no idea how to begin coding...
I would be glad for any help!
Kind regards,
Hashirama
The L function should not be used here because the data are highly inhomogeneous.
I suggest you combine the two point patterns into a single "marked" point pattern,
X <- superimpose(A=mypattern1, B=mypattern2)
Then estimate the spatially-varying densities of points
D <- density(split(X))
plot(D)
or the spatially varying proportions of each type of cell
R <- relrisk(X)
plot(R)
You can also use segregation.test or a contingency table of nearest neighbours (dixon).
See Chapter 14 of the spatstat book and the help files for relrisk, density.splitppp and segregation.test.

Correct way of calculating modularity for weighted graphs

I have about 13000 genes which I am trying to cluster using igraph as follows:
g.communities <- edge.betweenness.community(as.undirected(g), weights = E(g)$weight)
which returns 97 communities with modularity 0.9773353:
modularity(as.undirected(g), membership = g.communities$membership, weights = E(g)$weight)
#0.9773353
when I tried to custom made the number of communities as below I get modularity of 0.0094:
modularity(as.undirected(g), membership = cutat(g.communities, steps = 97), weights =
E(g)$weight)
#0.0094
Shouldn't these functions return similar results? Also, is it possible to use the above
function to find the correct number of clusters? (since by just increasing the steps the modularity always increases)
Finally g.communities$modularity returns a number for each vertex.
Can these numbers be interpreted as the correlation of each vertex to its corresponding module?
You are using the steps argument of cut_at. This does not specify the number of communities, but the number of merging steps to perform on the dendrogram. If you want 97 communities, use cut_at(g.communities, no=97) or simply cut_at(g.communities, 97).
That said, I do not suggest using edge.betweenness.community on weighted graphs at this time, for the reasons I described here.

Timeseries cluster validation: using cluster.stats metrics to decide optimal cluster number

I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.
My approach is the following:
1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options
2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).
My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.
I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.
How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?
Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?
Here is how I am deciding the best number of clusters using the statistics:
cs_metrics is my dataframe which contains the statistics.
Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]
Here is the result:
Here are the plots that compare the cluster statistics for the different numbers of clusters:

find number of clusters in hierarchical clustering dendrogram after cutree in r

Here I have some problem to find number of clusters after using cutree on a dendrogram. Here is my approach:
mat <- a huge matrix
hc <- (as.dist(mat), method = "average", members = NULL)
#to cut the tree just 1 level below the maximum height
tree <- cutree(hc, h = hc$height[[length(hc$height)-1]])
By printing the tree variable I can see that my dendrogram is cut into two clusters. I can also get the labels from each cluster using names(tree[tree==1]), but how can get the number of clusters without looking at the data? I want to automate this in a pipeline based on number of clusters it has in tree variable.
finally i made it to answer my question by running a loop over the tree object after cutting dendrogram, but this might not be an optimal solution. and feel free to suggest modifications to make it more elegant..
clust <- c()
for (i in 1:length(tree)){
clust[i] <- tree[[i]]
}
length(unique(clust))
This should possibly give the answer as per my knowledge..
Thank you

Resources