Correct way of calculating modularity for weighted graphs - r

I have about 13000 genes which I am trying to cluster using igraph as follows:
g.communities <- edge.betweenness.community(as.undirected(g), weights = E(g)$weight)
which returns 97 communities with modularity 0.9773353:
modularity(as.undirected(g), membership = g.communities$membership, weights = E(g)$weight)
#0.9773353
when I tried to custom made the number of communities as below I get modularity of 0.0094:
modularity(as.undirected(g), membership = cutat(g.communities, steps = 97), weights =
E(g)$weight)
#0.0094
Shouldn't these functions return similar results? Also, is it possible to use the above
function to find the correct number of clusters? (since by just increasing the steps the modularity always increases)
Finally g.communities$modularity returns a number for each vertex.
Can these numbers be interpreted as the correlation of each vertex to its corresponding module?

You are using the steps argument of cut_at. This does not specify the number of communities, but the number of merging steps to perform on the dendrogram. If you want 97 communities, use cut_at(g.communities, no=97) or simply cut_at(g.communities, 97).
That said, I do not suggest using edge.betweenness.community on weighted graphs at this time, for the reasons I described here.

Related

Generating multiple random graphs in R with same number of nodes and ties?

I would like to generate a large number of random graphs with the same number of nodes and ties, and use the result to find the distributions etc of the standard metrics.
I found this link for generating random graphs with a given number of nodes and ties (Graph generation given number of edges and nodes). Is there an easy way to tell R to do this 1000x or so, and combine all of those into one object, that I can then analyze? (for things like av. distance, degree, diameter, etc).
Ultimately I want to be able to use this information for comparison with an empirical network.
I got this answer from a friend, and it appears to work:
RR1 <- list()
for(i in 1:1000) {
RR1[[i]] <- erdos.renyi.game(559,9640,type=c("gnm"),directed=FALSE,loops=FALSE)
}
number_of_edges <- sapply(RR1, gsize)
number_of_edges

How can I calculate degree, eigenvector, bonacich power centrality using R igraph package?

I am trying to calculate degree centrality, eigenvetor centrality and bonacich power centrality using igraph package in R. My data is South Korea's commuting data.
My data looks like this1st column: orientation region code, 2nd column: destination region code, 3rd column: commuting times between the two regions
I have made igraph graph using function graph_from_data_frame() like the picture below.
od18 is the data I used. The same one mentioned at the first picture.
But here are my problems.
I can't make an adjacency matrix using this graph.
: Error in get.adjacency.sparse(graph, type = type, attr = attr, edges = edges, :
Sparse matrices must be either numeric or logical,and the edge attribute is not
this is the code I executed.
this is the error.
Actually, I don't care about the adjacency matrix if centrality calculations don't have problems. But I am worrying what if this means that I am not going to get correct results of centrality calculations.
I tried to calculate the degree centrality using function degree() but the results values are all same.
All of my nodes have the same degree values as 250.
Any help about Bonacich power centrality using -beta.
Can you help me with these problems?

Understanding R subspace package clustering output

Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.
So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan

Replication of Erdos-Renyi graph

I'm really new to R and I have got an assignment for my classes. I have to create 1000 networks of Erdos-Renyi model. The thing is, I actually can create one model, check it parameters like degree distribution, plot it etc.. Also I can check its transitivity and so on. However, I have to compare the average clustering coefficient (local transitivity) of those 1000 networks to some network that we have been working on classes in Cytoscape. This is the code that I already know:
library(igraph)
g<-erdos.renyi.game(1000,2000, type=c("gnm"))
transitivity(g) #and other atrributes...
g2<-replicate(g,1000)
transitivity(g2[[1]])
#now I have sort of list with every graph but when I try to analyze
#I got the communicate that it is not a graph object
I have to calculate standard deviation and mean ACC from those 1000 networks, and then compare it.
I will appreciate any kind of help.
I tried a lot actually:
g<-erdos.renyi.game(1026,2222,type=c("gnm"))
g2<-1000*g
transitivity(g2[2]) # this however ends in "not a graph object"error
g2[1] #outputs the adjacency matrix but instead of 1026 vertices,
#I've got 1026000 vertices, so multiplication doesn't replicate function
#but parameters
Also, I have tried unifying the list of graphs
glist<-1000*g
acc<-union(glist, byname="auto")
transitivity(acc) #outputs the same value as first function g (only one
#erdos-renyi graph
To multiply many graphs use replication function below
g<-erdos.renyi.game(100, 20, type=c("gnm"))
g2<-replicate(1000, erdos.renyi.game(100, 20, type=c("gnm")), simplify=FALSE);
sapply(g2, transitivity)
To calculate mean of some attribute like average degree or transitivity use:
mean(sapply(g2, transitivity))

How to calculate the quality of clustering by dtw?

my aim is to cluster 126 time-series concerning 26 weeks (so each time-series has 26 observation). I used pam{cluster} = partitioning around medoids to cluster these time-series.
Before clustering I wanted to compare which distance measure is the most appropriate: euclidean, manhattan or dynamic time warping. I used each distance to cluster and compare by silhouette plot. Is there any way I can compare different distance measure?
For example I know that procedure clValid {clValid} to validate cluster results, however I cannot implement dtw to calculate indexes.
So how can I compare different distance metrics (not only by silhouette)?
Additional question: is GAP statistic enough to decide how many clusters choose? Or should I evaluate number of clusters with different methods or compare two or three ways how to do it?
I would be grateful for any suggestions.
I have just read the book "cluster analysis, fifth edition" by Brian S. Everitt, etc. And currently, I adopt the following strategy to select method to calculate distance matrix, clustering and validation:
for distance: using cmdscale{stats} function to calculate multidimentional scaling, and plot the scatterplot of the two scaling dimensions with density information. As expected, if there is distinct clusters or nested clusters, the scatterplot will give some hints.
for clustering: for every clustering method, calculate cophenetic correlation between clustering results and the distance, this can be calculated using cophenetic{stats} function. The best clustering method will give higher correlation. However, this is only working for hierarchical clustering. I haven't idea for other clustering methods, like pam, or kmeans.
for partition evaluation: package {clusterSim} give several function to calculate the index to evaluate the clustering quality. Another package {NbClust} also calculate so many as 30 index to evaluate the combination of "distance", "clustering" and "number of clusters". However, this package partition the hierarchical tree using {cutree}, which is not suitable for nested clustering structure. Another method provided by {dynamicTreeCut} give reasonable results.
for cluster number determination: will added later.
Cluster data for which you have class labels, and use the RAND index to measure cluster quality.
50 such datasets are at the UCR time series archive
This paper does something similar
http://www.cs.ucr.edu/~eamonn/ClusteringTimeSeriesUsingUnsupervised-Shapelets.pdf

Resources