Generating multiple random graphs in R with same number of nodes and ties? - r

I would like to generate a large number of random graphs with the same number of nodes and ties, and use the result to find the distributions etc of the standard metrics.
I found this link for generating random graphs with a given number of nodes and ties (Graph generation given number of edges and nodes). Is there an easy way to tell R to do this 1000x or so, and combine all of those into one object, that I can then analyze? (for things like av. distance, degree, diameter, etc).
Ultimately I want to be able to use this information for comparison with an empirical network.

I got this answer from a friend, and it appears to work:
RR1 <- list()
for(i in 1:1000) {
RR1[[i]] <- erdos.renyi.game(559,9640,type=c("gnm"),directed=FALSE,loops=FALSE)
}
number_of_edges <- sapply(RR1, gsize)
number_of_edges

Related

Understanding R subspace package clustering output

Somewhat related to this question.
I am using R subspace package for subspace clustering. As in the question above, I have failed to use the generic plotting method to plot out my resulting clusters in a way native to the package. The next step is to understand the output of the command
CLIQUE(df, xi = 40, tau = 0.2)
That looks something like this:
I understand that the "object" is the row number for the clustered unit, and the subspace indicates the dimensions of the data in which the clustering was done. However I don't see how the clusters in the given dimensions can be distinguished.
The documentation does not contain information on the output. Ideally, my goal is to plot out all the clusters with something like ggplot2, or in 3D, what have you. And I need to know which units are in which clusters in corresponding dimensions.
Additionally, checked if the dimensions of any of the two members of the output list are the same like this:
cluster_result <- clique_model
equalities_matrix <- matrix(
0L, nrow = length(cluster_result), ncol = length(cluster_result)
)
for (i in 1:length(cluster_result)){
for (j in 1:length(cluster_result)){
equalities_matrix[i,j] <- (
all(cluster_result[[i]]$subspace == cluster_result[[j]]$subspace)
)
}
}
sum(equalities_matrix)
The answer is no.
So, here is what my researches lead to, might be helpful for someone in the future.
Above, the CLIQUE algorithm above outputted one cluster per every set of dimensions, possibly by the virtue of the data or tuning of the algorithm. I added more features to the data, ran it again, and checked if the dimensions of any of the two members of the output list are the same again. This time, yes, several dimensions were the same as the sum(equalities_matrix) yielded a number larger than the number of features.
In conclusion, the output of the algorithm is a list of lists where each member list represents one cluster in one subspace:
subspace ... the dimensions making the subspace indicated with TRUE,
objects ... members of the cluster.
If there is more than one cluster in a given subspace, there will be more member lists with the same subspace, and different members of the cluster.
Here are the papers that helped me understand the theory:
Parsons, Haque, and Liu; 2004
Agrawal Gehrke, Gunopulos, Raghavan

nstart for k-means in R

Search results in numerous places report that the argument nstart in R's function kmeans sets a number of iterations of the algorithm and chooses 'the best one', see e.g. https://datascience.stackexchange.com/questions/11485/k-means-in-r-usage-of-nstart-parameter. Can anyone provide any clarity on how it does this, i.e. by what measure does it define best?
Secondly: R's kmeans function takes an argument centers. Here, as typical in k-means, it is possible to initialise the centroids before the algorithm begins expectation-maximisation, by choosing as initial centroids rows (data-points) from within your data-set. (You could supply, in vector form, points not present in your data-set as well, with considerably greater effort. In this case you could in theory choose the global optimum as your centroids. This is not what I'm asking for.) When nstart or the seed randomises initializations, I am quite sure that it does so by picking a random choice of centroids from your data-set and starting from those (not just a random set of points within the space).
In general, therefore, I'm looking for a way to get a good (e.g. best out of $n$ trials, or best from nstart) set of starting data-instances from the data-set as initial centroids. Is there any way of extracting the 'winning' (=best) set of initial centroids from nstart (which I could then use, say, in the centers parameter in future)? Any other streamlined & quick way to get a very good set of starting centroids (presumably, reasonably close to where the cluster centres will end up being)?
Is there perhaps, at least, a way to extract from a given kmeans run, what initial centroids it chose to start with?
The criterion that kmeans tries to minimize is the trace of the within scatter matrix, i.e. (unfortunately, this forum does not support LaTeX, but you hopefully can read it nevertheless):
$$ trace(S_w) = \sum_{k=1}^K \sum{x \in C_k} ||x - \mu_k||^2 $$
Concerning the best starting point: obviously, the "best" starting point would be the cluster centers eventually chosen by kmeans. These are returned in the attribute centers:
km <- kmeans(iris[,-5], 3)
print(km$centers)
If you are looking for the best random start point, you can create random start points yourself (with runif), do this nstart times and evaluate which initial configuration leads to the smallest km$tot.withinss:
nstart <- 10
K <- 3 # number of clusters
D <- 4 # data point dimension
# select possible range
r.min <- apply(iris[,-5], MARGIN=2, FUN=min)
r.max <- apply(iris[,-5], MARGIN=2, FUN=max)
for (i in 1:nstart) {
centers <- data.frame(runif(K, r.min[d], r.max[d]))
for (d in 2:D) {
centers <- cbind(centers, data.frame(runif(K, r.min[d], r.max[d])))
}
names(centers) <- names(iris[,-5])
# call kmeans with centers and compare tot.withinss
# ...
}

Replication of Erdos-Renyi graph

I'm really new to R and I have got an assignment for my classes. I have to create 1000 networks of Erdos-Renyi model. The thing is, I actually can create one model, check it parameters like degree distribution, plot it etc.. Also I can check its transitivity and so on. However, I have to compare the average clustering coefficient (local transitivity) of those 1000 networks to some network that we have been working on classes in Cytoscape. This is the code that I already know:
library(igraph)
g<-erdos.renyi.game(1000,2000, type=c("gnm"))
transitivity(g) #and other atrributes...
g2<-replicate(g,1000)
transitivity(g2[[1]])
#now I have sort of list with every graph but when I try to analyze
#I got the communicate that it is not a graph object
I have to calculate standard deviation and mean ACC from those 1000 networks, and then compare it.
I will appreciate any kind of help.
I tried a lot actually:
g<-erdos.renyi.game(1026,2222,type=c("gnm"))
g2<-1000*g
transitivity(g2[2]) # this however ends in "not a graph object"error
g2[1] #outputs the adjacency matrix but instead of 1026 vertices,
#I've got 1026000 vertices, so multiplication doesn't replicate function
#but parameters
Also, I have tried unifying the list of graphs
glist<-1000*g
acc<-union(glist, byname="auto")
transitivity(acc) #outputs the same value as first function g (only one
#erdos-renyi graph
To multiply many graphs use replication function below
g<-erdos.renyi.game(100, 20, type=c("gnm"))
g2<-replicate(1000, erdos.renyi.game(100, 20, type=c("gnm")), simplify=FALSE);
sapply(g2, transitivity)
To calculate mean of some attribute like average degree or transitivity use:
mean(sapply(g2, transitivity))

Timeseries cluster validation: using cluster.stats metrics to decide optimal cluster number

I am clustering timeseries data using appropriate distance measures and clustering algorithms for longitudinal data. My goal is to validate the optimal number of clusters for this dataset, through cluster result statistics. I read a number of articles and posts on stackoverflow on this subject, particularly: Determining the Optimal Number of Clusters. Visual inspection is only possible on a subset of my data; I cannot rely on it to be representative of my whole dataset since I am dealing with big data.
My approach is the following:
1. I cluster several times using different numbers of clusters and calculate the cluster statistics for each of these options
2. I calculate the cluster statistic metrics using FPC's cluster.stats R package: Cluster.Stats from FPC Cran Package. I plot these and decide for each metric which is the best cluster number (see my code below).
My problem is that these metrics each evaluate a different aspect of the clustering "goodness", and the best number of clusters for one metric may not coincide with the best number of clusters of a different metric. For example, Dunn's index may point towards using 3 clusters, while the within-sum of squares may indicate that 75 clusters is a better choice.
I understand the basics: that distances between points within a cluster should be small, that clusters should have a good separation from each other, that the sum of squares should be minimized, that observations which are in different clusters should have a large dissimilarity / different clusters should ideally have a strong dissimilarity. However, I do not know which of these metrics is most important to consider in evaluating cluster quality.
How do I approach this problem, keeping in mind the nature of my data (timeseries) and the goal to cluster identical series / series with strongly similar pattern regions together?
Am I approaching the clustering problem the right way, or am I missing a crucial step? Or am I misunderstanding how to use these statistics?
Here is how I am deciding the best number of clusters using the statistics:
cs_metrics is my dataframe which contains the statistics.
Average.within.best <- cs_metrics$cluster.number[which.min(cs_metrics$average.within)]
Average.between.best <- cs_metrics$cluster.number[which.max(cs_metrics$average.between)]
Avg.silwidth.best <- cs_metrics$cluster.number[which.max(cs_metrics$avg.silwidth)]
Calinsky.best <- cs_metrics$cluster.number[which.max(cs_metrics$ch)]
Dunn.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn)]
Dunn2.best <- cs_metrics$cluster.number[which.max(cs_metrics$dunn2)]
Entropy.best <- cs_metrics$cluster.number[which.min(cs_metrics$entropy)]
Pearsongamma.best <- cs_metrics$cluster.number[which.max(cs_metrics$pearsongamma)]
Within.SS.best <- cs_metrics$cluster.number[which.min(cs_metrics$within.cluster.ss)]
Here is the result:
Here are the plots that compare the cluster statistics for the different numbers of clusters:

How to find measures after community detection in igraph (R)?

I am working with Community Detection in graphs. I have been through the different community detection algorithms implemented in igraph and plotting the community structures. Now after getting the communities object for different algorithms, I want to compare the algorithms based on different measures like density,cut ratio, coverage. (I know that modularity is already implemented). I can obtain a subgraph and then calculate the intra-cluster density but to find the inter-cluster density, I dont not know how to proceed. This is the code I have been using to find intra-cluster density:
karate <- graph.famous("Zachary")
wckarate <- walktrap.community(karate) #any algorithm
subg1<-induced.subgraph(karate, which(membership(wckarate)==1)) #membership id differs for each cluster
intradensity1 <- ecount(subg1)/ecount(karate) #for each cluster
Similarly I could proceed for each cluster and add all the densities or take the average of the all. My question is that if the number of communities is very large, then how to proceed?
And if I want to extract the number of edges between different communities, is there a nice way to extract the number of edges?
Please pardon me if this question is already asked. I am novice to igraph and R.
Well, we can just adapt your code to loop over the different subgroups
karate <- graph.famous("Zachary")
wckarate <- walktrap.community(karate) #any algorithm
sapply(unique(membership(wckarate)), function(g) {
subg1<-induced.subgraph(karate, which(membership(wckarate)==g)) #membership id differs for each cluster
ecount(subg1)/ecount(karate)
})
and as far as getting the edges between the communities, you could do
#get all combinations of communities
cs <- data.frame(combn(unique(membership(wckarate)),2))
cx <- sapply(cs, function(x) {
es<-E(karate)[V(karate)[membership(wckarate)==x[1]] %--%
V(karate)[membership(wckarate)==x[2]]]
length(es)
})
cbind(t(cs),cx)
Also you can plot the communities to make sure that looks reasonable
plot.communities(wckarate, karate)

Resources