I'm using igraph community-detection and the community sizes are either too small or too large. Is there any way to specify the size of the detected communities? If not, is there any way for me to manually split or merge communities detected from igraph? Thanks!
Whilst I don't think it's possible to set/specify the size of a community detected by igraph, some of the community detection algorithms allow you to specify how many communities you want (an alternative to splitting/merging).
You can use either the cluster_spinglass() function and set spins to be the number of communities desired. Or use one of the hierarchical methods and then use cut_at() to get the desired number of communities, using the no argument to specify how many communities you want.
Example code:
# Set up your graph object
g <-[an igraph object] # set up your graph
# Use spinglass to create a set number of communities
sg <- g %>% cluster_spinglass(spins = 10) # produces 10 communities using spinglass algorithm
# Use hierarchical methods and cut_at to create a set number of communities
walk <- g %>% cluster_walktrap() %>% cut_at(no = 10)
eb <- g %>% cluster_edge_betweenness() %>% cut_at(no = 10)
Note that the spinglass method will give you back a communities object, whereas the cut_at method simply gives you back the community indices for all nodes in the graph (i.e. a simple numeric vector).
You can find more details on the communities help page.
Related
I'm in the process of creating a weighted igraph network object from a edge list containing two columns from and to. It has proven to be somewhat challenging for me, because when doing a workaround, I notice changes in the network metrics and I believe I'm doing something wrong.
library(igraph)
links <- read.csv2("edgelist.csv")
vertices <- read.csv2("vertices.csv")
network <- graph_from_data_frame(d=links,vertices = vertices,directed = TRUE)
##the following step is included to remove self-loops that I have used to include all isolate nodes to the network##
network <- simplify(network,remove.multiple = FALSE, remove.loops = TRUE)
In this situation I have successfully created a network object. However, it is not weighted. Therefore I create a second network object by taking the adjacency matrix from the objected created earlier and creating the new igraph object from it like this:
gettheweights <- get.adjacency(network)
network2 <- graph_from_adjacency_matrix(gettheweights,mode = "directed",weighted = TRUE)
However, after this when I call both of the objects, I notice a difference in the number of edges, why is this?
network2
IGRAPH ef31b3a DNW- 200 1092 --
network
IGRAPH 934d444 DN-- 200 3626 --
Additionally, I believe I've done something wrong because if they indeed would be the same network, shouldn't their densities be the same? Now it is not the case:
graph.density(network2)
[1] 0.02743719
graph.density(network)
[1] 0.09110553
I browsed and tried several different answers found from here but many were not 1:1 identical and I failed to find a solution.
All seems to be in order. When you re-project a network with edge-duplicates to be represented as a weight by the number of edges between given vertices, the density of your graph should change.
When you you test graph.density(network2) and graph.density(network), they should be different if indeed edge-duplicates were reduced to single-edges with weight as an edge attribute, as your output from network2 and network suggest.
This (over-) commented code goes through the process.
library(igraph)
# Data that should resemble yours
edges <- data.frame(from=c("A","B","C","D","E","A","A","A","B","C"),
to =c("A","C","D","A","B","B","B","C","B","D"))
vertices <- unique(unlist(edges))
# Building graphh in the same way as you do
g0 <- graph_from_data_frame(d=edges, vertices=vertices, directed = TRUE)
# Note that the graph is "DN--": directed, named, but NOT Weighted, since
# Instead of weighted edges, we have a whole lot of dubble edges
(g0)
plot(g0)
# We can se the dubble edges in the adjacency matrix as >1
get.adjacency(g0)
# Use simplify to remove LOOPS ONLY as we can see in the adjacency metrix test
g1 <- simplify(g0, remove.multiple = FALSE, remove.loops = TRUE)
get.adjacency(g1) == get.adjacency(g0)
# Turn the multiple edges into edge-weights by jumping through an adjacency matrix
g2 <- graph_from_adjacency_matrix(get.adjacency(g1), mode = "directed", weighted = TRUE)
# Instead of multiple edges (like many links between "A" and "B"), there are now
# just single edges with weights (hence the density of the network's changed).
graph.density(g1) == graph.density(g2)
# The former doubble edges are now here:
E(g2)$weight
# And we can see that the g2 is now "Named-Directed-Weighted" where g1 was only
# "Named-Directed" and no weights.
(g1);(g2)
# Let's plot the weights
E(g2)$width = E(g2)$weight*5
plot(g2)
A shortcoming of this/your method, however, is that the adjacency matrix is able to carry only the edge-count between any given vertices. If your edge-list contains more variables than i and j, the use of graph_from_data_frame() would normally embed edge-attributes of those variables for you straight from your csv-import (which is nice).
When you convert the edges into weights, however, you would loose that information. And, come to think of it, that information would have to be "converted" too. What would we do with two edges between the same vertices that have different edge-attributes?
At this point, the answer goes slightly beyond your question, but still stays in the realm of explaining the relation between graphs of multiple edges between the same vertices and their representation as weighted graphs with only one structural edge per verticy.
To convert edge-attributes along this transformation into a weighted graph, I suggest you'd use dplyr to "rebuild" any edge-attributes manually in order to keep control of how they are supposed to be merged down when recasting into a weighted one.
This picks up where the code above left off:
# Let's imagine that our original network had these two edge-attributes
E(g0)$coolness <- c(1,2,1,2,3,2,3,3,2,2)
E(g0)$hotness <- c(9,8,2,3,4,5,6,7,8,9)
# Plot the hotness
E(g0)$color <- colorRampPalette(c("green", "red"))(10)[E(g0)$hotness]
plot(g0)
# Note that the hotness between C and D are very different
# When we make your transformations for a weighted netowk, we loose the coolness
# and hotness information
g2 <- g0 %>% simplify(remove.multiple = FALSE, remove.loops = TRUE) %>%
get.adjacency() %>%
graph_from_adjacency_matrix(mode = "directed", weighted = TRUE)
g2$hotness # Naturally, the edge-attributes were lost!
# We can use dplyr to take controll over how we'd like the edge-attributes transfered
# when multiple edges in g0 with different edge attributes are supposed to merge into
# one single edge
library(dplyr)
recalculated_edge_attributes <-
data.frame(name = ends(g0, E(g0)) %>% as.data.frame() %>% unite("name", V1:V2, sep="->"),
hotness = E(g0)$hotness) %>%
group_by(name) %>%
summarise(mean_hotness = mean(hotness))
# We used a string-version of the names of connected verticies (like "A->B") to refere
# to the attributes of each edge. This can now be used to merge back the re-calculated
# edge-attributes onto the weighted graph in g2
g2_attributes <- data.frame(name = ends(g2, E(g2)) %>% as.data.frame() %>% unite("name", V1:V2, sep="->")) %>%
left_join(recalculated_edge_attributes, by="name")
# And manually re-attatch our mean-attributes onto the g2 network
E(g2)$mean_hotness <- g2_attributes$mean_hotness
E(g2)$color <- colorRampPalette(c("green", "red"))(max(E(g2)$mean_hotness))[E(g2)$mean_hotness]
# Note how the link between A and B has turned into the brown mean of the two previous
# green and red hotness-edges
plot(g2)
Sometimes, your analyses may benefit from either structure (weighted no duplicates or unweighted with duplicates). Algorithms for, for example, shortest paths are able to incorporate edge-weight as described in this answer, but other analyses might not allow for or be intuitive when using the weighted version of your network data.
Let purpose guide your structure.
I am trying to perform DBSCAN clustering on the data https://www.kaggle.com/arjunbhasin2013/ccdata. I have cleaned the data and applied the algorithm.
data1 <- read.csv('C:\\Users\\write\\Documents\\R\\data\\Project\\Clustering\\CC GENERAL.csv')
head(data1)
data1 <- data1[,2:18]
dim(data1)
colnames(data1)
head(data1,2)
#to check if data has empty col or rows
library(purrr)
is_empty(data1)
#to check if data has duplicates
library(dplyr)
any(duplicated(data1))
#to check if data has NA values
any(is.na(data1))
data1 <- na.omit(data1)
any(is.na(data1))
dim(data1)
Algorithm was applied as follows.
#DBSCAN
data1 <- scale(data1)
library(fpc)
library(dbscan)
set.seed(500)
#to find optimal eps
kNNdistplot(data1, k = 34)
abline(h = 4, lty = 3)
The figure shows the 'knee' to identify the 'eps' value. Since there are 17 attributes to be considered for clustering, I have taken k=17*2 =34.
db <- dbscan(data1,eps = 4,minPts = 34)
db
The result I obtained is "The clustering contains 1 cluster(s) and 147 noise points."
No matter whatever values I change for eps and minPts the result is same.
Can anyone tell where I have gone wrong?
Thanks in advance.
You have two options:
Increase the radius of your center points (given by the epsilon parameter)
Decrease the minimum number of points (minPts) to define a center point.
I would start by decreasing the minPts parameter, since I think it is very high and since it does not find points within that radius, it does not group more points within a group
A typical problem with using DBSCAN (and clustering in general) is that real data typically does not fall into nice clusters, but forms one connected point cloud. In this case, DBSCAN will always find only a single cluster. You can check this with several methods. The most direct method would be to use a pairs plot (a scatterplot matrix):
plot(as.data.frame(data1))
Since you have many variables, the scatterplot pannels are very small, but you can see that the points are very close together in almost all pannels. DBSCAN will connect all points in these dense areas into a single cluster. k-means will just partition the dense area.
Another option is to check for clusterability with methods like VAT or iVAT (https://link.springer.com/chapter/10.1007/978-3-642-13657-3_5).
library("seriation")
## calculate distances for a small sample
d <- dist(data1[sample(seq(nrow(data1)), size = 1000), ])
iVAT(d)
You will see that the plot shows no block structure around the diagonal indicating that clustering will not find much.
To improve clustering, you need to work on the data. You can remove irrelevant variables, you may have very skewed variables that should be transformed first. You could also try non-linear embedding before clustering.
I created an igraph with a community membership identified:
fc <- fastgreedy.community(graph)
colors <- rainbow(max(membership(fc)))
This provided me the clusters that each of the nodes belong to.
Now when I plot this:
plot(graph,vertex.color=colors[membership(fc)],
layout=layout.kamada.kawai)
it doesn't provide a layout where it exclusively separates each group of nodes based on the membership. Does anyone know a different layout that can provide this? All this is doing is taking the layout: kamada.kawai and coloring in the memberships rather than restructuring the layout so that it is organized by membership.
Hope this question makes sense. Thanks!
You have to calculate the Kamada-Kawai layout with an artificial weight vector that assigns a high weight to edges within clusters and a low weight to edges that cross cluster boundaries:
> graph <- grg.game(100, 0.2) # example graph
> cl <- fastgreedy.community(graph)
> weights <- ifelse(crossing(cl, graph), 1, 100)
> layout <- layout_with_kk(graph, weights=weights)
> plot(graph, layout=layout)
The trick here is the ifelse(crossing(cl, graph), 1, 100) part -- crossing(cl, graph) takes a clustering and the graph that the clustering belongs to, and returns a Boolean vector that defines for each edge whether the edge is crossing cluster boundaries or not. The ifelse() call then simply replaces TRUE (i.e. edge crossing boundaries) in this vector with 1 and FALSE (i.e. edge stays within the cluster) with 0.
Basically I have tried a few different ways of clustering. I can usually get to a point in iGraph where each node is labeled with a cluster. I can then identify all the nodes within a single cluster. However, this loses their edges.
I'd have to re-iterate back over the original dataset for all the nodes in cluster 1 to get only those where both nodes+the edge are within the cluster. I'd have to do this for every cluster.
This seems like a painfully long process and there is probably a shortcut my google-fu is missing.
So, is there an easy way to, after clustering or performing community detection processes, to maintain an individual cluster/community as its own smaller graph -- that is, retaining all nodes AND edges between them?
You can use delete.vertices() to create a subgraph. Example:
library(igraph)
set.seed(123)
# create random graph
g <- barabasi.game(100, directed = F)
plot(g, layout=layout.fruchterman.reingold)
# do community detection
wc <- multilevel.community(g)
V(g)$community <- membership(wc)
# make community 1 subgraph
g_sub <- delete.vertices(g, V(g)[community != 1])
plot(g_sub, layout=layout.fruchterman.reingold)
An alternative:
#Create random network
d <- sample_gnm(n=50,m=40)
#Identify the communities
dc <- cluster_walktrap(d)
#Induce a subgraph out of the first community
dc_1 <- induced.subgraph(d,dc[[1]])
#plot that specific community
plot(dc_1)
I have 4 undirected graph with 1000 vertices and 176672, 150994, 193477, 236060 edges. I am trying to see interaction between a specific set of nodes (16 in number) for each graph. This visualization in tkplot is not feasible as 1000 vertices is already way too much for it. I was thinking of if there is some way to extract the interaction of these 16 nodes from the parent graph and view separately, which will be then more easy to handle and work with in tkplot. I don't want the loss of information as in what is the node(s) in he path of interaction if it comes from other than 16 pre-specified nodes. Is there a way to achieve it?
In such a dense graph, if you only take the shortest paths connecting each pair of these 16 vertices, you will still get a graph too large for tkplot, or even to see any meaningful on a cairo pdf plot.
However, if you aim to do it, this is one possible way:
require(igraph)
g <- erdos.renyi.game(n = 1000, p = 0.1)
set <- sample(1:vcount(g), 16)
in.shortest.paths <- NULL
for(v in set){
in.shortest.paths <- c(in.shortest.paths,
unlist(get.all.shortest.paths(g, from = v, to = set)$res))
}
subgraph <- induced.subgraph(g, unique(in.shortest.paths))
In this example, subgraph will include approx. half of all the vertices.
After this, I think you should consider to find some other way than visualization to investigate the relationships between your vertices of interest. It can be some topological metric, but it really depends on the aims of your analysis.