Neighbor groups based on cluster assignment is slow - r

I am doing some analysis using iGraph in R, and I am currently doing a calculation that is very expensive. I need to do it across all of the nodes in my graph, so if someone knows a more efficient way to do it, I would appreciate it.
I start out with a graph, g. I first do some community detection on the graph
library(igraph)
adj_matrix <- matrix(rbinom(10 * 5, 1, 0.5), ncol = 8000, nrow = 8000)
g <- graph_from_adjacency_matrix(adj_matrix, mode = 'undirected', diag = FALSE)
c <- cluster_louvain(g)
Then, I basically assign each cluster to 1 of 2 groups
nc <- length(c)
assignments <- rbinom(nc, 1, .5)
Now, for each node, I want to find out what percentage of its neighbors are in a given group (as defined by the cluster assignments). I currently do this in the current way:
pct_neighbors_1 <- function(g, vertex, c, assignments) {
sum(
ifelse(
assignments[membership(c)[neighbors(g, vertex)]] == 1, 1, 0)
)/length(neighbors(g, vertex))
}
And then, given that I have a dataframe with each row corresponding to one vertex in the graph, I do this for all vertices with
data$pct_neighbors_1 <- sapply(1:nrow(data),
pct_neighbors_1,
graph = g, community = c,
assignments = assignments)
Is there somewhere in here that I can make things more efficient? Thanks!

This should be faster :
library(igraph)
# for reproducibility's sake
set.seed(1234)
# create a random 1000 vertices graph
nverts <- 1000
g <- igraph::random.graph.game(nverts,0.1,type='gnp',directed=FALSE)
# clustering
c <- cluster_louvain(g)
# assignments
nc <- length(c)
assignments <- rbinom(nc, 1, .5)
# precalculate if a vertex belongs to the assigned communities
vertsInAssignments <- membership(c) %in% which(assignments==1)
# compute probabilities
probs <- sapply(1:vcount(g),FUN=function(i){
neigh <- neighbors(g,i)
sum(vertsInAssignments[neigh]) / length(neigh)
})

Related

Does igraph has a function that generates sub-graphs limited by weights? dfs, random_walk

I have a weighted graph in igraph R environment.
And need to obtain sub-graphs recursively, starting from any random node. The sum of weights in each sub-graph has to be less them a number.
The Deep First Search algorithm seems to deal with this problem. Also the random walk function.
Does anybody know which igraph function could tackle this?
This iterative function finds the sub-graph grown from vertex vertex of any undirected graph which contains the biggest possible weight-sum below a value spevified in limit.
A challange in finding such a graph is the computational load of evaluating the weight sum of any possible sub-graphs. Consider this example, where one iteration has found a sub-graph A-B with a weight sum of 1.
The shortest path to any new vertex is A-C (with a weight of 3), a sub-graph of A-B-D has a weight-sum of 6, while A-B-C would have a weight-sum of 12 because of the inclusion of the edge B-C in the sub-graph.
The function below looks ahead and evaluates iterative steps by choosing to gradually enlarge the sub-graph by including the next vertex that would result in the lowest sub-graph weight-sum rather than that vertex which has the shortest direct paths.
In terms of optimisation, this leaves something to be desired, but I think id does what you requested in your first question.
find_maxweight_subgraph_from <- function(graph, vertex, limit=0, sub_graph=c(vertex), current_ws=0){
# Keep a shortlist of possible edges to go next
shortlist = data.frame(k=integer(0),ws=numeric(0))
limit <- min(limit, sum(E(graph)$weight))
while(current_ws < limit){
# To find the next possible vertexes to include, a listing of
# potential candidates is computed to be able to choose the most
# efficient one.
# Each iteration chooses amongst vertecies that are connected to the sub-graph:
adjacents <- as.vector(adjacent_vertices(graph, vertex, mode="all")[[1]])
# A shortlist of possible enlargements of the sub-graph is kept to be able
# to compare each potential enlargement of the sub-graph and always choose
# the one which results in the smallest increase of sub-graph weight-sum.
#
# The shortlist is enlarged by vertecies that are:
# 1) adjacent to the latest added vertex
# 2) not alread IN the sub-graph
new_k <- adjacents[!adjacents %in% sub_graph]
shortlist <- rbind(shortlist[!is.na(shortlist$k),],
data.frame(k = new_k,
ws = rep(Inf, length(new_k)) )
)
# The addition to the weight-sum is NOT calculated by the weight on individual
# edges leading to vertecies on the shortlist BUT on the ACTUAL weight-sum of
# a sub-graph that would be the result of adding a vertex `k` to the sub-graph.
shortlist$ws <- sapply(shortlist$k, function(x) sum( E(induced_subgraph(graph, c(sub_graph,x)))$weight ) )
# We choose the vertex with the lowest impact on weight-sum:
shortlist <- shortlist[order(shortlist$ws),]
vertex <- shortlist$k[1]
current_ws <- shortlist$ws[1]
shortlist <- shortlist[2:nrow(shortlist),]
# Each iteration adds a new vertex to the sub-graph
if(current_ws <= limit){
sub_graph <- c(sub_graph, vertex)
}
}
(induced_subgraph(graph, sub_graph))
}
# Test function using a random graph
g <- erdos.renyi.game(16, 30, type="gnm", directed=F)
E(g)$weight <- sample(1:1000/100, length(E(g)))
sum(E(g)$weight)
plot(g, edge.width = E(g)$weight, vertex.size=2)
sg <- find_maxweight_subgraph_from(g, vertex=12, limit=60)
sum(E(sg)$weight)
plot(sg, edge.width = E(sg)$weight, vertex.size=2)
# Test function using your example code:
g <- make_tree(10, children = 2, mode = c("undirected"))
s <- seq(1:10)
g <- set_edge_attr(g, "weight", value= s)
plot(g, edge.width = E(g)$weight)
sg <- find_maxweight_subgraph_from(g, 2, 47)
sum(E(sg)$weight)
plot(sg, edge.width = E(g)$weight)
It is done here below, however, it does not seem to be effective.
#######Example code
g <- make_tree(10, children = 2, mode = c("undirected"))
s <- seq(1:19)
g <- set_edge_attr(g, "weight", value= s)
plot(g)
is_weighted(g)
E(g)$weight
threshold <- 5
eval <- function(r){
#r <- 10
Vertice_dfs <- dfs(g, root = r)
Sequencia <- as.numeric(Vertice_dfs$order)
for (i in 1:length(Sequencia)) {
#i <- 2
# function callback by vertice to dfs
f.in <- function(graph, data, extra) {
data[1] == Sequencia[i]-1
}
# DFS algorithm to the function
dfs <- dfs(g, root = r,in.callback=f.in)
# Vertices resulted from DFS
dfs_eges <- na.omit(as.numeric(dfs$order))
# Rsulted subgraph
g2 <- induced_subgraph(g, dfs_eges)
# Total weight subgraph g2
T_W <- sum(E(g2)$weight)
if (T_W > threshold) {
print(T_W)
return(T_W)
break
}
}
}
#search by vertice
result <- lapply(1:length(V(g)),eval)

Shortest Paths based on edge attribute with igraph

I'm trying to get the shortest paths of a graph but based on its edge ids.
So having the following graph:
library(igraph)
set.seed(45)
g <- erdos.renyi.game(25, 1/10, directed = TRUE)
E(g)$id <- sample(1:3, length(E(g)), replace = TRUE)
The shortest_paths(g, 1, V(g)) function finds all the shortest paths from node 1 to all the other nodes. However, I would like to calculate this, not just by following the geodesic distance, but a mix between the geodesic distance, and the minimum of edge id changes.
For example if this would be a train network, and the edge ids would represent trains. I would like to calculate how to get from node A to all the other nodes using the shortest path, but while changing the least amount of time of trains.
OK I think I have a working solution, although the code is a little ugly. The basic algorithm (lets call it gs(i, j)) goes like this: If we want to find the shortest train journey from i to j (gs(i, j)) we:
find the shortest path from i to j considering all trains. if this path is length 0 or 1 return it (there is either no path or a path on 1 train)
split the graph up by 'trains' (subset graph by edges) so as to consider each train network separately, and find the shortest path between i and j in each individual train network
if a single train will get you from i to j, return the train route with the fewest stops between i and j, else
if no single train runs from i to j then call gs(i, j-1) where (j-1) is the stop before j in the shortest path between i and j on the full network.
So basically, we look to see if a single train can do it, and if it can't we call the function recursively looking if a single train can get you to the stop before the last stop, etc. etc.
library(igraph)
# First your data
set.seed(45)
g <- erdos.renyi.game(25, 1/10, directed = TRUE)
E(g)$id <- sample(1:3, length(E(g)), replace = TRUE)
plot(g, edge.color = E(g)$id)
# The function takes as arguments the graph, and the id of the vertex
# you want to go from/to. It should work for a vector of
# destinations but I have not rigorously tested it so proceed with
# caution!
get.shortest.routes <- function(g, from, to){
train.routes <- lapply(unique(E(g)$id), function(id){subgraph.edges(g, eids = which(E(g)$id==id), delete.vertices = F)})
target.sp <- shortest_paths(g, from = from, to = to, output = 'vpath')$vpath
single.train.paths <- lapply(train.routes, function(gs){shortest_paths(gs, from = from, to = to, output = 'vpath')$vpath})
for (i in length(target.sp)){
if (length(target.sp[[i]]>1)) {
cands <- lapply(single.train.paths, function(l){l[[i]]})
if (sum(unlist(lapply(cands, length)))!=0) {
cands <- cands[lapply(cands, length)!=0]
cands <- cands[lapply(cands, length)==min(unlist(lapply(cands, length)))]
target.sp[[i]] <- cands[[1]]
} else {
target.sp[[i]] <- c(get.shortest.routes(g, from = as.numeric(target.sp[[i]][1]),
to = as.numeric(target.sp[[i]][(length(target.sp[[i]]) - 1)]))[[1]],
get.shortest.routes(g, from = as.numeric(target.sp[[i]][(length(target.sp[[i]]) - 1)]),
to = as.numeric(target.sp[[i]][length(target.sp[[i]])]))[[1]][-1])
}
}
}
target.sp
}
OK now lets run some tests. If you squint at the graph above you can see that the path from vertex 5 to vertex 21 is length-2 if you take two trains, but that you can get there on 1 train if you pass through an extra station. Our new function should return the longer path:
shortest_paths(g, 5, 21)$vpath
#> [[1]]
#> + 3/25 vertices, from b014eb9:
#> [1] 5 13 21
get.shortest.routes(g, 5, 21)
#> Warning in shortest_paths(gs, from = from, to = to, output = "vpath"): At
#> structural_properties.c:745 :Couldn't reach some vertices
#> Warning in shortest_paths(gs, from = from, to = to, output = "vpath"): At
#> structural_properties.c:745 :Couldn't reach some vertices
#> [[1]]
#> + 4/25 vertices, from c22246c:
#> [1] 5 13 15 21
Lets make a really easy graph where we are sure what we want to see: here we should get 1-2-4-5 instead of 1-3-5:
df <- data.frame(from = c(1, 1, 2, 3, 4), to = c(2, 3, 4, 5, 5))
g1 <- graph_from_data_frame(df)
E(g1)$id <- c(1, 2, 1, 3, 1)
plot(g1, edge.color = E(g1)$id)
get.shortest.routes(g1, 1, 5)
#> Warning in shortest_paths(gs, from = from, to = to, output = "vpath"): At
#> structural_properties.c:745 :Couldn't reach some vertices
#> Warning in shortest_paths(gs, from = from, to = to, output = "vpath"): At
#> structural_properties.c:745 :Couldn't reach some vertices
#> [[1]]
#> + 4/5 vertices, named, from c406649:
#> [1] 1 2 4 5
I'm sure there is a more rigorous solution, and you'll probably want to optimize the code a bit. For instance, I just realized that I don't stop the function immediately if the shortest path on the full graph has only two nodes -- doing so would avoid some needless computations! This was a fun problem, I hope some other answers gets posted.
Created on 2018-05-11 by the reprex package (v0.2.0).
Here is my take on the problem. A few notes:
1) all_simple_paths will not scale well with large or highly connected graphs
2) I favored fewest changes above all else, which means a path with two changes and a dist of 40 will beat a path with three changes and a dist of 3.
4) I can imagine an even faster approach if # of changes and distance change priority if there is no path on one id
library(igraph)
# First your data
set.seed(45)
g <- erdos.renyi.game(25, 1/10, directed = TRUE)
E(g)$id <- sample(1:3, length(E(g)), replace = TRUE)
plot(g, edge.color = E(g)$id)
##Option 1:
rst <- all_simple_paths(g, from = 1, to = 18, mode = "out")
rst <- lapply(rst, as_ids)
rst1 <- lapply(rst, function(x) c(x[1], rep(x[2:(length(x)-1)],
each=2), x[length(x)]))
rst2 <- lapply(rst1, function(x) data.frame(eid = get.edge.ids(graph=g, vp = x),
train=E(g)$id[get.edge.ids(graph=g, vp = x)]))
rst3 <- data.frame(pathID=seq_along(rst),
changes=sapply(rst2, function(x) length(rle(x$train)$lengths)),
dist=sapply(rst2, nrow))
spath <- rst3[order(rst3$changes, rst3$dist), ][1,1]
#Vertex IDs
rst[[spath]]
#[1] 1 23 8 18
plot(g, edge.color = E(g)$id, vertex.color=ifelse(V(g) %in% rst[[spath]], "firebrick", "gray80"),
edge.arrow.size=0.5)

Extracting simplified graph in igraph

I have a network that looks like this
library(igraph)
library(igraphdata)
data("kite")
plot(kite)
I run a community detection and the result looks like this
community <- cluster_fast_greedy(kite)
plot(community,kite)
Now I want to extract a network based on the communities. The edge weight should be the number of ties between communities (how strong are communities connected to each other), the vertex attribute should be the number of nodes in the community (called numnodes).
d <- data.frame(E=c(1, 2, 3),
A=c(2, 3, 1))
g2 <- graph_from_data_frame(d, directed = F)
E(g2)$weight <- c(5, 1, 1)
V(g2)$numnodes <- c(4,3,3)
plot.igraph(g2,vertex.label=V(g2)$name, edge.color="black",edge.width=E(g2)$weight,vertex.size=V(g2)$numnodes)
The graph should look like this
One node is larger than the others, one edge has a lot of weight in comparison to the others.
As far as I know, igraph doesn't have method to count edges connecting groups of vertices. Therefore to count the edges connecting communities you need to iterate over each pairs of communities. To count the members for each community, you can use the sizes method.
library(igraph)
library(igraphdata)
data("kite")
plot(kite)
community <- cluster_fast_greedy(kite)
plot(community,kite)
cedges <- NULL
for(i in seq(1,max(community$membership) - 1)){
for(j in seq(i + 1, max(community$membership))){
imembers <- which(community$membership == i)
jmembers <- which(community$membership == j)
weight <- sum(
mapply(function(v1) mapply(
function(v2) are.connected(kite, v1, v2),
jmembers),
imembers)
)
cedges <- rbind(cedges, c(i, j, weight))
}
}
cedges <- as.data.frame(cedges)
names(cedges)[3] <- 'weight'
cgraph <- graph_from_data_frame(cedges, directed = FALSE)
V(cgraph)$numnodes <- sizes(community)
plot.igraph(cgraph,
vertex.label = V(cgraph)$name,
edge.color = "black",
edge.width = E(cgraph)$weight,
vertex.size = V(cgraph)$numnodes)

Efficient way to analysis neighbours of subsets of nodes in large graph

I have a graph of 6 million of nodes such as
require(igraph)
# Graph of 1000 nodes
g <- ba.game(1000)
with the following four attributes defined for each node
# Attributes
V(g)$attribute1 <- V(g) %in% sample(V(g), 20)
V(g)$attribute2 <- V(g) %in% sample(V(g), 20)
V(g)$attribute3 <- V(g) %in% sample(V(g), 20)
V(g)$attribute4 <- V(g) %in% sample(V(g), 20)
Among the nodes I have a subset of 12,000 that are of particular interest:
# Subset of 100 nodes
V(g)$subset <- V(g) %in% sample(V(g), 100)
What I want to obtain is an analysis (count) of the neighbourhood of my subset. That is, I want to define
V(g)$neigh.attr1 <- rep(NA, vcount(g))
V(g)$neigh.attr2 <- rep(NA, vcount(g))
V(g)$neigh.attr3 <- rep(NA, vcount(g))
V(g)$neigh.attr4 <- rep(NA, vcount(g))
such that NA is replaced for every node in the subset with the corresponding count of neighbouring nodes with V(g)$attribute{1..4}==TRUE.
I can easily create a list of the neighbourhood of interest with
neighbours <- neighborhood(g, order = 1, V(g)[V(g)$subset==TRUE], mode = "out")
but I can't think of an efficient way to iterate over every neighbours and compute the statistics for each of the four attributes. Indeed the only way I've came up with is a loop which given the size of my original graph takes just too long:
subset_indices <- as.numeric(V(g)[V(g)$subset==TRUE])
for (i in 1:length(neighbours)) {
V(g)$neigh.attr1[subset_indices[i]] <- sum(V(g)$attribute1[neighbours[[i]]])
V(g)$neigh.attr2[subset_indices[i]] <- sum(V(g)$attribute2[neighbours[[i]]])
V(g)$neigh.attr3[subset_indices[i]] <- sum(V(g)$attribute3[neighbours[[i]]])
V(g)$neigh.attr4[subset_indices[i]] <- sum(V(g)$attribute4[neighbours[[i]]])
}

Most representative instance of a cluster

After performing a cluster analysis to my dataset (a dataframe named data.matrix), I added a new column, named cluster, at the end (col 27) containing the cluster name that each instance belongs to.
What I want now, is a representative instance from each cluster. I tried to find the instance having the smallest euclidean distance from the cluster's centroid (and repeat the procedure for each one of my clusters)
This is what I did. Can you think of other -perhaps more elegant- ways? (assume numeric columns with no nulls).
clusters <- levels(data.matrix$cluster)
cluster_col = c(27)
for (j in 1:length(clusters)) {
# get the subset for cluster j
data = data.matrix[data.matrix$cluster == clusters[j],]
# remove the cluster column
data <- data[,-cluster_col]
# calculate the centroid
cent <- mean(data)
# copy data to data.matrix_cl, attaching a distance column at the end
data.matrix_cl <- cbind(data, dist = apply(data, 1, function(x) {sqrt(sum((x - cent)^2))}))
# get instances with min distance
candidates <- data.matrix_cl[data.matrix_cl$dist == min(data.matrix_cl$dist),]
# print their rownames
print(paste("Candidates for cluster ",j))
print(rownames(candidates))
}
At first I don't now if you distance formula is alright. I think there should be sqrt(sum((x-cent)^2)) or sum(abs(x-cent)). I assumed first.
Second thought is that just printing solution is not good idea. So I first compute, then print.
Third - I recommend using plyr but I give both (with and without plyr) solutions.
# Simulated data:
n <- 100
data.matrix <- cbind(
data.frame(matrix(runif(26*n), n, 26)),
cluster=sample(letters[1:6], n, replace=TRUE)
)
cluster_col <- which(names(data.matrix)=="cluster")
# With plyr:
require(plyr)
candidates <- dlply(data.matrix, "cluster", function(data) {
dists <- colSums(laply(data[, -cluster_col], function(x) (x-mean(x))^2))
rownames(data)[dists==min(dists)]
})
l_ply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
print(paste("Candidates for cluster ",c_name))
print(c_list)
})
# without plyr
candidates <- tapply(
1:nrow(data.matrix),
data.matrix$cluster,
function(id, data=data.matrix[id, ]) {
dists <- rowSums(sapply(data[, -cluster_col], function(x) (x-mean(x))^2))
rownames(data)[dists==min(dists)]
}
)
invisible(lapply(names(candidates), function(c_name, c_list=candidates[[c_name]]) {
print(paste("Candidates for cluster ",c_name))
print(c_list)
}))
Is the technique you're interested in 'k-means clustering'? If so, here's how the centroids are calculated at each iteration:
choose a k value (an integer that
specifies the number of clusters to
divide your data set);
random select k rows from your data
set, those are the centroids for the
1st iteration;
calculate the distance that each
data point is from each centroid;
each data point has a 'closest
centroid', that determines its
'group';
calculate the mean for each
group--those are the new centroids;
back to step 3 (stopping criterion
is usually based on comparison with
the respective centroid values in
successive loops, i.e., if they
values change not more than 0.01%,
then quit).
Those steps in code:
# toy data set
mx = matrix(runif60, 10, 99), nrow=12, ncol=5, byrow=F)
cndx = sample(nrow(mx), 2)
# the two centroids at iteration 1
cn1 = mx[cndx[1],]
cn2 = mx[cndx[2],]
# to calculate Pearson similarity
fnx1 = function(a){sqrt((cn1[1] - a[1])^2 + (cn1[2] - a[2])^2)}
fnx2 = function(a){sqrt((cn2[1] - a[1])^2 + (cn2[2] - a[2])^2)}
# calculate distance matrix
dx1 = apply(mx, 1, fnx1)
dx2 = apply(mx, 1, fnx2)
dx = matrix(c(dx1, dx2), nrow=2, ncol=12)
# index for extracting the new groups from the data set
ndx = apply(dx, 1, which.min)
group1 = mx[ndx==1,]
group2 = mx[ndx==2,]
# calculate the new centroids for the next iteration
new_cnt1 = apply(group1, 2, mean)
new_cnt2 = apply(group2, 2, mean)

Resources