(R language) Understanding what is a "weighted" graph - r

I am using R and the igraph library to learn about network graph data. In particular, I am trying to understand the concept of a "weighted graph" - from what I have read, the "weights" are generally associated with the "Edges" in the graph. But can the "weights" ever be associated with the "nodes"? (sometimes, I see that "nodes" are also referred to as "vertexes")
Suppose I have two datasets : one for the nodes and one for the edges.
library(igraph)
library(visNetwork)
Nodes <-data.frame(
"Source" = c("123","124","125","122","111", "126"),
"Salary" = c("100","150","200","200","100", "100"),
"Debt" = c("10","15","20","20","10", "10"),
"Savings" = c("1000","1500","2000","2000","1000", "1000")
)
Nodes$Salary= as.numeric(Nodes$Salary)
Nodes$Debt = as.numeric(Nodes$Debt)
Nodes$Savings = as.numeric(Nodes$Savings)
mydata <-data.frame(
"source" = c("123","124","123","125","123"),
"target" = c("126", "123", "125", "122", "111"),
"color" = c("red","red","green","blue","red"),
"food" = c("pizza","pizza","cake","pizza","cake")
)
Normally, I would have made a simple binary graph for this data, in which the entire analysis would only involve two columns:
#make graph
graph <- graph_from_data_frame(mydata[,c(1:2)], directed=FALSE)
simple_graph<- simplify(graph)
plot(simple_graph)
#do some clustering on the graph#
fc <- fastgreedy.community(simple_graph)
V(simple_graph)$community <- fc$membership
nodes <- data.frame(id = V(simple_graph)$name, title = V(simple_graph)$name, group = V(simple_graph)$community)
nodes <- nodes[order(nodes$id, decreasing = F),]
edges <- get.data.frame(simple_graph, what="edges")[1:2]
visNetwork(nodes, edges) %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)
Now, I want to explore the concept of a "weighted graph". I want to make a graph such that I can use the financial information (salary, debt, savings) for each node in the analysis. The way I see it, this would assign a notion of "weight" to the nodes and not the edges, correct?
A very basic way to approach this problem, would be to take the average(salary, debt and savings) for each node and considering this average amount as a weight. This way, we could begin to ask questions such as "are nodes with larger average financial amounts more likely to form relationships with one another, compared to nodes with smaller average financial amounts?" (in network science, I believe this concept is referred to as "homophily")
Thus, we can modify the file containing information about the nodes (calculate average financial amount for each node) :
nodes_avg = data.frame(ID=Nodes[,1], Means=rowMeans(Nodes[,-1]))
Now, we need to create a new graph in which this averaged financial information is considered as a "weight". This is where I begin to get confused.
This way does not work:
set_vertex_attr(simple_graph, Weight, index = V(graph), nodes_avg$Means)
Error in as.igraph.vs(graph, index) :
Cannot use a vertex sequence from another graph.
I tried the following command, but I got a warning message:
E(simple_graph)$weight <- nodes_avg$Means
Warning message:
In eattrs[[name]][index] <- value :
number of items to replace is not a multiple of replacement length
Finally, I tried this command, but I don't think it is using the averaged financial amounts as node weights:
weighted_graph <- graph_from_data_frame(mydata, directed=TRUE, vertices=nodes_avg)
Does anyone know how can I make a "weighted_graph" with the averaged financial amounts, and then run a clustering algorithm on the network graph which takes into consideration the node weights? Something like this:
simple_weighted_graph<- simplify(weighted_graph)
plot(simple_weighted_graph)
#do some clustering on the weighted_graph#
fc <- fastgreedy.community(simple_weighted_graph)
V(simple_weighted_graph)$community <- fc$membership
nodes <- data.frame(id = V(simple_weighted_graph)$name, title = V(simple_weighted_graph)$name, group = V(simple_weighted_graph)$community)
nodes <- nodes[order(nodes$id, decreasing = F),]
edges <- get.data.frame(simple_weighted_graph, what="edges")[1:2]
visNetwork(nodes, edges) %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)
Or is this not possible? That is, weighted graphs are only made using "edge weights" and CAN NOT be done using "node weights" ... and therefore, graph network clustering can not be done on a weighted graph made of node weights.
Thanks

Related

Creating weighted igraph network using two-column edge list

I'm in the process of creating a weighted igraph network object from a edge list containing two columns from and to. It has proven to be somewhat challenging for me, because when doing a workaround, I notice changes in the network metrics and I believe I'm doing something wrong.
library(igraph)
links <- read.csv2("edgelist.csv")
vertices <- read.csv2("vertices.csv")
network <- graph_from_data_frame(d=links,vertices = vertices,directed = TRUE)
##the following step is included to remove self-loops that I have used to include all isolate nodes to the network##
network <- simplify(network,remove.multiple = FALSE, remove.loops = TRUE)
In this situation I have successfully created a network object. However, it is not weighted. Therefore I create a second network object by taking the adjacency matrix from the objected created earlier and creating the new igraph object from it like this:
gettheweights <- get.adjacency(network)
network2 <- graph_from_adjacency_matrix(gettheweights,mode = "directed",weighted = TRUE)
However, after this when I call both of the objects, I notice a difference in the number of edges, why is this?
network2
IGRAPH ef31b3a DNW- 200 1092 --
network
IGRAPH 934d444 DN-- 200 3626 --
Additionally, I believe I've done something wrong because if they indeed would be the same network, shouldn't their densities be the same? Now it is not the case:
graph.density(network2)
[1] 0.02743719
graph.density(network)
[1] 0.09110553
I browsed and tried several different answers found from here but many were not 1:1 identical and I failed to find a solution.
All seems to be in order. When you re-project a network with edge-duplicates to be represented as a weight by the number of edges between given vertices, the density of your graph should change.
When you you test graph.density(network2) and graph.density(network), they should be different if indeed edge-duplicates were reduced to single-edges with weight as an edge attribute, as your output from network2 and network suggest.
This (over-) commented code goes through the process.
library(igraph)
# Data that should resemble yours
edges <- data.frame(from=c("A","B","C","D","E","A","A","A","B","C"),
to =c("A","C","D","A","B","B","B","C","B","D"))
vertices <- unique(unlist(edges))
# Building graphh in the same way as you do
g0 <- graph_from_data_frame(d=edges, vertices=vertices, directed = TRUE)
# Note that the graph is "DN--": directed, named, but NOT Weighted, since
# Instead of weighted edges, we have a whole lot of dubble edges
(g0)
plot(g0)
# We can se the dubble edges in the adjacency matrix as >1
get.adjacency(g0)
# Use simplify to remove LOOPS ONLY as we can see in the adjacency metrix test
g1 <- simplify(g0, remove.multiple = FALSE, remove.loops = TRUE)
get.adjacency(g1) == get.adjacency(g0)
# Turn the multiple edges into edge-weights by jumping through an adjacency matrix
g2 <- graph_from_adjacency_matrix(get.adjacency(g1), mode = "directed", weighted = TRUE)
# Instead of multiple edges (like many links between "A" and "B"), there are now
# just single edges with weights (hence the density of the network's changed).
graph.density(g1) == graph.density(g2)
# The former doubble edges are now here:
E(g2)$weight
# And we can see that the g2 is now "Named-Directed-Weighted" where g1 was only
# "Named-Directed" and no weights.
(g1);(g2)
# Let's plot the weights
E(g2)$width = E(g2)$weight*5
plot(g2)
A shortcoming of this/your method, however, is that the adjacency matrix is able to carry only the edge-count between any given vertices. If your edge-list contains more variables than i and j, the use of graph_from_data_frame() would normally embed edge-attributes of those variables for you straight from your csv-import (which is nice).
When you convert the edges into weights, however, you would loose that information. And, come to think of it, that information would have to be "converted" too. What would we do with two edges between the same vertices that have different edge-attributes?
At this point, the answer goes slightly beyond your question, but still stays in the realm of explaining the relation between graphs of multiple edges between the same vertices and their representation as weighted graphs with only one structural edge per verticy.
To convert edge-attributes along this transformation into a weighted graph, I suggest you'd use dplyr to "rebuild" any edge-attributes manually in order to keep control of how they are supposed to be merged down when recasting into a weighted one.
This picks up where the code above left off:
# Let's imagine that our original network had these two edge-attributes
E(g0)$coolness <- c(1,2,1,2,3,2,3,3,2,2)
E(g0)$hotness <- c(9,8,2,3,4,5,6,7,8,9)
# Plot the hotness
E(g0)$color <- colorRampPalette(c("green", "red"))(10)[E(g0)$hotness]
plot(g0)
# Note that the hotness between C and D are very different
# When we make your transformations for a weighted netowk, we loose the coolness
# and hotness information
g2 <- g0 %>% simplify(remove.multiple = FALSE, remove.loops = TRUE) %>%
get.adjacency() %>%
graph_from_adjacency_matrix(mode = "directed", weighted = TRUE)
g2$hotness # Naturally, the edge-attributes were lost!
# We can use dplyr to take controll over how we'd like the edge-attributes transfered
# when multiple edges in g0 with different edge attributes are supposed to merge into
# one single edge
library(dplyr)
recalculated_edge_attributes <-
data.frame(name = ends(g0, E(g0)) %>% as.data.frame() %>% unite("name", V1:V2, sep="->"),
hotness = E(g0)$hotness) %>%
group_by(name) %>%
summarise(mean_hotness = mean(hotness))
# We used a string-version of the names of connected verticies (like "A->B") to refere
# to the attributes of each edge. This can now be used to merge back the re-calculated
# edge-attributes onto the weighted graph in g2
g2_attributes <- data.frame(name = ends(g2, E(g2)) %>% as.data.frame() %>% unite("name", V1:V2, sep="->")) %>%
left_join(recalculated_edge_attributes, by="name")
# And manually re-attatch our mean-attributes onto the g2 network
E(g2)$mean_hotness <- g2_attributes$mean_hotness
E(g2)$color <- colorRampPalette(c("green", "red"))(max(E(g2)$mean_hotness))[E(g2)$mean_hotness]
# Note how the link between A and B has turned into the brown mean of the two previous
# green and red hotness-edges
plot(g2)
Sometimes, your analyses may benefit from either structure (weighted no duplicates or unweighted with duplicates). Algorithms for, for example, shortest paths are able to incorporate edge-weight as described in this answer, but other analyses might not allow for or be intuitive when using the weighted version of your network data.
Let purpose guide your structure.

Applying Graph Clustering Algorithms on the (famous) Iris data set

My question deals with the application of graph clustering algorithms. Most times, I see that graphs are made by using nodes and edges within the data. For example, suppose we have social media data: each individual in the data could be represented as a node and the relationship between individuals could be represented as edges. Using this information, we could build a graph and then perform graph clustering algorithms (e.g. Louvain Clustering) on this graph.
Sometimes, graphs can also be made using distances between points. Distances between points can be thought of as edges. For example, in the Spectral Clustering algorithm, a KNN (k nearest neighbor) graph is made from the data and then the K-Means clustering algorithm is performed on this graph.
My question is this: Suppose we take the famous Iris data and remove the response variable ("Species"). Would it make any sense to create a graph of this Iris data in which each node corresponds to an individual flower and edges correspond to pairwise Euclidean distances between each points? Assuming this is a logical and correct approach, could graph clustering algorithms be then performed on this Iris graph?
Below, I have attempted to first create a graph of the Iris data using pairwise Euclidean distances (in R). I then performed Louvain Clustering and Infomap Clustering on the resulting graph. After that, I attempted to create a KNN graph of the Iris data and perform MST (minimum spanning tree) clustering on this KNN graph, as well as perform Louvain Clustering.
Could someone please provide an opinion on what I have done? Is this intuitive and does it make mathematical sense? As a way of "cheating" - the Iris data only has 3 species. Thus, if a given clustering algorithm returns significantly more than 3 clusters, we know that the graph and/or the clustering algorithm may not be the best choice. However, in real applications, we are unable to know how many "true" classes exist within the data.
library(igraph)
library(network)
library(reshape2)
library(mstknnclust)
library(visNetwork)
library(cluster)
/****louvain clustering done on a distance based graph - maybe this is correct****/
x <- iris[,1:4]
dist <- daisy(x,
metric = "euclidean"
)
d_mat <- as.matrix(dist)
d_long <- melt(d_mat)
colnames(d_long) <- c("from", "to", "correlation")
d_mat_long <- d_long[which(d_long$correlation > .5),]
graph <- graph_from_data_frame(d_mat_long, directed = FALSE)
nodes <- as_data_frame(graph, what = "vertices")
colnames(nodes) <- "id"
nodes$label <- nodes$id
links <- as_data_frame(graph, what = "edges")
visNetwork(nodes, links) %>% visIgraphLayout(layout = "layout_with_fr")
cluster <- cluster_louvain(graph)
nodes$cluster <- cluster$membership
nodes$color <- ifelse(nodes$cluster == 1, "red", "blue")
visNetwork(nodes, links) %>% visIgraphLayout(layout = "layout_with_fr") %>% visOptions(selectedBy = "cluster") %>% visNodes(color = "color")
/***infomap and louvain clustering done a distance based graph but with a different algorithm: I think this is wrong***/
imc <- cluster_infomap(graph)
membership(imc)
communities(imc)
plot(imc, graph)
lc <- cluster_louvain(graph, weights = NULL)
membership(lc)
communities(lc)
plot(lc, graph)
/****mst spanning algorithm on the knn graph : based on the number of clusters I think this is wrong****/
cg <- generate.complete.graph(1:nrow(x),d_mat)
##Generates kNN graph
knn <- generate.knn(cg)
plot(knn$knn.graph,
main=paste("kNN \n k=", knn$k, sep=""))
results <- mst.knn(d_mat)
igraph::V(results$network)$label.cex <- seq(0.6,0.6,length.out=2)
plot(results$network, vertex.size=8,
vertex.color=igraph::clusters(results$network)$membership,
layout=igraph::layout.fruchterman.reingold(results$network, niter=10000),
main=paste("MST-kNN \n Clustering solution \n Number of clusters=",results$cnumber,sep="" ))
/*****louvain clustering and infomap done on the knn graph - maybe this is correct****/
#louvain
lc <- cluster_louvain(knn$knn.graph, weights = NULL)
membership(lc)
communities(lc)
plot(lc, knn$knn.graph)
imc <- cluster_infomap(knn$knn.graph)
membership(imc)
communities(imc)
plot(imc, knn$knn.graph)
"louvain clustering done on a distance based graph - maybe this is correct"
Not really, distance is used when graphing things like betweenness centrality. If your interest is similarity then convert distance to similarity.

compare communities from graphs with different number of vertices

I am calculating louvain communities on graphs of communications data, where vertices represent performers on a big project. The graphs represent different communication methods (e.g., email, phone).
We want to try to identify teams of performers from their communication data. Since performers have preferences for different communication methods, the graphs are of different sizes and may have some unique vertices which may not be present in both. When I try to compare the community objects from the respective graphs, igraph::compare() throws an exception. See toy reprex below.
I considered a dplyr::full_join() or inner_join() of the vertex lists before constructing the graph & community objects to make them the same size, but worry about the impact of doing so on the resulting cluster_louvain() solutions.
Any ideas on how I can compare the community objects to one another from these different communication methods? Thanks in advance!
library(tidyverse, warn.conflicts = FALSE)
library(igraph, warn.conflicts = FALSE)
nodes <- as_tibble(list(id = c("sample1", "sample2", "sample3")))
edge <- as_tibble(list(from = "sample1",
to = "sample2"))
net <- graph_from_data_frame(d = edge, vertices = nodes, directed = FALSE)
com <- cluster_louvain(net)
nodes2 <- as_tibble(list(id = c("sample1","sample21", "sample22","sample23"
)))
edge2 <- as_tibble(list(from = c("sample1", "sample21"),
to = c("sample21", "sample22")))
net2 <- graph_from_data_frame(d = edge2, vertices = nodes2, directed = FALSE)
com2 <- cluster_louvain(net2)
# # uncomment to see graph plots
# plot.igraph(net, mark.groups = com)
# plot.igraph(net2, mark.groups = com2)
compare(com, com2)
#> Error in i_compare(comm1, comm2, method): At community.c:3106 : community membership vectors have different lengths, Invalid value
Created on 2019-02-22 by the reprex package (v0.2.1)
You will not (I don't believe) be able to compare clustering algorithms from two different graphs that contain two different sets of nodes. Practically you can't do it in igraph and conceptually its hard because the way clustering algorithms are compared is by considering all pairs of nodes in a graph and checking whether they are placed in the same cluster or a different cluster in each of the two clustering approaches. If both clustering approaches typically put the same nodes together and the same nodes apart then they are considered more similar.1
I suppose another valid way to approach the problem would be to evaluate how similar the clustering schemes are for purely the set of nodes that are the intersection of the two graphs. You'll have to decide what makes more sense in your setting. I'll show how to do it using the union of nodes rather than the intersection.
So you need all the same nodes in both graphs in order to make the comparison. In fact, I think the easier way to do it is to put all the same nodes in one graph and have different edge types. Then you can compute your clusters for each edge type separately and then make the comparison. The reprex below is hopefully clear:
# repeat your set-up
library(tidyverse, warn.conflicts = FALSE)
library(igraph, warn.conflicts = FALSE)
nodes <- as_tibble(list(id = c("sample1", "sample2", "sample3")))
edge <- as_tibble(list(from = "sample1",
to = "sample2"))
nodes2 <- as_tibble(list(id = c("sample1","sample21", "sample22","sample23")))
edge2 <- as_tibble(list(from = c("sample1", "sample21"),
to = c("sample21", "sample22")))
# approach from a single graph
# concatenate edges
edges <- rbind(edge, edge2)
# create an edge attribute indicating network type
edges$type <- c("phone", "email", "email")
# the set of nodes (across both graphs)
nodes <- unique(rbind(nodes, nodes2))
g <- graph_from_data_frame(d = edges, vertices = nodes, directed = F)
# We cluster over the graph without the email edges
com_phone <- cluster_louvain(g %>% delete_edges(E(g)[E(g)$type=="email"]))
plot(g, mark.groups = com_phone)
# Now we can cluster over the graph without the phone edges
com_email <- cluster_louvain(g %>% delete_edges(E(g)[E(g)$type=="phone"]))
plot(g, mark.groups = com_email)
# Now we can compare
compare(com_phone, com_email)
#> [1] 0.7803552
As you can see from the plots we pick out the same initial clustering structure you found in the separate graphs with the additions of the extra isolated nodes.
1: Obviously this is a pretty vague explanation. The default algorithm used in compare is from this paper, which has a nice discussion.

cluster igraph on attribute

I am fairly new to igraph in R and to clustering/partitioning algorithms in general.
I have a general question on clusters. My idea is to build a contiguous cluster from a (directed) graph based on an attribute. What I try to achieve is something very similar to https://www.sixhat.net/finding-communities-in-networks-with-r-and-igraph.html. However, I am not sure I understand the functions (e.g. cluster_walktrap()) correctly. (I know there are some clustering methods that work on directed graphs, and some that don't.)
As an example: I use the network net from http://kateto.net/networks-r-igraph. Assuming I would like to end up with contiguous clusters based on the attribute audience.size (contrary to a clustering based on the betweeness), how would I use a clustering function from igraph?
cluster_walktrap(net, weights = E(net)$audience.size, steps = 4)?
How do I interpret the weight in this case?
MWE with data from: http://www.kateto.net/wordpress/wp-content/uploads/2016/01/netscix2016.zip
library(igraph)
nodes <- read.csv("Dataset1-Media-Example-NODES.csv", header=T, as.is=T)
links <- read.csv("Dataset1-Media-Example-EDGES.csv", header=T, as.is=T)
net <- graph_from_data_frame(d=links, vertices=nodes, directed=T)
net <- simplify(net, remove.multiple = F, remove.loops = T)
cluster_walktrap(net, weights = E(net)$audience.size, steps = 4)
Thank you very much!

Visualizing collaboration network structure with existing R applications/packages

I am trying to visualize a relational data structure of "joint venture" (i.e., firms collaborate with others in products). For example, firm i may be involved in joint venture A with firm j, yet firm i also participates in joint venture B with firm j and firm k, etc., so both firm i, j, k all share some sort of co-membership relations ({i, j}, {i, j, k}), but the strength of collaboration between firm {i, j} is stronger than that of firm {i, k} as firm i and j collaborate in more joint venture.
I would to visualize this in those iconic network graphs but emphasize the strength of relationship that varies between different dyads (firms). A relevant example that came to my mind is Mark Newman's co-authorship studies in PNAS (Newman 2004), in Fig. 6 each pair of nodes (i.e., authors) are connected by edges of different thickness, representing the strength of co-authorship intensity between each pair of authors (i.e., number of collaborative works between the two), like the picture shown below:
I have checked out a number of previous posts (such as this one) pertaining to R's igraph and bipartite packages, but do not think bipartite network and its application fit my purpose here.
I am wondering (1) if there are any existing R packages/applications out there that will help to visualize the strength of connectedness between each nodes in a network, and (2) how should the structure of this type of data look like? (using 'firm', 'project' as columns or rows?)
Thank you.
As #R.B noted, you may use the visNetwork library. The code with invented data may look like this:
library(igraph)
library(visNetwork)
set.seed(98765) # for reproducibility
### generate some data,
### nodes are entitities: letters represent contributors
nodes <- data.frame(id = 1:11,
label = LETTERS[1:11], # name of node
title = LETTERS[1:11]) # optional tooltip
### edges represent relations
edges <- data.frame(
from = sample(1:11, 50, replace = TRUE),
to = sample(1:11, 50, replace = TRUE),
arrows = "",
width = c(rep(1, 20), rep(4, 20), rep(6,6), rep(10, 3), 15) ## weights
)
visNetwork(nodes, edges, width = "100%") %>%
visIgraphLayout(layout = "layout_in_circle") %>%
visNodes(size = 25) %>%
visOptions(highlightNearest = list(enabled = F, hover = T) )
This generates the following plot (interactive in html)
Please let me know whether this is what you want.

Resources