Different results from betweenness score weighted graph vs multigraph - r

I am playing around with StatsBomb FIFA World Cup 18 data and am trying to figure out the central players in each team. I do this by constructing a network of passes (directed graph with player making a pass and player receiving the pass). Then I look at various centrality measures to gauge the most pivotal players (as regards to playmaking)
There are essentially two ways to present this data to the algorithm. One is to give each event (pass) as individual row when creating the graph (multiple edges between two nodes), the other is to aggregate by passer and receiver and give weight to the edge according to how many times player A passed to player B etc.
When looking at strength-scores, the results are exactly the same. However, that alone is uninteresting (strikers receive a lot of balls but they no longer pass it onwards as much so they cannot be said to be in the thick of things)
Betweenness-score would be more appropriate, calculating how often player X is the bridge in ball going from A to B. (to my best understanding of weighted version of betweenneess-measure)
However, here results fluctuate wildly. The one-pass-per-row, multiple edges between two nodes give reasonably logical results (here, looking at Argentina vs Croatia and Messi the second most central player after defender Nicolas Tagliafico and strikers Aguero and later Higuain in the other end. If you remember, the game was a disaster for Argentina: https://www.independent.co.uk/sport/football/world-cup/world-cup-2018-argentina-jorge-sampaoli-group-nigeria-lionel-messi-tactics-a8411031.html
The weighted-version however puts sub Higuain at top and good scores for Aguero and two other subs Dybala/Pavon as well. This cannot be right but I have no idea why the results are so different. Does the fact that on individual level we have an ordering of passes matter?
Here's my R code, StatsBombR needs to be installed from Github, via devtools::install_github("statsbomb/StatsBombR")
library(StatsBombR)
library(dplyr)
library(igraph)
matches <- FreeMatches(43)
#download match events
match <- get.matchFree(matches[9,])
#take info about passes, remove non-pass events
passes <- select(match,player.name,pass.recipient.name,team.name)
passes <- na.omit(passes)
#teams in match
teams <- unique(passes$team.name)
#two ways of presenting data, pass-per-row or aggregated player-wise
teamPasses <- passes[passes$team.name==teams[2],1:2]
weightPasses <- teamPasses %>%
group_by(player.name, pass.recipient.name) %>%
summarise(weight=n())
#create graphs
net <- graph_from_data_frame(teamPasses, directed = TRUE)
net2 <- graph_from_data_frame(weightPasses, directed = TRUE)
E(net2)$weight <- weightPasses$weight
#scores
betweenness(net)
betweenness(net2)
strength(net, mode = "out")
strength(net2, mode = "out")

Ad strength():
They are identical, just the ordering of vertices is different
all.equal(
sort(strength(net, mode="out")),
sort(strength(net2, mode="out"))
)
# TRUE
Ad betweenness():
It is easier to explain on a simple graph like these two:
el <- data_frame(
from = c("A", "B", "A", "D", "A"),
to = c("B", "C", "D", "C", "B")
)
g <- graph_from_data_frame(el)
elw <- el %>%
count(from, to) %>%
rename(weight=n)
gw <- graph_from_data_frame(elw)
plot(g)
In gw there are only 4 arcs (vs 5 in g) and arc A -> B has weight 2 while all others have weight 1. Let's focus on betweenness of B:
In the multigraph g it will be 2/3 because:
B lies only on the shortest paths between A and C
There are alltogether 3 shortest paths from A to C: A->D->C and two paths A->B->C as we can pick one of the two arcs in (A,B) dyad.
Two of those paths involve B
In the weighted graph gw it will be 0 because:
The path A->B->C has weight 3
The path A->D->C has weight 2
The "shortest" (minimal weight) path is only one, and it does not involve B
In other words, in the weighted graph a shortest path is actually "path with minimum sum of arc weights". In that sense you are finding paths of players that pass the ball the least, which is probably totally not what you are after.

Related

Creating weighted igraph network using two-column edge list

I'm in the process of creating a weighted igraph network object from a edge list containing two columns from and to. It has proven to be somewhat challenging for me, because when doing a workaround, I notice changes in the network metrics and I believe I'm doing something wrong.
library(igraph)
links <- read.csv2("edgelist.csv")
vertices <- read.csv2("vertices.csv")
network <- graph_from_data_frame(d=links,vertices = vertices,directed = TRUE)
##the following step is included to remove self-loops that I have used to include all isolate nodes to the network##
network <- simplify(network,remove.multiple = FALSE, remove.loops = TRUE)
In this situation I have successfully created a network object. However, it is not weighted. Therefore I create a second network object by taking the adjacency matrix from the objected created earlier and creating the new igraph object from it like this:
gettheweights <- get.adjacency(network)
network2 <- graph_from_adjacency_matrix(gettheweights,mode = "directed",weighted = TRUE)
However, after this when I call both of the objects, I notice a difference in the number of edges, why is this?
network2
IGRAPH ef31b3a DNW- 200 1092 --
network
IGRAPH 934d444 DN-- 200 3626 --
Additionally, I believe I've done something wrong because if they indeed would be the same network, shouldn't their densities be the same? Now it is not the case:
graph.density(network2)
[1] 0.02743719
graph.density(network)
[1] 0.09110553
I browsed and tried several different answers found from here but many were not 1:1 identical and I failed to find a solution.
All seems to be in order. When you re-project a network with edge-duplicates to be represented as a weight by the number of edges between given vertices, the density of your graph should change.
When you you test graph.density(network2) and graph.density(network), they should be different if indeed edge-duplicates were reduced to single-edges with weight as an edge attribute, as your output from network2 and network suggest.
This (over-) commented code goes through the process.
library(igraph)
# Data that should resemble yours
edges <- data.frame(from=c("A","B","C","D","E","A","A","A","B","C"),
to =c("A","C","D","A","B","B","B","C","B","D"))
vertices <- unique(unlist(edges))
# Building graphh in the same way as you do
g0 <- graph_from_data_frame(d=edges, vertices=vertices, directed = TRUE)
# Note that the graph is "DN--": directed, named, but NOT Weighted, since
# Instead of weighted edges, we have a whole lot of dubble edges
(g0)
plot(g0)
# We can se the dubble edges in the adjacency matrix as >1
get.adjacency(g0)
# Use simplify to remove LOOPS ONLY as we can see in the adjacency metrix test
g1 <- simplify(g0, remove.multiple = FALSE, remove.loops = TRUE)
get.adjacency(g1) == get.adjacency(g0)
# Turn the multiple edges into edge-weights by jumping through an adjacency matrix
g2 <- graph_from_adjacency_matrix(get.adjacency(g1), mode = "directed", weighted = TRUE)
# Instead of multiple edges (like many links between "A" and "B"), there are now
# just single edges with weights (hence the density of the network's changed).
graph.density(g1) == graph.density(g2)
# The former doubble edges are now here:
E(g2)$weight
# And we can see that the g2 is now "Named-Directed-Weighted" where g1 was only
# "Named-Directed" and no weights.
(g1);(g2)
# Let's plot the weights
E(g2)$width = E(g2)$weight*5
plot(g2)
A shortcoming of this/your method, however, is that the adjacency matrix is able to carry only the edge-count between any given vertices. If your edge-list contains more variables than i and j, the use of graph_from_data_frame() would normally embed edge-attributes of those variables for you straight from your csv-import (which is nice).
When you convert the edges into weights, however, you would loose that information. And, come to think of it, that information would have to be "converted" too. What would we do with two edges between the same vertices that have different edge-attributes?
At this point, the answer goes slightly beyond your question, but still stays in the realm of explaining the relation between graphs of multiple edges between the same vertices and their representation as weighted graphs with only one structural edge per verticy.
To convert edge-attributes along this transformation into a weighted graph, I suggest you'd use dplyr to "rebuild" any edge-attributes manually in order to keep control of how they are supposed to be merged down when recasting into a weighted one.
This picks up where the code above left off:
# Let's imagine that our original network had these two edge-attributes
E(g0)$coolness <- c(1,2,1,2,3,2,3,3,2,2)
E(g0)$hotness <- c(9,8,2,3,4,5,6,7,8,9)
# Plot the hotness
E(g0)$color <- colorRampPalette(c("green", "red"))(10)[E(g0)$hotness]
plot(g0)
# Note that the hotness between C and D are very different
# When we make your transformations for a weighted netowk, we loose the coolness
# and hotness information
g2 <- g0 %>% simplify(remove.multiple = FALSE, remove.loops = TRUE) %>%
get.adjacency() %>%
graph_from_adjacency_matrix(mode = "directed", weighted = TRUE)
g2$hotness # Naturally, the edge-attributes were lost!
# We can use dplyr to take controll over how we'd like the edge-attributes transfered
# when multiple edges in g0 with different edge attributes are supposed to merge into
# one single edge
library(dplyr)
recalculated_edge_attributes <-
data.frame(name = ends(g0, E(g0)) %>% as.data.frame() %>% unite("name", V1:V2, sep="->"),
hotness = E(g0)$hotness) %>%
group_by(name) %>%
summarise(mean_hotness = mean(hotness))
# We used a string-version of the names of connected verticies (like "A->B") to refere
# to the attributes of each edge. This can now be used to merge back the re-calculated
# edge-attributes onto the weighted graph in g2
g2_attributes <- data.frame(name = ends(g2, E(g2)) %>% as.data.frame() %>% unite("name", V1:V2, sep="->")) %>%
left_join(recalculated_edge_attributes, by="name")
# And manually re-attatch our mean-attributes onto the g2 network
E(g2)$mean_hotness <- g2_attributes$mean_hotness
E(g2)$color <- colorRampPalette(c("green", "red"))(max(E(g2)$mean_hotness))[E(g2)$mean_hotness]
# Note how the link between A and B has turned into the brown mean of the two previous
# green and red hotness-edges
plot(g2)
Sometimes, your analyses may benefit from either structure (weighted no duplicates or unweighted with duplicates). Algorithms for, for example, shortest paths are able to incorporate edge-weight as described in this answer, but other analyses might not allow for or be intuitive when using the weighted version of your network data.
Let purpose guide your structure.

How to calculate the number of vertices contracted into one graph?

I have a few large igraph objects that represent social networks. All nodes have various attributes, among them sector which is a factor variable. I have contracted this large network into a small where vertices represent groups and edges have the sum of individual edges in the original network. The label attribute in the second network represents the sector attribute in the first.
groupnet <- contract(g, as.integer(as.factor(V(g)$sector)), "ignore")
E(groupnet)$weight <- 1
groupnet <- simplify(groupnet, edge.attr.comb = list(weight = "sum"))
V(groupnet)$label <- levels(as.factor(V(g)$sector))
I would like to add another attribute to the second object V(groupnet)$groupsize that represents the number of original vertices that were contracted into groupnet. I have tried it with the following code but it did not work:
V(groupnet)$groupsize <- length(V(g)$sector[V(g)$sector == V(groupnet)$label])
How can I do this properly?
table() could be helpful here. Try out:
set.seed(1234)
library(igraph)
g <- make_ring(1000)
V(g)$sector <- factor(sample(LETTERS, 100, replace = T))
V(g)$sector
## contracted network
groupnet <- contract(g, as.integer(as.factor(V(g)$sector)), "ignore")
E(groupnet)$weight <- 1
V(groupnet)$label <- levels(as.factor(V(g)$sector))
## number of original vertices that were contracted into groupnet
# the tip is to see that table(V(g)$sector) provides the number of vertices per sector and
# its output is also arranged like V(groupnet)
table(V(g)$sector)
V(groupnet)
# solution
V(groupnet)$groupsize <- as.numeric(table(V(g)$sector))

Why igraph::cluster_walktrap gives a different result for non directed isomorphic graphs?

I'm trying to use igraph::cluster_walktrap in R to look for communities inside of a graph, however I noticed a weird behaviour (or at least, a behaviour I am not able to explain).
Suppose you are given an undirected graph by defining a list of its edges. Say
a,b
c,d
e,f
...
Then, if I define another graph by swapping randomly selected vertices in the edge list definition:
a,b
d,c
e,f
...
I expect the two graphs to be isomorphic and the difference between the two graph to be empty. This is exactly what happens in R in my toy example. Following this line of reasoning, calling cluster_walktrap on the two graphs (using set.seed appropriately) should yield the same result since the two graphs are the same. This is not happening and the only explanation I can give is that the starting point of each random walk is not the same for the two graphs. Why is this?
You can follow my reasoning in the toy example below. I don't understand why the last two objects are not identical.
require(igraph)
# Number of vertices
verteces <- 50
# Swap randomly some elements in the edges definition
set.seed(20)
row_swapped <- sample(1:verteces,25,replace=F)
m_values <- sample(letters, verteces*2, replace=T) #1:100
# Build edge lists
m1 <- matrix(m_values, verteces, 2)
m1
a <- m1
colS <- seq(round(ncol(m1)*0.3))
m1[row_swapped, 2:1] <- m1[row_swapped, 1:2]
m1
b <- m1
# Define the two graphs
ag <- igraph::graph_from_edgelist(a, directed = F)
bg <- igraph::graph_from_edgelist(b, directed = F)
# Another way of building an isomorphic graph for testing
#bg <- permute(ag, sample(vcount(ag)))
# Should be empty: ok
difference(ag, bg)
# Should be TRUE: ok
isomorphic(ag,bg)
# I expect it to be TRUE but it isn't...
identical(ag, bg)
# Vertices
V(ag)
ag
V(bg)
bg
# Calculate community
set.seed(100)
ac1 <- cluster_walktrap(ag)
set.seed(100)
bc1 <- cluster_walktrap(bg)
# I expect all to be TRUE, however
# merges is different
# membership is different
# names are different
identical(ac1$merges, bc1$merges)
identical(ac1$modularity, bc1$modularity)
identical(ac1$membership, bc1$membership)
identical(ac1$names, bc1$names)
identical(ac1$vcount, bc1$vcount)
identical(ac1$algorithm, bc1$algorithm)
The results are not different. You have two things going on which is making your graphs not identical but isoporphic. I emphasize identical because it has a very strict definition.
1) identical(ag, bg) is not identical because the vertices and edges are not in the same order between the two graphs. Exactly, the same nodes and edges exist but they are not in the exact same place or orientation. For, example if I shuffle the rows of a and make a new graph...
a1 <- a[sample(1:nrow(a)), ]
a1g <- igraph::graph_from_edgelist(a1, directed = F)
identical(ag, a1g)
#[1] FALSE
2) This goes for edges as well. An edge is stored as node1, node2 and a flag if the edge is directed or not. so when you swap rows the representation at the "byte level" (I use this term loosely) is different even though the relationship is the same. Edge 44 represents the same relationship but is stored based on how it was constructed.
E(ag)[44]
# + 1/50 edge from 6318240 (vertex names):
# [1] q--d
E(bg)[44]
# + 1/50 edge from 38042e0 (vertex names):
# [1] d--q
So onto your cluster_walktrap, first, the function returns the index of the vertices, not the name which can be misleading. Which means the reason the objects aren't identical is because ag and bg have different ordering of nodes in the object.
If I reorder the membership by node name the two become identical.
identical(membership(bc1)[order(names(membership(bc1)))], membership(ac1)[order(names(membership(ac1)))])
#[1] TRUE

igraph --- find shortest path including weight at turns

The following example gives shortest path 1-2-6-7-3-4, where only the weight of edges is considered; and the weight of turn at vertices is not counted for. Can someone suggest a procedure to include the weight at each vertex that is no-turn, right-turn, or left-turn? We can assume the weight for (NT, RT, LT)=(0,0.5,1). When edge weight is combined with turn effect, the shortest path would become 1-2-3-4. Below is the example in question. Thank you.
#
library(igraph)
n <- c(1,2,3,4,5,6,7,8)
x <- c(1,4,7,10,1,4,7,10)
y <- c(1,1,1,1,4,4,4,4)
node <- data.frame(n,x,y)
fm <- c(1,2,3,5,6,7,1,2,3,4)
to<-c(2,3,4,6,7,8,5,6,7,8)
weight<- c(1,4,1,1,1,2,5,1,1,1)
link <- data.frame(fm,to,weight)
g <- graph.data.frame(link,directed=FALSE,vertices=node)
sv <- get.shortest.paths(g,1,4,weights=NULL,output="vpath")
sv
E(g)$color <- "pink"
E(g, path=sv[[1]])$width <- 8
plot(g,edge.color="red")
plot(g,edge.label=weight,edge.label.color="blue",edge.label.cex=2)
As a preprocessing step: for each vertex v with a incoming edges and b outgoing edges, split it into a vertexes connected to those incoming edges and b vertexes connected to those outgoing edges. Then create edges representing turning costs in between.
In principal, Jeffery is describing what we want, but the problem size is such that we need a programmatic solution. Maybe 200,000 vertices with 3 to 6 edges. If there is a way to explode, for instance, the standard intersection of 4 edges in and 4 edges out to the 16 right through left movements and automatically assigning left through and right penalties.
most important is the ability to have lesser penalties for turning at T intersections (ease of wayfinding) than turning at traditional intersections/vertices
Is this tractable for a huge network?

Calculate degree, closeness and betweenness in R

I have a data table which consists of names of users who post in the same thread in a forum, it looks like that:
X1 X2
1. g79 kian
2. g79 greyracer
3. g79 oldskoo1 ...
I need to calculate degree, closeness and betweenness. I'm using the following code:
library(igraph)
setwd("/Volumes/NATASHKA/api/R files")
load("edgelist_one_mode.rda")
load("map.rda")
load ("result.rda")
el <- as.matrix(whatwewant)
el[,1] <- as.character(el[,1])
el[,2] <- as.character(el[,2])
g <- graph.data.frame(el, directed=FALSE)
plot(g, edge.arrow.size=.5)
indegreeG <- degree(g, mode="in")
outdegreeG <- degree(g, mode="out")
totaldegreeG <- degree(g)
inclosenessG <- closeness(g, mode='in')
outclosenessG <- closeness(g, mode='out')
totalclosenessG <- closeness(g)
betweennessG <- betweenness(g)
forumG <- data.frame(V(g)$name, indegreeG, outdegreeG, totaldegreeG, inclosenessG, outclosenessG, totalclosenessG, betweennessG)
write.table(forumG,file="forumG.csv",sep=";")
The question is why do I get the same values for in-degree, out-degree and total-degree, the same for closeness? Besides, at the beginning I have 41213 users, but after analysis (when I calculate degree, etc..) I only have 37874. How could I lose so many observations? Please tell me if I have a mistake in the code.
Thanks
The reason you get the same value for in-degree, out-degree and total degree is because you are creating an undirected network with the graph.data.frame(el, directed=FALSE).
In an undirected network, the number of links from a node and to a node are the same and they are both equal to the global degree.
If you want a directed network, you will need to do graph.data.frame(el, directed=TRUE).
It will create a directed network in which the id in the first column of your dataframe is the id of the node sending the tie and the id in the second column indicates the node receiving that tie.
As for loosing nodes, my guess would be that you have some individuals who never interact with anyone and therefore are lost when you transform your two-mode network into one-mode (I assume you do this but don't show us how you do it because of your line:load("edgelist_one_mode.rda"))
Short of a reproducible example, I think that is all I can deduce from your code.

Resources