Calculate degree, closeness and betweenness in R - r

I have a data table which consists of names of users who post in the same thread in a forum, it looks like that:
X1 X2
1. g79 kian
2. g79 greyracer
3. g79 oldskoo1 ...
I need to calculate degree, closeness and betweenness. I'm using the following code:
library(igraph)
setwd("/Volumes/NATASHKA/api/R files")
load("edgelist_one_mode.rda")
load("map.rda")
load ("result.rda")
el <- as.matrix(whatwewant)
el[,1] <- as.character(el[,1])
el[,2] <- as.character(el[,2])
g <- graph.data.frame(el, directed=FALSE)
plot(g, edge.arrow.size=.5)
indegreeG <- degree(g, mode="in")
outdegreeG <- degree(g, mode="out")
totaldegreeG <- degree(g)
inclosenessG <- closeness(g, mode='in')
outclosenessG <- closeness(g, mode='out')
totalclosenessG <- closeness(g)
betweennessG <- betweenness(g)
forumG <- data.frame(V(g)$name, indegreeG, outdegreeG, totaldegreeG, inclosenessG, outclosenessG, totalclosenessG, betweennessG)
write.table(forumG,file="forumG.csv",sep=";")
The question is why do I get the same values for in-degree, out-degree and total-degree, the same for closeness? Besides, at the beginning I have 41213 users, but after analysis (when I calculate degree, etc..) I only have 37874. How could I lose so many observations? Please tell me if I have a mistake in the code.
Thanks

The reason you get the same value for in-degree, out-degree and total degree is because you are creating an undirected network with the graph.data.frame(el, directed=FALSE).
In an undirected network, the number of links from a node and to a node are the same and they are both equal to the global degree.
If you want a directed network, you will need to do graph.data.frame(el, directed=TRUE).
It will create a directed network in which the id in the first column of your dataframe is the id of the node sending the tie and the id in the second column indicates the node receiving that tie.
As for loosing nodes, my guess would be that you have some individuals who never interact with anyone and therefore are lost when you transform your two-mode network into one-mode (I assume you do this but don't show us how you do it because of your line:load("edgelist_one_mode.rda"))
Short of a reproducible example, I think that is all I can deduce from your code.

Related

Remove maximal cliques in igraph

I have a disconnected undirected network.
I want to identify and remove all the components that are cliques.
I do not want to remove all the cliques, just those that are themselves a component of the network.
How should I proceed?
library(igraph)
g <- graph_from_literal(a-b-c-d-b,e-f-g-e,h-i-l)
result <- graph_from_literal(a-b-c-d-b,h-i-l)
One solution is the following, but I do not know to what extent this is efficient in large networks.
d <- graph_from_literal(a-b-c-d-b,e-f-g-e,h-i-l)
d0 <- decompose.graph(d)
d1 <- disjoint_union(d0[unlist(lapply(d0, function(x) count_max_cliques(x)!=1))])

Different results from betweenness score weighted graph vs multigraph

I am playing around with StatsBomb FIFA World Cup 18 data and am trying to figure out the central players in each team. I do this by constructing a network of passes (directed graph with player making a pass and player receiving the pass). Then I look at various centrality measures to gauge the most pivotal players (as regards to playmaking)
There are essentially two ways to present this data to the algorithm. One is to give each event (pass) as individual row when creating the graph (multiple edges between two nodes), the other is to aggregate by passer and receiver and give weight to the edge according to how many times player A passed to player B etc.
When looking at strength-scores, the results are exactly the same. However, that alone is uninteresting (strikers receive a lot of balls but they no longer pass it onwards as much so they cannot be said to be in the thick of things)
Betweenness-score would be more appropriate, calculating how often player X is the bridge in ball going from A to B. (to my best understanding of weighted version of betweenneess-measure)
However, here results fluctuate wildly. The one-pass-per-row, multiple edges between two nodes give reasonably logical results (here, looking at Argentina vs Croatia and Messi the second most central player after defender Nicolas Tagliafico and strikers Aguero and later Higuain in the other end. If you remember, the game was a disaster for Argentina: https://www.independent.co.uk/sport/football/world-cup/world-cup-2018-argentina-jorge-sampaoli-group-nigeria-lionel-messi-tactics-a8411031.html
The weighted-version however puts sub Higuain at top and good scores for Aguero and two other subs Dybala/Pavon as well. This cannot be right but I have no idea why the results are so different. Does the fact that on individual level we have an ordering of passes matter?
Here's my R code, StatsBombR needs to be installed from Github, via devtools::install_github("statsbomb/StatsBombR")
library(StatsBombR)
library(dplyr)
library(igraph)
matches <- FreeMatches(43)
#download match events
match <- get.matchFree(matches[9,])
#take info about passes, remove non-pass events
passes <- select(match,player.name,pass.recipient.name,team.name)
passes <- na.omit(passes)
#teams in match
teams <- unique(passes$team.name)
#two ways of presenting data, pass-per-row or aggregated player-wise
teamPasses <- passes[passes$team.name==teams[2],1:2]
weightPasses <- teamPasses %>%
group_by(player.name, pass.recipient.name) %>%
summarise(weight=n())
#create graphs
net <- graph_from_data_frame(teamPasses, directed = TRUE)
net2 <- graph_from_data_frame(weightPasses, directed = TRUE)
E(net2)$weight <- weightPasses$weight
#scores
betweenness(net)
betweenness(net2)
strength(net, mode = "out")
strength(net2, mode = "out")
Ad strength():
They are identical, just the ordering of vertices is different
all.equal(
sort(strength(net, mode="out")),
sort(strength(net2, mode="out"))
)
# TRUE
Ad betweenness():
It is easier to explain on a simple graph like these two:
el <- data_frame(
from = c("A", "B", "A", "D", "A"),
to = c("B", "C", "D", "C", "B")
)
g <- graph_from_data_frame(el)
elw <- el %>%
count(from, to) %>%
rename(weight=n)
gw <- graph_from_data_frame(elw)
plot(g)
In gw there are only 4 arcs (vs 5 in g) and arc A -> B has weight 2 while all others have weight 1. Let's focus on betweenness of B:
In the multigraph g it will be 2/3 because:
B lies only on the shortest paths between A and C
There are alltogether 3 shortest paths from A to C: A->D->C and two paths A->B->C as we can pick one of the two arcs in (A,B) dyad.
Two of those paths involve B
In the weighted graph gw it will be 0 because:
The path A->B->C has weight 3
The path A->D->C has weight 2
The "shortest" (minimal weight) path is only one, and it does not involve B
In other words, in the weighted graph a shortest path is actually "path with minimum sum of arc weights". In that sense you are finding paths of players that pass the ball the least, which is probably totally not what you are after.

After clustering in R (iGraph, etc), can you maintain nodes+edges from a cluster to do individual cluster analysis?

Basically I have tried a few different ways of clustering. I can usually get to a point in iGraph where each node is labeled with a cluster. I can then identify all the nodes within a single cluster. However, this loses their edges.
I'd have to re-iterate back over the original dataset for all the nodes in cluster 1 to get only those where both nodes+the edge are within the cluster. I'd have to do this for every cluster.
This seems like a painfully long process and there is probably a shortcut my google-fu is missing.
So, is there an easy way to, after clustering or performing community detection processes, to maintain an individual cluster/community as its own smaller graph -- that is, retaining all nodes AND edges between them?
You can use delete.vertices() to create a subgraph. Example:
library(igraph)
set.seed(123)
# create random graph
g <- barabasi.game(100, directed = F)
plot(g, layout=layout.fruchterman.reingold)
# do community detection
wc <- multilevel.community(g)
V(g)$community <- membership(wc)
# make community 1 subgraph
g_sub <- delete.vertices(g, V(g)[community != 1])
plot(g_sub, layout=layout.fruchterman.reingold)
An alternative:
#Create random network
d <- sample_gnm(n=50,m=40)
#Identify the communities
dc <- cluster_walktrap(d)
#Induce a subgraph out of the first community
dc_1 <- induced.subgraph(d,dc[[1]])
#plot that specific community
plot(dc_1)

extract a connected subgraph from a subset of vertices with igraph

I have a graph G(V,E) unweighted, undirected and connected graph with 12744 nodes and 166262 edges. I have a set of nodes (sub_set) that is a subset of V. I am interested in extracting the smallest connected subgraph where sub_set is a part of this new graph. I have managed to get a subgraph where my subset of nodes is included but I would like to know if there is a way to minimise the graph.
Here is my code (adapted from http://sidderb.wordpress.com/2013/07/16/irefr-ppi-data-access-from-r/)
library('igraph')
g <- erdos.renyi.game(10000, 0.003) #graph for illustrating my propose
sub_set <- sample(V(g), 80)
order <- 1
edges <- get.edges(g, 1:(ecount(g)))
neighbours.vid <- unique(unlist(neighborhood(g, order, which(V(g) %in% sub_set))))
rel.vid <- edges[intersect(which(edges[,1] %in% neighbours.vid), which(edges[,2] %in% neighbours.vid)),]
rel <- as.data.frame(cbind(V(g)[rel.vid[,1]], V(g)[rel.vid[,2]]), stringsAsFactors=FALSE)
names(rel) <- c("from", "to")
subgraph <- graph.data.frame(rel, directed=F)
subgraph <- simplify(subgraph)
I have read this post
minimum connected subgraph containing a given set of nodes, so I guess that my problem could be "The Steiner Tree problem", is there any way to try to find a suboptimal solution using igraph?
Not sure if that's what you meant but
subgraph<-minimum.spanning.tree(subgraph)
produces a graph with the minimum number of edges in which all nodes stay connected in one component.

igraph --- find shortest path including weight at turns

The following example gives shortest path 1-2-6-7-3-4, where only the weight of edges is considered; and the weight of turn at vertices is not counted for. Can someone suggest a procedure to include the weight at each vertex that is no-turn, right-turn, or left-turn? We can assume the weight for (NT, RT, LT)=(0,0.5,1). When edge weight is combined with turn effect, the shortest path would become 1-2-3-4. Below is the example in question. Thank you.
#
library(igraph)
n <- c(1,2,3,4,5,6,7,8)
x <- c(1,4,7,10,1,4,7,10)
y <- c(1,1,1,1,4,4,4,4)
node <- data.frame(n,x,y)
fm <- c(1,2,3,5,6,7,1,2,3,4)
to<-c(2,3,4,6,7,8,5,6,7,8)
weight<- c(1,4,1,1,1,2,5,1,1,1)
link <- data.frame(fm,to,weight)
g <- graph.data.frame(link,directed=FALSE,vertices=node)
sv <- get.shortest.paths(g,1,4,weights=NULL,output="vpath")
sv
E(g)$color <- "pink"
E(g, path=sv[[1]])$width <- 8
plot(g,edge.color="red")
plot(g,edge.label=weight,edge.label.color="blue",edge.label.cex=2)
As a preprocessing step: for each vertex v with a incoming edges and b outgoing edges, split it into a vertexes connected to those incoming edges and b vertexes connected to those outgoing edges. Then create edges representing turning costs in between.
In principal, Jeffery is describing what we want, but the problem size is such that we need a programmatic solution. Maybe 200,000 vertices with 3 to 6 edges. If there is a way to explode, for instance, the standard intersection of 4 edges in and 4 edges out to the 16 right through left movements and automatically assigning left through and right penalties.
most important is the ability to have lesser penalties for turning at T intersections (ease of wayfinding) than turning at traditional intersections/vertices
Is this tractable for a huge network?

Resources