Extract graphs based on identifier and calculate network measures in igraph - r

I want to separately analyze groups within a network. For example, the UK faculty data in the igraphdata package has some network data with group information on the node level.
library(igraph)
library(igraphdata)
data("UKfaculty")
V(UKfaculty)$Group
I want to extract networks based on the 4 groups and run a few calculations on the extracted graph (density, average degree, diameter, clustering coefficient, etc.) and store this information based on the groups in a dataframe. I want to calculate the measures only based on the nodes within a group, not on the whole network level (e.g. calculating only centrality based on connections in group 1, not taking connections to other groups into account).
Group density diameter
1 x x
2 x x
3 x x
Any idea how to efficiently do this?

You can use induced_subgraph to extract the subgraphs based on a list of vertices for every group.
library(igraph)
library(igraphdata)
data("UKfaculty")
ig <- UKfaculty
# `list` of vertices for every group
idx <- split(V(ig), V(ig)$Group)
# Create subgraphs based on the `list` of vertices
lst <- lapply(idx, function(v) induced_subgraph(ig, v))
It's then straight-forward to calculate any subgraph-specific metrics, e.g.
do.call(rbind, lapply(lst, function(ig)
data.frame(
Group = unique(V(ig)$Group),
density = edge_density(ig),
diameter = diameter(ig))))
# Group density diameter
#1 1 0.3001894 21
#2 2 0.3561254 12
#3 3 0.2807018 14
#4 4 1.0000000 12

Related

Adding ties in a network based on node attribute (weight)

I am simulating network change over time using igraph in r and am looking for an efficient and scalable way to code this for use in business.
The main drivers of network change are:
New nodes
New ties
New node weights
In the first stage, in the network of 100 nodes 10% are randomly connected. The node weights are also assigned at random. The network is undirected. There are 100 stages.
In each of the following stages:
Ten (10) new nodes occur randomly and are added to the model. They are unconnected in this stage.
The node weights of these new nodes are assigned at random.
The new ties between two nodes in time t+1 are a probabilistic function of the network distance between these nodes in the network and the node weight at previous stage (time t). Nodes at greater network distance are less likely to connect than nodes nodes at shorter distance. The decay function is exponential.
Nodes with greater weight attract more ties than those with smaller weights. The relationship between node weight and increased probability of tie-formation should be super-linear.
In each step, 10% of the total existing ties is added as a function what the previous point.
The network ties and nodes from previous stages are carried over (i.e. the networks are cumulative).
At each stage, the node weight can change randomly up to 10% of its current weight (i.e. a weight of 1 can change to {0.9-1.1} in t+1)
At each stage, the network needs to be saved.
How can this be written?
Edit: these networks will be examined on a number of graph-level characteristics at a later stage
This is what I have now, but doesn't include the node weights. How do we include this efficiently?
# number of nodes and ties to start with
n = 100
p = 0.1
r = -2
# build random network
net1 <- erdos.renyi.game(n, p, "gnp", directed = F)
#plot(net1)
write_graph(net1, paste0("D://network_sim_0.dl"), format="pajek")
for(i in seq(1,100,1)){
print(i)
time <- proc.time()
net1 <- read_graph(paste0("D://network_sim_",i-1,".dl"), format="pajek")
# how many will we build in next stage?
new_ties <- round(0.1*ecount(net1), 0) # 10% of those in net1
# add 10 new nodes
net2 <- add_vertices(net1, 10)
# get network distance for each dyad in net1 + the new nodes
spel <- data.table::melt(shortest.paths(net2))
names(spel) <- c("node_i", "node_j", "distance")
# replace inf with max observed value + 1
spel$distance[which(!is.finite(spel$distance))] <- max(spel$distance[is.finite(spel$distance)]) +1
# assign a probability (?) with a exponential decay function. Smallest distance == greatest prob.
spel$prob <- -0.5 * spel$distance^r # is this what I need?
#hist(spel$prob, freq=T, xlab="Probability of tie-formation")
#hist(spel$distance, freq=T, xlab="Network Distance")
# lets sample new ties from this probability
spel$index <- seq_along(spel$prob)
to_build <- subset(spel, index %in% sample(spel$index, size = new_ties, prob=spel$prob))
net2 <- add_edges(net2, as.numeric(unlist(str_split(paste(to_build$node_i, to_build$node_j), " "))))
# save the network
write_graph(net2, paste0("D://network_sim_",i,".dl"), format="pajek")
print(proc.time()-time)
}
I will try to answer this question, as far as I understand.
There are a couple of assumptions I made; I should clarify them.
First, what distribution will node weights follow?
If you are modeling an event that naturally occurs, it is most likely that the node weights follow a normal distribution. However, if the event is socially-oriented and other social mechanisms influence the event or the event popularity, the node weights might follow a different distribution-- mostly likely a power distribution.
Mainly, this is likely to true for customer-related behaviors. So, it would be beneficial for you to consider the random distribution you will model for the node weights.
For the following example, I use normal distributions to define value from a normal distribution for each node. At the end of each iteration, I let the node weights change up to %10 {.9,1.10}.
Second, what is the probability function of tie formation?
We have two inputs for making a decision: distance weights and node weights. So, we will create a function by using these two inputs and define probability weights. What I understood is that the smaller the distance is, the higher the likelihood is. And then the greater the node weight is, the higher the likelihood is, as well.
It might not be the best solution, but I did the followings:
First, calculate the decay function of distances and call it distance weights. Then, I get the node weights and create a super-linear function using both distance and node weights.
So, there are some parameters you can play with and see whether you get a result you want.
Btw, I did not change most of your codes. Also, I did not focus on processing time a lot. There are still rooms to impove.
library(scales)
library(stringr)
library(igraph)
# number of nodes and ties to start with
n <- 100
p <- 0.2
number_of_simulation <- 100
new_nodes <- 15 ## new nodes for each iteration
## Parameters ##
## How much distance will be weighted?
## Exponential decay parameter
beta_distance_weight <- -.4
## probability function parameters for the distance and node weights
impact_of_distances <- 0.3 ## how important is the distance weights?
impact_of_nodes <- 0.7 ## how important is the node weights?
power_base <- 5.5 ## how important is having a high score? Prefential attachment or super-linear function
# build random network
net1 <- erdos.renyi.game(n, p, "gnp", directed = F)
# Assign normally distributed random weights
V(net1)$weight <- rnorm(vcount(net1))
graph_list <- list(net1)
for(i in seq(1,number_of_simulation,1)){
print(i)
time <- proc.time()
net1 <- graph_list[[i]]
# how many will we build in next stage?
new_ties <- round(0.1*ecount(net1), 0) # 10% of those in net1
# add 10 new nodes
net2 <- add_vertices(net1, new_nodes)
## Add random weights to new nodes from a normal distribution
V(net2)$weight[is.na(V(net2)$weight)] <- rnorm(new_nodes)
# get network distance for each dyad in net1 + the new nodes
spel <- reshape2::melt(shortest.paths(net2))
names(spel) <- c("node_i", "node_j", "distance")
# replace inf with max observed value + 1
spel$distance[which(!is.finite(spel$distance))] <- max(spel$distance[is.finite(spel$distance)]) +1
# Do not select nodes if they are self-looped or have already link
spel <- spel[!spel$distance %in% c(0,1) , ]
# Assign distance weights for each dyads
spel$distance_weight <- exp(beta_distance_weight*spel$distance)
#hist(spel$distance_weight, freq=T, xlab="Probability of tie-formation")
#hist(spel$distance, freq=T, xlab="Network Distance")
## Get the node weights for merging the data with the distances
node_weights <- data.frame(id= 1:vcount(net2),node_weight=V(net2)$weight)
spel <- merge(spel,node_weights,by.x='node_j',by.y='id')
## probability is the function of distince and node weight
spel$prob <- power_base^((impact_of_distances * spel$distance_weight) + (impact_of_nodes * spel$node_weight))
spel <- spel[order(spel$prob, decreasing = T),]
# lets sample new ties from this probability with a beta distribution
spel$index <- seq_along(spel$prob)
to_build <- subset(spel, index %in% sample(spel$index, new_ties, p = 1/spel$index ))
net2 <- add_edges(net2, as.numeric(unlist(str_split(paste(to_build$node_i, to_build$node_j), " "))))
# change in the weights up to %10
V(net2)$weight <- V(net2)$weight*rescale(rnorm(vcount(net2)), to = c(0.9, 1.1))
graph_list[[i+1]] <- net2
print(proc.time()-time)
}
To get the results or write the graph to Pajek, you can use the following:
lapply(seq_along(graph_list),function(x) write_graph(graph_list[[x]], paste0("network_sim_",x,".dl"), format="pajek"))
EDIT
To change the node weight, you can use the following syntax.
library(scales)
library(stringr)
library(igraph)
# number of nodes and ties to start with
n <- 100
p <- 0.2
number_of_simulation <- 100
new_nodes <- 10 ## new nodes for each iteration
## Parameters ##
## How much distance will be weighted?
## Exponential decay parameter
beta_distance_weight <- -.4
## Node weights for power-law dist
power_law_parameter <- -.08
## probability function parameters for the distance and node weights
impact_of_distances <- 0.3 ## how important is the distance weights?
impact_of_nodes <- 0.7 ## how important is the node weights?
power_base <- 5.5 ## how important is having a high score? Prefential attachment or super-linear function
# build random network
net1 <- erdos.renyi.game(n, p, "gnp", directed = F)
## MADE A CHANGE HERE
# Assign normally distributed random weights
V(net1)$weight <- runif(vcount(net1))^power_law_parameter
graph_list <- list(net1)
for(i in seq(1,number_of_simulation,1)){
print(i)
time <- proc.time()
net1 <- graph_list[[i]]
# how many will we build in next stage?
new_ties <- round(0.1*ecount(net1), 0) # 10% of those in net1
# add 10 new nodes
net2 <- add_vertices(net1, new_nodes)
## Add random weights to new nodes from a normal distribution
V(net2)$weight[is.na(V(net2)$weight)] <- runif(new_nodes)^power_law_parameter
# get network distance for each dyad in net1 + the new nodes
spel <- reshape2::melt(shortest.paths(net2))
names(spel) <- c("node_i", "node_j", "distance")
# replace inf with max observed value + 1
spel$distance[which(!is.finite(spel$distance))] <- max(spel$distance[is.finite(spel$distance)]) + 2
# Do not select nodes if they are self-looped or have already link
spel <- spel[!spel$distance %in% c(0,1) , ]
# Assign distance weights for each dyads
spel$distance_weight <- exp(beta_distance_weight*spel$distance)
#hist(spel$distance_weight, freq=T, xlab="Probability of tie-formation")
#hist(spel$distance, freq=T, xlab="Network Distance")
## Get the node weights for merging the data with the distances
node_weights <- data.frame(id= 1:vcount(net2),node_weight=V(net2)$weight)
spel <- merge(spel,node_weights,by.x='node_j',by.y='id')
## probability is the function of distince and node weight
spel$prob <- power_base^((impact_of_distances * spel$distance_weight) + (impact_of_nodes * spel$node_weight))
spel <- spel[order(spel$prob, decreasing = T),]
# lets sample new ties from this probability with a beta distribution
spel$index <- seq_along(spel$prob)
to_build <- subset(spel, index %in% sample(spel$index, new_ties, p = 1/spel$index ))
net2 <- add_edges(net2, as.numeric(unlist(str_split(paste(to_build$node_i, to_build$node_j), " "))))
# change in the weights up to %10
V(net2)$weight <- V(net2)$weight*rescale(rnorm(vcount(net2)), to = c(0.9, 1.1))
graph_list[[i+1]] <- net2
print(proc.time()-time)
}
Result
So, to validate whether the code is working, I checked a small number of iteration with limited nodes: 10 iterations with 4 nodes. For each iteration, I added 3 new nodes and one new tie.
I did this simulation with three different settings.
The first setting focuses on only the weight function of distances: the more close nodes are, the more likely that a new tie will be formed between them.
The second setting focuses on only the weight function of node: the more weight nodes have, the more likely that a new tie will be formed with them.
The third setting focuses on the weight functions of both distance and node: the more weight nodes have and the more they are close, the more likely that a new tie will be formed with them.
Please observe the network behaviors how each setting provided different results.
Only Distance Matters
Only Node Weight Matters
Both Node Weight and Distance Matter

Find number of clusters using distance matrix with hierarchical clustering

How do I determine the optimal number of clusters while using hierarchical clustering. If I am just having the distance matrix as I am measuring only pairwise distances (levenshtein distances), how do I find out the optimal number of clusters? I referred to other posts they all use k-means, hierarchical but not for string type of data as shown below. Any suggestions on how to use R to find the number of clusters?
set.seed(1)
rstr <- function(n,k){ # vector of n random char(k) strings
sapply(1:n,function(i) {do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
str<- c(paste0("aa",rstr(10,3)),paste0("bb",rstr(10,3)),paste0("cc",rstr(10,3)))
# Levenshtein Distance
d <- adist(str)
rownames(d) <- str
hc <- hclust(as.dist(d))
Several statistics can be used.
Look for example at the WeightedCluster package that can compute and plot a series of such statistics.
To illustrate, you get the optimal number of groups for each available statistics as follows:
library("WeightedCluster")
hcRange <- as.clustrange(hc, diss=as.dist(d), ncluster=6)
summary(hcRange)
## 1. N groups 1. stat
## PBC 3 0.8799136
## HG 3 1.0000000
## HGSD 3 0.9987651
## ASW 3 0.4136550
## ASWw 3 0.4722895
## CH 3 8.3605263
## R2 6 0.4734561
## CHsq 3 20.6538462
## R2sq 6 0.6735039
## HC 3 0.0000000
You can also plot the statistics (here we show the Average silhouette width, ASWw, Huber's Gamma, HG, and the Point biserial correlation) for all the computed solutions
plot(hcRange, stat = c("ASWw", "HG", "PBC"), lwd = 2)
The better solution seems to be the three groups solution.

Compute distance based Local Moran for 527k+ point dataset using spdep library

As the title says, I'm trying to compute Local Moran for a 527k point dataset using the spdep package, creating neighborhoods based on distance. The generalized process I'm doing is the following:
library(spdep)
# Convert coordinates to matrix
matrix_pts <- as.matrix(coordinates)
# Generate neighbors
neighbors <- dnearneigh(matrix_pts,
d1 = 0,
d2 = range)
# Get weight matrix
wm <- nb2listw(neighbors,
zero.policy = T,
style = style)
# Get moran statistics
moran_stat <- localmoran(value,
wm,
zero.policy = T)
But I've run into a problem where I can't create the neighborhoods using dnearneigh, since the dataset is way too large, and neighborhoods are composed of 200-1000 points.
I tried the solution depicted Here and I got myself a dataframe with a first row with IDs and a second row with a list containing the IDs of neighboring points (i.e):
id int_ids
1: 239226 239226,242762,339386,444833,243000,240521,...
2: 242762 239226,242762,339386,444833,243000,240521,...
3: 339386 239226,242762,339386,444833,243000,240521,...
4: 444833 239226,242762,339386,444833,243000,240521,...
5: 243000 239226,242762,339386,444833,243000,240521,...
6: 240521 239226,242762,339386,444833,243000,240521,...
However, I don't know how to create the nb object required by nb2listw, and digging around hasn't helped me a lot.
Is there a way to transform this dataframe into a nb object? If I were able to do so, would the weight matrix be as hard to create as the neighbors ?
Is there a another way to compute local moran for a dataset of this volume ?

exctract correlated elements of a correlation matrix

I have a correlation matrix in R and I want to know how many groups (and put these groups into vectors) of elements correlate between them in more than 95%.
X <- matrix(0,3,5)
X[,1] <- c(1,2,3)
X[,2] <- c(1,2.2,3)*2
X[,3] <- c(1,2,3.3)*3
X[,4] <- c(6,5,1)
X[,5] <- c(6.1,5,1.2)*4
cor.matrix <- cor(X)
cor.matrix <- cor.matrix*lower.tri(cor.matrix)
cor.vector <- which(cor.matrix>0.95, arr.ind=TRUE)
cor.vector then contains:
row col
[1,] 2 1
[2,] 3 1
[3,] 3 2
[4,] 5 4
That means, as expected, that the vectors 1,2 and 3 correlate between them, and also 4 and 5.
What I would need is to get two vectors c(1,2,3) and c(4,5) as the final result.
This is a simple example, I am processing large matrices though.
Here's an approach using igraph package:
require(igraph)
g <- graph.data.frame(cor.vector, directed = FALSE)
split(unique(as.vector(cor.vector)), clusters(g)$membership)
# $`1`
# [1] 2 3 1
# $`2`
# [1] 5 4
What this essentially does is to find the clusters in the graph g (disconnected sets), as illustrated in the figure below. Since the vertices are used to create the graph in the order you entered (from your cor.vector), the clustering order also comes back in the same order. That is: for vertices c(2,3,5,1,4) the clusters are c(1,1,2,1,2) with a total of two clusters (cluster 1 and cluster 2). So, we just use this to split using the cluster group.

Generating random graph in r

I would like to generate a grandom graph in R using any of the packages.
The desired output would be a two column matrix with the first column listing agents and the second column their connections of the following form:
1 3
1 4
1 6
1 7
2 2
2 5
3 9
3 11
3 32
3 43
3 2
4 5
I would like to be able to specify the average degree and minimum and maximum number of contacts.
What is the easiest way of doing this?
Since you don't specify the need for anything other than just a graph we ca do this very simply:
actor <- sample(1:4, 10, replace=TRUE)
receiver <- sample(3:43, 10, replace=TRUE)
graph <- cbind(actor,receiver)
if you want something more specific have a look at igraph for instance
library(igraph)
graph <- erdos.renyi.game(21, 0.3, type=c("gnp", "gnm"),
directed = FALSE, loops = FALSE)
# here the 0.3 is the probability of ties and 21 is the number of nodes
# this is a one mode network
or using package bipartite which focuses specifically on two mode networks:
library(bipartite)
web <- genweb(N1 = 5, N2 = 10, dens = 2)
web2edges(web,return=TRUE)
# here N1 is the number of nodes in set 1 and N2 the number of nodes in set 2
# and dens the average number of ties per node
There are many things to take into account, for instance if you want to constrain the degree distribution, probablity of ties between agents etc.

Resources