Creating network from adjacency matrix using weights on networkx - graph

Here is my question: from a HMM model I created, I want to create a network using the transition matrix (with the probability to go from a state to another) as an adjacency matrix. There are no 0s in this matrix, neither on the diagonal, because you can go from state 1 to state 1 again.
How can I be sure that the network created is actually using the weights (the values in the transition matrix) to create communities?
I'm using the following code:
import community
import networkx as nx
G = nx.from_numpy_matrix(tm, parallel_edges=False, create_using=None)
# Relabel nodes
G = nx.relabel_nodes(G, {i: f"node_{i}" for i in G.nodes})
# Compute partition
partition = community.best_partition(G)
# Get a set of the communities
communities = set(partition.values())
# Create a dictionary mapping community number to nodes within that community
communities_dict = {c: [k for k, v in partition.items() if v == c] for c in communities}
communities_dict
and the output makes sense: 5 communities (0 to 4) grouping 25 nodes (from 0 to 24)
{0: ['node_0', 'node_1', 'node_14', 'node_17'],
1: ['node_5', 'node_10', 'node_11', 'node_12'],
2: ['node_2', 'node_20', 'node_24'],
3: ['node_3', 'node_8', 'node_9', 'node_15', 'node_18', 'node_22'],
4: ['node_4', 'node_6', 'node_7', 'node_13', 'node_16', 'node_19', 'node_21', 'node_23']}
I noticed a post here on stack overflow in which they used
partition = community.best_partition(G, weight='weights')
I tried to implement it, and the results are awful: each node is a community (for a total of 25 communities on 25 nodes).
My question is: am I doing good with the first code? Or the correct one is the other one which return bad communities?

Related

Extract graphs based on identifier and calculate network measures in igraph

I want to separately analyze groups within a network. For example, the UK faculty data in the igraphdata package has some network data with group information on the node level.
library(igraph)
library(igraphdata)
data("UKfaculty")
V(UKfaculty)$Group
I want to extract networks based on the 4 groups and run a few calculations on the extracted graph (density, average degree, diameter, clustering coefficient, etc.) and store this information based on the groups in a dataframe. I want to calculate the measures only based on the nodes within a group, not on the whole network level (e.g. calculating only centrality based on connections in group 1, not taking connections to other groups into account).
Group density diameter
1 x x
2 x x
3 x x
Any idea how to efficiently do this?
You can use induced_subgraph to extract the subgraphs based on a list of vertices for every group.
library(igraph)
library(igraphdata)
data("UKfaculty")
ig <- UKfaculty
# `list` of vertices for every group
idx <- split(V(ig), V(ig)$Group)
# Create subgraphs based on the `list` of vertices
lst <- lapply(idx, function(v) induced_subgraph(ig, v))
It's then straight-forward to calculate any subgraph-specific metrics, e.g.
do.call(rbind, lapply(lst, function(ig)
data.frame(
Group = unique(V(ig)$Group),
density = edge_density(ig),
diameter = diameter(ig))))
# Group density diameter
#1 1 0.3001894 21
#2 2 0.3561254 12
#3 3 0.2807018 14
#4 4 1.0000000 12

Compute distance based Local Moran for 527k+ point dataset using spdep library

As the title says, I'm trying to compute Local Moran for a 527k point dataset using the spdep package, creating neighborhoods based on distance. The generalized process I'm doing is the following:
library(spdep)
# Convert coordinates to matrix
matrix_pts <- as.matrix(coordinates)
# Generate neighbors
neighbors <- dnearneigh(matrix_pts,
d1 = 0,
d2 = range)
# Get weight matrix
wm <- nb2listw(neighbors,
zero.policy = T,
style = style)
# Get moran statistics
moran_stat <- localmoran(value,
wm,
zero.policy = T)
But I've run into a problem where I can't create the neighborhoods using dnearneigh, since the dataset is way too large, and neighborhoods are composed of 200-1000 points.
I tried the solution depicted Here and I got myself a dataframe with a first row with IDs and a second row with a list containing the IDs of neighboring points (i.e):
id int_ids
1: 239226 239226,242762,339386,444833,243000,240521,...
2: 242762 239226,242762,339386,444833,243000,240521,...
3: 339386 239226,242762,339386,444833,243000,240521,...
4: 444833 239226,242762,339386,444833,243000,240521,...
5: 243000 239226,242762,339386,444833,243000,240521,...
6: 240521 239226,242762,339386,444833,243000,240521,...
However, I don't know how to create the nb object required by nb2listw, and digging around hasn't helped me a lot.
Is there a way to transform this dataframe into a nb object? If I were able to do so, would the weight matrix be as hard to create as the neighbors ?
Is there a another way to compute local moran for a dataset of this volume ?

subtour elimination constraint pyomo

I am struggling to formulate the following subtour elimination constraint for TSP-like problem in Pyomo, given a graph G(V,A) where node 1 is the depot:
where x_ij and y_h are binary constraints that I have previously defined as binary variables.
First, I created a dictionary of all possible subsets S such that node 1 is always contained: subsets_s.
Then, I have been trying with something like this, but I am running into errors:
model1.con3=ConstraintList()
for h in model1.V:
if h is not 1:
for i in model1.S:
if h not in subsets_s[i]['nodes_subset']:
S=subsets_s[i]['nodes_subset']
for v in S:
print(v)
model1.con3.add(sum(sum(model1.x[v,j]) for j in
model1.V if j not in S)>=model1.y[h])
Do you have any suggestions?
Thank you

Louvain community detection in R using igraph - format of edges and vertices

I have a correlation matrix of scores that I would like to run community detection on using the Louvain method in igraph, in R. I converted the correlation matrix to a distance matrix using cor2dist, as below:
distancematrix <- cor2dist(correlationmatrix)
This gives a 400 x 400 matrix of distances from 0-2. I then made the list of edges (the distances) and vertices (each of the 400 individuals) using the below method from http://kateto.net/networks-r-igraph (section 3.1).
library(igraph)
test <- as.matrix(distancematrix)
mode(test) <- "numeric"
test2 <- graph.adjacency(test, mode = "undirected", weighted = TRUE, diag = TRUE)
E(test2)$weight
get.edgelist(test2)
From this I then wrote csv files of the 'from' and 'to' edge list, and corresponding weights:
edgeweights <-E(test2)$weight
write.csv(edgeweights, file = "edgeweights.csv")
fromtolist <- get.edgelist(test2)
write.csv(fromtolist, file = "fromtolist.csv")
From these two files I produced a .csv file called "nodes.csv" which simply had all the vertex IDs for the 400 individuals:
id
1
2
3
4
...
400
And a .csv file called "edges.csv", which detailed 'from' and 'to' between each node, and provided the weight (i.e. the distance measure) for each of these edges:
from to weight
1 2 0.99
1 3 1.20
1 4 1.48
...
399 400 0.70
I then tried to use this node and edge list to create an igraph object, and run louvain clustering in the following way:
nodes <- read.csv("nodes.csv", header = TRUE, as.is = TRUE)
edges <- read.csv("edges.csv", header = TRUE, as.is = TRUE)
clustergraph <- graph_from_data_frame(edges, directed = FALSE, vertices = nodes)
clusterlouvain <- cluster_louvain(clustergraph)
Unfortunately this did not do the louvain community detection correctly. I expected this to return around 2-4 different communities, which could be plotted similarly to here, but sizes(clusterlouvain) returned:
Community sizes
1
400
indicating that all individuals were sorted into the same community. The clustering also ran immediately (i.e. with almost no computation time), which also makes me think it was not working correctly.
My question is: Can anyone suggest why the cluster_louvain method did not work and identified just one community? I think I must be specifying the distance matrix or edges/nodes incorrectly, or in some other way not giving the correct input to the cluster_louvain method. I am relatively new to R so would be very grateful for any advice. I have successfully used other methods of community detection on the same distance matrix (i.e. k-means) which identified 2-3 communities, but would like to understand what I have done wrong here.
I'm aware there are multiple other queries about using igraph in R, but I have not found one which explicitly specifies the input format of the edges and nodes (from a correlation matrix) to get the louvain community detection working correctly.
Thank you for any advice! I can provide further information if helpful.
I believe that cluster_louvain did exactly what it should do with your data.
The problem is your graph.Your code included the line get.edgelist(test2). That must produce a lot of output. Instead try, this
vcount(test2)
ecount(test2)
Since you say that your correlation matrix is 400x400, I expect that you will
get that vcount gives 400 and ecount gives 79800 = 400 * 399 / 2. As you have
constructed it, every node is directly connected to all other nodes. Of course there is only one big community.
I suspect that what you are trying to do is group variables that are correlated.
If the correlation is near zero, the variables should be unconnected. What seems less clear is what to do with variables with correlation near -1. Do you want them to be connected or not? We can do it either way.
You do not provide any data, so I will illustrate with the Ionosphere data from
the mlbench package. I will try to mimic your code pretty closely, but will
change a few variable names. Also, for my purposes, it makes no sense to write
the edges to a file and then read them back again, so I will just directly
use the edges that are constructed.
First, assuming that you want variables with correlation near -1 to be connected.
library(igraph)
library(mlbench) # for Ionosphere data
library(psych) # for cor2dist
data(Ionosphere)
correlationmatrix = cor(Ionosphere[, which(sapply(Ionosphere, class) == 'numeric')])
distancematrix <- cor2dist(correlationmatrix)
DM1 <- as.matrix(distancematrix)
## Zero out connections where there is low (absolute) correlation
## Keeps connection for cor ~ -1
## You may wish to choose a different threshhold
DM1[abs(correlationmatrix) < 0.33] = 0
G1 <- graph.adjacency(DM1, mode = "undirected", weighted = TRUE, diag = TRUE)
vcount(G1)
[1] 32
ecount(G1)
[1] 140
Not a fully connected graph! Now let's find the communities.
clusterlouvain <- cluster_louvain(G1)
plot(G1, vertex.color=rainbow(3, alpha=0.6)[clusterlouvain$membership])
If instead, you do not want variables with negative correlation to be connected,
just get rid of the absolute value above. This should be much less connected
DM2 <- as.matrix(distancematrix)
## Zero out connections where there is low correlation
DM2[correlationmatrix < 0.33] = 0
G2 <- graph.adjacency(DM2, mode = "undirected", weighted = TRUE, diag = TRUE)
clusterlouvain <- cluster_louvain(G2)
plot(G2, vertex.color=rainbow(4, alpha=0.6)[clusterlouvain$membership])

Global multi-optimization function specification in R

I would like to use ngsa2 of mco package to solve an optimization problem with 3 objectives. In short, I am lookink for optimal land uses to solve environmental problem.
Here is my experiment:
- 100 land uses are possible in total (all.options in the code below), each land use being characterized by three performances (main.goal1, main.goal2 and main.goal3).
- I have 50 fields, whose characteristics (soil in fields.Kq) subset the 100 land uses (i.e., all land uses are not possible for each field) => options.soil1 and options.soil2
My objective is to assign a land use to each of my 50 fields, in order to minimize alltogether main.goal1, main.goal2 and main.goal3. From what I read, Genetic Algorithms are very powerful for such type of problems.
So here are my virtual data.
set.seed(0)
all.options<-data.frame(num.option=1:100,main.goal1 = abs(rnorm(100)),
main.goal2 = abs(rnorm(100)),
main.goal3 = abs(rnorm(100))) # all possible combinations of the 3 goals
options.soil1<-subset(all.options, main.goal1>0.5) # possible combinations for soil1
options.soil2<-subset(all.options, main.goal3<0.5) # possible combinations for soil2
fields.Kq<-data.frame(num.field=1:50,soil=round(runif(50,0,1),0))
I guess that my objective function should look like
my.function<-function(x) {
x[1]<-sum(A[,1) # main.goal1 for selected options for each of fields.Kq
x[2]<-sum(A[,2) # main.goal2 for selected options for each of fields.Kq
x[3]<-sum(A[,3) # main.goal3 for selected options for each of fields.Kq
} # where A should be a matrix of 50 lines with one line per field, and #"choosen" land use option
nsga2(my.function)
Unfortunately I could not go further, as I am new in optimizing with R. How to build the matrix A, with choosen land use for each field?
And using, nga, how to return these land uses? (together with the optimized (minimized) values for main.goal1, main.goal2 and main.goal3?
Thanks in advance for all the help you could provide me, I am really looking forward advices/links/books... to advance on my optimization problem.
Best regards,
LH
Here is how I solved the problem:
library("mco")
set.seed(0)
all.options<-data.frame(num.option=1:100,main.goal1 = abs(rnorm(100)),
main.goal2 = abs(rnorm(100)),
main.goal3 = abs(rnorm(100)),soil=c(rep("soilType1",50),rep("soilType2",50))) # all possible combinations of the 3 goals
fields.Kq<-data.frame(num.field=1:50,soil=rep(c("soilType1","soilType2"),25))
main.goal1=function(x) # x - a vector
{
main.goal1=sum(all.options[x,1]) # compute main.goal1
return(main.goal1) }
main.goal2=function(x) # x - a vector
{
main.goal2=sum(all.options[x,2]) # compute main.goal2
return(main.goal2) }
main.goal3=function(x) # x - a vector
{
main.goal3=sum(all.options[x,3]) # compute main.goal3
return(main.goal3) }
eval=function(x) c(main.goal1(x),main.goal2(x),main.goal3(x)) #objectivefunction
D<-length(fields.Kq[,1]) # number of fields
D2<-length(fields.Kq[,1])/2 # number of fields per type (simplified)
D.soil1<-max(which(all.options$soil=="soilType1")) # get boundary for bound soil1
D.soil2<-min(which(all.options$soil=="soilType2")) # get boundary for bound soil2
G=nsga2(fn=eval,idim=D,odim=3,
lower.bounds=c(rep(1,D2),rep(D.soil2,D2)),upper.bounds=c(rep(D.soil1,D2),rep(100,D2)), # lower/upper bound: min/max num option
popsize=20,generations=1:1000, cprob = 0.7, cdist = 5,
mprob = 0.2, mdist = 10)
I defined it thanks to exemples found in the very helpful and informative book "Modern optimization in R" by Paulo Cortez.
LH

Resources