Normalizing graph edit distance to [0,1] (networkx) - graph

I want to have a normalized graph edit distance.
I'm using this function:
https://networkx.github.io/documentation/stable/reference/algorithms/generated/networkx.algorithms.similarity.graph_edit_distance.html#networkx.algorithms.similarity.graph_edit_distance
I'm trying to understand to graph_edit_distance function in order to be able to normalize it between 0 and 1 but I don't understand it fully.
For example:
def compare_graphs(Ga, Gb):
draw_graph(Ga)
draw_graph(Gb)
graph_edit_distance = nx.graph_edit_distance(Ga, Gb, node_match=lambda x,y : x==y)
return graph_edit_distance
compare_graphs(G1, G3)
Why is the graph_edit_distance = 4?
Graph construction:
Hey
e1 = [(1,2), (2,3)]
e3 = [(1,3), (3,1)]
G1 = nx.DiGraph()
G1.add_edges_from(e1)
G3 = nx.DiGraph()
G3.add_edges_from(e3)
The edit distance is measured by:
nx.graph_edit_distance(Ga, Gb, node_match=lambda x,y : x==y)
The difference from graph_edit_distance is that it relates to node indexes.
This is the output of optimize_edit_paths:
list(optimize_edit_paths(G1, G2, node_match, edge_match,
node_subst_cost, node_del_cost, node_ins_cost,
edge_subst_cost, edge_del_cost, edge_ins_cost,
upper_bound, True))
Out[3]:
[([(1, 1), (2, None), (3, 3)],
[((1, 2), None), ((2, 3), None), (None, (1, 3)), (None, (3, 1))],
5.0),
([(1, 1), (2, 3), (3, None)],
[((1, 2), (1, 3)), (None, (3, 1)), ((2, 3), None)],
4.0)]
I know it should be the minimum sequence of node and edge edit operations transforming graph G1 to graph isomorphic to G2.
When I try to count, I get:
1. Add node 2 to G3,
2. Cancel e1=(1,3) from G3
3. Cancel e2=(3,1) from G3
4. Add e3 = (1,2) to G3
5. Add e4 = (2,3) to G3
graph_edit_distance = 5.
What am I missing?
Or alternatively, what can I do in order to normalize the distance I receive?
I thought about dividing by |V1| + |V2| + |E1| + |E2|, or dividing by max(|V1| + |E1|, |V2| + |E2|)) but I'm not sure.
Thanks in advance.

I know its old post but I am currently reading about GED and was willing to answer it for someone looking for it in future.
Graph edit distance is 4
Reason:
1 and 3 are connected using an undirected edge. While 1 and 2 are connected using directed edge. Now graph edit path will be :
Turn undirected edge to directed edge
Change 3 to 2 (substitution)
Add an edge to 2
Finally add a node to that edge

The Graph Edit Distance is unbounded. The more nodes you add to one graph, that the other graph doesn't have, the more edits you need to make them the same. So, the GED can be arbitrarily large.
I haven't seen this proposed anywhere else, but I have an idea:
Instead of GED(G1,G2), you can compute GED(G1,G2)/[GED(G1,G0) + GED(G2,G0)],
where G0 is the empty graph.
The situation is analogous to the difference between real numbers.
Imagine that I give you |A-B| and |C-D|. They are not on the same footing.
E.g., you could have A=1, B=2 and C=1000, D=1001.
The differences are equal, but the relative differences are very different.
To convey that, you would compute |A-B|/(|A|+|B|) instead of just |A-B|.
This is symmetric to a swapping of A and B, and it's a relative distance.
Since it's relative, it can be compared to the other relative distance: |C-D|/(|C|+|D|). These relative distances are comparable, they're expressing a notion that is universal and applies to all pairs of numbers.
In summary, compute the relative GED, using G0, the null graph, like you would use 0 if you were measuring the relative distance between real numbers.

Related

Change edge size in igraph

I want to plot a simple star graph in which the size of the edges depends on a score representing a difference of perception between the central node (e.g.,a leader) and the other nodes (e.g., its employees).
I succeeded in modifying the colors, the size of the node, the width of the edges but not the size of the latter.
How would you do?
library(igraph)
nodes <- read.csv("exemple_nodes.csv", header=T, as.is=T)
links <- read.csv("exemple_edges.csv", header=T, as.is=T)
st <- graph_from_data_frame(d=links, vertices=nodes, directed=T)
plot(st, vertex.color=V(st)$perception.type)
With the ggraph package and one of the geom_edge_ func' (e.g., geom_edge_arc, geom_edge_diagonal), in order to use the edge_width parameter, depending on a numeric value associated with the edges, in the edges-list (hereafter "value"). For example:
ggraph::ggraph(st) +
ggraph::geom_edge_diagonal(aes(edge_width = as.numeric(value)) )
In addition, ggraph allow you to specify other edges-parameters inside the geom_edge_ func', for example edge_alpha = as.numeric(value).
I think that what you want is to position the vertices so that you can control the length of the edges. If that is not what you want, then please explain what you mean by the "size" of the edges.
You do not provide your data so that we cannot use exactly your graph. I will use a generic star graph as an example. In order to control the placement of the vertices, you need to use the layout parameter. The basic function layout_as_star will place the first vertex at the center and the other vertices equally spaced around it at the same distance. Because this layout function places the center vertex at (0,0) and the remaining nodes on a unit circle around the center, it is easy to adjust it so that the distance of the outer vertices is controlled by a parameter. Just multiply the coordinates by the parameter and it will proportionally change the distance. I just make something up for the distances, but you can use your parameter.
## Make up perception parameter
set.seed(271828)
Perception = sample(4, 9, replace=T)
Perception
[1] 2 3 4 4 1 4 2 2 1
Now there is one weight for every outer vertex, but we need a weight for the central vertex. We don't want it to move so we use a weight of 1.
Weight = c(1, Perception)
LO = layout_as_star(S10)
LO = LO*Weight
plot(S10, layout=LO)

Identification of most frequent observation in numeric vector using minimal observation ranges

Problem:
Say I have a numeric vector x of observations of some distance (in R). For instance it could be the throwing length of a number of people.
x <- c(3,3,3,7,7,7,7,8,8,12,15,15,15,20,30)
h <- hist(x, breaks = 30, xlim = c(1,30))
I then want to define a set S of "selectors" (ranges) that select as much of my observations as possible and at the same time span as little distance as possible (the cost is the sum of ranges in S). Each selector si range must be at least 3 (its resolution).
Example:
In the toy data x I could put the first selector s1 from [6;8] which will select 4+2 observations (distance 7 and 8), use 3 distances and select 6/15 observations in total ([7;9] would give the same but for simplicity I put the selector midpoint in the max frequency). Next would be adding s2 [14;16] (6 distance and select 9/15). In summary, S would be build along the steps:
[6;8] (3, 6/15) #s1
[6;8], [14;16] (6, 9/15) #s2
[3;8], [14;16] (9, 12/15) #Extending s1 (cheapest)
[3;8], [12;16] (11, 13/15) #Extending s2
[3;8], [12;16], [29;31], (14, 14/15) #s3
[3;8], [12;20], [29;31], (18, 15/15) #Extending s2
One would stop the iterations when a certain total distance is used (sum of S) or when a certain fraction of data is covered by S. Or plot the sum of S against fraction of data covered and decide from that.
For very huge data (100,000s clustered observations in 1,000,000s of distance space) I could probably be more sloppy by increasing the minimum steps allowed (above it is 1, maybe try 100) and decreasing the resolution (above its 3, one could try maybe 1000).
Since its equivalent of maximizing the area under density(x) while minimizing the ranges of x, my intuition is that one could approximate the steps described (for time and memory considerations) using density() and optim() . Maybe its even a well known maximization/minimization problem.
Any suggestions that could get me started would be very appreciated.

How to calculate minimum spanning tree in R

Given a graph of N vertices and the distance between the edges of the vertices stored in tuple T1 = (d11, d12, …, d1n) to Tn = (dn1, dn2, …, dnn). Find out a minimum spanning tree of this graph starting from the vertex V1. Also, print the total distance travel needed to travel this generated tree.
Example:
For N =5
T1 = (0, 4, 5, 7, 5)
T2 = (4, 0, 6, 2, 5)
T3 = (5, 6, 0, 2, 1)
T4 = (7, 2, 2, 0, 5)
T5 = (5, 5, 1, 5, 0)
Selection of edges according to minimum distance are:
V1 -> V2 = 4
V2 -> V4 = 2
V4 -> V3 = 2
V3 -> V5 = 1
Thus, MST is V1 -> V2 -> V4 -> V3 -> V5 and the distance travelled is 9 (4+2+2+1)
Literally,I don't have idea about how to create a graph of n vertices in R.
I searched in google but i didn't understand how to approach above problem.
Please,help me.
Your question doesn't seem to match the title - you're after the graph creation not the MST? Once you've got a graph, as #user20650 says, the MST itself is easy.
It is easy to create a graph of size n, but there is a whole lot of complexity about which nodes are connected and their weights (distances) that you don't tell us about, so this is a really basic illustration.
If we assume that all nodes are connected to all other nodes (full graph), we can use make_full_graph. If that isn't the case, you either need data to say which nodes are connected or use a random graph.
# create graph
n <- 5
g <- make_full_graph(n)
The next issue is the distances. You haven't given us any information on how those distances are distributed, but we can demonstrate assigning them to the graph. Here, I'll just use random uniform [0-1] numbers:
# number of edges in an (undirected) full graph is (n2 - n) /2 but
# it is easier to just ask the graph how many edges it has - this
# is more portable if you change from make_full_graph
n_edge <- gsize(g)
g <- set_edge_attr(g, 'weight', value=runif(n_edge))
plot(g)
The next bit is just the MST itself, using minimum.spanning.tree:
mst <- minimum.spanning.tree(g)
The output mst looks like this:
IGRAPH dc21276 U-W- 5 4 -- Full graph
+ attr: name (g/c), loops (g/l), weight (e/n)
+ edges from dc21276:
[1] 1--4 1--5 2--3 2--5

Checking validity of topological sort

Given the following directed graph:
I determined the topological sort to be 0, 1, 2, 3, 7, 6, 5, 4 with the values for each node being:
d[0] = 1
f[0] = 16
d[1] = 2
f[1] = 15
d[2] = 3
f[2] = 14
d[3] = 4
f[3] = 13
d[4] = 7
f[4] = 8
d[5] = 6
f[5] = 9
d[6] = 5
f[6] = 10
d[7] = 11
f[7] = 12
Where d is discovery-time and f is finishing-time.
How can I check whether the topological sort is valid or not?
With python and networkx, you can check it as follows:
import networkx as nx
G = nx.DiGraph()
G.add_edges_from([(0, 2), (1, 2), (2, 3)])
all_topological_sorts = list(nx.algorithms.dag.all_topological_sorts(G))
print([0, 1, 2, 3] in all_topological_sorts) # True
print([2, 3, 1, 0] in all_topological_sorts) # False
However, note that in order to have a topological ordering, the graph must be a Directed Acyclic Graph (DAG). If G is not directed, NetworkXNotImplemented will be raised. If G is not acyclic (as in your case) NetworkXUnfeasible will be raised.
See documentation here.
If you want a less coding approach to this question (since it looks like your original topological ordering was generated without code), you can go back to the definition of a topological sort. Paraphrased from Emory University:
Topological ordering of nodes = an ordering (label) of the nodes/vertices such that for every edge (u,v) in G, u appears earlier than v in the ordering.
There's two ways that you could approach this question: from an edge perspective of a vertex perspective. I describe a naive (meaning with some additional space complexity and cleverness, you could improve on them) implementation of both below.
Edge approach
Iterate through the edges in G. For each edge, retrieve the index of each of its vertices in the ordering. Compared the indices. If the origin vertex isn't earlier than the destination vertex, return false. If you iterate through all of the edges without returning false, return true.
Complexity: O(E*V)
Vertex approach
Iterate through the vertices in your ordering. For each vertex, retrieve its list of outgoing edges. If any of those edges end in a vertex that precedes the current vertex in the ordering, return false. If you iterate through all the vertices without returning false, return true.
Complexity: O(V^2*E)
First, do a graph traversal to get the incoming degree of each vertex. Then start from the first vertex in your list. Every time, when we look at a vertex, we want to check two things 1) is the incoming degree of this vertex is 0? 2) is this vertex a neighbor of the previous vertex? We also want to decrement all its neighbors' incoming degree, as if we cut all edges. If we got a no from the previous questions at some point, we know that this is not a valid topological order. Otherwise, it is. This takes O(V + E) time.

Draw Network in R (control edge thickness plus non-overlapping edges)

I need to draw a network with 5 nodes and 20 directed edges (an edge connecting each 2 nodes) using R, but I need two features to exist:
To be able to control the thickness of each edge.
The edges not to be overlapping (i.e.,the edge form A to B is not drawn over the edge from B to A)
I've spent hours looking for a solution, and tried many packages, but there's always a problem.
Can anybody suggest a solution please and provide a complete example as possible?
Many Thanks in advance.
If it is ok for the lines to be curved then I know two ways. First I create an edgelist:
Edges <- data.frame(
from = rep(1:5,each=5),
to = rep(1:5,times=5),
thickness = abs(rnorm(25)))
Edges <- subset(Edges,from!=to)
This contains the node of origin at the first column, node of destination at the second and weight at the third. You can use my pacake qgraph to plot a weighted graph using this. By default the edges are curved if there are multiple edges between two nodes:
library("qgraph")
qgraph(Edges,esize=5,gray=TRUE)
However this package is not really intended for this purpose and you can't change the edge colors (yet, working on it:) ). You can only make all edges black with a small trick:
qgraph(Edges,esize=5,gray=TRUE,minimum=0,cut=.Machine$double.xmin)
For more control you can use the igraph package. First we make the graph:
library("igraph")
g <- graph.edgelist(as.matrix(Edges[,-3]))
Note the conversion to matrix and subtracting one because the first node is 0. Next we define the layout:
l <- layout.fruchterman.reingold(g)
Now we can change some of the edge parameters with the E()function:
# Define edge widths:
E(g)$width <- Edges$thickness * 5
# Define arrow widths:
E(g)$arrow.width <- Edges$thickness * 5
# Make edges curved:
E(g)$curved <- 0.2
And finally plot the graph:
plot(g,layout=l)
While not an R answer specifically, I would recommend using Cytoscape to generate the network.
You can automate it using a RCytoscape.
http://bioconductor.org/packages/release/bioc/html/RCytoscape.html
The package informatively named 'network' can draw directed networks fairly well, and handle your issues.
ex.net <- rbind(c(0, 1, 1, 1), c(1, 0, 0, 1), c(0, 0, 0, 1), c(1, 0, 1, 0))
plot(network(ex.net), usecurve = T, edge.curve = 0.00001,
edge.lwd = c(4, rep(1, 7)))
The edge.curve argument, if set very low and combined with usecurve=T, separates the edges, although there might be a more direct way of doing this, and edge.lwd can take a vector as its argument for different sizes.
It's not always the prettiest result, I admit. But it's fairly easy to get decent looking network plots that can be customized in a number of different ways (see ?network.plot).
The 'non overlapping' constraint on edges is the big problem here. First, your network has to be 'planar' otherwise it's impossible in 2-dimensions (you cant connect three houses to gas, electric, phone company buildings without crossovers).
I think an algorithm for planar graph layout essentially solves the 4-colour problem. Have fun with that. Heuristics exist, search for planar graph layout, and force-directed, and read Planar Graph Layouts

Resources