Any ideas about the variation of the diameter of a network/graph as the number of nodes increases? - networking

My question is with regard to the increase/decrease of the diameter of a network. I'm thinking that as one adds more nodes to an existing network, the density should effectively increase and the probability of the edges created by the new nodes could result in higher degree of clustering. If this is the case, my assumption is that the diameter of the network should decrease as we add more nodes, owing to the probability that shorter geodesic paths can now exist and become the new diameter. Am I wrong with this logic? Or is there a better explanation or perhaps something I'm missing?

Work by Leskovec, Kleinberg, and Faloutsos has examined this question specifically [1,2]. They find:
"First, graphs densify over time, with the number of edges crowing super-linearly in the number of nodes. Second, the average distance between nodes often shrinks over time, in contrast to the conventional wisdom that such distance parameters should increase slowly as a function of the number of nodes."

Related

Why nodes(vertices) in peripheral positions have higher betweenness centrality scores after plotting on the igraph network visualization?

I calculated the betweenness centrality for a matrix using the 'igraph' package and obtained the scores. After plotting the network, I found that nodes (vertices) that are in the peripheral positions of the network have higher betweenness centrality scores compared to the more center-positioned nodes. Since the definition of betweenness centrality is defined by "the number of geodesics (shortest paths) going through a vertex or an edge". In that case, should more central nodes have higher betweenness centrality? The scores I am getting here, with higher centrality scores located in the peripheral positions of the network, does not fit with the definition and the other graphs that I have seen plotting the betweenness centrality. Do you know what's happening here? enter image description here The original matrix to create the network is shared on the github here (https://github.com/evaliu0077/network.matrix.git). My code for plotting the network and also the network visualization plot are both attached.
matrix <- read.csv("matrix.csv")
matrix <-as.matrix(matrix)
network <- graph_from_adjacency_matrix(matrix, weighted=T, mode="undirected", diag=F)
network =delete.edges(network, which(E(network.eng)$weight <=.1)) # delete the negative correlation values to plot it later
set.seed(10)
l=layout.fruchterman.reingold(network)
plot.igraph(network, layout=l,
vertex.size=betweenness(network),
edge.width=E(network)$weight*2 # rescaled by 2,
edge.color=ifelse(E(network)$weight>0.25,"blue","red"),main="Betweenness
centrality for the sample")
Thank you!
Pay attention to the meaning of edge weights before you use them.
In the context of betweenness centrality, edge "weights" are interpreted as "lengths", and are used for determining shortest paths. The length of a path is the sum of the weights/lengths of edges along the path. Higher "length" values indicate a weaker link, not a stronger one.
Are your weight values suitable for this use? Does it make sense to add them up along a path? If they are correlations, then I would say no. You could transform them so that weaker links have higher lengths, for example by inverting the values. You will sometimes see this in the literature, but it is a rather dubious practice. It still does not make much sense to add up inverse correlation values.
Similarly, check if the layout function you are calling makes use of weights, and if yes, in what way. First, your graph is almost complete. Therefore, with layout methods that do not use weights, the vertex positions are completely meaningless. Generally, be careful about reading too much into any kind of network visualization unless there are very obvious effects (such as an undisputable community structure). Here you use igraph's Fruchterman-Reingold layout algorithm, which happens to draw vertices connected by a high-weight edge closer to each other, not further. Thus, it interprets weights in exactly the opposite way compared to betweenness calculations: high weight indicates "strong" connections. Some other layout algorithms, such as Kamada-Kawai, interpret high weights (lengths) as weak (long) connections. Yet other layout algorithms ignore weights completely. It's good to keep this in mind when trying to interpret a network visualization.
should more central nodes have higher betweenness centrality?
I think the problem is that you're mixing two notions of centrality here. There's the well defined 'betweenness centrality' and then there's 'nodes that end up in the center of the picture after doing a layout with Fruchterman-Reingold'. They are not the same.
For example, take a full graph, and then add one new node A and connect it only to node B (just some random node in the full graph). Then B will have a high betweenness, but there's no reason to draw it in the middle of the graph. If I wanted to make a nice picture of this I would draw A and B at the edge. Maybe Fruchterman-Reingold does that too, because it will force A outward because it's not connected to most nodes.
Betweenness-based layout algorithms do exist:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-10-19, but I don't think igraph has one available.

Minimum length of lines needed

Suppose we have a set of N points on the cartesian plane (x_i and y_i). Suppose we connect those points with lines.
Is there any way like using a graph and something like a shortest path algorithm or minimum spanning tree so that we can reach any point starting from any point but minimizing the total length of the lines??
I though that maybe I could set the cost of the edges with the distance of a graph and use a shortest path algorithm but I'm not sure if this is possible.
Any ideas ?
I'm not 100% sure what you want, so I go for two algorithms.
First: you just want a robust algorithm, use dijkstras algorithm. The only challange left is to define the edge cost. Which would be 1 for neighboring nodes, I assume.
Second: you want to use heuristics to estimate the next best node and optimize time consumption. Use A*, but you need to write a heuristic which under estimates the distance. You could use the euclidean distance to do so. The edge problematic stays the same.

How to find total number of minimum spanning trees in a graph?

I don't want to find all the minimum spanning trees but I want to know how many of them are there, here is the method I considered:
Find one minimum spanning tree using prim's or kruskal's algorithm and then find the weights of all the spanning trees and increment the running counter when it is equal to the weight of minimum spanning tree.
I couldn't find any method to find the weights of all the spanning trees and also the number of spanning trees might be very large, so this method might not be suitable for the problem.
As the number of minimum spanning trees is exponential, counting them up wont be a good idea.
All the weights will be positive.
We may also assume that no weight will appear more than three times in the graph.
The number of vertices will be less than or equal to 40,000.
The number of edges will be less than or equal to 100,000.
There is only one minimum spanning tree in the graph where the weights of vertices are different. I think the best way of finding the number of minimum spanning tree must be something using this property.
EDIT:
I found a solution to this problem, but I am not sure, why it works. Can anyone please explain it.
Solution: The problem of finding the length of a minimal spanning tree is fairly well-known; two simplest algorithms for finding a minimum spanning tree are Prim's algorithm and Kruskal's algorithm. Of these two, Kruskal's algorithm processes edges in increasing order of their weights. There is an important key point of Kruskal's algorithm to consider, though: when considering a list of edges sorted by weight, edges can be greedily added into the spanning tree (as long as they do not connect two vertices that are already connected in some way).
Now consider a partially-formed spanning tree using Kruskal's algorithm. We have inserted some number of edges with lengths less than N, and now have to choose several edges of length N. The algorithm states that we must insert these edges, if possible, before any edges with length greater than N. However, we can insert these edges in any order that we want. Also note that, no matter which edges we insert, it does not change the connectivity of the graph at all. (Let us consider two possible graphs, one with an edge from vertex A to vertex B and one without. The second graph must have A and B as part of the same connected component; otherwise the edge from A to B would have been inserted at one point.)
These two facts together imply that our answer will be the product of the number of ways, using Kruskal's algorithm, to insert the edges of length K (for each possible value of K). Since there are at most three edges of any length, the different cases can be brute-forced, and the connected components can be determined after each step as they would be normally.
Looking at Prim's algorithm, it says to repeatedly add the edge with the lowest weight. What happens if there is more than one edge with the lowest weight that can be added? Possibly choosing one may yield a different tree than when choosing another.
If you use prim's algorithm, and run it for every edge as a starting edge, and also exercise all ties you encounter. Then you'll have a Forest containing all minimum spanning trees Prim's algorithm is able to find. I don't know if that equals the forest containing all possible minimum spanning trees.
This does still come down to finding all minimum spanning trees, but I can see no simple way to determine whether a different choice would yield the same tree or not.
MST and their count in a graph are well-studied. See for instance: http://www14.informatik.tu-muenchen.de/konferenzen/Jass08/courses/1/pieper/Pieper_Paper.pdf.

Undirected graph edge "relink" measure

I wish to characterize a dynamic graph based on the frequency of edges being reformed between vertices and the duration between these relinking instances. I refer to such a measure as 'link repetition'. A high value would indicate that newly formed edges are often reconnecting vertices that were connected recently. A low value would indicate that new edges are being formed between new pairs of vertices, or non recent neighbors.
I have searched a while for a measure of this sort but have found mostly measures dealing with new edges that aren't ever removed. A reference to an existing dynamic graph measure would be ideal. My current solution is just the inverse 'time since last link between i and j' averaged over number of timesteps, but I would like to stick with an established solution if it exists.
Can you have a counter matrix that increments each time links are reformed between nodes of a graph then base your measures off of that.

Graph Drawing With Weighted Edges

I'm looking to build an algorithm (or reuse one) that organizes nodes and edges on a 2 dimensional canvas where edges can have corresponding weights.
Any starting material and info would be helpful.
What would the weights do to affect their placement on your canvas?
That being said, you might want to look into graphviz and, more specifically, the DOT language, which organizes nodes on a canvas.
Many graph visualization frameworks use a force-based simulation, in which all nodes exert a repulsive force against each other (with their mass being their size), and edges exert tension on the nodes they connect. This creates aesthetically-arranged graph visualizations.
Although again, I'm not sure where you want node "weights" to come into play. Do you want weighted nodes to be more in the center? To be larger? More further apart?
Many graph/network layout algorithms are implicitly capable of handling weighted networks, but you may need to do some pre-processing and tweaks to the implementation to get it to work. Usually the first step is to determine if your weights represent "similarities" (usually interpreted to mean that stronger weights should place nodes closer togeter) or "dissimilarities" (stronger weights = father apart). The most common case is the former, so you will need to translate them to dissimilarities, often done by subtracting each edge value from the maximum observed edge value in the network. The matrix of dissimilarity values for each edge can then be fed to the algorithm and interpreted as desired distances in the layout space for each edge (i.e. "spring lengths")--usually after multiplying by some constant to transform to display units (pixels).
If you tell me what language you are using, I may be able to point you to some code examples.

Resources