Alternative for shortest_path algorithm - r

I have a network consisting of 335 nodes. I computed the weighted shortest.paths between all of the nodes.
Now I would like to see which path sequences where used to travel between the nodes.
I use the the shortest_path command in igraph and iterate through all combinations of nodes in my network (335² combinations - 335(path from/to same node is 0)/2 (graph is undirected). So all in all I have to iterate over 55.945 combinations.
My approach looks like this:
net is my network
sp_data is a df with all combinations of links in the network
results1 <- sapply(sp_data[,1], function(x){shortest_paths(net, from = x, to = V(net), output="epath"})
Unfortunately this needs ages to compute and at the end I don't have enough memory to store the information. (Error: cannot allocate vector of size 72 Kb).
Basically I have two questions:
How can it be that the shortest.paths command needs seconds to compute the distance between all nodes of my network whereas extracting the path sequences (not just it length) needs days and exceeds the memory capacity?
Is there an alternative to get the desired output (path sequences of shortest path)? I guess that the sapply Syntax should already be faster than a for::loop?

you could try cppRouting package.
It provides get_multi_paths function which return a list containing the node sequence for each shortest path.
library(igraph)
library(cppRouting)
#random graph
g <- make_full_graph(335)
#convert to three columns data.frame
df<-data.frame(igraph::as_data_frame(g),dist=1)
#instantiate cppRouting graph
gr<-cppRouting::makegraph(df)
#extract all nodes
all_nodes<-unique(c(df$from,df$to))
#Get all paths sequence
all_paths<-get_multi_paths(Graph=gr,from=all_nodes,to=all_nodes)
#Get path from node 1 to 3
all_paths[["1"]][["3"]]

Related

How to create network with both edges and isolates using statnet/igraph

My question is similar to the one posted here: Network adding edges error
I am creating a network from scratches: I have data about 228 vertices, over a period of 13 years. In the first year, I have just 1781 edges: they do not involve all of my vertices (barely 164), therefore the remaining nodes should result as isolated.
I created the network starting from my edgelist, using the code
fdi.graph.2003 <- graph_from_data_frame(fdi.edge.2003, directed = T, vertices = fdi.attr.2003)
where fdi.edge.2003 is a data.frame containing edge attributes (including some potential weight columns): it only involves 164 vertices out of the total vertices defined by fdi.attr.2003 is a data.frame containing a row for each vertex that is involved in the edgelist (164 in total).
all i get is a network with 164 vertices and no isolates. However, I know they do exist in my data! Any suggestion on how to do it? I think that I shoul initialize a network with all 228 vertices, adding their attributes and then adding the edges. However, nothing I am trying is working: rather, I am receiving the most disparate errors related to "Illegal vertex reference in addEdges_R".
Any suggestion is more than welcome, also in the case it would involve the alternative package igraph, for which I am finding the same problem
Filippo
Use add.isolates from the sna package
net1 = as.network(cbind(1:3, 3:5)) #5 vertices, 3 edges
net2 = as.network(add.isolates(net1, 10), matrix.type = "edgelist") #15 v, 3 e
And then you'll probably want to create new vertex names, e.g.
net2%v%"vertex.names" = 1:15

how to extract only the vertices with multiple edges from a graph using igraph in R

I am new to igraph and graph theory. I have a very large file (> 4 GB) and I was told it is a graph. I can see the format includes the pairs separated by tab and I can read it as a table first then convert it to graph data frame.
Number of vertices with vcount and number of edges with ecount suggest that there are vertices with multiple edges. I have been looking at various sources but I could not find the information about directly extracting the vertices with more than one edges.
Any help is appreciated.
To get the edges incident to each vertex (if g is your igraph)
ie <- igraph::incident_edges(g, igraph::V(g))
Then, to get the number of edges adjacent to each vertex
num.incident.edges <- sapply(ie, length)
Sorry, I guess I was wrong with the terminology. What I meant by vertices with multiple edges is called 'articulation_points'.
This was what I was looking for:
library(igraph)
bi <- biconnected_components(g)
bi$articulation_points

How to create graph with large number of points in R?

I have a large dataset contains a large number of nodes; more than 25000 nodes, organized in a .csv file. The structure is similar to the following :
node freq
a 3
b 2
c 5
I want to create a graph from these node in which edges between nodes are constructed by a function of the freq column. I have used the rgraph function from sna package, such as:
num_nodes <- length(data$node)
pLink = data$freq/10
# create 1 graph with nodes and link proability, graph loops = FALSE
graph_adj= rgraph(num_nodes,1,pLink,"graph",FALSE)
graph <- graph.adjacency(graph_adj, mode="undirected")
The above code is running in case of small number of nodes, but with large number of nodes, The R session aborted with the following Error:
Error: C stack usage 19924416 is too close to the limit
Is there another way to create a graph with the mentioned properties: a large number of nodes and edges are created with probability?

Kmeans clustering of text data with percentage match

I am having hundreds of large strings and would want to cluster them into groups (clusters). I found kmeans as one way to do this. But my problem is that it takes only the number of clusters as an argument. But my requirement is to take the percentage match between strings as an argument and cluster only those strings into different clusters, which meet or exceed that criteria. For example, if strings 1 & 2 match >90%, then only I want them in a cluster. The ones which do not match can be put in single element clusters. Is there a way to do this in R r Python or any language?
Clustering algorithm
k-means
As its name suggest, k-means will try to make k clusters, and will use for center of the cluster the mean of all values in the cluster. You then update the position of your centers, attribute element to the closest center, and repeat until it does not change anymore.
As you can see, all you need is to define the number of centers (and their starting points, but often this is randomized and repeated many times).
Your classification
What you want is to cluster words that are very similar to one another based on a threshold.
You could always do that by computing the distance between elements (the distance being your similarity).
The pseudo-code for that would be:
1) initialize cluster with first word
2) add all words to cluster that are "close enough" to this word
3) pick a word that has not been clustered yet, and initialize a new cluster with it
4) add all words "close enough" to this word
5) repeat 3 and 4 until all words are used

Generate an OD list of nodes within n stops

I have a graph G(V,E), the number of edges is 35000 and the number of nodes is 3500,
Is there anyway I can generate a origin-destination list within n (say 4) stops for each node?
I think the function neighborhood() does exactly what you want. Set the order argument to 4 and for each vertex you'll get a vector of vertex ids for the vertices that are at most 4 steps away from it.
I figure it out:
Use the property of the adjacency matrix A, the entry in row i and column j of A^n gives the number of (directed or undirected) walks of length n from vertex i to vertex j. So for n stop, construct n matrix An, A(n-1)......A1, in which, An= A^n. Then the union of An,An-1....A1 should be the matrix that representing n stop reachable destinations for an origin.

Resources