Is there any way to make Node2Vec faster? - graph

I have a graph with 480k nodes and 34M edges. I want to create node embeddings using Node2Vec on this graph. But, It is not even able to calculate transition probabilities. I am using a Google Cloud Machine with 32 cores and 120 GB RAM. Infrastructure is not the problem, the problem is that the function _precompute_probabilities in the node2vec pip library is not paraller. It is using only a single thread to calculate the transition probabilities. Is there a way to make this parallel or is they any other parallel version of Node2Vec ?

TLDR
To compute embeddings on large graphs use GRAPE.
pip install grape
Example:
from grape import Graph
from grape.embedders import Node2VecGloVeEnsmallen
graph = Graph.from_csv(
## The path to the edges list tsv
edge_path="edges.csv",
sources_column="source",
destinations_column="destination",
directed=False,
)
embedding = Node2VecGloVeEnsmallen().fit_transform(graph)
Longer answer
To solve this issue, we developed GRAPE, our Rust library with Python bindings. We needed to run node2vec on big graphs and the libraries we found weren't fast enough. We re-implemented and optimized many models, including Node2vec's, from the ground up without using Tensorflow or Pytorch.
Here's some benchmarks on our server with 12 cores (24 threads) and 128GB of ram.
Info about the tested graphs:
WikiEN has 130M edges and 17M nodes.
CTD has 45M edges and 100K nodes.
PheKnowLator has 7M edges and 800K nodes.
Here's the tutorial on how to load your own custom graph, we support CSV-like formats:
https://github.com/AnacletoLAB/grape/blob/main/tutorials/Loading_a_Graph_in_Ensmallen.ipynb
and here's the complete tutorial on how to run and visualize node2vec with Glove on a given graph:
https://github.com/AnacletoLAB/grape/blob/main/tutorials/Using_Node2Vec_GloVe_to_embed_Cora.ipynb
Our paper is still under review ( https://arxiv.org/abs/2110.06196 ) and we are just two developers, so, if you need any help contact us on Discord, Github, or Twitter #GRAPElib.

I found a library Graph2Vec, it uses a CSR Matrix to generate walks instead of jumping from node to node in memory. It is way faster than Node2Vec.
Link: https://www.singlelunch.com/2019/08/01/700x-faster-node2vec-models-fastest-random-walks-on-a-graph/
Github: https://github.com/VHRanger/graph2vec
Also, you can refer to this issue and try the mentioned libraries:
https://github.com/aditya-grover/node2vec/issues/10

I've tried https://github.com/eliorc/node2vec with the "temp_folder" property. Thought I didn't feel that it was much faster, so I ended up with the version with CSR Matrices.
Oh... was it yourself, who answered the question? :)
Good to know, thank you for the tip

Related

Is it possible to do cohesion block in parallel with igraph?

I have a network of about 2k nodes and 4.8k edges which exhibits high small world attributes. I want to get its cohesion blocks with cohesive.blocks() in igraph, however, it has been running for days and still not output.
There was a link similar with this issue (https://lists.nongnu.org/archive/html/igraph-help/2008-01/msg00020.html), and MPI was mentioned to run cohesive.blocks() in parallel. So I wonder how to run cohesive.blocks() or other functions in igraph in parallel.
Thanks.

How to manage R packages in spark cluster

I´m working with a small spark cluster (5 DataNodes and 2 NameNodes) with version 1.3.1. I read this entry in a berkeley blog:
https://amplab.cs.berkeley.edu/large-scale-data-analysis-made-easier-with-sparkr/
Where it´s detailled how to implement gradient descent using sparkR; running a user-defined gradient function in parallel through the sparkR method "lapplyPartition". If lapplyPartition makes user-defined gradient function executed in every node, I guess that all the methods used inside the user-defined gradient function should be available in very node too. That means, R and all its packages should be installed in every node. Did I understood well?
If so, is there any way of managing the R packages? Now my cluster is small so we can do it manually, but I guess those people with large clusters won't do it like this. Any suggestion?
Thanks a lot!

Compute automorphism group of graph / check if two graphs are isometric (DAG)

This has to be a well researched problem, but I am struggling researching it.
I started here, but I am looking for algorithms to study and implement.
http://en.wikipedia.org/wiki/Graph_isomorphism_problem
For example, if I have two of these DAGs (Directe Acyclic Graphs), I want to mark/delete one of them because it is just a rotation/reflection of the first. Being in the same automorphism group means they can be rotated/reflected to have the exact same adjacency matrix right?
you can use nauty or saucy algorithm to compute this problem.
This links might be helpful to you. :)
Nauty:
http://cs.anu.edu.au/~bdm/nauty/
http://cs.anu.edu.au/~bdm/nauty/
Saucy:
http://vlsicad.eecs.umich.edu/BK/SAUCY/
There is also a list of ready made tools available (especially for command line in linux based OS), in the saucy page.

R tree and Graph partitioning library in R

I need to make efficient d-dimensional points searching and also make efficient k-NN queries of a point in d-dimension. Therefore i require an R-Tree library. I require a library which will build the R-Tree structure, which i can use to query whenever needed.
Also i need to have some library like that of METIS or hMETIS, although my application does not involve hypergraphs. My requirement is to find the min cut set of a graph which divides the graph in roughly two equal sized graphs.
The thing is i would require libraries which support these in R.
I have found a library RANN, which has kd-tree based k-NN queries, but the problem is that either i have to make all the k-NN queries at once and store the results in a huge array, or need to call the function (nn or nn2) every time i need, which defeates the O(n lg n) retrieval growth of time.
Can anyone tell me if there is any such libraries in R?
Note: I would require the R-Tree library for implementing clustering algorithms efficiently, and the graph partition library would be required to implement the CHAMELEON clustering algorithm.
After some study on R and its libraries i think it is better to get the required libraries or make my own code in C or C++ and then use it through the .C() or .Call() R to C language interface.
Also i need to have some library like that of METIS or hMETIS, although my application does not involve hypergraphs. My requirement is to find the min cut set of a graph which divides the graph in roughly two equal sized graphs.
Despite that this is an old question, I have written something like this recently. That is,
A Kernighan-Lin like algorithm.
An algorithm to find an approximately connected balanced partition using the method suggested by Chlebíková (1996).
An algorithm that takes the solution found by the method in 2. and tries to minimize the cut price using a Kernighan-Lin like algorithm while still requiring that the two sets in the partition are connected.
From the graphs I am working with, 3. seems to often find a quite good solution often for bigger graphs (say ~ 1-4 million edges with ~ 1 million vertices). This takes seconds or a few minutes. The implementation is in the pedmod package at https://github.com/boennecd/pedmod. Call the following to install the package and to find a vignette with further details:
remotes::install_github("boennecd/pedmod", build_vignettes = TRUE)
vignette("pedigree_partitioning", package = "pedmod")
I am not sure how my implementation compares in terms of speed and quality of the partition compared with other software though.
References
Chlebíková, Janka. 1996. “Approximating the Maximally Balanced Connected Partition Problem in Graphs.” Information Processing Letters 60 (5): 225–30.
Kernighan, B. W., and S. Lin. 1970. “An Efficient Heuristic Procedure for Partitioning Graphs.” The Bell System Technical Journal 49 (2): 291–307

Parallelize Solve() for Ax=b?

Crossposted with STATS.se since this problem could straddle both STATs.se/SO
https://stats.stackexchange.com/questions/17712/parallelize-solve-for-ax-b
I have some extremely large sparse matrices created using spMatrix function from the matrix package.
Using the solve() function works for my Ax=b issue, but it takes a very long time. Several days.
I noticed that http://cran.r-project.org/web/packages/RScaLAPACK/RScaLAPACK.pdf
appears to have a function that can parallelize the solve function, however, it can take several weeks to get new packages installed on this particular server.
The server already has the snow package installed it.
So
Is there a way of using snow to parallelize this operation?
If not, are there other ways to speed up this type of operation?
Are there other packages like RScaLAPACK? My search on RScaLAPACK seemed to suggest people had a lot of issues with it.
Thanks.
[EDIT] -- Additional details
The matrices are about 370,000 x 370,000.
I'm using it to solve for alpha centrality, http://en.wikipedia.org/wiki/Alpha_centrality. I was originally using the alpha centrality function in the igraph package, but it would crash R.
More details
This is on a single machine with 12 cores and 96 gigs of memory (I believe)
It's a directed graph along the lines of paper citation relationships.
Calculating condition number and density will take awhile. Will post as it comes available.
Will crosspost on stat.SE and will add a link back to here
[Update 1: For those just tuning in: The original question involved parallelizing computations to solving a regression problem; given that the underlying problem is related to alpha centrality, some of the issues, such as bagging and regularized regression may not be as immediately applicable, though that leads down the path of further statistical discussions.]
There are a bundle of issues to address here, from the infrastructural to the statistical.
Infrastructure
[Updated - also see Update #2 below.]
Regarding parallelized linear solvers, you can replace R's BLAS / LAPACK library with one that supports multithreaded computations, such as ATLAS, Goto BLAS, Intel's MKL, or AMD's ACML. Personally, I use AMD's version. ATLAS is irritating, because one fixes the number of cores at compilation, not at run-time. MKL is commercial. Goto is not well supported anymore, but is often the fastest, but only by a slight margin. It's under the BSD license. You can also look at Revolution Analytics's R, which includes, I think, the Intel libraries.
So, you can start using all of the cores right away, with a simple back-end change. This could give you a 12X speedup (b/c of the number of cores) or potentially much more (b/c of better implementation). If that brings down the time to an acceptable range, then you're done. :) But, changing the statistical methods could be even better.
You've not mentioned the amount of RAM available (or the distribution of it per core or machine), but A sparse solver should be pretty smart about managing RAM accesses and not try to chew on too much data at once. Nonetheless, if it is on one machine and if things are being done naively, then you may encounter a lot of swapping. In that case, take a look at packages like biglm, bigmemory, ff, and others. The former addresses solving linear equations (or GLMs, rather) in limited memory, the latter two address shared memory (i.e. memory mapping and file-based storage), which is handy for very large objects. More packages (e.g. speedglm and others) can be found at the CRAN Task View for HPC.
A semi-statistical, semi-computational issue is to address visualization of your matrix. Try sorting by the support per row & column (identical if graph is undirected, else do one then the other, or try a reordering method like reverse Cuthill-McKee), and then use image() to plot the matrix. It would be interesting to see how this is shaped, and that affects which computational and statistical methods one could try.
Another suggestion: Can you migrate to Amazon's EC2? It is inexpensive, and you can manage your own installation. If nothing else, you can prototype what you need and migrate it in-house once you have tested the speedups. JD Long has a package called segue that apparently makes life easier for distributing jobs on Amazon's Elastic MapReduce infrastructure. No need to migrate to EC2 if you have 96GB of RAM and 12 cores - distributing it could speed things up, but that's not the issue here. Just getting 100% utilization on this machine would be a good improvement.
Statistical
Next up are multiple simple statistical issues:
BAGGING You could consider sampling subsets of your data in order to fit the models and then bag your models. This can give you a speedup. This can allow you to distribute your computations on as many machines & cores as you have available. You can use SNOW, along with foreach.
REGULARIZATION The glmnet supports sparse matrices and is very fast. You would be wise to test it out. Be careful about ill-conditioned matrices and very small values of lambda.
RANK Your matrices are sparse: are they full rank? If they are not, that could be part of the issue you're facing. When matrices are either singular or very nearly so (check your estimated condition number, or at least look at how your 1st and Nth eigenvalues compare - if there's a steep drop off, you're in trouble - you might check eval1 versus ev2,...,ev10,...). Again, if you have nearly singular matrices, then you need to go back to something like glmnet to shrink out the variables are either collinear or have very low support.
BOUNDING Can you reduce the bandwidth of your matrix? If you can block diagonalize it, that's great, but you'll likely have cliques and members of multiple cliques. If you can trim the most poorly connected members, then you may be able to estimate their alpha centrality as being upper bounded by the lowest value in the same clique. There are some packages in R that are good for this sort of thing (check out Reverse Cuthill-McKee; or simply look to see how you'd convert it into rectangles, often relating to cliques or much smaller groups). If you have multiple disconnected components, then, by all means, separate the data into separate matrices.
ALTERNATIVES Are you wedded to the Alpha Centrality? There may be other measures that are monotonically correlated (i.e. have high rank correlation) with the same value that could be calculated more cheaply or at least implemented quite efficiently. If those will work, then your analyses could proceed with a lot less effort. I have a few ideas, but SO isn't really the place to go about that discussion.
For more statistical perspectives, appropriate Q&A should occur on the stats.stackexchange.com, Cross-Validated.
Update 2: I was a bit too quick in answering and didn't address this from the long-term perspective. If you are planning to do research on such systems for the long-term, you should look at other solvers that may be more applicable to your type of data and computing infrastructure. Here is a very nice directory of the options for both solvers and pre-conditioners. It seems this doesn't include IBM's "Watson" solver suite. Although it may take weeks to get software installed, it's quite possible that one of the packages is already installed if you have a good HPC administrator.
Also, keep in mind that R packages can be installed to the user directory - you need not have a package installed in the general directory. If you need to execute something as a user other than yourself, you could also download a package to the scratch or temporary space (if you're running within just 1 R instance, but using multiple cores, check out tempdir).

Resources