Why does LightGraphs.betweenness_centrality() run so long - julia

When I try to calculate the betweenness_centrality() of my SimpleWeightedGraph with julia's LightGraphs package it runs indefinitely. It keeps on increasing it's RAM usage until at some point it crashes without an error message. Is the something wrong with my graph? Or what would be the best way to find the cause of this problem?
My graphs are not generated by LightGraphs but by another library, FlashWeave. I'm don't know if that matters...
The problem does not occur for unweighted SimpleGraph's or for the weighted graph I created in LightGraphs...
using BenchmarkTools
using FlashWeave
using ParserCombinator
using GraphIO.GML
using LightGraphs
using SimpleWeightedGraphs
data_path = /path/to/my/data
netw_results = FlashWeave.learn_network(data_path,
sensitive = true,
heterogeneous = false)
dummy_weighted_graph = SimpleWeightedGraph(smallgraph(:house))
# {5, 6} undirected simple Int64 graph with Float64 weights
my_weighted_graph = graph(netw_results_no_meta)
# {6558, 8484} undirected simple Int64 graph with Float64 weights
# load_graph() only loads unweighted graphs
save_network(gml_no_meta_path, netw_results_no_meta)
my_unweighted_graph = loadgraph(gml_no_meta_path, GMLFormat())
# {6558, 8484} undirected simple Int64 graph
#time betweenness_centrality(my_unweighted_graph)
# 12.467820 seconds (45.30 M allocations: 7.531 GiB, 2.73% gc time)
#time betweenness_centrality(dummy_weighted_graph)
# 0.271050 seconds (282.41 k allocations: 13.838 MiB)
#time betweenness_centrality(my_weighted_graph)
# steadily increasing RAM usage until RAM is full and julia crashes.

This was answered already in https://github.com/JuliaGraphs/LightGraphs.jl/issues/1531, before this question was posted here. As mentioned in the resolution, you have negative weights in the graph. Dijkstra does not handle graphs with negative weights, and LightGraphs betweenness centrality uses Dijkstra.

Did you check if your graph contains negative weights? LightGraphs.betweenness_centrality() uses Dijkstra's shortest paths to calculate betweenness centrality, and thus expect non-negative weights.
LightGraphs.betweenness_centrality() doesn't check for illegal/nonsensical graphs. That's why it didn't throw an error. The issue is already reported here, but for now, check your own graphs if your not sure they are legal.
A = Float64[
0 4 2
4 0 1
2 1 0
]
B = Float64[
0 4 2
4 0 -1
2 -1 0
]
graph_a = SimpleWeightedGraph(A)
# {3, 3} undirected simple Int64 graph with Float64 weights
graph_b = SimpleWeightedGraph(B)
# {3, 3} undirected simple Int64 graph with Float64 weights
minimum(graph_a.weights)
# 0.00
minimum(graph_b.weights)
# -1.00
#time betweenness_centrality(graph_a)
# 0.321796 seconds (726.13 k allocations: 36.906 MiB, 3.53% gc time)
# Vector{Float64} with 3 elements
# 0.00
# 0.00
# 1.00
#time betweenness_centrality(graph_b)
# reproduces your problem.

Related

How can I find first principal component (and loadings) fast without using covariance matrix?

I have a matrix $X$ and I would like to find its first principal component and the corresponding loadings. I would like to do this without computing the covariance matrix of $X$. How can I do so?
This is the standard version, which uses the eigendecomposition of the covariance matrix.
using LinearAlgebra: eigen
using Statistics: mean
function find_principal_component(X)
n = size(X, 1)
B = X .- mapslices(mean, X, dims=[1]) # Center columns of X
evalues, V = eigen(B'B / (n - 1)) # EigenDecomposition of Covariance Matrix
PC = V[:, argmax(evalues)] # Grab principal component and compute loading
return B * PC, PC
end
Alternatively, one could use the power method, which still uses the covariance matrix
function power_method(X, niter=50)
pc = randn(size(X, 2))
pc /= norm(pc)
M = X'X
for i in 1:niter
pc = M * pc
pc /= norm(pc)
end
return X * pc, pc
end
I would like something like the power method, but without needing to compute the covariance matrix, which can be quite costly.
Possible solution
I noticed something interesting. Let r_t be the principal component vector at time t. The idea of the power method is to start with a random r_t and multiply it by X' X many times to stretch it towards the principal component. In other words r_{t+1} = X' X r_t
Once we have the principal component r_t then the loadings are simply \ell_t = X r_t. This means we can write r_{t+1} = X^\top \ell_t
One could therefore start with r_t and \ell_t initialized randomly and then do
r_{t+1} = normalize(X^\top \ell_t)\\
\ell_{t+1} = X r_{t+1}
In general, you may find singular value decompositions more useful for this.
The definition of the singular value decomposition is
B = U Σ V'
This means that
B'B = V Σ² V'
As a result, you code can avoid the computation of B'B. More importantly, the singular values are always real and thus you don't have to worry about whether B'B will be exactly symmetric.
Even better, Arpack.svds allows you to compute just the largest few singular values.
Here is a version of your code that uses SVD instead of eigen decomposition:
using LinearAlgebra: eigen
using Statistics: mean
using Arpack: svds
function find_principal_component(X)
n = size(X, 1)
# Center columns of X
B = X .- mapslices(mean, X, dims=[1])
# Decomposition of Covariance Matrix
svd,_ = svds(B / (n - 1), nsv=1)
# Grab principal component and compute loading
PC = svd.V[:, 1]
return B * PC, PC
end
Running this on a large sparse matrix (100k x 1k, 1M non-zeros) gives this speed:
julia> #time find_principal_component(sprandn(100_000, 1_000, 0.01))
25.529426 seconds (18.45 k allocations: 3.015 GiB, 0.02% gc time)
([0.014242904195824286, 0.10635817357717596, -0.010142643763158442, ...])
and on a large non-sparse example (1M x 100 entries):
julia> #time find_principal_component(randn(1_000_000, 100))
4.922949 seconds (1.31 k allocations: 2.280 GiB, 0.02% gc time)
([-0.06629858174095249, 0.6996443876327108, -1.1783642870384952, ...])
Try using KrylovKit.jl. Specifically, eigsolve(X, howmany=1, which=:LM]) will give you the eigen value with largest magnitude and the associated eigenvector. Docs are at https://jutho.github.io/KrylovKit.jl/stable/man/eig/

Extremely Sparse Integer Quadratic Programming

I am working on an optimization problem with a huge number of variables (upwards of hundreds of millions). Each of them should be a 0-1 binary variable.
I can write it in the form (maximize x'Qx) where Q is positive semi-definite, and I am using Julia, so the package COSMO.jl seems like a great fit. However, there is a ton of sparsity in my problem. Q is 0 except on approximately sqrt(|Q|) entries, and for the constraints there are approximately sqrt(|Q|) linear constraints on the variables.
I can describe this system pretty easily using SparseArrays, but it appears the most natural way to input problems into COSMO uses standard arrays. Is there a way I can take advantage of the sparsity in this massive problem?
While there is no sample code in your perhaps this could help:
JuMP works with sparse arrays so perhaps the easiest thing could be just use it in the construction of the goal function:
julia> using JuMP, SparseArrays, COSMO
julia> m = Model(with_optimizer(COSMO.Optimizer));
julia> q = sprand(Bool, 20, 20,0.05) # for readability I use a binary q
20×20 SparseMatrixCSC{Bool, Int64} with 21 stored entries:
⠀⠀⠀⡔⠀⠀⠀⠀⡀⠀
⠀⠀⠂⠀⠠⠀⠀⠈⠑⠀
⠀⠀⠀⠀⠀⠤⠀⠀⠀⠀
⠀⢠⢀⠄⠆⠀⠂⠀⠀⠀
⠀⠀⠀⠀⠀⠀⠄⠀⠀⠌
julia> #variable(m, x[1:20], Bin);
julia> x'*q*x
x[1]*x[14] + x[14]*x[3] + x[15]*x[8] + x[16]*x[5] + x[18]*x[4] + x[18]*x[13] + x[19]*x[14] + x[20]*x[11]
You can see that the equation gets correctly reduced.
Indeed you could check the performance with a very sparse q having 100M elements:
julia> q = sprand(10000, 10000,0.000001)
10000×10000 SparseMatrixCSC{Float64, Int64} with 98 stored entries:
...
julia> #variable(m,z[1:10000], Bin);
julia> #btime $z'*$q*$z
1.276 ms (51105 allocations: 3.95 MiB)
You can see that you are just getting the expected performance when constructing the goal function.

Cannot figure out simple use of Cumulants.jl

I cannot for the life of me figure out how to use Cumulants.jl to get moments or cumulants from some data. I find the docs (https://juliahub.com/docs/Cumulants/Vrq25/1.0.4/) completely over my head.
Suppose I have a vector of some data e.g.:
using Distributions
d = rand(Exponential(1), 1000)
The documentation suggests, so far as I can understand it, that cumulants(d, 3) should return the first three cumulants. The function is defined like so:
cumulants(data::Matrix{T}, m::Int = 4, b::Int = 2) where T<: AbstractFloat
a Matrix in Julia is, so far as I understand, a 2D array. So I convert my data to a 2D array:
dm = reshape(d, length(d), 1)
But I get:
julia> cumulants(dm,3)
ERROR: DimensionMismatch("bad block size 2 > 1")
My question concisely: how do I use Cumulants.jl to get the first m cumulants and the first m moments from some simulated data?
Thanks!
EDIT: In the above example, c = cumulants(dm,3,1) as suggested in a comment will give, for c:
3-element Array{SymmetricTensors.SymmetricTensor{Float64,N} where N,1}:
SymmetricTensors.SymmetricTensor{Float64,1}(Union{Nothing, Array{Float64,1}}[[1.0122452678071678]], 1, 1, 1, true)
SymmetricTensors.SymmetricTensor{Float64,2}(Union{Nothing, Array{Float64,2}}[[1.0336298356976195]], 1, 1, 1, true)
SymmetricTensors.SymmetricTensor{Float64,3}(Union{Nothing, Array{Float64,3}}[[2.5438037582591146]], 1, 1, 1, true)
I find that I can access the first, second, and third cumulants by:
c[1][1]
c[2][1,1]
c[3][1,1,1]
Which I arrived at essentially by guessing. I have no idea why this nutty output format exists. I still cannot figure out how to get the first m cumulants as a vector easily.
As I wrote in the comments, if you have a univariate problem you should use cumulants(dm,3,1) as the cumulants are calulated using tensors and the tensors are saved in a block structure, where the blocks are of size bxb, i.e. the third argument in the function call. However, If you have only one column, the size of the tensors will be 1, so that it doesn't make sense to save it in a 2x2 block.
To access the cumulants in Array form you have to convert them first. This is done by Array(cumulant(data, nc, b)[c]), where nc is the number of cumulants you want to calculate, b is the block size (for efficient storage of the tensors), and c is the cumulant you need.
Summing up:
using Cumulants
# univariate data
unidata = rand(1000,1)
uc = cumulants(unidata, 3, 1)
Array(uc[1])
#1-element Array{Float64,1}:
# 0.48772026299259374
Array(uc[2])
#1×1 Array{Float64,2}:
# 0.0811428357438324
Array(uc[3])
#[:, :, 1] =
# 0.0008653019738796724
# multivariate data
multidata = rand(1000,3)
mc = cumulants(multidata, 3, 2)
Array(mc[1])
#3-element Array{Float64,1}:
# 0.5024511157116442
# 0.4904838734508787
# 0.48286680648519215
Array(mc[2])
#3×3 Array{Float64,2}:
# 0.0834021 -0.00368562 -0.00151614
# -0.00368562 0.0835084 0.00233202
# -0.00151614 0.00233202 0.0808521
Array(mc[3])
# [:, :, 1] =
# -0.000506926 -0.000763061 -0.00183751
# -0.000763061 -0.00104804 -0.00117227
# -0.00183751 -0.00117227 0.00112968
#
# [:, :, 2] =
# -0.000763061 -0.00104804 -0.00117227
# -0.00104804 0.000889305 -0.00116559
# -0.00117227 -0.00116559 -0.000106866
#
# [:, :, 3] =
# -0.00183751 -0.00117227 0.00112968
# -0.00117227 -0.00116559 -0.000106866
# 0.00112968 -0.000106866 0.00131965
The optimal size of the blocks can be found in their software paper (https://arxiv.org/pdf/1701.05420.pdf), where they write (for proper latex formatting have a look at the paper):
5.2.1. The optimal size of blocks.
The number of coefficients required to store a super-symmetric tensor of order d and n dimensions is equal to (d+n−1 over n). The storage of tensor disregarding the super-symmetry requires n^d coefficients. The block structure introduced in [49] uses more than minimal amount of memory but allows for easier further processing of super-symmetric tensors.If we store the super-symmetric tensor in the block structure, the block size parameter b appears. In our implementation in order to store a super-symmetric tensor in the block structure we need, assuming n|b, an array of (n over b)^d pointers to blocks and an array of the same size of flags that contain the information if a pointer points to a valid block. Recall that diagonal blocks contain redundant information.Therefore on the one hand, the smaller the value of b, the less redundant elements on diagonals of the block structure. On the other hand, the larger the value of b,the smaller the number of blocks, the smaller the blocks’ operation overhead, and the fewer the number of pointers pointing to empty blocks. For detailed discussion of memory usage see [49]. The analysis of the influence of the parameter b on the computational time of cumulants for some parameters are presented in Fig. 2. We obtain the shortest computation time for b = 2 in almost all test cases, and this value will be set as default and used in all efficiency tests. Note that for b = 1we loose all the memory savings.
Using Oskar's helpful answer, I thought I'd provide my wrapper function which accomplishes the goal of returning a vector of the first m cumulants, given an input of a 1D array of data.
using Cumulants
function mycumulants(d, m) # given a 1D array of data d, return a vector of the first m cumulants
res = zeros(m)
dm = reshape(d, length(d), 1) # Convert 1D array to 2D
c = cumulants(dm, m, 1) # Need the 1 (block size) or else it errors
for i in 1:m
res[i] = Array(c[i])[1]
end
return(res)
end
But it turns out this is really really slow compared to just directly calculating raw moments and coverting them to cumulants by e.g. k[5] = u[5] - 5*u[4]*u[1] - 10*u[3]*u[2] + 20*u[3]*u[1]^2 + 30*u[2]^2*u[1] - 60*u[2]*u[1]^3 + 24*u[1]^5 so I think I won't be using Cumulants.jl after all for my purposes, which only involve univariate data at this time.
Example of time difference for calculating the first six cumulants from some simulated data:
----Data set 2----
Direct calculation:
1.997 ms (14 allocations: 469.47 KiB)
Cumulants.jl:
152.798 ms (318435 allocations: 17.59 MiB)

pagerank in igraph with isolated node

I wonder how the page_rank() in R's igraph package workes when there are isolated nodes. For example,
g <- graph(edges=c(1,2), n = 3, directed = F)
page_rank(g, algo = "prpack")
I got (with default damping factor being 0.85.)
$vector
[1] 0.46511628 0.46511628 0.06976744
Why is this result? I thought node 3 should be 0.15 / 3.
I think i figured out the reason. Using the standard page rank algorithm (see wiki), I get (1/3, 1/3, 1/20). Normalize it to a distribution, I get (0.46511628 0.46511628 0.06976744).

Create directed random graph specifing alpha of power-law degree distribution

I have a real directed graph for which I know the number of nodes and edges. The degree distribution approximates a power-law distribution. Now I want to create a random graph replicating the following features of my real graph:
Number of nodes
Number of edges
(Similar) power-law indegree and out
distribution
Let's assume g is my real graph of 10000 nodes and 30000 edges
exp.out = 2.2
exp.in = 2.3
set.seed(123)
g <- static.power.law.game(10000, 30000, exp.out, exp.in, multiple=TRUE)
Yet I don't know exp.out and exp.in. Then I try to estimate the power-law exponents with the plfit function (downloaded here):
plfit(degree(g, mode="in")+1)
# $xmin
# [1] 5
#
# $alpha
# [1] 2.97
#
# $D
# [1] 0.01735342
plfit(degree(g, mode="out")+1)
# $xmin
# [1] 5
#
# $alpha
# [1] 2.83
#
# $D
# [1] 0.01589222
From which I then derive my distribution functions (respectively for indegree and outdegree):
p(x) ~ x^-2.97 for x >= 5
p(x) ~ x^-2.83 for x >= 5
According to the documentation of static.power.law.game
The game simply uses static.fitness.game with appropriately
constructed fitness vectors. In particular, the fitness of vertex i is
i^(-alpha), where alpha = 1/(gamma-1) and gamma is the exponent given
in the arguments
As far as I understand it, to replicate my alphas I should pass as gammas respectively 1.3367 (2.97=1/(x-1)) and 1.35336 (2.83=1/(x-1)). Then
set.seed(321)
random.g <- static.power.law.game(10000, 30000, 1.35336, 1.3367, multiple=TRUE)
# Error in .Call("R_igraph_static_power_law_game", no.of.nodes, no.of.edges, :
# At games.c:3748 : out-degree exponent must be >= 2, Invalid value
Yet the fact that static.power.law.game only takes degree exponents higher then or equal to 2 makes me think that probably I am missing something...
exp_out and exp_in should simply be the desired exponent of the out-degree and in-degree distributions, there is no need to do any transformations on the exponents you have obtained from plfit. However, note that it is unlikely that you will recover your "observed" exponents exactly due to finite size effects

Resources