I have a 3 million x 9 million sparse matrix with several billion non-zero entries. R and Python do not allow sparse matrices with more than MAXINT non-zero entries, thus why I found myself using Julia.
While scaling this data with the standard deviation is trivial, demeaning is of course a no-go in a naive manner as that would create a dense, 200+ terabyte matrix.
The relevant code for doing svd is julia can be found at https://github.com/JuliaLang/julia/blob/343b7f56fcc84b20cd1a9566fd548130bb883505/base/linalg/arnoldi.jl#L398
From my reading, a key element of this code is the AtA_or_AAt struct and several of the functions around those, specifically A_mul_B!. Copied below for your convenience
struct AtA_or_AAt{T,S} <: AbstractArray{T, 2}
A::S
buffer::Vector{T}
end
function AtA_or_AAt(A::AbstractMatrix{T}) where T
Tnew = typeof(zero(T)/sqrt(one(T)))
Anew = convert(AbstractMatrix{Tnew}, A)
AtA_or_AAt{Tnew,typeof(Anew)}(Anew, Vector{Tnew}(max(size(A)...)))
end
function A_mul_B!(y::StridedVector{T}, A::AtA_or_AAt{T}, x::StridedVector{T}) where T
if size(A.A, 1) >= size(A.A, 2)
A_mul_B!(A.buffer, A.A, x)
return Ac_mul_B!(y, A.A, A.buffer)
else
Ac_mul_B!(A.buffer, A.A, x)
return A_mul_B!(y, A.A, A.buffer)
end
end
size(A::AtA_or_AAt) = ntuple(i -> min(size(A.A)...), Val(2))
ishermitian(s::AtA_or_AAt) = true
This is passed into the eigs function, where some magic happens, and the output is then processed in to the relevant components for SVD.
I think the best way to make this work for a 'centering on the fly' type setup is to do something like subclass AtA_or_AAT with a AtA_or_AAT_centered version that more or less mimics the behavior but also stores the column means, and redefines the A_mul_B! function appropriately.
However, I do not use Julia very much and have run in to some difficulty modifying things already. Before I try to dive into this again, I was wondering if I could get feedback if this would be considered an appropriate plan of attack, or if there is simply a much easier way of doing SVD on such a large matrix (I haven't seen it, but I may have missed something).
edit: Instead of modifying base Julia, I've tried writing a "Centered Sparse Matrix" package that keeps the sparsity structure of the input sparse matrix, but enters the column means where appropriate in various computations. It's limited in what it has implemented, and it works. Unfortunately, it is still too slow, despite some pretty extensive efforts to try to optimize things.
After much fuddling with the sparse matrix algorithm, I realized that distributing the multiplication over the subtraction was dramatically more efficient:
If our centered matrix Ac is formed from the original nxm matrix A and its vector of column means M, with a nx1 vector of ones that I will just call 1. We are multiplying by a mxk matrix X
Ac := (A - 1M')
AcX = X
= AX - 1M'X
And we are basically done. Stupidly simple, actually.
AX is can be carried out with the usual sparse matrix multiplication function, M'X is a dense vector-matrix inner product, and the vector of 1's "broadcasts" (to use Julia's terminology) to each row of the AX intermediate result. Most languages have a way of doing that broadcasting without realizing the extra memory allocation.
This is what I've implemented in my package for AcX and Ac'X. The resulting object can then be passed to algorithms, such as the svds function, which only depend on matrix multiplication and transpose multiplication.
Related
I am trying R package apcluster on a set of objects that I want to cluster, but I'm running into performance/memory problems, and I suspect I'm not doing it right. I'd like to hear your opinion, please.
In short: I have a set of about 13000 objects. Each object is associated with a set of 2 to 5 'features'. The similarity (by which I want to cluster, eventually) between any two objects i and j is equal to the number of features they have in common divided by the total number of distinct features they 'span'. E.g. if i = {a,b,c} and j = {c,d}, then sim[i,j] = 1/4 = 0.25, because they have only 1 feature in common ({c}) and in total they describe 4 distinct features ({a,b,c,d}).
Calculating my NxN similarity matrix is not a problem in theory: it can be done using set operations if each object's features are stored as a list; or features can be pivoted to a matrix of 1's and 0's, where each column is a feature, and then R's function dist with method="binary" does the trick.
In practice however, the first problem is that such similarity calculations are extremely slow. For 13 K objects, there are about 84.5 M similarities to calculate, but this doesn't sound so bad for a modern computer. I don't understand why it should take a few hours to do that. And the set operation version, that should be quicker as far as I can tell, is actually much slower than dist. [Another package called fingerprint is supposed to deal with such cases more efficiently, but so far I haven't been able to make it work, it gives a lot of errors when trying to make what they call 'featvec' objects].
The other thing to consider is that the 2-5 features per object are not very repetitive. There may be a group of 100 or so objects with at least one feature in common between them, but then none of the other 12.9 K objects has any feature in common with these 100 objects. The consequence is that the pivoted feature matrix is very sparse (if we consider 0's as empty). There are about 4000 columns in the pivoted matrix, and each row has at most 5 1's. I wonder if this is negatively impacting the performance of dist, in that it has to multiply through a lot of 0's that could instead be ignored.
Does it seem normal to you that it should take a few hours to apply dist to a matrix like the one I described? Can you suggest a different way to calculate the similarity that takes advantage of the sparseness of the matrix?
Anyway, I managed to get the output from dist, which however had class 'dist', and was a distance matrix, not a similarity one, so I had to use 1 - as.matrix(distance_matrix) to be able to make the similarity matrix apcluster needs as input.
That's when I got the first 'memory' problem. R said the vector could not be allocated due to its size. I tried the usual tricks, but in the end I could not get more than 4 GB, and my matrices are (apparently) bigger.
I overcame this by assigning each time new matrices to their old 'self'.
And then when I submitted this painstakingly put together similarity matrix to apcluster, again the vector size error popped up, as if the first thing apcluster did was create some other large object from what I had fed it.
I had a look at as.Sparse... in apcluster, but it does not seem to help a lot, considering that you have to calculate the full matrix first anyway.
In the end the only thing that worked a little bit was 'leveraged affinity propagation' by apclusterL, which however is an approximation.
Does anybody know if and how I could do this better? E.g. is it wise to pivot the data first, or should I stick to list and set operations? Or, can the fact that the initial matrix is sparse be used to compute directly a sparse similarity matrix, rather than compute it fully and reduce it to sparse later?
Any advice would be greatly appreciated. Thanks!
BTW, yes, I saw this thread: Cluster Analysis in R on large sparse matrix ; which does not seem to have been answered conclusively.
The R interpreter is really slow.
So you should use R mostly to "drive" your program, but implement all the computations heavy stuff in C or FORTRAN.
You didn't show the code you are using, but I guess it involves nested for loops? Try to rewrite it without any for loops in R, or rewrite it in C.
But no matter what, AP clustering will always remain very slow. It involves many passes over O(n²) matrixes, i.e. it scales very badly.
I have a difficult R computation to do, and I have an option of 2 computers, called V and L, to run the code. V is supposed to be faster than L, but I did not experience this. So I decided to test it out.
As a simple test, I decided to ask them invert a 3000*3000 matrice 500 times, and record the time.
set.seed(123)
I=500
n=3000
time=matrix(NA,ncol=3,nrow=I)
for(i in 1:I){
t0<-proc.time()
x<-solve(matrix(runif(n^2),n))
mt1<-proc.time()
time[i,]<-(mt1-t0)[1:3]
}
The problem is that during a particular iteration, it got stuck. I don't know why but I suspect it is because the matrix generated was near singular. So I would like to improve the code. I can think of 3 ways:
make sure the matrix generated is easily invertible. But how do i enforce this??? Of course, any solution needs to be computationally inexpensive, otherwise the exercise becomes meaningless.
ask R to skip that iteration if solve takes too long? But again, how do I do that?
assign them a different computation task instead, any recommendation?
A random matrix is invertible with probability 1, meaning that, in practice, the probability of generating a singular (i.e. non-invertible) matrix is infinitesimally small.
Moreover, from the point of view of the algorithm that R uses to invert matrices, there is no such thing as an "easily invertible" matrix. Either the algorithm succeeds, or it determines that a matrix is singular and fails. But there is no scenario under which it tries "really hard" and takes a long time to invert a matrix. It's a deterministic algorithm which either runs into a 0 (or a value smaller than some given epsilon), in which case if fails, or else it doesn't.
On which iteration do you get stuck? Are you sure you are getting stuck on the inversion of the matrix, and it's not something like garbage collection that is taking a long time?
I can't reproduce the problem you describe. Starting with random seed 123, I can invert 500 random 3000x3000 matrices in a row, using your code, without any significant timing discrepancies. Can you find a random seed that generates a "hard to invert matrix" directly?
What would be an efficient way (in terms of CPU-time and/or memory requirements) of multiplying, in fortran9x, an arbitrary M x N matrix, say A, only containing +1 and -1 as its entries (and fully populated!), with an arbitrary (dense) N-dimensional vector, v?
Many thanks,
Osmo
P.S. The size of A (i.e., M and N) is not known at the compilation time.
My guess is that it would be faster to just do the multiplication instead of trying to avoid the multiplication by checking the sign of the matrix element and adding/subtracting accordingly. Hence, just use a general optimized matrix-vector multiply routine. E.g. xGEMV from BLAS.
Depending on the usage scenario, if you have to apply the same matrix multiple times, you might separate it into two parts, one with the positive entries and one with the negatives.
With this you can avoid the need for multiplications, however it would introduce an indirection, which might be more expensive then the multiplications.
Thus janneb's solution might be the most suitable.
i am trying to make a 100 x 100 tridiagonal matrix with 2's going down the diagonal and -1's surrounding the 2's. i can make a tridiagonal matrix with only 1's in the three diagonals and preform matrix addition to get what i want, but i want to know if there is a way to customize the three diagonals to what ever you want. maplehelp doesn't list anything useful.
The Matrix function in the LinearAlgebra package can be called with a parameter (init) that is a function that can assign a value to each entry of the matrix depending on its position.
This would work:
f := (i, j) -> if i = j then 2 elif abs(i - j) = 1 then -1 else 0; end if;
Matrix(100, f);
LinearAlgebra[BandMatrix] works too (and will be WAY faster), especially if you use storage=band[1]. You should probably use shape=symmetric as well.
The answers involving an initializer function f will do O(n^2) work for square nxn Matrix. Ideally, this task should be O(n), since there are just less than 3*n entries to be filled.
Suppose also that you want a resulting Matrix without any special (eg. band) storage or indexing function (so that you can later write to any part of it arbitrarily). And suppose also that you don't want to get around such an issue by wrapping the band structure Matrix with another generic Matrix() call which would double the temp memory used and produce collectible garbage.
Here are two ways to do it (without applying f to each entry in an O(n^2) manner, or using a separate do-loop). The first one involves creation of the three bands as temps (which is garbage to be collected, but at least not n^2 size of it).
M:=Matrix(100,[[-1$99],[2$100],[-1$99]],scan=band[1,1]);
This second way uses a routine which walks M and populates it with just the three scalar values (hence not needing the 3 band lists explicitly).
M:=Matrix(100):
ArrayTools:-Fill(100,2,M,0,100+1);
ArrayTools:-Fill(99,-1,M,1,100+1);
ArrayTools:-Fill(99,-1,M,100,100+1);
Note that ArrayTools:-Fill is a compiled external routine, and so in principal might well be faster than an interpreted Maple language (proper) method. It would be especially fast for a Matrix M with a hardware datatype such as 'float[8]'.
By the way, the reason that the arrow procedure above failed with error "invalid arrow procedure" is likely that it was entered in 2D Math mode. The 2D Math parser of Maple 13 does not understand the if...then...end syntax as the body of an arrow operator. Alternatives (apart from writing f as a proc like someone else answered) is to enter f (unedited) in 1D Maple notation mode, or to edit f to use the operator form of if. Perhaps the operator form of if here requires a nested if to handle the elif. For example,
f := (i,j) -> `if`(i=j,2,`if`(abs(i-j)=1,-1,0));
Matrix(100,f);
jmbr's proposed solutions can be adapted to work:
f := proc(i, j)
if i = j then 2
elif abs(i - j) = 1 then -1
else 0
end if
end proc;
Matrix(100, f);
Also, I understand your comment as saying you later need to destroy the band matrix nature, which prevents you from using BandMatrix - is that right? The easiest solution to that is to wrap the BandMatrix call in a regular Matrix call, which will give you a Matrix you can change however you'd like:
Matrix(LinearAlgebra:-BandMatrix([1,2,1], 1, 100));
Disclaimer
This is not strictly a programming question, but most programmers soon or later have to deal with math (especially algebra), so I think that the answer could turn out to be useful to someone else in the future.
Now the problem
I'm trying to check if m vectors of dimension n are linearly independent. If m == n you can just build a matrix using the vectors and check if the determinant is != 0. But what if m < n?
Any hints?
See also this video lecture.
Construct a matrix of the vectors (one row per vector), and perform a Gaussian elimination on this matrix. If any of the matrix rows cancels out, they are not linearly independent.
The trivial case is when m > n, in this case, they cannot be linearly independent.
Construct a matrix M whose rows are the vectors and determine the rank of M. If the rank of M is less than m (the number of vectors) then there is a linear dependence. In the algorithm to determine the rank of M you can stop the procedure as soon as you obtain one row of zeros, but running the algorithm to completion has the added bonanza of providing the dimension of the spanning set of the vectors. Oh, and the algorithm to determine the rank of M is merely Gaussian elimination.
Take care for numerical instability. See the warning at the beginning of chapter two in Numerical Recipes.
If m<n, you will have to do some operation on them (there are multiple possibilities: Gaussian elimination, orthogonalization, etc., almost any transformation which can be used for solving equations will do) and check the result (eg. Gaussian elimination => zero row or column, orthogonalization => zero vector, SVD => zero singular number)
However, note that this question is a bad question for a programmer to ask, and this problem is a bad problem for a program to solve. That's because every linearly dependent set of n<m vectors has a different set of linearly independent vectors nearby (eg. the problem is numerically unstable)
I have been working on this problem these days.
Previously, I have found some algorithms regarding Gaussian or Gaussian-Jordan elimination, but most of those algorithms only apply to square matrix, not general matrix.
To apply for general matrix, one of the best answers might be this:
http://rosettacode.org/wiki/Reduced_row_echelon_form#MATLAB
You can find both pseudo-code and source code in various languages.
As for me, I transformed the Python source code to C++, causes the C++ code provided in the above link is somehow complex and inappropriate to implement in my simulation.
Hope this will help you, and good luck ^^
If computing power is not a problem, probably the best way is to find singular values of the matrix. Basically you need to find eigenvalues of M'*M and look at the ratio of the largest to the smallest. If the ratio is not very big, the vectors are independent.
Another way to check that m row vectors are linearly independent, when put in a matrix M of size mxn, is to compute
det(M * M^T)
i.e. the determinant of a mxm square matrix. It will be zero if and only if M has some dependent rows. However Gaussian elimination should be in general faster.
Sorry man, my mistake...
The source code provided in the above link turns out to be incorrect, at least the python code I have tested and the C++ code I have transformed does not generates the right answer all the time. (while for the exmample in the above link, the result is correct :) -- )
To test the python code, simply replace the mtx with
[30,10,20,0],[60,20,40,0]
and the returned result would be like:
[1,0,0,0],[0,1,2,0]
Nevertheless, I have got a way out of this. It's just this time I transformed the matalb source code of rref function to C++. You can run matlab and use the type rref command to get the source code of rref.
Just notice that if you are working with some really large value or really small value, make sure use the long double datatype in c++. Otherwise, the result will be truncated and inconsistent with the matlab result.
I have been conducting large simulations in ns2, and all the observed results are sound.
hope this will help you and any other who have encontered the problem...
A very simple way, that is not the most computationally efficient, is to simply remove random rows until m=n and then apply the determinant trick.
m < n: remove rows (make the vectors shorter) until the matrix is square, and then
m = n: check if the determinant is 0 (as you said)
m < n (the number of vectors is greater than their length): they are linearly dependent (always).
The reason, in short, is that any solution to the system of m x n equations is also a solution to the n x n system of equations (you're trying to solve Av=0). For a better explanation, see Wikipedia, which explains it better than I can.