BLAS: gemm vs. gemv - linear-algebra

Why does BLAS have a gemm function for matrix-matrix multiplication and a separate gemv function for matrix-vector multiplication? Isn't matrix-vector multiplication just a special case of matrix-matrix multiplication where one matrix only has one row/column?

Mathematically, matrix-vector multiplication is a special case of matrix-matrix multiplication, but that's not necessarily true of them as realized in a software library.
They support different options. For example, gemv supports strided access to the vectors on which it is operating, whereas gemm does not support strided matrix layouts. In the C language bindings, gemm requires that you specify the storage ordering of all three matrices, whereas that is unnecessary in gemv for the vector arguments because it would be meaningless.
Besides supporting different options, there are families of optimizations that might be performed on gemm that are not applicable to gemv. If you know that you are doing a matrix-vector product, you don't want the library to waste time figuring out that's the case before switching into a code path that is optimized for that case; you'd rather call it directly instead.

When you optimize gemv and gemm different techniques apply:
For the matrix-matrix operation you are using blocked algorithms. Block sizes depend on cache sizes.
For optimising the matrix-vector product you use so called fused Level 1 operations (e.g. fused dot-products or fused axpy).
Let me know if you want more details.

I think it just fits the BLAS hierarchy better with its level 1 (vector-vector), level 2 (matrix-vector) and level 3 (matrix-matrix) routines. And it maybe optimizable a bit better if you know it is only a vector.

Related

What is the point of the Symmetric type in Julia?

What is the point of the Symmetric type in the LinearAlgebra package of Julia? It seems like it is equivalent to the type Hermitian for real matrices (although: is this true?). If that is true, then the only case for which Symmetric is not redundant with Hermitian is for complex matrices, and it would be surprising to want to have a symmetric as opposed to Hermitian complex matrix (maybe I am mistaken on that though).
I ask this question in part because I sometimes find myself doing casework like this: if I have a real matrix, then use Symmetric; if complex, then Hermitian. It seems though that I could save work by just always using Hermitian. Will I be missing out on performance or otherwise if I do this?
(Also, bonus question that may be related: why is there no HermTridiagonal type in addition to SymTridiagonal? I could use the former. Plus, it seems more useful than SymTridiagonal in consideration of the above.)
To copy the answer from the linked discourse thread (via #stevengj):
Always use Hermitian. For real elements, there is no penalty compared to Symmetric.
There aren’t any specialized routines for complex Symmetric matrices that I know of. My feeling is that it was probably a mistake to have a separate Symmetric type in LinearAlgebra, but it is hard to remove at this point.

OpenCl and power iteration method (eigendecomposition)

I'm new in OpenCL and I'm trying to implement power iteration method (described over here)
matrix sizes over 100000x100000!
Actually I have no idea how to implement this.
It's because workgroup have restriction CL_DEVICE_MAX_WORK_GROUP_SIZE (so I can't make one workgoup with 1000000 work-items)
But on each step of iterating I need to synchronize and normalize vector.
1) So is it possible to make all calculations inside one kernel? (I think that answer is no if matrix sizes is more than CL_DEVICE_MAX_WORK_GROUP_SIZE)
2) Can I make "while" loop in the host code? and is it still profitable to use GPU in this case?
something like:
while (condition)
{
kernel calling
synchronization
}
2: Yes, you can make a while loop in host code. Whether this is still profitable in terms of performance depends on whether the kernel that is called achieves a good speedup. My personal preference is not to pack too much logic into a single kernel, because smaller kernels are easier to maintain and sometimes easier to optimize. But of course, invoking a kernel has a (small) overhead that has to be taken into account. And whether combining to kernels into one can bring a speedup (or new potential for optimizations) depends on what the kernels are actually doing. But in this case (Matrix Multiplation and Vector Normalization) I'd personally start with two different kernels that are invoked from the host in a while-loop.
1: Since a 100000x100000 matrix with float values will take at least 40GB of memory, you'll have to think about the approach in general anyhow. There is a vast amount of literature on Matrix operations, their parallelization, and the corresponding implementations on the GPU. One important aspect from the "high level" point of view is whether the matrices are dense or sparse ( http://en.wikipedia.org/wiki/Sparse_matrix ). Depending on the sparsity, it might even be possible to handle 100000x100000 matrices in main memory. Apart from that, you might consider having a look at a library for matrix operations (e.g. http://viennacl.sourceforge.net/ ) because implementing an efficient matrix multiplication is challenging, particularly for sparse matrices. But if you want to go the whole way on your own: Good luck ;-) and ... the CL_DEVICE_MAX_WORK_GROUP_SIZE imposes no limitation on the problem size. In fact, the problem size (that is, the total number of work-items) in OpenCL is virtually infinitely large. If your CL_DEVICE_MAX_WORK_GROUP_SIZE is 256, and you want to handle 10000000000 elements, then you create 10000000000/256 work groups and let OpenCL care about how they are actually dispatched and executed. For matrix operations, the CL_DEVICE_MAX_WORK_GROUP_SIZE is primarily relevant when you want to use local memory (and you will have to, in order to achieve good performance): The size of the work groups thus implicitly defines how large your chunks of local memory may be.

math.h functions in fixed-point (32,32) format (64-bit) library for C

I'm looking for a 64-bit fixed-point (32,32) library for one of my C implementations.
Similar to this one http://code.google.com/p/libfixmath/
Need support for standard math.h operation.
Did anyone see such implementations?
fixedptc seems to be what you look for. It is a C, header-only and integer-only library for fixed point operations, located at http://www.sourceforge.net/projects/fixedptc
Bitwidth is settable through defines. In your case, you want to compile with -DFIXEDPT_BITS=64 -DFIXED_WBITS=32 to get a (32,32) fixed point number format.
Implemented functions are conversion to string, multiplication, division, square root, sine, cosine, tangent, exponential, power, natural logarithm and arbitrary-base logarithm.

Fortran: multiplication with matrices only containing +1 and -1 as entries

What would be an efficient way (in terms of CPU-time and/or memory requirements) of multiplying, in fortran9x, an arbitrary M x N matrix, say A, only containing +1 and -1 as its entries (and fully populated!), with an arbitrary (dense) N-dimensional vector, v?
Many thanks,
Osmo
P.S. The size of A (i.e., M and N) is not known at the compilation time.
My guess is that it would be faster to just do the multiplication instead of trying to avoid the multiplication by checking the sign of the matrix element and adding/subtracting accordingly. Hence, just use a general optimized matrix-vector multiply routine. E.g. xGEMV from BLAS.
Depending on the usage scenario, if you have to apply the same matrix multiple times, you might separate it into two parts, one with the positive entries and one with the negatives.
With this you can avoid the need for multiplications, however it would introduce an indirection, which might be more expensive then the multiplications.
Thus janneb's solution might be the most suitable.

Accuracy of ZHEEV and ZHEEVD

I am using LAPACK to diagonalize complex Hermitian matrices. I can choose between ZHEEV and ZHEEVD. Which one of these routines is more accurate for matrices of the size 40 and a range of eigenvalues from 1E-2 to 1E1?
ZHEEVD uses a divide-and-conquer method to compute eigenvalues.
If your matrices are 40 x 40 and the eigenvalues are within the range [1e-2, 1e1] then
you should have absolutely no numerical issues. You can use either routine.
I don't know the answer but,
It probably depends on which LAPACK library you're using. There are a number of them out there, optimized for various platforms. Are you using Netlib, MKL, ACML, ??
Why would you take a take a total stranger's word for this when you can measure it yourself?

Resources