How do you do parallel matrix multiplication in Julia? - julia

Is there a good way to do parallel matrix multiplication in julia? I tried using DArrays, but it was significantly slower than just a single-thread multiplication.

Parallel in what sense? If you mean single-machine, multi-threaded, then Julia does this by default as OpenBLAS (the underlying linear algebra library used) is multithreaded.
If you mean multiple-machine, distributed-computing-style, then you will be encountering a lot of communications overhead that will only be worth it for very large problems, and a customized approach might be needed.

The problem is most likely that direct (maybe single-threaded) matrix-multiplication is normally performed with an optimized library function. In the case of OpenBLAS, this is already multithreaded. For arrays with size 2000x2000, the simple matrixmultiplication
#time c = sa * sb;
results in 0.3 seconds multithreaded and 0.7 seconds singlethreaded.
Splitting of a single dimension in multiplication the times get even worse and reach around 17 seconds in singlethreaded mode.
#time for j = 1:n
sc[:,j] = sa[:,:] * sb[:,j]
shared arrays
The solution to your problem might be the use of shared arrays, which share the same data across your processes on a single computer. Please note that shared arrays are still marked as experimental.
# create shared arrays and initialize them with random numbers
sa = SharedArray(Float64,(n,n),init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sb = SharedArray(Float64,(n,n),init = s -> s[localindexes(s)] = rand(length(localindexes(s))))
sc = SharedArray(Float64,(n,n));
Then you have to create a function, which performs a cheap matrix multiplication on a subset of the matrix.
#everywhere function mymatmul!(n,w,sa,sb,sc)
# works only for 4 workers and n divisible by 4
range = 1+(w-2) * div(n,4) : (w-1) * div(n,4)
sc[:,range] = sa[:,:] * sb[:,range]
Finally, the main process tells the workers to work on their part.
#time #sync begin
for w in workers()
#async remotecall_wait(w, mymatmul!, n, w, sa, sb, sc)
which takes around 0.3 seconds which is the same time as the multithreaded single-process time.

Here is my problem - I would like to generate a fairly large number of factorial combinations and then apply some constraints on them to narrow down the list of all possible combinations. However, this becomes an issue when the number of all possible combinations becomes extremely large.
Let's take an example - Assume we have 8 variables (A; B; C; etc.) each taking 3 levels/values (A={1,2,3}; B={1,2,3}; etc.).
The list of all possible combinations would be 3**8 (=6561) and can be generated as following:
tic <- function(){start.time <<- Sys.time()}
toc <- function(){round(Sys.time() - start.time, 4)}
nX = 8
lk = as.list(NULL)
lk = lapply(1:nX, function(x) c(1,2,3))
mapx = expand.grid(lk)
mapx$idx = 1:nrow(mapx)
So far so good, these operations are done pretty quickly (< 1 second) even if we significantly increase the number of variables.
The next step is to generate a corrected set of all pairwise comparisons (An uncorrected set would be obtain by freely combining all 6561 options with each other, leading to 65616561=43046721 combinations) - The size of this "universe" would be: 6561(6561-1)/2 = 21520080. Already pretty big!
I am using the R built-in function combn to get it done. In this example the running time remains acceptable (about 20 seconds on my PC) but things become impossible with higher higher number of variables and/or more levels per variable (running time would increase exponentially, for example it already took 177 seconds with 9 variables!). But my biggest concern is actually that the object size would become so large that R can no longer handle it (Memory issue).
univ = t(combn(mapx$idx,2))
The next step would be to identify the list of combinations meeting some pre-defined constraints. For instance I would like to sub-select all combinations sharing exactly 3 common elements (ie 3 variables take the same values). Again the running time will be very long (even if a 8 variables) as my approach is to loop over all combinations previously defined.
vrf = NULL
vrf = sapply(1:nrow(univ), function(x){
j1 = mapx[mapx$idx==univ[x,1],-ncol(mapx)]
j2 = mapx[mapx$idx==univ[x,2],-ncol(mapx)]
cond = ifelse(sum(j1==j2)==3,1,0)
univ = univ[vrf==1,]
Would you know how to overcome this issue? Any tips/advices would be more than welcome!

Writing a chunk of MPI distributed data via hdf5 in fortran

I have a 3d array distributed into different MPI processes:
real :: DATA(i1:i2, j1:j2, k1:k2)
where i1, i2, ... are different for each MPI process, but the MPI grid is cartesian.
For simplicity let's assume I have a 120 x 120 x 120 array, and 27 MPI processes distributed as 3 x 3 x 3 (so that each processor has an array of size 40 x 40 x 40).
Using hdf5 library I need to write only a slice of that data, say, a slice that goes through the middle perpendicular to the second axis. The resulting (global) array would be of size 120 x 1 x 120.
I'm a bit confused on how to properly use the hdf5 here, and how to generalize full DATA writing (which I can do). The problem is, not each MPI thread is going to be writing. For instance, in the case above, only 9 processes will have to write something, others (which are on the +/-x and +/-z edges of the cube) will not have to, since they don't contain any chunk of the slab I need.
I tried the chunking technique described here, but it looks like that's just for a single thread.
Would be very grateful if the hdf5 community can help me in this :)
When writing an HDF5 dataset in parallel, all MPI processes must participate in the operation (even if a certain MPI process does not have values to write).
If you are not bound to a specific library, take a look at HDFql. Based on what I could understand from the use-case you have posted, here goes an example on how to write data in parallel in Fortran using HDFql.
! use HDFql module (make sure it can be found by the Fortran compiler)
! declare variables
REAL(KIND=8), DIMENSION(40, 40, 40) :: values
CHARACTER(2) :: start
INTEGER :: state
! create an HDF5 file named "example.h5" and use (i.e. open) it in parallel
state = hdfql_execute("CREATE AND USE FILE example.h5 IN PARALLEL")
! create a dataset named "dset" of data type double of three dimensions (size 120x120x120)
state = hdfql_execute("CREATE DATASET dset AS DOUBLE(120, 120, 120)");
! populate variable "values" with certain values
DO x = 1, 40
DO y = 1, 40
DO z = 1, 40
values(z, y, x) = hdfql_mpi_get_rank() * 100000 + (x * 1600 + y * 40 + z)
! register variable "values" for subsequent use (by HDFql)
state = hdfql_variable_register(values)
IF (hdfql_mpi_get_rank() < 3) THEN
! insert (i.e. write) values from variable "values" into dataset "dset" using an hyperslab in function of the MPI rank (each rank writes 40x40x40 values)
WRITE(start, "(I0)") hdfql_mpi_get_rank() * 40
state = hdfql_execute("INSERT INTO dset(" // start // ":1:1:40) IN PARALLEL VALUES FROM MEMORY 0")
! if MPI rank is equal or greater than 3 nothing is written
state = hdfql_execute("INSERT INTO dset IN PARALLEL NO VALUES")
Please check HDFql reference manual to get additional information on how to work with HDF5 files in parallel (i.e. with MPI) using this library.

Julia : BLAS.gemm!() parameters

I want to use the BLAS package. To do so, the meaning of the two first parameters of the gemm() function is not evident for me.
What do the parameters 'N' and 'T' represent?
BLAS.gemm!('N', 'T', lr, alpha, A, B, beta, C)
What is the difference between BLAS.gemm and BLAS.gemm! ?
According to the documentation
gemm!(tA, tB, alpha, A, B, beta, C)
Update C as alpha * A * B + beta*C or the other three variants according to tA (transpose A) and tB. Returns the updated C.
Note: here, alpha and beta must be float type scalars. A, B and C are all matrices. It's up to you to make sure the matrix dimensions match.
Thus, the tA and tB parameters refer to whether you want to apply the transpose operation to A or to B before multiplying. Note that this will cost you some computation time and allocations - the transpose isn't free. (thus, if you were going to apply the multiplication many times, each time with the same transpose specification, you'd be better off storing your matrix as the transposed version from the beginning). Select N for no transpose, T for transpose. You must select one or the other.
The difference between gemm!() and gemv!() is that for gemm!() you already need to have allocated the matrix C. The ! is a "modify in place" signal. Consider the following illustration of their different uses:
A = rand(5,5)
B = rand(5,5)
C = Array(Float64, 5, 5)
BLAS.gemm!('N', 'T', 1.0, A, B, 0.0, C)
D = BLAS.gemm('N', 'T', 1.0, A, B)
julia> C == D
Each of these, in essence, perform the calculation C = A * B'. (Technically, gemm!() performs C = (0.0)*C + (1.0)*A * B'.)
Thus, the syntax for the modify in place gemm!() is a bit unusual in some respects (unless you've already worked with a language like C in which case it seems very intuitive). You don't have the explicit = sign like you frequently do when calling functions in assigning values in a high level object oriented language like Julia.
As the illustration above shows, the outcome of gemm!() and gemm() in this case is identical, even though the syntax and procedure to achieve that outcome is a bit different. Practically speaking, however, performance differences between the two can be significant, depending on your use case. In particular, if you are going to be performing that multiplication operation many times, replacing/updating the value of C each time, then gemm!() can be a decent bit quicker because you don't need to keep re-allocating new memory each time, which does have time costs, both in the initial memory allocation and then in the garbage collection later on.

Efficiently multiply OpenCL vector components?

I have a float8 vector type that I multiply the components of the vector using vector component addressing as follows ( note the variable v below isn't a constant in reality);
float8 v = (float8) (1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f);
float result = v.s0 * v.s1 * v.s2 * v.s3 * v.s4 * v.s5 * v.s6 * v.s7;
However this prevents my kernel from being vectorised when being compiled with Intel Code builder.
Device build started
Device build done
Kernel <test> was not vectorized
To over come this I started to create copies of the vector, masking the required components and multiplying them all together before trying to call the dot function however this all seemed rather inefficient and convoluted.
My question is therefore how can I multiply the components of my vector in a efficient vectorised manor?
My comment was wrong as it is not a dot product you need in the result. It is simply a multiplication of 8 numbers. Parallel work data should be parallel, not in same container. If you want to multiply s0 s1 s2 ... s7 then you put them in consecutive vector variables
variable-1: s0 p0 r0 q0 .... z0
variable-2: s1 p1 r1 q1 .... z1
variable-8: s7 p7 .... z7
you can multiply those at SIMD speed and have 8 multiplications at a time using float8 type and continue as many times as you need, not just 8.
At each multiplication, you have responsibility to check for errors and overflows. But when hardware does 8 multiplications in a single instruction, which order do you want? You want them multiplied in increasing index order(serial,slow) or something like a pairwise multiplication on tree elements(less multiplications,faster,but give different results)? Order of operations may be important sometimes.
If it is a gpu, simply multiply items and instruction level parallelism + hyperthread engine of gpu achieves efficiency. If it is cpu, you should first check if your cpu supports vertical multiplication instructions(I doubt such thing exists), if not then you need to multiply on array elements not vector elements. This should be easier to vectorise as it is a continuous data on main memory since a cpu does not give explicit control on local memory.

Efficient Multiplication of Varying-Length #s [Conceptual]

So it seems I "underestimated" what varying length numbers meant. I didn't even think about situations where the operands are 100 digits long. In that case, my proposed algorithm is definitely not efficient. I'd probably need an implementation who's complexity depends on the # of digits in each operands as opposed to its numerical value, right?
As suggested below, I will look into the Karatsuba algorithm...
Write the pseudocode of an algorithm that takes in two arbitrary length numbers (provided as strings), and computes the product of these numbers. Use an efficient procedure for multiplication of large numbers of arbitrary length. Analyze the efficiency of your algorithm.
I decided to take the (semi) easy way out and use the Russian Peasant Algorithm. It works like this:
a * b = a/2 * 2b if a is even
a * b = (a-1)/2 * 2b + a if a is odd
My pseudocode is:
rpa(x, y){
if x is 1
return y
if x is even
return rpa(x/2, 2y)
if x is odd
return rpa((x-1)/2, 2y) + y
I have 3 questions:
Is this efficient for arbitrary length numbers? I implemented it in C and tried varying length numbers. The run-time in was near-instant in all cases so it's hard to tell empirically...
Can I apply the Master's Theorem to understand the complexity...?
a = # subproblems in recursion = 1 (max 1 recursive call across all states)
n / b = size of each subproblem = n / 1 -> b = 1 (problem doesn't change size...?)
f(n^d) = work done outside recursive calls = 1 -> d = 0 (the addition when a is odd)
a = 1, b^d = 1, a = b^d -> complexity is in n^d*log(n) = log(n)
this makes sense logically since we are halving the problem at each step, right?
What might my professor mean by providing arbitrary length numbers "as strings". Why do that?
Many thanks in advance
What might my professor mean by providing arbitrary length numbers "as strings". Why do that?
This actually change everything about the problem (and make your algorithm incorrect).
It means than 1234 is provided as 1,2,3,4 and you cannot operate directly on the whole number. You need to analyze your algorithm in terms of #additions, #multiplications, #divisions.
You should expect a division to be a bit more expensive than a multiplication, and a multiplication to be lot more expensive than an addition. So a good algorithm try to reduce the number of divisions and multiplications.
Check out the Karatsuba algorithm, (ps don't copy it that's not what your teacher want) is one of the fastest for this specification.
Add 3): Native integers are limited in how large (or small) numbers they can represent (32- or 64-bit integers for example). To represent arbitrary length numbers you can choose strings, because then you are not really limited by this. The problem is then, of course, that your arithmetic units are not really made to add strings ;-)
