Parallel Computing SharedArray - julia

I am trying to implement a simple parallel algorithm in Julia for a multiply-add operation on a matrix. This is more for understanding how to do parallel computations in Julia than for practicality.
Ultimately, this should implement C = C + A*B where A,B,and C are matrices. The code I am using is:
function MultiplyAdd!(C::SharedArray{Int64,2}, A::SharedArray{Int64,2}, B::SharedArray{Int64,2})
assert(size(A)[2]==size(B)[1])
const p = size(A)[1] #rows(A), rows(C)
const m = size(A)[2] #cols(A), rows(B)
const n = size(B)[2] #cols(C), cols(B)
# thresh is some threshold based on the number of operations required
# to compute C before we should switch to parallel computation.
const thresh = 190
M = 2*p*m*n
if ( M < thresh )
C += A*B # base case for recursion
elseif ( n >= max(p,m) )
# Partition C and B (Vertical Split)
if (iseven(p))
#even partition of C and B
#MultiplyAdd!( C0, A, B0 )
#MultiplyAdd!( C1, A, B1 )
a = #spawn MultiplyAdd!( C[:,1:convert(Int64,p/2)], A, B[:,1:convert(Int64,p/2)] )
b = #spawn MultiplyAdd!( C[:,1:convert(Int64,p-(p/2))], A, B[:,1:convert(Int64,p-(p/2))])
fetch(a), fetch(b)
else
#odd parition of C and B
a = #spawn MultiplyAdd!( C[:,1:convert(Int64,p/2+0.5)], A, B[:,1:convert(Int64,p/2+0.5)] )
b = #spawn MultiplyAdd!( C[:,1:convert(Int64,p-(p/2+0.5))], A, B[:,1:convert(Int64,p-(p/2+0.5))])
fetch(a), fetch(b)
end
elseif ( p >= m )
# Partition C and A (Horizontal Split)
if (iseven(n))
#even partition of C and A
#MultiplyAdd!( C0, A0, B )
#MultiplyAdd!( C1, A1, B )
a = #spawn MultiplyAdd!(C[1:convert(Int64,n/2),:],A[1:convert(Int64,n/2),:],B)
b = #spawn MultiplyAdd!(C1 = C[1:convert(Int64,n-(n/2)),:], A[1:convert(Int64,n-(n/2)),:], B)
fetch(a), fetch(b)
else
#odd parition of C and A
# MultiplAdd!( C0, A0, B )
# MultiplyAdd!( C1, A1, B )
a = #spawn MultiplyAdd!( C[1:convert(Int64,n/2 + 0.5),:], A[1:convert(Int64,n/2),:], B )
b = #spawn MultiplyAdd!( C[1:convert(Int64,n-(n/2 + 0.5)),:], A[1:convert(Int64,n-(n/2 + 0.5)),:], B )
fetch(a), fetch(b)
end
else
#Begin Serial Recursion
# A is Vertical Split, B is Horizontal Split
#MultiplyAdd!( C,A0,B0 )
#MultiplyAdd!( C,A1,B1 )
if (iseven(m))
MultiplyAdd!( C,A[:,1:convert(Int64,m/2)], B[1:convert(Int64,m/2),:] )
MultiplyAdd!( C,A[:,1:convert(Int64,m-(m/2))], B[1:convert(Int64,m-(m/2) ),:] )
else
MultiplyAdd!( C,A[:,1:convert(Int64,m/2 + 0.5)], B[1:convert(Int64,m/2 + 0.5), :])
MultiplyAdd!( C,A[:,1:convert(Int64,m-(m/2 + 0.5))], B[1:convert(Int64,m-(m/2 + 0.5)), :] )
end
end
end
First, this doesn't work at all. I get the following error when I run it.
LoadError: On worker 5:
UndefVarError: #MultiplyAdd! not defined
in deserialize_datatype at ./serialize.jl:823
in handle_deserialize at ./serialize.jl:571
in collect at ./array.jl:307
in deserialize at ./serialize.jl:588
in handle_deserialize at ./serialize.jl:581
in deserialize at ./serialize.jl:541
in deserialize_datatype at ./serialize.jl:829
in handle_deserialize at ./serialize.jl:571
in deserialize_msg at ./multi.jl:120
in message_handler_loop at ./multi.jl:1317
in process_tcp_streams at ./multi.jl:1276
in #618 at ./event.jl:68
while loading In[32], in expression starting on line 73
in #remotecall_fetch#606(::Array{Any,1}, ::Function, ::Function, ::Base.Worker, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:1070
in remotecall_fetch(::Function, ::Base.Worker, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:1062
in #remotecall_fetch#609(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:1080
in remotecall_fetch(::Function, ::Int64, ::Base.RRID, ::Vararg{Any,N}) at ./multi.jl:1080
in call_on_owner(::Function, ::Future, ::Int64, ::Vararg{Int64,N}) at ./multi.jl:1130
in fetch(::Future) at ./multi.jl:1156
in MultiplyAdd!(::SharedArray{Int64,2}, ::SharedArray{Int64,2}, ::SharedArray{Int64,2}) at ./In[32]:24
Second, it would seem to me that I should not run two #spawn tasks. I should let the second one just be a MultiplyAdd!(C,A,B) call in each of these cases. In other words, just assign a and fetch(a).
Third, Julia passes arrays to functions by reference, so wouldn't all of the operations naturally operate on the same C, A, and B matrices? As it stands, taking a slice such as:
C0 = C[:, 1:p/2]
C1 = C[:, 1:p-p/2]
creates an entirely new object, which explains taking the slices inside of the function calls in the above code. In essence, I am avoiding assignment to try and always operate on the same object. There must be a better way to do that than what I have implemented. Ultimately, I want to operate on the same data in memory and just "move a magnifying glass over the array" to operate on subsets of it in parallel.

It is difficult to help you here because you have not really asked a question. At the risk of sounding condescending, I suggest that you peek at suggestions for asking a good SO question.
As for your code, there are several problems with your approach.
As coded, MultiplyAdd! only parallelizes to a maximum of two workers.
MultiplyAdd! performs several calls like A[:,1:n] which allocate new arrays.
In that vein, calls like A[:,1:n] will make Array objects and not SharedArray objects, so recursive calls to MultiplyAdd! with arguments strictly typed to SharedArray{Int,2} will not work.
Lastly, MultiplyAdd! does not respect the indexing scheme of SharedArray objects.
The last point is important. Asking worker 1 to access the parts of A and B assigned to worker 2 requires data transfer, which kills any performance gain from parallelizing your code. You can see this by running
for i in procs(A)
#show i, localindexes(A)
end
on your SharedArray object A. Each worker i should ideally operate only on its own local indices, though it can be helpful to allow data sharing at local index boundaries to save yourself some bookkeeping headache.
If you insist on using SharedArrays for your prototype, then you still have options. The SharedArray documentation has some good suggestions. I have found that the construct
function f(g, S::SharedArray)
#sync begin
for p in procs(S)
#async begin
remotecall_fetch(g, p, S, p)
end
end
end
S
end
with some kernel function g (e.g. MultiplyAdd!) will typically parallelize operations nicely across all participating workers. Obviously you must decide on how to partition the execution; the advection_shared! example in the docs is a good guide.
You may also consider using Julia's native multithreading. That parallel framework is a little different than shared memory computing. However, it allows you to operate on Array objects directly with familiar iteration constructs.

Related

function composition for multiple arguments and nested functions

I have a pure function that takes 18 arguments process them and returns an answer.
Inside this function I call many other pure functions and those functions call other pure functions within them as deep as 6 levels.
This way of composition is cumbersome to test as the top level functions,in addition to their logic,have to gather parameters for inner functions.
# Minimal conceptual example
main_function(a, b, c, d, e) = begin
x = pure_function_1(a, b, d)
y = pure_function_2(a, c, e, x)
z = pure_function_3(b, c, y, x)
answer = pure_function_4(x,y,z)
return answer
end
# real example
calculate_time_dependant_losses(
Ap,
u,
Ac,
e,
Ic,
Ep,
Ecm_t,
fck,
RH,
T,
cementClass::Char,
ρ_1000,
σ_p_start,
f_pk,
t0,
ts,
t_start,
t_end,
) = begin
μ = σ_p_start / f_pk
fcm = fck + 8
Fr = σ_p_start * Ap
_σ_pb = σ_pb(Fr, Ac, e, Ic)
_ϵ_cs_t_start_t_end = ϵ_cs_ti_tj(ts, t_start, t_end, Ac, u, fck, RH, cementClass)
_ϕ_t0_t_start_t_end = ϕ_t0_ti_tj(RH, fcm, Ac, u, T, cementClass, t0, t_start, t_end)
_Δσ_pr_t_start_t_end =
Δσ_pr(σ_p_start, ρ_1000, t_end, μ) - Δσ_pr(σ_p_start, ρ_1000, t_start, μ)
denominator =
1 +
(1 + 0.8 * _ϕ_t0_t_start_t_end) * (1 + (Ac * e^2) / Ic) * ((Ep * Ap) / (Ecm_t * Ac))
shrinkageLoss = (_ϵ_cs_t_start_t_end * Ep) / denominator
relaxationLoss = (0.8 * _Δσ_pr_t_start_t_end) / denominator
creepLoss = (Ep * _ϕ_t0_t_start_t_end * _σ_pb) / Ecm_t / denominator
return shrinkageLoss + relaxationLoss + creepLoss
end
I see examples of functional composition (dot chaining,pipe operator etc) with single argument functions.
Is it practical to compose the above function using functional programming?If yes, how?
The standard and simple way is to recast your example so that it can be written as
# Minimal conceptual example, re-cast
main_function(a, b, c, d, e) = begin
x = pure_function_1'(a, b, d)()
y = pure_function_2'(a, c, e)(x)
z = pure_function_3'(b, c)(y) // I presume you meant `y` here
answer = pure_function_4(z) // and here, z
return answer
end
Meaning, we use functions that return functions of one argument. Now these functions can be easily composed, using e.g. a forward-composition operator (f >>> g)(x) = g(f(x)) :
# Minimal conceptual example, re-cast, composed
main_function(a, b, c, d, e) = begin
composed_calculation =
pure_function_1'(a, b, d) >>>
pure_function_2'(a, c, e) >>>
pure_function_3'(b, c, y) >>>
pure_function_4
answer = composed_calculation()
return answer
end
If you really need the various x y and z at differing points in time during the composed computation, you can pass them around in a compound, record-like data structure. We can avoid the coupling of this argument handling if we have extensible records:
# Minimal conceptual example, re-cast, composed, args packaged
main_function(a, b, c, d, e) = begin
composed_calculation =
pure_function_1'(a, b, d) >>> put('x') >>>
get('x') >>> pure_function_2'(a, c, e) >>> put('y') >>>
get('x') >>> pure_function_3'(b, c, y) >>> put('z') >>>
get({'x';'y';'z'}) >>> pure_function_4
answer = composed_calculation(empty_initial_state)
return value(answer)
end
The passed around "state" would be comprised of two fields: a value and an extensible record. The functions would accept this state, use the value as their additional input, and leave the record unchanged. get would take the specified field out of the record and put it in the "value" field in the state. put would mutate the extensible record in the state:
put(field_name) = ( {value:v ; record:r} =>
{v ; put_record_field( r, field_name, v)} )
get(field_name) = ( {value:v ; record:r} =>
{get_record_field( r, field_name) ; r} )
pure_function_2'(a, c, e) = ( {value:v ; record:r} =>
{pure_function_2(a, c, e, v); r} )
value(r) = get_record_field( r, value)
empty_initial_state = { novalue ; empty_record }
All in pseudocode.
Augmented function application, and hence composition, is one way of thinking about "what monads are". Passing around the pairing of a produced/expected argument and a state is known as State Monad. The coder focuses on dealing with the values while treating the state as if "hidden" "under wraps", as we do here through the get/put etc. facilities. Under this illusion/abstraction, we do get to "simply" compose our functions.
I can make a small start at the end:
sum $ map (/ denominator)
[ _ϵ_cs_t_start_t_end * Ep
, 0.8 * _Δσ_pr_t_start_t_end
, (Ep * _ϕ_t0_t_start_t_end * _σ_pb) / Ecm_t
]
As mentioned in the comments (repeatedly), the function composition operator does indeed accept multiple argument functions. Cite: https://docs.julialang.org/en/v1/base/base/#Base.:%E2%88%98
help?> ∘
"∘" can be typed by \circ<tab>
search: ∘
f ∘ g
Compose functions: i.e. (f ∘ g)(args...; kwargs...) means f(g(args...; kwargs...)). The ∘ symbol
can be entered in the Julia REPL (and most editors, appropriately configured) by typing
\circ<tab>.
Function composition also works in prefix form: ∘(f, g) is the same as f ∘ g. The prefix form
supports composition of multiple functions: ∘(f, g, h) = f ∘ g ∘ h and splatting ∘(fs...) for
composing an iterable collection of functions.
The challenge is chaining the operations together, because any function can only pass on a tuple to the next function in the composed chain. The solution could be making sure your chained functions 'splat' the input tuples into the next function.
Example:
# splat to turn max into a tuple-accepting function
julia> f = (x->max(x...)) ∘ minmax;
julia> f(3,5)
5
Using this will in no way help make your function cleaner, though, in fact it will probably make a horrible mess.
Your problems do not at all seem to me to be related to how you call, chain or compose your functions, but are entirely due to not organizing the inputs in reasonable types with clean interfaces.
Edit: Here's a custom composition operator that splats arguments, to avoid the tuple output issue, though I don't see how it can help picking the right arguments, it just passes everything on:
⊕(f, g) = (args...) -> f(g(args...)...)
⊕(f, g, h...) = ⊕(f, ⊕(g, h...))
Example:
julia> myrev(x...) = reverse(x);
julia> (myrev ⊕ minmax)(5,7)
(7, 5)
julia> (minmax ⊕ myrev ⊕ minmax)(5,7)
(5, 7)

Acess methods for struct initialisation

I would like to do something like this:
Base.#kwdef mutable struct Setup
# physics
lx = 20.0
dc = 1.0
n = 4
# inital condition
ic(x) = exp(-(x-lx/4)^2)
# numerics
nx = 200
nvis = 50
# derived numerics
dx = lx/nx
dt = dx^2/dc/10
nt = nx^2 ÷ 5
# arrays
xc = LinRange(dx/2,lx-dx/2,nx)
C0 = ic.(xc)
C = copy(C)
q = zeros(nx-1)
# collections for easy use
dgl_params=[dc,n]
end
The problem here is that it says ic was undefined. Makes sense, because ic is not in the global scope.
Then I tried writing an outside constructor instead (I am not writing an inside constructor as that would overwrite the default constructor).
Base.#kwdef mutable struct Setup
# physics
lx = 20.0
dc = 1.0
n = 4
# inital condition
ic(x) = exp(-(x-lx/4)^2)
# numerics
nx = 200
nvis = 50
# derived numerics
dx = lx/nx
dt = dx^2/dc/10
nt = nx^2 ÷ 5
# arrays
xc = LinRange(dx/2,lx-dx/2,nx)
# C0 = ic.(xc)
C0
C = copy(C)
q = zeros(nx-1)
# collections for easy use
dgl_params=[dc,n]
end
function Setup()
Setup(Setup.ic(Setup.xc))
end
Setup()
But now it says DataType has no field ic which of course makes sense, I want the ic of the object itself. However there appears to be no selfor this keyword in julia.
Strangely enough the above seems to work fine with dx or dt which are also depending on other variables
Normally the design is to have multiple dispatch and functions outside of the object
When creating structs always provide the datatype of elements
For this large structs usually you will find out that using Parameters package will be more convenient when later debugging
The easiest way to circumvent the limitation is to have a lambda function in a field such as (this is however not the recommended Julia style):
#with_kw mutable struct Setup
lx::Float64 = 20.0
ic::Function = x -> lx * x
end
This can be now used as:
julia> s = Setup(lx=30)
Setup
lx: Float64 30.0
ic: #10 (function of type var"#10#14"{Int64})
julia> s.ic(10)
300
Actually, it is not in the design to have what in Java or C++ you would call "member functions". Part of this is Julia's will to benefit from the multiple dispatch programming paradigm. In Julia, mutables are pointers, so you pass them directly to a function, e.g.
function ic(setup::Setup, x)
return exp(-(x-setup.lx/4)^2)
end
That said, there is still a way to have more Java-esque classes, though not super recommended. Check this thread and, particularly, the answered marked as solution, given by one of Julia's authors themself.
Okay, I found the solution.
This does not work, because there are no methods in julia:
Base.#kwdef mutable struct S
n = 5
m
f(x) = x + 100
A = f.(randn(n,m))
end
s = S(m=5) # ERROR: UndefVarError: f not defined
s.A
s.f(5)
But this does work, because here f is a variable and not a function
Base.#kwdef mutable struct S
n = 5
m
f= x-> x + 100
A = f.(randn(n,m))
end
s = S(m=5)
s.A
s.f(5)

Use of Memory-mapped in Julia

I have a Julia code, version 1.2, which performs a lot of operations on a 10000 x 10000 Array . Due to OutOfMemory() error when I run the code, I’m exploring other options to run it, such as Memory-mapping. Concerning the use of Mmap.mmap, I’m a bit confused with the use of the Array that I map to my disk, due to little explanations on https://docs.julialang.org/en/v1/stdlib/Mmap/index.html. Here is the beginning of my code:
using Distances
using LinearAlgebra
using Distributions
using Mmap
data=Float32.(rand(10000,15))
Eucldist=pairwise(Euclidean(),data,dims=1)
D=maximum(Eucldist.^2)
sigma2hat=mean(((Eucldist.^2)./D)[tril!(trues(size((Eucldist.^2)./D)),-1)])
L=exp.(-(Eucldist.^2/D)/(2*sigma2hat))
L is the 10000 x 10000 Array with which I want to work, so I mapped it to my disk with
s = open("mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
What am I supposed to do after that? The next step is to perform K=eigen(L) and apply other commands to K. How should I do that? With K=eigen(L) or K=eigen(s)? What’s the role of the object s and when does it get involved? Moreover, I don’t understand why I have to use Mmap.sync! and when. After each subsequent lines after eigen(L)? At the end of the code? How can I be sure that I’m using my disk space instead of RAM memory?Would like some highlights about memory-mapping, please. Thank you!
If memory usage is a concern, it is often best to re-assign your very large arrays to 0, or to a similar type-safe small matrix, so that the memory can be garbage collected, assuming you are done with those intermediate matrices. After that, you just call Mmap.mmap() on your stored data file, with the type and dimensions of the data as second and third arguments to mmap, and then assign the function's return value to your variable, in this case L, resulting in L being bound to the file contents:
using Distances
using LinearAlgebra
using Distributions
using Mmap
function testmmap()
data = Float32.(rand(10000, 15))
Eucldist = pairwise(Euclidean(), data, dims=1)
D = maximum(Eucldist.^2)
sigma2hat = mean(((Eucldist.^2) ./ D)[tril!(trues(size((Eucldist.^2) ./ D)), -1)])
L = exp.(-(Eucldist.^2 / D) / (2 * sigma2hat))
s = open("./tmp/mmap.bin", "w+")
write(s, size(L,1))
write(s, size(L,2))
write(s, L)
close(s)
# deref and gc collect
Eucldist = data = L = zeros(Float32, 2, 2)
GC.gc()
s = open("./tmp/mmap.bin", "r+") # allow read and write
m = read(s, Int)
n = read(s, Int)
L = Mmap.mmap(s, Matrix{Float32}, (m, n)) # now L references the file contents
K = eigen(L)
K
end
testmmap()
#time testmmap() # 109.657995 seconds (17.48 k allocations: 4.673 GiB, 0.73% gc time)

Can I use <- instead of = in Julia?

Like in R:
a <- 2
or even better
a ← 2
which should translate to
a = 2
and if possible respect method overloading.
= is overloaded (not in the multiple dispatch sense) a lot in Julia.
It binds a new variable. As in a = 3. You won't be able to use ← instead of = in this context, because you can't overload binding in Julia.
It gets lowered to setindex!. As in, a[i] = b gets lowered to setindex!(a, b, i). Unfortunately, setindex! takes 3 variables while ← can only take 2 variables. So you can't overload = with 3 variables.
But, you can use only 2 variables and overload a[:] = b, for example. So, you can define ←(a,b) = (a[:] = b) or ←(a,b) = setindex!(a,b,:).
a .= b gets lowered to (Base.broadcast!)(Base.identity, a, b). You can overload this by defining ←(a,b) = (a .= b) or ←(a,b) = (Base.broadcast!)(Base.identity, a, b).
So, there are two potentially nice ways of using ←. Good luck ;)
Btw, if you really want to use ← to do binding (like in 1.), the only way to do it is using macros. But then, you will have to write a macro in front of every single assignment, which doesn't look very good.
Also, if you want to explore how operators get lowered in Julia, do f(a,b) = (a .= b), for example, and then #code_lowered f(x,y).
No. = is not an operator in Julia, and cannot be assigned to another symbol.
Disclaimer: You are fully responsible if you will try my (still beginner's) experiments bellow! :P
MacroHelper is module ( big thanks to #Alexander_Morley and #DanGetz for help ) I plan to play with in future and we could probably try it here :
julia> module MacroHelper
# modified from the julia source ./test/parse.jl
function parseall(str)
pos = start(str)
exs = []
while !done(str, pos)
ex, pos = parse(str, pos) # returns next starting point as well as expr
ex.head == :toplevel ? append!(exs, ex.args) : push!(exs, ex)
end
if length(exs) == 0
throw(ParseError("end of input"))
elseif length(exs) == 1
return exs[1]
else
return Expr(:block, exs...) # convert the array of expressions
# back to a single expression
end
end
end;
With module above you could define simple test "language":
julia> module TstLang
export #tst_cmd
import MacroHelper
macro tst_cmd(a)
b = replace("$a", "←", "=") # just simply replacing ←
# in real life you would probably like
# to escape comments, strings etc
return MacroHelper.parseall(b)
end
end;
And by using it you could probably get what you want:
julia> using TstLang
julia> tst```
a ← 3
println(a)
a +← a + 3 # maybe not wanted? :P
```
3
9
What about performance?
julia> function test1()
a = 3
a += a + 3
end;
julia> function test2()
tst```
a ← 3
a +← a + 3
```
end;
julia> test1(); #time test1();
0.000002 seconds (4 allocations: 160 bytes)
julia> test2(); #time test2();
0.000002 seconds (4 allocations: 160 bytes)
If you like to see syntax highlight (for example in atom editor) then you need to use it differently:
function test3()
#tst_cmd begin
a ← 3
a ← a + a + 3 # parser don't allow you to use "+←" here!
end
end;
We could hope that future Julia IDEs could syntax highlight cmd macros too. :)
What could be problem with "solution" above? I am not so experienced julian so many things. :P (in this moment something about "macro hygiene" and "global scope" comes to mind...)
But what you want is IMHO good for some domain specific languages and not to redefine basic of language! It is because readability very counts and if everybody will redefine everything then it will end in Tower of Babel...

Rearranging a vector in parallel for fast performance

I have a vector whose length can go up to about a few million or more.
If I say the vector is vec = [a1 a2 ...b1 b2 ... c1 c2 ...d1 d2 ...]
I need to rearrange vec to new_vec = [a1 b1 c1 d1 a2 b2 c2 d2 ...] .
If viewed as a matrix of column vectors, then this could be viewed as a transpose, but I do not have a two dimensional vector. I understand how to do it on a sequential computer. That is very simple.
But I am not sure how to do this on a multiple processor cluster or on a GPU, or even if this would be feasible on parallel machines. Memory seems to be the obvious bottleneck. If there are any algorithms or any architecture specific optimizations that I can use, please let me know.
Edit: More information below.
The code structure is:
subroutine reorder(vec,parameter)
real(kind = 8),dimension(parameter%length), intent(inout) :: vec
real(kind = 8),dimension(parameter%length) :: temp
type(param) :: parameter !just a struct holding certain constant parameters
integer :: i,j,k,q1,q2,q3,nn1,n1,n2,nn2
i1 = parameter%len1 !lengths of sub-vectors in each direction
i2 = parameter%len2 !the multiplication of the 3 gives the
!overall length of vec
i3 = parameter%len3
temp = vec
n1 = i2*i1
n2 = i2*i3
do k = 1, i3
q1 = n1*(k-1)
q2 = i2*(k-1)
do j = 1, i2
q3 = i1*(j-1)
do i = 1, i1
nn1 = q1+q3+i
nn2 = q2+j+n2*(i-1)
vec(nn2) = temp(nn1)
end do
end do
end do
end subroutine reorder
Therefore the code aims to reorder the elements of the vector according to the rule. As you can see as the length of the vector becomes very large, a significant time is spent in this routine.
This routine is called in a main routine multiple times. A cartesian decomposition produces a cartesian 3D arrangement of ranks in the beginning and each rank calls this subroutine when a reordering of the elements is required for its own next subroutine call.The cartesian communicator subroutine is shown in the skeleton below:
subroutine cartesian_comm(ndim,comm_cart,comm_one_d,coord_cart)
use mpi
implicit none
integer, dimension(:), intent(in) :: ndim
integer, intent(out) :: comm_cart
integer, dimension(:), pointer :: comm_one_d, coord_cart
logical, dimension(size(ndim)) :: period, remain
integer :: dim,code, i, rank
!creating the cartesian communicator
dim = 3
allocate(comm_one_d(dim),coord_cart(dim))
period = .FALSE.
call MPI_CART_CREATE(MPI_COMM_WORLD, dim, ndim, period, .FALSE., comm_cart, code)
call MPI_COMM_RANK(comm_cart, rank, code)
call MPI_CART_COORDS(comm_cart, rank, dim, coord_cart, code)
!Creating sub-communicator for each direction
do i = 1, dim
remain = .FALSE.
remain(i) = .TRUE.
call MPI_CART_SUB(comm_cart, remain, comm_one_d(i), code)
end do
end subroutine cartesian_comm
And it is called in the main function as follows:
Program main
!initialize some stuff and intialize all the required variables
! ndim is the number of processes the program is called
! with "mpirun -np 8 ./exec" would mean ndim is cuberoot of 8,
! and therefore 2 for the 3D case. It is always made sure that
! np is a cube(or square for 2D) while calling the program
call cartesian_comm(ndim,comm_cart,comm_one_d,coord_cart)
do while (t<tend-1D-8) !start time loop
t = t + dt
!do some computations get the vector "vec" for
!each rank separately (different and independent in each rank)
call reorder(vec,parameter) ! all ranks call this subroutine
!do some computations here with the new reordered vec
end do !end time loop
!do other stuff (unrelated to reorder but use the "vec" vector) here
end Program main
I would like to know if there is a better way to do this in a multiprocessor cluster, or how I could proceed in the case of a GPU.

Resources