MPI_ALLREDUCE PROBLEM - mpi

I have a code which has a 2D local array (cval).This local array is being calculated by every processor and in the end I call MPI_ALLREDUCE to sum this local array to a global array(gns).
This local array has different sizes for different processors.The way I do a all reduce is as follows
k = n2spmax- n2spmin + 1 ! an arbitrary big value
do i = nmin, nmax
call MPI_ALLREDUCE(cval(i,:),gns(i,:),k,MPI_DOUBLE_PRECISION,MPI_SUM,MPI_COMM_WORLD,ierr)
end do
Is this the correct way of writing it.I am not sure about it ?

No, you can't do it this way. MPI_Allreduce requires that all of the processes in the communicator are contributing the same amount of data. That's why there's a single count argument.
To give more guidance on what is the right way to do this, we'll need a bit more clarity on what you're trying to do. Is the idea that you're calculating gns(i,j) = the sum over all ranks of cval(i,j), but not all ranks have all the cval(i,j)s?

Related

Nondeterministic growth in Julia loop

I have a computation loop with adaptive time stepping and I need to store the results at each iteration. In other words, I do not know the vector size before the computation, so I can not preallocate a vector size to store the data. Right now, I build a vector using the push! function
function comp_loop()
clock = [0.0]
data = [0.0]
while run_time < model_time
# Calculate new timestep
timestep = rand(Float64) # Be sure to add Random
run_time += timestep
# Build vector
push!(clock,run_time)
push!(data,timestep)
end
end
Is there a more efficient way to go about this? Again, I know the optimal choice is to preallocate the vector, but I do not have that luxury available. Buffers are theoretically not an option either, as I don't now how large to make them. I'm looking for something more "optimal" on how to implement this in Julia (i.e. maybe some advanced application available in the language).
Theoretically, you can use a linked list such as the one from DataStructures.jl to get O(1) appending. And then optionally write that out to a Vector afterwards (probably in reverse order, though).
In practise, push!ing to a Vector is often efficient enough -- Vectors use a doubling strategy to manage their dynamic size, which leads to amortized constant time and the advantage of contiguous memory access.
So you could try the linked list, but be sure to benchmark whether it's worth the effort.
Now, the above is about time complexity. When you care about allocation, the argument is quite similar; with a vector, you are going to end up with memory proportional to the next power of two after your actual requirements. Most often, that can be considered amortized, too.

Saving multiple sparse arrays in one big sparse array

I have been trying to implement some code in Julia JuMP. The idea of my code is that I have a for loop inside my while loop that runs S times. In each of these loops I solve a subproblem and get some variables as well as opt=1 if the subproblem was optimal or opt=0 if it was not optimal. Depending on the value of opt, I have two types of constraints, either optimality cuts (if opt=1) or feasibility cuts (if opt=0). So the intention with my code is that I only add all of the optimality cuts if there are no feasibility cuts for s=1:S (i.e. we get opt=1 in every iteration from 1:S).
What I am looking for is a better way to save the values of ubar, vbar and wbar. Currently I am saving them one at a time with the for-loop, which is quite expensive.
So the problem is that my values of ubar,vbar and wbar are sparse axis arrays. I have tried to save them in other ways like making a 3d sparse axis array, which I could not get to work, since I couldn't figure out how to initialize it.
The below code works (with the correct code inserted inside my <>'s of course), but does not perform as well as I wish. So if there is some way to save the values of 2d sparse axis arrays more efficiently, I would love to know it! Thank you in advance!
ubar2=zeros(nV,nV,S)
vbar2=zeros(nV,nV,S)
wbar2=zeros(nV,nV,S)
while <some condition>
opts=0
for s=1:S
<solve a subproblem, get new ubar,vbar,wbar and opt=1 if optimal or 0 if not>
opts+=opt
if opt==1
# Add opt cut Constraints
for i=1:nV
for k=1:nV
if i!=k
ubar2[i,k,s]=ubar[i,k]
end
end
for j=i:nV
if links[i,j]==1
vbar2[i,j,s]=vbar[i,j]
wbar2[i,j,s]=wbar[i,j]
end
end
end
else
# Add feas cut Constraints
#constraint(mas, <constraint from ubar,vbar,wbar> <= 0)
break
end
if opts==S
for s=1:S
#constraint(mas, <constraint from ubar2,vbar2,wbar2> <= <some variable>)
end
end
end
A SparseAxisArray is simply a thin wrapper in top of a Dict.
It was defined such that when the user creates a container in a JuMP macro, whether he gets an Array, a DenseAxisArray or a SparseAxisArray, it behaves as close as possible to one another hence the user does not need to care about what he obtained for most operations.
For this reason we could not just create a Dict as it behaves differently as an array. For instance you cannot do getindex with multiple indices as x[2, 2].
Here you can use either a Dict or a SparseAxisArray, as you prefer.
Both of them have O(1) complexity for setting and getting new elements and a sparse storage which seems to be adequate for what you need.
If you choose SparseAxisArray, you can initialize it with
ubar2 = JuMP.Containers.SparseAxisArray(Dict{Tuple{Int,Int,Int},Float64}())
and set it with
ubar2[i,k,s]=ubar[i,k]
If you choose Dict, you can initialize it with
ubar2 = Dict{Tuple{Int,Int,Int},Float64}()
and set it with
ubar2[(i,k,s)]=ubar[i,k]

R : Faster way to repeat an array along a dimension?

Suppose i have an array z given by :
z = array(runif(100*50*200),c(100,50,200))
Is there a faster way to do :
dim(z) = c(100,50,1,200)
z = z[,,rep(1,300),]
Note that this is an exemple, the new dimension where i want to repeat the array is not always the 3rd, and the starting dimension of the array is not always 3.
profvis::profvis() only shows that the Garbage collector takes a certain time in the computation, but it does not show other internals..
It might be an allocation issue, although i'm not shure why it takes that kind of time. I have several of those very basic calls in my code, and 95% of my runtime is spend there.. So even if it's unavoidable, can you explain to me why is it so long ?

Iterating results back into an OpenCL kernel

I have written an openCL kernel that takes 25million points and checks them relative to two lines, (A & B). It then outputs two lists; i.e. set A of all of the points found to be beyond line A, and vice versa.
I'd like to run the kernel repeatedly, updating the input points with each of the line results sets in turn (and also updating the checking line). I'm guessing that reading the two result sets out of the kernel, forming them into arrays and then passing them back in one at a time as inputs is quite a slow solution.
As an alternative, I've tested keeping a global index in the kernel that logs which points relate to which line. This is updated at each line checking cycle. During each iteration, the index for each point in the overall set is switched to 0 (no line), A or B or so forth (i.e. the related line id). In subsequent iterations only points with an index that matches the 'live' set being checked in that cycle (i.e. tagged with A for set A) are tested further.
The problem is that, in each iteration, the kernels still have to check through the full index (i.e. all 25m points) to discover wether or not they are in the 'live' set. As a result, the speed of each cycle does not significantly improve as the size of the results set decrease over time. Again, this seems a slow solution; whilst avoiding passing too much information between GPU and CPU it instead means that a large number of the work items aren't doing very much work at all.
Is there an alternative solution to what I am trying to do here?
You could use atomics to sort the outputs into two arrays. Ie if we're in A then get my position by incrementing the A counter and put me into A, and do the same for B
Using global atomics on everything might be horribly slow (fast on amd, slow on nvidia, no idea about other devices) - instead you can use a local atomic_inc in a 0'd local integer to do exactly the same thing (but for only the local set of x work-items), and then at the end do an atomic_add to both global counters based on your local counters
To put this more clearly in code (my explanation is not great)
int id;
if(is_a)
id = atomic_inc(&local_a);
else
id = atomic_inc(&local_b);
barrier(CLK_LOCAL_MEM_FENCE);
__local int a_base, b_base;
int lid = get_local_id(0);
if(lid == 0)
{
a_base = atomic_add(a_counter, local_a);
b_base = atomic_add(b_counter, local_b);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(is_a)
a_buffer[id + a_base] = data;
else
b_buffer[id + b_base] = data;
This involves faffing around with atomics which are inherently slow, but depending on how quickly your dataset reduces it might be much faster. Additionally if B data is not considered live, you can omit getting the b ids and all the atomics involving b, as well as the write back

In MPI how to communicate a part of a "shared" array to all other ranks?

I have an MPI program with some array of data. Every rank needs all the array to do its work, but will only work on a patch of the array. After a calculation step I need every rank to communicate its computed piece of the array to all other ranks.
How do I achieve this efficiently?
In pseudo code I would do something like this as a first approach:
if rank == 0: // only master rank
initialise_data()
end if
MPI_Bcast(all_data,0) // from master to every rank
compute which part of the data to work on
for ( several steps ): // each rank
execute_computation(part_of_data)
for ( each rank ):
MPI_Bcast(part_of_data, rank_number) // from every rank to every rank
end for
end for
The disadvantage is that there is as many broadcasts, i.e. barriers as there is ranks. So how would I replace the MPI_Bcasts ?
edit: I just might have found a hint... Is it MPI_Allgather I am looking for?
Yes, you are looking for MPI_Allgather. Note that recvcount is not the length of the whole recieve buffer, but the amount of data should be recieved from one process. Analogically, in MPI_Allgatherv recvcount[i] is the amount of data you want to recieve from i-th process. Moreover, recvcount should be equal (not less) to the respective sendcount. I tested it on my implemetation (OpenMPI), and if I tried to recieve less elements that were sent, I got MPI_ERR_TRUNCATE error.
Also in some rare cases I used MPI_Allreduce for that puprose. For example if we have the following arrays:
process0: AA0000
process1: 0000BB
process2: 00CC00
then we can do Allreduce with MPI_SUM operation and get AACCBB in all processes. Obviously, the same trick can be done with ones instead of zeros and MPI_PROD instead of MPI_SUM.

Resources