Does Mpi Scatter send data in any kind of order - mpi

I saw this on internet
MPI_Scatter takes an array of elements and distributes the elements in the order of process rank.
But I could not find it on the documentation.
I have an array and there are 4 processes. One process is root -> will scatter data among other 3 processes. The id-s are 0, 1, 2, 3.
Question: Will MPI_Scatter() or MPI_Scatterv() send the data in order, guaranteed?
Example 1:
0: [a, b, c, d, e]
// after scatter
1: [a, b]
2: [c, d]
3: [e]
Example 2:
0: [a, b]
// after scatter
1: [a]
2: [b]
3: [ ]
Also, does gather do the same thing? (preserve order)

The order is guaranteed according to the rank in the MPI_Comm.
Below statement is copy pasted from open-mpi v2.1 documentation:
An alternative description is that the root sends a message with
MPI_Send(sendbuf, sendcount * n, sendtype, ...). This message is split
into n equal segments, the ith segment is sent to the ith process in
the group, and each process receives this message as above. The send
buffer is ignored for all nonroot processes.

Related

Detecting cycles in Topological sort using Kahn's algorithm (in degree / out degree)

I have been practicing graph questions lately.
https://leetcode.com/problems/course-schedule-ii/
https://leetcode.com/problems/alien-dictionary/
The current way I detect cycles is to use two hashsets. One for visiting nodes, and one for fully visited nodes. And I push the result onto a stack with DFS traversal.
If I ever visit a node that is currently in the visiting set, then it is a cycle.
The code is pretty verbose and the length is long.
Can anyone please explain how I can use a more standard top-sort algorithm (Kahn's) to detect cycles and generate the top sort sequence?
I just want my method to exit or set some global variable which flags that a cycle has been detected.
Many thanks.
Khan's algorithm with cycle detection (summary)
Step 1: Compute In-degree: First we create compute a lookup for the in-degrees of every node. In this particular Leetcode problem, each node has a unique integer identifier, so we can simply store all the in-degrees values using a list where indegree[i] tells us the in-degree of node i.
Step 2: Keep track of all nodes with in-degree of zero: If a node has an in-degree of zero it means it is a course that we can take right now. There are no other courses that it depends on. We create a queue q of all these nodes that have in-degree of zero. At any step of Khan's algorithm, if a node is in q then it is guaranteed that it's "safe to take this course" because it does not depend on any courses that "we have not taken yet".
Step 3: Delete node and edges, then repeat: We take one of these special safe courses x from the queue q and conceptually treat everything as if we have deleted the node x and all its outgoing edges from the graph g. In practice, we don't need to update the graph g, for Khan's algorithm it is sufficient to just update the in-degree value of its neighbours to reflect that this node no longer exists.
This step is basically as if a person took and passed the exam for
course x, and now we want to update the other courses dependencies
to show that they don't need to worry about x anymore.
Step 4: Repeat: When we removing these edges from x, we are decreasing the in-degree of x's neighbours; this can introduce more nodes with an in-degree of zero. During this step, if any more nodes have their in-degree become zero then they are added to q. We repeat step 3 to process these nodes. Each time we remove a node from q we add it to the final topological sort list result.
Step 5. Detecting Cycle with Khan's Algorithm: If there is a cycle in the graph then result will not include all the nodes in the graph, result will return only some of the nodes. To check if there is a cycle, you just need to check whether the length of result is equal to the number of nodes in the graph, n.
Why does this work?:
Suppose there is a cycle in the graph: x1 -> x2 -> ... -> xn -> x1, then none of these nodes will appear in the list because their in-degree will not reach 0 during Khan's algorithm. Each node xi in the cycle can't be put into the queue q because there is always some other predecessor node x_(i-1) with an edge going from x_(i-1) to xi preventing this from happening.
Full solution to Leetcode course-schedule-ii in Python 3:
from collections import defaultdict
def build_graph(edges, n):
g = defaultdict(list)
for i in range(n):
g[i] = []
for a, b in edges:
g[b].append(a)
return g
def topsort(g, n):
# -- Step 1 --
indeg = [0] * n
for u in g:
for v in g[u]:
indeg[v] += 1
# -- Step 2 --
q = []
for i in range(n):
if indeg[i] == 0:
q.append(i)
# -- Step 3 and 4 --
result = []
while q:
x = q.pop()
result.append(x)
for y in g[x]:
indeg[y] -= 1
if indeg[y] == 0:
q.append(y)
return result
def courses(n, edges):
g = build_graph(edges, n)
ordering = topsort(g, n)
# -- Step 5 --
has_cycle = len(ordering) < n
return [] if has_cycle else ordering

State diagram, calculate the value after these functions

I have this picture of a state diagram and have to calculate the value of x after a few events. The events are e1-e2-e2-e2-e2
The x would be 2 in the beginning.
First event is e1, so I think it would become 4 after that event.
Next is e2 and I was wondering because the exit is x=x-1, so would it go to state B, because it is less that 4, or C because it was 4, but became 3 in the exit?
And lets suppose it goes to B, and becomes 5, and we do e2 again. Would nothing happen because the only possibility is x>5 and it is equal to 5?
Assuming that the guard between A and C is x>=4 (since there is no e defined) I made a small transition table:
So the final state should be B and X is 11.
In UML state machines, the guards are evaluated, when beeing in the original state. I.e., when receiving e2 the first time, x is 4 and thus you take the transition to C, unser the assumption that e is x (otherwise it doesnt make sense) . After you decided going to C, and thus leave A, you aubstract 1 from x due to the exit ocndition. When beeing in C, you can change B by trigger of e2, which is unguarded (the guard x>5 belongs to the transition from B to C). Now x is 6, as you add 3 due to the entry condition. Then you receive the next e2 and transide to B, where you add 1, so x is now 7. When receiving the next e2, you check the guard on the transition to C, which demands that x is greater 5, which holds. So lets go to C and execute the entry action once again. So x is now 10. Then you get one more e2, so the state changes to C and its entry action is executed, thus x is 11.
So after the execution of the given events, x is 11 and the statemachine is in state B.

Find paths of length = 4, starting by an adjacency matrix of a directed graph, considering only distinct edges?

Given an EREW-PRAM model, that allows me to use an arbitrary number of processors in parallel without them conflicting nor in read, nor in write access, I need to find the number of paths of length 4, considering that I have an input node-node adjacency matrix A representing a directed graph and that I need to exclude paths that don't use distinct edges (e.g.: (a,b),(b,a),(a,b),(b,a) is not a valid path).
I have a function that uses n^3 processors and calculates the matrix multiplication of two given matrices in time O(logn):
mult-matrix(A, A, n) => B --> gives me the paths of length 2.
mult-matrix(B, B, n) => C --> gives me the paths of length 4, but I think it considers paths that run across the same edges.
I tried subtracting 1 from elements of C that have a node u communicating with a node v in both directions, but I'm not sure it works.
How could I solve the problem considering that I just need to exclude some paths from the resulting matrix C?
Any working solution is appreciated, considering that the number of processors is constrained to n^3 and time must be O(logn) in the worst case. The exercises must be solved using a pseudo-pascal language, but given a working solution, I should be able to write the pseudocode by myself.
I think I found a solution in https://www.perlmonks.org/?node_id=522270
Given an input matrix A, I am able to calculate the adjacency matrix for paths of length 2, 3 and 4 with the provided function.
A2 is the adjacency matrix obtained by multiplying A*A and contains paths of length 2
A3 is obtained by multiplying A2*A and contains paths of length 3
A4 is obtained by multiplying A3*A and contains paths of length 4
In order to exclude the repeated edges, I have to compute the matrix C, obtained by doing an element-wise subtraction among the calculated matrices.
C[i,j] = A4[i,j] - A3[i,j] - A2[i,j] - A[i,j]
C contains the final result.
The following pseudocode solves the problem with an EREW-PRAM using O(n^3) processors and in time O(logn).
procedure paths_length_4(A, n) // Work = O(n^3 logn)
begin
A2 := mult_matrix(A, A, n) // T=O(logn), P=O(n^3)
A3 := mult_matrix(A2, A, n) // T=O(logn), P=O(n^3)
A4 := mult_matrix(A3, A, n) // T=O(logn), P=O(n^3)
for all i,j where 1 ≤ i ≤ n, 1 ≤ j ≤ n pardo // P=O(n^2)
C[i,j] := A4[i,j] - A3[i,j] - A2[i,j] - A[i,j]
end

Prolog infinite loop

I'm fairly new to Prolog and I hope this question hasn't been asked and answered but if it has I apologize, I can't make sense of any of the other similar questions and answers.
My problem is that I have 3 towns, connected by roads. Most are one way, but there are two towns connected by a two way street. i.e.
facts:
road(a, b, 1).
road(b, a, 1).
road(b, c, 3).
where a, b and c are towns, and the numbers are the distances
I need to be able to go from town a to c without getting stuck between a and b
Up to here I can solve with the predicates: (where r is a list of towns on the route)
route(A, B, R, N) :-
road(A, B, N),
R1 = [B],
R = [A|R1],
!.
route(A, B, R, N) :-
road(A, C, N1),
route(C, B, R1, N2),
\+ member(A, R1),
R = [A | R1],
N is N1+N2.
however if I add a town d like so
facts:
road(b, d, 10)
I can't get Prolog to recognize this is a second possible route. I know that this is because I have used a cut, but without the cut it doesn't stop and ends in stack overflow.
Furthermore I will then need to be able to write a new predicate that returns true when R is given as the shortest route between a and c.
Sorry for the long description. I hope someone can help me!
This is a problem of graph traversal. I think your problem is that you've got a cyclic graph — you find the leg a-->b and the next leg you find is b-->a where it again finds the leg a-->b and ... well, you get the picture.
I would approach the problem like this, using a helper predicate with accumulators to build my route and compute total distance. Something like this:
% ===========================================================================
% route/4: find the route(s) from Origin to Destination and compute the total distance
%
% This predicate simply invoke the helper predicate with the
% accumulator(s) variables properly seeded.
% ===========================================================================
route(Origin,Destination,Route,Distance) :-
route(Origin,Destination,[],0,Route,Distance)
.
% ------------------------------------------------
% route/6: helper predicate that does all the work
% ------------------------------------------------
route(D,D,V,L,R,L) :- % special case: you're where you want to be.
reverse([D|V],R) % - reverse the visited list since it get built in reverse order
. % - and unify the length accumulator with the final value.
route(O,D,V,L,Route,Length) :- % direct connection
road(O,D,N) , % - a segment exists connecting origin and destination directly
L1 is L+N , % - increment the length accumulator
V1 = [O|V] , % - prepend the current origin to the visited accumulator
route(D,D,V1,L1,Route,Length) % - recurse down, indicating that we've arrived at our destination
. %
route(O,D,V,L,Route,Length) :- % indirect connection
road(O,X,N) , % - a segment exists from the current origin to some destination
X \= D , % - that destination is other than the desired destination
not member(X,V) , % - and we've not yet visited that destination
L1 is L+N , % - increment the length accumulator
V1 = [O|V] , % - prepend the current origin to the visited accumulator
route(X,D,V1,L1,Route,Length) % - recurse down using the current destination as the new origin.

MPI several broadcast at the same time

I have a 2D processor grid (3*3):
P00, P01, P02 are in R0, P10, P11, P12, are in R1, P20, P21, P22 are in R2.
P*0 are in the same computer. So the same to P*1 and P*2.
Now I would like to let R0, R1, R2 call MPI_Bcast at the same time to broadcast from P*0 to p*1 and P*2.
I find that when I use MPI_Bcast, it takes three times the time I need to broadcast in only one row.
For example, if I only call MPI_Bcast in R0, it takes 1.00 s.
But if I call three MPI_Bcast in all R[0, 1, 2], it takes 3.00 s in total.
It means the MPI_Bcast cannot work parallel.
Is there any methods to make the MPI_Bcast broadcast at the same time?
(ONE node broadcast with three channels at the same time.)
Thanks.
If I understand your question right, you would like to have simultaneous row-wise broadcasts:
P00 -> P01 & P02
P10 -> P11 & P12
P20 -> P21 & P22
This could be done using subcommunicators, e.g. one that only has processes from row 0 in it, another one that only has processes from row 1 in it and so on. Then you can issue simultaneous broadcasts in each subcommunicator by calling MPI_Bcast with the appropriate communicator argument.
Creating row-wise subcommunicators is extreamly easy if you use Cartesian communicator in first place. MPI provides the MPI_CART_SUB operation for that. It works like that:
// Create a 3x3 non-periodic Cartesian communicator from MPI_COMM_WORLD
int dims[2] = { 3, 3 };
int periods[2] = { 0, 0 };
MPI_Comm comm_cart;
// We do not want MPI to reorder our processes
// That's why we set reorder = 0
MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &comm_cart);
// Split the Cartesian communicator row-wise
int remaindims[2] = { 0, 1 };
MPI_Comm comm_row;
MPI_Cart_sub(comm_cart, remaindims, &comm_row);
Now comm_row will contain handle to a new subcommunicator that will only span the same row that the calling process is in. It only takes a single call to MPI_Bcast now to perform three simultaneous row-wise broadcasts:
MPI_Bcast(&data, data_count, MPI_DATATYPE, 0, comm_row);
This works because comm_row as returned by MPI_Cart_sub will be different in processes located at different rows. 0 here is the rank of the first process in comm_row subcommunicator which will correspond to P*0 because of the way the topology was constructed.
If you do not use Cartesian communicator but operate on MPI_COMM_WORLD instead, you can use MPI_COMM_SPLIT to split the world communicator into three row-wise subcommunicators. MPI_COMM_SPLIT takes a color that is used to group processes into new subcommunicators - processes with the same color end up in the same subcommunicator. In your case color should equal to the number of the row that the calling process is in. The splitting operation also takes a key that is used to order processes in the new subcommunicator. It should equal the number of the column that the calling process is in, e.g.:
// Compute grid coordinates based on the rank
int proc_row = rank / 3;
int proc_col = rank % 3;
MPI_Comm comm_row;
MPI_Comm_split(MPI_COMM_WORLD, proc_row, proc_col, &comm_row);
Once again comm_row will contain the handle of a subcommunicator that only spans the same row as the calling process.
The MPI-3.0 draft includes a non-blocking MPI_Ibcast collective. While the non-blocking collectives aren't officially part of the standard yet, they are already available in MPICH2 and (I think) in OpenMPI.
Alternatively, you could start the blocking MPI_Bcast calls from separate threads (I'm assuming R0, R1 and R2 are different communicators).
A third possibility (which may or may not be possible) is to restructure the data so that only one broadcast is needed.

Resources