I understand what the recursive algorithm to solve Tower of Hanoi is doing:
For example if we have three pegs ( A , B , C ) and we want to move 3 discs from A -> C , we move disc 1 and 2 to B and then move disc 3, the biggest one, to peg C, then move discs 1 and 2 that we moved earlier to B, to C. This algorithm is expressed in pseudo in the following way:
FUNCTION MoveTower(disk, source, dest, spare):
IF disk == 0, THEN:
move disk from source to dest
ELSE:
MoveTower(disk - 1, source, spare, dest)
move disk from source to dest
MoveTower(disk - 1, spare, dest, source)
END IF
I make the call: MoveTower(3,A,C,B) which would call MoveTower(2,A,B,C) which would call MoveTower(1,A,C,B) which would finally reach the base case which would move A -> B.
This is where I am confused. When arriving at the base case, we move the top disk on A to B ( in a single shot) How does the recursion move everything else? How does the "backing out phase" move the other discs( which is in this case all discs but the biggest to B)? Isn't it only moving the top disc at the base case?
For example I know in the factorial function after we reach the base case, the recursion "returns" a value which is passed on to the previous call which is passed on to the previous call all the way until the top call. At each level its waiting for its recursion to come back.
Can someone help me understand how the recursion accomplishes anything in the first MoveTower(disk - 1, source, spare, dest) call other than reaching the base case?
Thanks
Just be patient, and be precise.
FUNCTION MoveTower(disk, source, dest, spare):
1. IF disk == 0, THEN:
2. move disk from source to dest
3. ELSE:
4. MoveTower(disk - 1, source, spare, dest)
5. move disk from source to dest
6. MoveTower(disk - 1, spare, dest, source)
7. END IF
Imagine yourself being a human computer, sitting at a nice comfy desk, with paper and pencils aplenty.
To call a function you copy the function recipe onto an empty sheet of paper and place it in front of you.
To call another function you copy that function's recipe onto an empty sheet of paper and place this sheet of paper on the stack of sheets of paper in front of you. It does not matter whether you are calling the same function or not, because you are working with its recipe's copy.
CALL MoveTower(3, A, C, B)
==> MoveTower_recipe{ disk=3, source=A, dest=C, spare=B }
| 1. IF disk==0, THEN:
= IF 3 ==0, THEN:
....
3. ELSE:
4. CALL MoveTower(disk - 1, source, spare, dest)
= CALL MoveTower(3-1, A, B, C)
==> MoveTower_recipe{ disk=2, source=A, dest=B, spare=C }
| 1. IF 2 == 0 THEN:
.....
3. ELSE:
4. CALL MoveTower( 1, A, C, B)
==> MoveTower_recipe{ disk=1, source=A, dest=C, spare=B }
| 1. IF 1 == 0 THEN:
3. ELSE:
4. CALL MoveTower(0, A, B, C)
==> MoveTower_recipe{ disk=0, source=A, dest=B, spare=C }
| 1. IF 0 == 0 THEN:
2. ___move disk{0} from source{A} to dest{B}___ (*1)
7. END IF
<==
5. ___move disk{1} from source{A} to dest{C}___ (*2)
6. CALL MoveTower( 0, C, B, A)
==>
.....
.....
See? The copies of the MoveTower function recipe are stacked on top each other, with each copy holding its own actual values of the function's named parameters (here the stack grows - visually - down, but on your desk the stack of papers would be piling up).
You work along the recipe on the top sheet of paper, making notes on its margins as to where you currently are along the recipe's lines, and the values of various named parameters and / or named internal variables (if any) and / or interim unnamed values (like disk - 1).
When you're done with the top sheet of paper, you throw it away, not before copying its return value (if any) to the sheet of paper now on top, at the place where you were before you entered the now discarded recipe's copy.
You can also note the input-output instructions performed by you the human computer, following the (copies of) the function recipes, on yet another sheet of paper on your side, recording the effects on the Real World your program would be having (marked with (*1), (*2), etc., above).
That's it.
The state of your computation is recorded in the margins on each recipe copy in the stack.
When you've reached the base case, not only have you produced the first output instruction; you've also piled up a whole lot of function recipes' copies in front of you, each with its associated data and state (current point of execution).
You must visualize the call tree in your head:
MoveTower 3, from A, to B, using C:
1. MoveTower 2, from A, to C, using B:
1.1. MoveTower 1, from A, to B, using C:
1.1.1. Move disk 1 from A to B.
1.2. Move disk 2 from A to C.
1.3. MoveTower 1, from B, to C, using A:
1.3.1. Move disk 1 from B to C.
2. Move disk 3 from A to B.
3. MoveTower 2, from C, to B, using A:
3.1. MoveTower 1, from C, to A, using B:
3.1.1. Move disk 1 from C to A.
3.2. Move disk 2 from C to B.
3.3. MoveTower 1, from A, to B, using C:
3.3.1. Move disk 1 from A to B.
Now if we read only the "Move disk" operations, we have:
Move disk 1 from A to B.
Move disk 2 from A to C.
Move disk 1 from B to C.
Move disk 3 from A to B.
Move disk 1 from C to A.
Move disk 2 from C to B.
Move disk 1 from A to B.
As to "how it accomplishes anything", it's like this:
We have to move a tower of n disks from post A, to post B, using post C as temporary storage.
There are two possibilities, n is 1, or n is greater than 1.
If n is 1 we just move the disk.
If n is greater than 1 we do it in three steps:
Move a smaller tower of n − 1 disks from A to C using B as temporary storage.
Move the bottom disk from A to B.
Move the smaller tower of n - 1 disks from C to B using A as temporary storage.
Related
The image below shows a set that is supposed to describe a binary transitive relation:
That first arrow notation looks good at first until I saw the d node. I thought that since d cannot reach b (or any other node, yet it connects to c), it cannot be transitive?
A little bit of clarification would be great
The first panel is fine, i.e., it is transitive. Transtivity does not require d has a (directed) path to b in this case. Transitivity, by definition, requires "if there are x and y such that d → x and x → y, then it must be d → y". Since c (which potentially play the role of x here) does not go to anywhere, as for a chain of arrows that starts from d, there is no condition that needs to be satisfied (i.e., vacuously true, when starting from d).
This is the moment I feel I need something like MPI_Neighbor_allreduce, but I know it doesn't exist.
Foreword
Given a 3D MPI cartesian topology describing how a 3D physical domain is distributed among processes, I wrote a function probe that asks for a scalar value (which is supposed to be put in a simple REAL :: val) given the 3 coordinates of a point inside the domain.
There can only be 1, 2, 4, or 8 process(es) that are actually involved in the computation of val.
1 if the point is internal to a process subdomain (and it has no neighbors involved),
2 if the point is on a face between 2 processes' subdomains (and each of them has 1 neighbor involved),
4 if the point is on a side between 4 processes' subdomains (and each of them has 2 neighbors involved),
8 if the point is a vertex between 8 processes' subdomain (and each of them has 3 neighbors involved).
After the call to probe as it is now, each process holds val, which is some value for involved processes, 0 or NaN (I decide by (de)commenting the proper lines) for not-involved processes. And each process knows if it is involved or not (through a LOGICAL :: found variable), but does not know if it is the only one involved, nor who are the involved neighbors if it is not.
In the case of 1 involved process, that only value of that only process is enough, and the process can write it, use it, or whatever is needed.
In the latter three cases, the sum of the different scalar values of the processes involved must be computed (and divided by the number of neighbors +1, i.e. self included).
The question
What is the best strategy to accomplish this communication and computation?
What solutions I'm thinking about
I'm thinking about the following possibilities.
Every process executes val = 0 before the call to probe, then MPI_(ALL)REDUCE can be used, (the involved processes participating with val /= 0 in general, all others with val == 0), but this would mean that if more points are asked for val, those points would be treated serially, even if the set of process(es) involved for each of them does not overlap with other sets.
Every process calls MPI_Neighbor_allgather to share found among neighboring processes to make each involved process know which one(s) of the 6 neighbors participate(s) to the sum and then perform individual MPI_send(s) and an MPI_recv(s) to communicate val. But this would still involve every process (even though each communicates only with the 6 neighbors.
Maybe the best choice is that each process defines a communicator made up of itself plus the 6 neighbors and then use.
EDIT
For what concerns the risk of deadlock mentioned by #JorgeBellón, I initially solved it by calling MPI_SEND before MPI_RECV for communications in the positive direction, i.e. those corresponding to even indices in who_is_involved, and vice-versa in the negative direction. As a special case, this could not deal with a periodic direction with only two processes along it (since each of the two would see the other one as a neighbor in both positive and negative directions, thus resulting in both processes calling MPI_SEND and MPI_RECV in the same order, thus causing a deadlock); the solution to this special case was the following ad-hoc edit to who_is_involved (which I called found_neigh in my code):
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
As a reference for the readers, the solution that I implemented so far (a solution I'm not so satisfied with) is the following.
found = ... ! .TRUE. or .FALSE. depending whether the process is/isn't involved in computation of val
IF ( found) val = ... ! compute own contribution
IF (.NOT. found) val = NaN
! share found among neighbors
found_neigh(:) = .FALSE.
CALL MPI_NEIGHBOR_ALLGATHER(found, 1, MPI_LOGICAL, found_neigh, 1, MPI_LOGICAL, procs_grid, ierr)
found_neigh = found_neigh .AND. found
! modify found_neigh to deal with special case of TWO processes along PERIODIC direction
DO id = 1, ndims
IF (ALL(found_neigh(2*id - 1:2*id))) found_neigh(2*id -1 + mycoords(id)) = .FALSE.
END DO
! exchange contribution with neighbors
val_neigh(:) = NaN
IF (found) THEN
DO id = 1, ndims
IF (found_neigh(2*id)) THEN
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idp(id), 999, MPI_COMM_WORLD, ierr)
CALL MPI_RECV(val_neigh(2*id), 1, MPI_DOUBLE_PRECISION, idp(id), 666, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
END IF
IF (found_neigh(2*id - 1)) THEN
CALL MPI_RECV(val_neigh(2*id - 1), 1, MPI_DOUBLE_PRECISION, idm(id), 999, MPI_COMM_WORLD, MPI_STATUS_IGNORE, ierr)
CALL MPI_SEND(val, 1, MPI_DOUBLE_PRECISION, idm(id), 666, MPI_COMM_WORLD, ierr)
END IF
END DO
END IF
! combine own contribution with others
val = somefunc(val, val_neigh)
As you said, MPI_Neighbor_allreduce does not exist.
You can create derived communicators that only include your adjacent processes and then perform a regular MPI_Allreduce on them. Each process can have up to 7 communicators in a 3D grid.
The communicator in which a specific process will be placed in the center of the stencil.
The respective communicator for each of the adjacent processes.
This can be a quite expensive process, but it does not mean it could be worthwhile (HPLinpack makes extensive use of derived communicators, for example).
If you already have a cartesian topology, a good approach is to use MPI_Neighbor_allgather. This way you will not only know how many neighbors are involved but also who it is.
int found; // logical: either 1 or 0
int num_neighbors; // how many neighbors i got
int who_is_involved[num_neighbors]; // unknown, to be received
MPI_Neighbor_allgather( &found, ..., who_is_involved, ..., comm );
int actually_involved = 0;
int r = 0;
MPI_Request reqs[2*num_neighbors];
for( int i = 0; i < num_neighbors; i++ ) {
if( who_is_involved[i] != 0 ) {
actually_involved++;
MPI_Isend( &val, ..., reqs[r++]);
MPI_Irecv( &val, ..., reqs[r++]);
}
}
MPI_Waitall( r, reqs, MPI_STATUSES_IGNORE );
Note that I'm using non-blocking point to point routines. This is important in most cases because MPI_Send may wait for the receiver to call MPI_Recv. Unconditionally calling MPI_Send and then MPI_Recv in all processes, may cause a deadlock (see MPI 3.1 standard section 3.4).
Another possibility is to send both the real value and the found in a single communication, so that the number of transfers are reduced. Since all processes are involved in the MPI_Neighbor_allgather anyway, you could use it to get everything done (for a small increase in the amount of data transferred it really pays off).
INTEGER :: neighbor, num_neighbors, found
REAL :: val
REAL :: sendbuf(2)
REAL :: recvbuf(2,num_neighbors)
sendbuf(1) = found
sendbuf(2) = val
CALL MPI_Neighbor_allgather( sendbuf, 1, MPI_2REAL, recvbuf, num_neighbors, MPI_2REAL, ...)
DO neighbor = 1,num_neighbors
IF recvbuf(1,neighbor) .EQ. 1 THEN
! use neighbor val, placed in recvbuf(2,neighbor)
END IF
END DO
Assuming here is a binary search tree, and given the rule that above(X,Y) - X is directly above Y. Also I created the rule root(X) - X has no parent.
Then, I was trying to figure out what the depth of node in this tree.
Assume the root node of tree is "r" So I got fact level(r,0). In order to implement the rule level(N,D) :-, what I was thinking is it should be have a recursion here.
Thus, I tried
level(N,D): \+ root(N), above(X,N), D is D+1, level(X,D).
So if N is not a root, there has a node X above N and level D plus one, then recursion. But when I tested this, it just works for the root condition. When I created more facts, such as node "s" is leftchild of node "r", My query is level(s,D). It returns me "no". I traced the query, it shows me
1 1 Call: level(s,_16) ?
1 1 Fail: level(s,_16) ?
I just confusing why it fails when I call level(s,D)?
There are some problems with your query:
In Prolog you cannot write something like D is D+1, because a variable can only be assigned one value;
at the moment you call D is D+1, D is not yet instantiated, so it will probably cause an error; and
You never state (at least not in the visible code) that the level/2 of the root is 0.
A solution is to first state that the level of any root is 0:
level(N,0) :-
root(N).
Now we have to define the inductive case: first we indeed look for a parent using the above/2 predicate. Performing a check that N is no root/1 is not necessary strictly speaking, because it would conflict with the fact that there is an above/2. Next we determine the level of that parent LP and finally we calculate the level of our node by stating that L is LP+1 where L is the level of N and LP the level op P:
level(N,L) :-
above(P,N),
level(P,LP),
L is LP+1.
Or putting it all together:
level(N,0) :-
root(N).
level(N,L) :-
above(P,N),
level(P,LP),
L is LP+1.
Since you did not provide a sample tree, I have no means to test whether this predicate behaves as you expect it to.
About root/1
Note that by writing root/1, you introduce data duplication: you can simply write:
root(R) :-
\+ above(_,R).
I am fairly new to using MPI. My question is the following: I have a matrix with 2000 rows and 3 columns stored as a 2D array (not contiguous data). Without changing the structure of the array, depending on the number of processes np, each process should get a portion of the matrix.
Example:
A: 2D array of 2000 arrays by 3 columns, np = 2, then P0 gets the first half of A which would be 2D array of first 1000 rows by 3 columns, and P1 gets the second half which would be the second 1000 rows by 3 columns.
Now np can be any number (as long as it divides the number of rows). Any easy way to go about this?
I will have to use FORTRAN 90 for this assignment.
Thank you
Row-wise distribution of 2D arrays in Fortran is tricky (but not impossible) using scatter/gather operations directly because of the column-major storage. Two possible solutions follow.
Pure Fortran 90 solution: With Fortran 90 you can specify array sections like A(1:4,2:3) which would take a small 4x2 block out of the matrix A. You can pass array slices to MPI routines. Note with current MPI implementations (conforming to the now old MPI-2.2 standard), the compiler would create temporary contiguous copy of the section data and would pass it to the MPI routine (since the lifetime of the temporary storage is not well defined, one should not pass array sectons to non-blocking MPI operations like MPI_ISEND). MPI-3.0 introduces new and very modern Fortran 2008 interface that allows MPI routines to directly take array sections (without intermediate arrays) and supports passing of sections to non-blocking calls.
With array sections you only have to implement a simple DO loop in the root process:
INTEGER :: i, rows_per_proc
rows_per_proc = 2000/nproc
IF (rank == root) THEN
DO i = 0, nproc-1
IF (i /= root) THEN
start_row = 1 + i*rows_per_proc
end_row = (i+1)*rows_per_proc
CALL MPI_SEND(mat(start_row:end_row,:), 3*rows_per_proc, MPI_REAL, &
i, 0, MPI_COMM_WORLD, ierr)
END IF
END DO
ELSE
CALL MPI_RECV(submat(1,1), 3*rows_per_proc, MPI_REAL, ...)
END IF
Pure MPI solution (also works with FORTRAN 77): First, you have to declare a vector datatype with MPI_TYPE_VECTOR. The number of blocks would be 3, the block length would be the number of rows that each process should get (e.g. 1000), the stride should be equal to the total height of the matrix (e.g. 2000). If this datatype is called blktype, then the following would send the top half of the matrix:
REAL, DIMENSION(2000,3) :: mat
CALL MPI_SEND(mat(1,1), 1, blktype, p0, ...)
CALL MPI_SEND(mat(1001,1), 1, blktype, p1, ...)
Calling MPI_SEND with blktype would take 1000 elements from the specified starting address, then skip the next 2000 - 1000 = 1000 elements, take another 1000 and so on, 3 times in total. This would form a 1000-row sub-matrix of your big matrix.
You can now run a loop to send a different sub-block to each process in the communicator, effectively performing a scatter operation. In order to receive this sub-block, the receiving process could simply specify:
REAL, DIMENSION(1000,3) :: submat
CALL MPI_RECV(submat(1,1), 3*1000, MPI_REAL, root, ...)
If you are new to MPI, this is all you need to know about scattering matrices by rows in Fortran. If you know well how the type system of MPI works, then read ahead for more elegant solution.
(See here for an excellent description on how to do that with MPI_SCATTERV by Jonathan Dursi. His solution deals with splitting a C matrix in columns, which essentially poses the same problem as the one here as C stores matrices in row-major fashion. Fortran version follows.)
You could also make use of MPI_SCATTERV but it is quite involved. It builds on the pure MPI solution presented above. First you have to resize the blktype datatype into a new type, that has an extent, equal to that of MPI_REAL so that offsets in array elements could be specified. This is needed because offsets in MPI_SCATTERV are specified in multiples of the extent of the datatype specified and the extent of blktype is the size of the matrix itself. But because of the strided storage, both sub-blocks would start at only 4000 bytes apart (1000 times the typical extent of MPI_REAL). To modify the extent of the type, one would use MPI_TYPE_CREATE_RESIZED:
INTEGER(KIND=MPI_ADDRESS_KIND) :: lb, extent
! Get the extent of MPI_REAL
CALL MPI_TYPE_GET_EXTENT(MPI_REAL, lb, extent, ierr)
! Bestow the same extent upon the brother of blktype
CALL MPI_TYPE_CREATE_RESIZED(blktype, lb, extent, blk1b, ierr)
This creates a new datatype, blk1b, which has all characteristics of blktype, e.g. can be used to send whole sub-blocks, but when used in array operations, MPI would only advance the data pointer with the size of a single MPI_REAL instead of with the size of the whole matrix. With this new type, you could now position the start of each chunk for MPI_SCATTERV on any element of mat, including the start of any matrix row. Example with two sub-blocks:
INTEGER, DIMENSION(2) :: sendcounts, displs
! First sub-block
sendcounts(1) = 1
displs(1) = 0
! Second sub-block
sendcounts(2) = 1
displs(2) = 1000
CALL MPI_SCATTERV(mat(1,1), sendcounts, displs, blk1b, &
submat(1,1), 3*1000, MPI_REAL, &
root, MPI_COMM_WORLD, ierr)
Here the displacement of the first sub-block is 0, which coincides with the beginning of the matrix. The displacement of the second sub-block is 1000, i.e. it would start on the 1000-th row of the first column. On the receiver's side the data count argument is 3*1000 elements, which matches the size of the sub-block type.
Given an undirected cyclic graph, I want to find all possible traversals with Breadth-First search or Depth-First search. That is given a graph as an adjacency-list:
A-BC
B-A
C-ADE
D-C
E-C
So all BFS paths from root A would be:
{ABCDE,ABCED,ACBDE,ACBED}
and for DFS:
{ABCDE,ABCED,ACDEB,ACEDB}
How would I generate those traversals algorithmically in a meaningful way? I suppose one could generate all permutations of letters and check their validity, but that seems like last-resort to me.
Any help would be appreciated.
Apart from the obvious way where you actually perform all possible DFS and BFS traversals you could try this approach:
Step 1.
In a dfs traversal starting from the root A transform the adjacency list of the currently visited node like so: First remove the parent of the node from the list. Second generate all permutations of the remaining nodes in the adj list.
So if you are at node C having come from node A you will do:
C -> ADE transform into C -> DE transform into C -> [DE, ED]
Step 2.
After step 1 you have the following transformed adj list:
A -> [CB, BC]
B -> []
C -> [DE, ED]
D -> []
E -> []
Now you launch a processing starting from (A,0), where the first item in the pair is the traversal path and the second is an index. Lets assume we have two queues. A BFS queue and a DFS queue. We put this pair into both queues.
Now we repeat the following, first for one queue until it is empty and then for the other queue.
We pop the first pair off the queue. We get (A,0). The node A maps to [BC, CB]. So we generate two new paths (ACB,1) and (ABC,1). Put these new paths in the queue.
Take the first one of these off the queue to get (ACB,1). The index is 1 so we look at the second character in the path string. This is C. Node C maps to [DE, ED].
The BFS children of this path would be (ACBDE,2) and (ACBED,2) which we obtained by appending the child permutation.
The DFS children of this path would be (ACDEB,2) and (ACEDB,2) which we obtained by inserting the child permutation right after C into the path string.
We generate the new paths according to which queue we are working on, based on the above and put them in the queue. So if we are working on the BFS queue we put in (ACBDE,2) and (ACBED,2). The contents of our queue are now : (ABC,1) , (ACBDE,2), (ACBED,2).
We pop (ABC,1) off the queue. Generate (ABC,2) since B has no children. And get the queue :
(ACBDE,2), (ACBED,2), (ABC,2) and so on. At some point we will end up with a bunch of pairs where the index is not contained in the path. For example if we get (ACBED,5) we know this is a finished path.
BFS is should be quite simple: each node has a certain depth at which it will be found. In your example you find A at depth 0, B and C at depth 1 and E and D at depth 2. In each BFS path, you will have the element with depth 0 (A) as the first element, followed by any permutation of the elements at depth 1 (B and C), followed by any permutation of the elements at depth 2 (E and D), etc...
If you look at your example, your 4 BFS paths match that pattern. A is always the first element, followed by BC or CB, followed by DE or ED. You can generalize this for graphs with nodes at deeper depths.
To find that, all you need is 1 Dijkstra search which is quite cheap.
In DFS, you don't have the nice separation by depth which makes BFS straightforward. I don't immediately see an algorithm that is as efficient as the one above. You could set up a graph structure and build up your paths by traversing your graph and backtracking. There are some cases in which this would not be very efficient but it might be enough for your application.