I have a MPI_Isend and MPI_Recv program.
Assume that i have 2 processors and both of them are like this.
MPI_Isend
MPI_Recv
MPI_Wait
What i expect from this is sending the data on both processors without blocking. Then wait for the data to come. Then resume, like this.
0 sends to 1
1 sends to 0
0 receives from 1
1 receives from 0
But what i get is this.
0 sends to 1
0 receives from 1 (although 1 didn't send!)
1 sends to 0 (now it sends)
1 receives from 0
I thought that MPI_Recv should wait until the data comes. What may be causing this?
MPI_Recv does block.
You just do not see the messages in the correct order because standard output is buffered and you do not see all outputs at once.
What you can do to have kind of an unbuffered output is to perform the output and flush with interleaved MPI_Barrier. Provided you have P processes, the rank of the current process is stored in the variable rank and you are using the communicator comm, you can do:
for (int p = 0; p < P; ++p)
{
// Only one process writes at this time
// std::endl flushes the buffer
if (p == rank)
std::cout << "Message from process " << rank << std::endl;
// Block the other processes until the one that is writing
// flushes the buffer
MPI_Barrier(comm);
}
Of course this is just a C++ example. You may have to translate it into C of Fortran. Also notice that this code still does not guarantee with 100% probability that the output will actually be what you are expecting, but there are good probabilities.
Anyway, the principle is to always add a barrier between two output operations and to flush the buffer.
Related
In MPI, if I have the following code will a copy of variable 'a' be created for both the processes or do I have to declare 'a' inside every loop? Or are they both the same?
main()
{
int a;
if(rank==0)
{
a+=1;
}
if(rank==1)
{
a+=2;
}
}
MPI has a distributed memory programming paradigm.
To simply put, if you have an application binary (for eg: hello.out) and if you run it with an mpi runtime by mpirun -n 4 hello.out then what happens is:
It launches 4 instances of the application hello.out (we can say it's similar to launching 4 different applications in 4 different nodes). They don't know each other. They execute their own code in their own address spaces. That means, every variables, functions etc belong to it's own instance and not shared with any other processes. So all process has their own variable a.
i.e, below code will be called 4 times (if we use mpirun -n 4) at same time in different cores/nodes. So, variable a will be available in all 4 instances. You can use rank to identify your MPI process and manipulates it's value. In below example a will store the processes rank value. All processes will print My rank is a with a taking values from 0 to 4. And only one process will print I am rank 0 since a==0 will only be true for process with rank 0.
main()
{
int a;
int rank;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
a=rank;
printf("My rank is %d",a);
if(a==0)
{
printf("I am rank 0");
}
}
So, to interact with each other, the launched processes (eg: hello.out) uses MPI (message passing interface) library (as #Hristo Iliev commented).
So basically, it's like launching same regular C program on multiple cores/nodes and communicating with each other using message passing (as Gilles Gouaillardet pointed in comment ).
Currently I have a Python program (serial) that calls a C executable (parallel through MPI) through subprocess.run. However, this is a terribly clunky implementation as it means I have to pass some very large arrays back and forth from the Python to the C program using the file system. I would like to be able to directly pass the arrays from Python to C and back. I think ctypes is what I should use. As I understand it, I would create a dll instead of an executable from my C code to be able to use it with Python.
However, to use MPI you need to launch the program using mpirun/mpiexec. This is not possible if I am simply using the C functions from a dll, correct?
Is there a good way to enable MPI for the function called from the dll? The two possibilities I've found are
launch the python program in parallel using mpi4py, then pass MPI_COMM_WORLD to the C function (per this post How to pass MPI information to ctypes in python)
somehow initialize and spawn processes inside the function without using mpirun. I'm not sure if this is possible.
One possibility, if you are OK with passing everything through the c program rank 0, is to use subprocess.Popen() with stdin=subprocess.PIPE and the communicate() function on the python side and fread() on the c side.
This is obviously fragile, but does keep everything in memory. Also, if your data size is large (which you said it was) you may have to write the data to the child process in chunk. Another option could be to use exe.stdin.write(x) rather than exe.communicate(x)
I created a small example program
c code (program named child):
#include "mpi.h"
#include "stdio.h"
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
double ans;
if(rank == 0){
fread(&ans, sizeof(ans), 1, stdin);
}
MPI_Bcast(&ans, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
printf("rank %d of %d received %lf\n", rank, size, ans);
MPI_Finalize();
}
python code (named driver.py):
#!/usr/bin/env python
import ctypes as ct
import subprocess as sp
x = ct.c_double(3.141592)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
x = ct.c_double(101.1)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
results:
> python ./driver.py
rank 0 of 4 received 3.141592
rank 1 of 4 received 3.141592
rank 2 of 4 received 3.141592
rank 3 of 4 received 3.141592
rank 0 of 4 received 101.100000
rank 2 of 4 received 101.100000
rank 3 of 4 received 101.100000
rank 1 of 4 received 101.100000
I tried using MPI_Comm_connect() and MPI_Comm_accept() through mpi4py, but I couldn't seem to get that working on the python side.
Since most of the time is spent in the C subroutine which is invoked multiple times, and you are running within a resource manager, I would suggest the following approach :
Start all the MPI tasks at once via the following command (assuming you have allocated n+1 slots
mpirun -np 1 python wrapper.py : -np <n> a.out
You likely want to start with a MPI_Comm_split() in order to generate a communicator only for the n tasks implemented by the C program.
Then you will define a "protocol" so the python wrapper can pass parameters to the C tasks, and wait for the result or direct the C program to MPI_Finalize().
You might as well consider using an intercommunicator (first group is for python, second group is for C) but this is really up to you. Intercommunicator semantic can be seen as non intuitive, so make sure you understand how this works if you want to go into that direction.
I would like to partition a vector in different size vectors by MPI_Scatterv. When I choose a partition made in decreasing order the code runs ok, but when I choose an increasing order, it fails fails.
Is it possible that MPI_Scatterv is used only for partitioning in decreasing order? I don't know where the error is. The code that is ok and the variation that is wrong follow.
program scatt
include 'mpif.h'
integer idproc, num, ierr, tag,namelen, status(MPI_STATUS_SIZE),comm
character *(MPI_MAX_PROCESSOR_NAME) processor_name
integer, allocatable :: myray(:),send_ray(:)
integer counts(3),displ(3)
integer siz,mysize,i,k,j,total
call MPI_INIT(ierror)
comm = mpi_comm_world
call MPI_COMM_SIZE(comm, num, ierror)
call MPI_COMM_RANK(comm, idproc, ierror)
siz=12
! create the segmentation in decreasing manner
counts(1)=5
counts(2)=4
counts(3)=3
displ(1)=0
displ(2)=5
displ(3)=9
allocate(myray(counts(idproc+1)))
myray=0
! create the data to be sent on the root
if(idproc == 0)then
!size=count*num
allocate(send_ray(0:siz-1))
do i=0,siz
send_ray(i)=i+1
enddo
write(*,*) send_ray
endif
! send different data to each processor
call MPI_Scatterv( send_ray, counts, displ, MPI_INTEGER, &
myray, counts, MPI_INTEGER, &
0,comm,ierr)
write(*,*)"myid= ",idproc," ray= ",myray
call MPI_FINALIZE(ierr)
end
Result ok is:
myid= 1 ray= 6 7 8 9
myid= 0 ray= 1 2 3 4 5
myid= 2 ray= 10 11 12
When I write the same code in increasing segmentation order
counts(1)=2
counts(2)=4
counts(3)=6
displ(1)=0
displ(2)=2
displ(3)=6
The segmentation is made only for the root
myid= 0 ray= 1 2
and the error message is:
Fatal error in PMPI_Scatterv: Message truncated, error stack:
PMPI_Scatterv(671)......................: MPI_Scatterv(sbuf=(nil), scnts=0x6b4da0, displs=0x6b4db0, MPI_INTEGER, rbuf=0x26024b0,
rcount=2, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Scatterv_impl(211).................:
I_MPIR_Scatterv_intra(278)..............: Failure during collective
I_MPIR_Scatterv_intra(272)..............:
MPIR_Scatterv(147)......................:
MPIDI_CH3_PktHandler_EagerShortSend(441): Message from rank 0 and tag 6 truncated; 16 bytes received but buffer size is 8
Fatal error in PMPI_Scatterv: Message truncated, error stack:
PMPI_Scatterv(671)................: MPI_Scatterv(sbuf=(nil), scnts=0x6b4da0, displs=0x6b4db0, MPI_INTEGER, rbuf=0x251f4b0, rcount=2, MPI_INTEGER, root=0, MPI_COMM_WORLD) failed
MPIR_Scatterv_impl(211)...........:
I_MPIR_Scatterv_intra(278)........: Failure during collective
I_MPIR_Scatterv_intra(272)........:
MPIR_Scatterv(147)................:
MPIDI_CH3U_Receive_data_found(131): Message from rank 0 and tag 6 truncated; 24 bytes received but buffer size is 8
forrtl: error (69): process interrupted (SIGINT)
There are two problems in your code.
First, the invocation of MPI_Scatterv is wrong. The size of the receive buffer must be a scalar, not an array, and give the size of the array in the calling rank only. In your case you need to change the second occurrence of counts to counts(idproc+1):
call MPI_Scatterv(send_ray, counts, displ, MPI_INTEGER, &
myray, counts(idproc+1), MPI_INTEGER, &
0, comm, ierr)
The same applies to the complimentary operation MPI_Gatherv - there the size of the local send buffer is also a scalar.
Another problem is the out-of-bounds access in this initialisation loop:
allocate(send_ray(0:siz-1))
do i=0,siz
send_ray(i)=i+1
enddo
Here send_ray is allocated with bounds 0:siz-1, but the loop runs from 0 to siz, which is one element past the end of the array. Some compilers have options to enable run-time out-of-bound access checks. For example, with Intel Fortran the option is -check bounds. For Gfortran the option is -fcheck=bounds. Accessing arrays past their end could overwrite and thus alter the values in other arrays (worst case, hard to spot) or destroy the heap pointers and crash your program (best case, easy to spot).
As Gilles Gouaillardet has noticed, do not use mpif.h. Instead, use mpi or even better use mpi_f08 should be used in newly developed programs.
OpenCL kernel crunches some numbers. This particular kernel then searches an array of 8 bit char4 vectors for a matching string of numbers. For example, array holds 3 67 8 2 56 1 3 7 8 2 0 2 - the kernel loops over that (actual string is 1024 digits long) and searches for 1 3 7 8 2 and "returns" data letting the host program know it found a match.
In an combo learning exercise/programming experiment I wanted to see if I could loop over an array and search for a range of values, where the array is not just char values, but char4 vectors, WITHOUT using a single if statement in the kernel. Two reasons:
1: After half an hour of getting compile errors I realized that you cannot do:
if(charvector[3] == searchvector[0])
Because some may match and some may not. And 2:
I'm new to OpenCL and I've read a lot about how branches can hurt a kernel's speed, and if I understand the internals of kernels correctly, some math may actually be faster than if statements. Is that the case?
Anyway... first, the kernel in question:
void search(__global uchar4 *rollsrc, __global uchar *srch, char srchlen)
{
size_t gx = get_global_id(0);
size_t wx = get_local_id(0);
__private uint base = 0;
__local uchar4 queue[8092];
__private uint chunk = 8092 / get_local_size(0);
__private uint ctr, start, overlap = srchlen-1;
__private int4 srchpos = 0, srchtest = 0;
uchar4 searchfor;
event_t e;
start = max((int)((get_group_id(0)*32768) - overlap), 0);
barrier(CLK_LOCAL_MEM_FENCE);
e = async_work_group_copy(queue, rollsrc+start, 8092, 0);
wait_group_events(1, &e);
for(ctr = 0; ctr < chunk+overlap; ctr++) {
base = min((uint)((get_group_id(0) * chunk) + ctr), (uint)((N*32768)-1));
searchfor.x = srch[max(srchpos.x, 0)];
searchfor.y = srch[max(srchpos.y, 0)];
searchfor.z = srch[max(srchpos.z, 0)];
searchfor.w = srch[max(srchpos.w, 0)];
srchpos += max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
srchpos = max(srchpos, 0);
srchtest = clamp(srchpos-(srchlen-1), 0, 1) << 31;
srch[0] |= (any(srchtest) * 255);
// if(get_group_id(0) == 0 && get_local_id(0) == 0)
// printf("%u: %v4u %v4u\n", ctr, srchpos, srchtest);
}
barrier(CLK_LOCAL_MEM_FENCE);
}
There's extra unneeded code in there, this was a copy from a previous kernel, and I havent cleaned up the extra junk yet. That being said.. in short and in english, how the math based if statement works:
Since I need to search for a range, and I'm searching a vector, I first set a char4 vector (searchfor) to have elements xyzw individually set to the number I am searching for. It's done individually because each of xyz and w hold a different stream, and the search counter - how many matches in a row we've had - will be different for each of the members of the vector. I'm sure there's a better way to do it than what I did. Suggestions?
So then, an int4 vector, searchpos, which holds the current position in the search array for each of the 4 vector positions, gets this added to it:
max((convert_int4(abs_diff(queue[base], searchfor))*-100), -100) | 1;
What this does: Take the ABS difference between the current location in the target queue (queue) and the searchfor vector set in the previous 4 lines. A vector is returned where each member will have either a positive number (not a match) or zero (a match - no difference).
It's converted to int4 (as uchar cannot be negative) then multipled by -100, then run through max(x,-100). Now the vector is either -100, or 0. We OR it with 1 and now it's -99 or 1.
End result: searchpos either increments by 1 (a match), or is reduced by 99, resetting any previous partial match increments. (Searches can be up to 96 characters long - there exists a chance to match 91, then miss, so it has to be able to wipe that all out). It is then max'ed with 0 so any negative result is clamped to zero. Again - open to suggestions to make that more efficient. I realized as I was writing this I could probably use addition with saturation to remove some of the max statements.
The last part takes the current srchpos, which now equals the number of consecutive matches, subtracts 1 less than the length of the search string, then clamps it to 0-1, thus ending up with either a 1 - a full match, or 0. We bit shift this << 31. Result is 0, or 0x8000000. Put this into srchtest.
Lastly, we bitwise OR the first character of the search string with the result of any(srchtest) * 255 - it's one of the few ways (I'm aware of) to test something across a vector and return a single integer from it. (any() returns 1 if any member of the vector has it's MSB set - which we set in the line above)
End result? srch[0] is unchanged, or, in the case of a match, it's set to 0xff. When the kernel returns, the host can read back srch from the buffer. If the first character is 0xff, we found a match.
It probably has too many steps and can be cleaned up. It also may be less efficient than just doing 4 if checks per loop. Not sure.
But, after this massive post, the thing that has me pulling my hair out:
When I UNCOMMENT the two lines at the end that prints debug information, the script works. This is the end of the output on my terminal window as I run it:
36: 0,0,0,0 0,0,0,0
37: 0,0,0,0 0,0,0,0
38: 0,0,0,0 0,0,0,0
39: 0,0,0,0 0,0,0,0
Search = 613.384 ms
Positive
Done read loop: -1 27 41
Positive means the string was found. The -1 27 41 is the first 3 characters of the search string, the first being set to -1 (signed char on the host side).
Here's what happens when I comment out the printf debugging info:
Search = 0.150 ms
Negative
Done read loop: 55 27 41
IT DOES NOT FIND IT. What?! How is that possible? Of course, I notice that the script execution time jumps from .15ms to 600+ms because of the printf, so I think, maybe it's somehow returning and reading the data BEFORE the script ends, and the extra delay from the printf gives it a pause. So I add a barrier(CLK_LOCAL_MEM_FENCE); to the end, thinking that will make sure all threads are done before returning. Nope. No effect. I then add in a 2 second sleep on the host side, after running the kernel, after running clFinish, and before running clReadBuffer.
NOPE! Still Negative. But I put the printf back in - and it works. How is that possible? Why? Does anyone have any idea? This is the first time I've had a programming bug that baffled me to the point of pulling hair out, because it makes absolutely zero sense. The work items are not clashing, they each read their own block, and even have an overlap in case the search string is split across two work item blocks.
Please - save my hair - how can a printf of irrelevant data cause this to work and removing it causes it to not?
Oh - one last fun thing: If I remove the parameters from the printf - just have it print text like "grr please work" - the kernel returns a negative, AND, nothing prints out. The printf is ignored.
What the heck is going on? Thanks for reading, I know this was absurdly long.
For anyone referencing this question in the future, the issue was caused by my arrays being read out of bounds. When that happens, all heck breaks loose and all results are unpredictable.
Once I fixed the work and group size and made sure I was not exceeding the memory bounds, it worked as expected.
I've created the following toy example that counts in a loop and writes the value to an Async.Pipe:
open Sys
open Unix
open Async.Std
let (r,w) = Pipe.create ()
let rec readloop r =
Pipe.read r >>=
function
| `Eof -> return ()
| `Ok v -> return (printf "Got %d\n" v) >>=
fun () -> after (Core.Time.Span.of_sec 0.5) >>=
fun () -> readloop r
let countup hi w =
let rec loop i =
printf "i=%d\n" i ;
if (i < hi &&( not (Pipe.is_closed w))) then
Pipe.write w i >>>
fun () -> loop (i+1)
else Pipe.close w
in
loop 0
let () =
countup 10 w;
ignore(readloop r);;
Core.Never_returns.never_returns (Scheduler.go ())
Notice the readloop function is recursive - it just continuously reads values from the Pipe as they are available. However, I've added a delay there of 0.5 sec between each read. The countup function is kind of similar but it loops and does a write to the same Pipe.
When I run this I get:
i=0
i=1
Got 0
i=2
Got 1
i=3
Got 2
i=4
Got 3
i=5
Got 4
i=6
Got 5
i=7
Got 6
i=8
Got 7
i=9
Got 8
i=10
Got 9
Aside from the first three lines of output above, all the rest of the output lines seem to need to wait the half second. So it seems that the Pipe is blocked after a write until there is a read from the Pipe. (Pipe.write w data appears to block waiting for a Pipe.read r )
What I thought should happen (since this is an Async Pipe of some sort) is that values would be queued up in the Pipe until the reads take place, something like:
i=0
Got 0 (* now reader side waits for 1/2 second before reading again *)
i=1 (* meanwhile writer side keeps running *)
i=2
i=3
i=4
i=5
i=6
i=7
i=8
i=9 (* up till here, all output happens pretty much simultaneously *)
Got 1 (* 1/2 second between these messages *)
Got 2
Got 3
Got 4
Got 5
Got 6
Got 7
Got 8
Got 9
I'm wondering if there is a way to get the behavior using Async?
My real usecase is that I've got a Tcp socket open (as a client) and if I were using threads after some setup between the client and the server I would start a thread that just sits and reads data coming in from the socket from the server and put that data into a queue of messages that can be examined in the main thread of the program when it's ready. However, instead of using threads I want to use Core.Async to achieve the same thing: Read data from the socket as it comes in from the server and when data is available, examine the message and do something based on it's content. There could be other things going on as well, so this is simulated by the "wait half a second" in the code above. I thought Pipe would queue up the messages so that they could be read when the reader side was ready, but that doesn't seem to be the case.
Indeed, pipe is a queue, but by default its length is set to 0. So that, when you're pushbacking, a producer will stop immediately and wait. You can control the size with a set_size_budget function.