Currently I have a Python program (serial) that calls a C executable (parallel through MPI) through subprocess.run. However, this is a terribly clunky implementation as it means I have to pass some very large arrays back and forth from the Python to the C program using the file system. I would like to be able to directly pass the arrays from Python to C and back. I think ctypes is what I should use. As I understand it, I would create a dll instead of an executable from my C code to be able to use it with Python.
However, to use MPI you need to launch the program using mpirun/mpiexec. This is not possible if I am simply using the C functions from a dll, correct?
Is there a good way to enable MPI for the function called from the dll? The two possibilities I've found are
launch the python program in parallel using mpi4py, then pass MPI_COMM_WORLD to the C function (per this post How to pass MPI information to ctypes in python)
somehow initialize and spawn processes inside the function without using mpirun. I'm not sure if this is possible.
One possibility, if you are OK with passing everything through the c program rank 0, is to use subprocess.Popen() with stdin=subprocess.PIPE and the communicate() function on the python side and fread() on the c side.
This is obviously fragile, but does keep everything in memory. Also, if your data size is large (which you said it was) you may have to write the data to the child process in chunk. Another option could be to use exe.stdin.write(x) rather than exe.communicate(x)
I created a small example program
c code (program named child):
#include "mpi.h"
#include "stdio.h"
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
double ans;
if(rank == 0){
fread(&ans, sizeof(ans), 1, stdin);
}
MPI_Bcast(&ans, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
printf("rank %d of %d received %lf\n", rank, size, ans);
MPI_Finalize();
}
python code (named driver.py):
#!/usr/bin/env python
import ctypes as ct
import subprocess as sp
x = ct.c_double(3.141592)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
x = ct.c_double(101.1)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
results:
> python ./driver.py
rank 0 of 4 received 3.141592
rank 1 of 4 received 3.141592
rank 2 of 4 received 3.141592
rank 3 of 4 received 3.141592
rank 0 of 4 received 101.100000
rank 2 of 4 received 101.100000
rank 3 of 4 received 101.100000
rank 1 of 4 received 101.100000
I tried using MPI_Comm_connect() and MPI_Comm_accept() through mpi4py, but I couldn't seem to get that working on the python side.
Since most of the time is spent in the C subroutine which is invoked multiple times, and you are running within a resource manager, I would suggest the following approach :
Start all the MPI tasks at once via the following command (assuming you have allocated n+1 slots
mpirun -np 1 python wrapper.py : -np <n> a.out
You likely want to start with a MPI_Comm_split() in order to generate a communicator only for the n tasks implemented by the C program.
Then you will define a "protocol" so the python wrapper can pass parameters to the C tasks, and wait for the result or direct the C program to MPI_Finalize().
You might as well consider using an intercommunicator (first group is for python, second group is for C) but this is really up to you. Intercommunicator semantic can be seen as non intuitive, so make sure you understand how this works if you want to go into that direction.
Related
In MPI, if I have the following code will a copy of variable 'a' be created for both the processes or do I have to declare 'a' inside every loop? Or are they both the same?
main()
{
int a;
if(rank==0)
{
a+=1;
}
if(rank==1)
{
a+=2;
}
}
MPI has a distributed memory programming paradigm.
To simply put, if you have an application binary (for eg: hello.out) and if you run it with an mpi runtime by mpirun -n 4 hello.out then what happens is:
It launches 4 instances of the application hello.out (we can say it's similar to launching 4 different applications in 4 different nodes). They don't know each other. They execute their own code in their own address spaces. That means, every variables, functions etc belong to it's own instance and not shared with any other processes. So all process has their own variable a.
i.e, below code will be called 4 times (if we use mpirun -n 4) at same time in different cores/nodes. So, variable a will be available in all 4 instances. You can use rank to identify your MPI process and manipulates it's value. In below example a will store the processes rank value. All processes will print My rank is a with a taking values from 0 to 4. And only one process will print I am rank 0 since a==0 will only be true for process with rank 0.
main()
{
int a;
int rank;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
a=rank;
printf("My rank is %d",a);
if(a==0)
{
printf("I am rank 0");
}
}
So, to interact with each other, the launched processes (eg: hello.out) uses MPI (message passing interface) library (as #Hristo Iliev commented).
So basically, it's like launching same regular C program on multiple cores/nodes and communicating with each other using message passing (as Gilles Gouaillardet pointed in comment ).
I run a subprocess from an OCaml program and check its termination status. If it exited normally (WEXITED int), I get the expected return code (0 usually indicating success).
However, if it was terminated by a signal (WSIGNALED int), I don't get the proper POSIX signal number. Instead, I get some (negative) OCaml specific signal number.
How do I convert this nonstandard signal number to a proper POSIX signal number, for proper error reports? Alternatively, how do I convert this number to a string?
(I'm aware that there are tons of named integer values like Sys.sigabrt, but do I really have to write that large match statement myself? Moreover, I don't get why they didn't use a proper variant type in the first place, given that those signal numbers are OCaml specific anyway.)
There is a function in the OCaml runtime that does this conversion (naturally). It is not kosher to call this function, but if you don't mind writing code that can break in future releases of OCaml (and other possibly bad outcomes), here is code that works for me:
A wrapper for the OCaml runtime function:
$ cat wrap.c
#include <caml/mlvalues.h>
extern int caml_convert_signal_number(int);
value oc_sig_to_host_sig(value ocsignum)
{
/* Convert a signal number from OCaml to host system.
*/
return Val_int(caml_convert_signal_number(Int_val(ocsignum)));
}
A test program.
$ cat m.ml
external convert : int -> int = "oc_sig_to_host_sig"
let main () =
Printf.printf "converted %d -> %d\n" Sys.sigint (convert Sys.sigint)
let () = main ()
Compile the program and try it out:
$ ocamlopt -o m -I $(ocamlopt -where) wrap.c m.ml
$ ./m
converted -6 -> 2
All in all, it might be better just to write some code that compares against the different signals defined in the Sys module and translates them to strings.
When using MPI 3 shared memory, it occurred to me that writing to adjacent memory positions of a shared memory window simultaneously on different tasks seemingly does not work.
I guessed that MPI ignores possible cache conflicts and now my question is if that is correct and MPI indeed does not care about cache coherency, or if this is a quirk of the implementation, or if there is a completely different explanation to that behaviour?
This is a minimal example where, in Fortran, simultaneously writing to distinct addresses in a shared memory window causes a conflict (tested with intel MPI 2017, 2018, 2019 and GNU OpenMPI 3).
program testAlloc
use mpi
use, intrinsic :: ISO_C_BINDING, only: c_ptr, c_f_pointer
implicit none
integer :: ierr
integer :: window
integer(kind=MPI_Address_kind) :: wsize
type(c_ptr) :: baseptr
integer, pointer :: f_ptr
integer :: comm_rank
call MPI_Init(ierr)
! Each processor allocates one entry
wsize = 1
call MPI_WIN_ALLOCATE_SHARED(wsize,4,MPI_INFO_NULL,MPI_COMM_WORLD,baseptr,window,ierr)
! Convert to a fortran pointer
call c_f_pointer(baseptr, f_ptr)
! Now, assign some value simultaneously
f_ptr = 4
! For output, get the mpi rank
call mpi_comm_rank(MPI_COMM_WORLD, comm_rank, ierr)
! Output the assigned value - only one task reports 4, the others report junk
print *, "On task", comm_rank, "value is", f_ptr
call MPI_Win_free(window, ierr)
call MPI_Finalize(ierr)
end program
Curiously, the same program in C does seem to work as intended, which leads to the question if there is something wrong with the Fortran implementation, or the C program is just lucky (tested with the same MPI libraries)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
// Allocate a single resource per task
MPI_Aint wsize = 1;
// Do a shared allocation
int *resource;
MPI_Win window;
MPI_Win_allocate_shared(wsize, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &resource, &window);
// For output clarification, get the mpi rank
int comm_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
// Assign some value
*resource = 4;
// Tell us the value - this seems to work
printf("On task %d the value is %d\n",comm_rank,*resource);
MPI_Win_free(&window);
MPI_Finalize();
}
From the MPI 3.1 standard (chapter 11.2.3 page 407)
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
Note the window size is in bytes and not in number of units.
So all you need is to use
wsize = 4
in Fortran (assuming your INTEGER size is indeed 4) and
wsize = sizeof(int);
in C
FWIW
even if the C version seems to give the expected result most of the time, it is also incorrect and I am able to evidence this by running under the program under a debugger.
generally speaking, you might have to declare volatile int * resource; in C to prevent the compiler from performing some optimizations that might impact the behavior of your app (and this is not needed here).
Is there a rule of thumb for keeping the compiler happy when it looks at a kernel and assigns registers?
The compiler has a lot of flexibility, but I worry that it might start using excessive local memory if I created like, 500 variables in my kernel... or a very long single line with a ton of operations.
I know the only way my program could really examine register use on a specific device is by using the AMD SDK or the NVIDIA SDK (or comparing the assembly code to the Device's architecture). Unfortunately, I am using PyOpenCL, so working with those SDKs would be impractical.
My program generates semi-random kernels, and I'm trying to prevent it from doing things that would choke the compiler and start dumping registers in local memory.
There is an option for NVIDIA platforms that even works programmatically, without the SDK. (Maybe there is something similar for AMD cards?)
You can specify the "-cl-nv-verbose" as the "build option" when calling clBuildProgram. This will generate some log information that can later be obtained via the build logs.
clBuildProgram(program, 0, NULL, "-cl-nv-verbose", NULL, NULL);
clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, ...);
(sorry, I'm not sure about the python syntax for this).
The result should be a string containing the desired information. For a simple vector addition, this shows
ptxas : info : 0 bytes gmem
ptxas : info : Compiling entry function 'sampleKernel' for 'sm_21'
ptxas : info : Function properties for sampleKernel
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 4 registers, 44 bytes cmem[0], 4 bytes cmem[16]
You can also use the "-cl-nv-maxrregcount=..." option to specify the maximum register count, but of course, all this is device- and platform specific, and thus should be used with care.
The compiler will keep track of the private variables scope, is not the number of variables you declare that matters, but how hey are used.
For example, in the following example, only 2 registers are used. Although, 5 private variables are used:
//Notice here, that a value is used in the register when it has to be stored
// not when it is declared. So declaring a variable that is never used will be
// optimized and removed by the compiler.
R1 | R2 | Code
a | - | int a = 1;
a | b | int b = 3;
a | b | int c;
c | b | c = a + b;
c | b | int d;
c | d | d = c + b;
c | d | int e;
e | - | int e = c + d;
- | - | out[idx] = e; //Global memory output
It all depends on the scope of the variable (when is it needed, if it is needed, and for how long).
The only critical thing is NOT to create more memory than needed if the compiler cannot predict that memory.
int a[100];
//Initialize a with some value
int b;
b = a[global_index];
The compiler will not be able to predict the values you are using, therefore it needs the 100 values, and will spil out the memory if needed. For those kind of operations is better to create a table or even do a single reading to a global table.
this is how we use MPI_Init function
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
…
}
why does MPI_Init use pointers to argc and argv instead of values of argv?
According to the answer stated here:
Passing arguments via command line with MPI
Most MPI implementations will remove all the mpirun-related arguments in this function so that, after calling it, you can address command line arguments as though it were a normal (non-mpirun) command execution.
i.e. after
mpirun -np 10 myapp myparam1 myparam2
argc = 7(?) because of the mpirun parameters (it also seems to add some) and the indices of myparam1 and myparam2 are unknown
but after
MPI_Init(&argc, &argv)
argc = 3 and myparam1 is at argv[1] and myparam2 is at argv[2]
Apparently this is outside the standard, but I've tested it on linux mpich and it certainly seems to be the case. Without this behaviour it would be very difficult (impossible?) to distinguish application parameters from mpirun parameters.
my guess to potentially allow to remove mpi arguments from commandline.
passing argument count by pointer allows to modify its value from the point of main.
According to OpenMPI man pages:
MPI_Init(3) man page
Open MPI accepts the C/C++ argc and argv arguments to main, but neither modifies, interprets, nor distributes them.
I'm not an expert but I believe the simple answer is that each node that you're working with is working with its own copy of the code. Passing these arguments allows each of the nodes to have access to argc and argv even though they were not passed them through the command line interface.
The original or master node that calls MPI_Init is passed these arguments. MPI_Init allows the other nodes to access them as well.
It is less overhead to just pass two pointers.