Variable use in MPI - mpi

In MPI, if I have the following code will a copy of variable 'a' be created for both the processes or do I have to declare 'a' inside every loop? Or are they both the same?
main()
{
int a;
if(rank==0)
{
a+=1;
}
if(rank==1)
{
a+=2;
}
}

MPI has a distributed memory programming paradigm.
To simply put, if you have an application binary (for eg: hello.out) and if you run it with an mpi runtime by mpirun -n 4 hello.out then what happens is:
It launches 4 instances of the application hello.out (we can say it's similar to launching 4 different applications in 4 different nodes). They don't know each other. They execute their own code in their own address spaces. That means, every variables, functions etc belong to it's own instance and not shared with any other processes. So all process has their own variable a.
i.e, below code will be called 4 times (if we use mpirun -n 4) at same time in different cores/nodes. So, variable a will be available in all 4 instances. You can use rank to identify your MPI process and manipulates it's value. In below example a will store the processes rank value. All processes will print My rank is a with a taking values from 0 to 4. And only one process will print I am rank 0 since a==0 will only be true for process with rank 0.
main()
{
int a;
int rank;
MPI_Init(NULL, NULL);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
a=rank;
printf("My rank is %d",a);
if(a==0)
{
printf("I am rank 0");
}
}
So, to interact with each other, the launched processes (eg: hello.out) uses MPI (message passing interface) library (as #Hristo Iliev commented).
So basically, it's like launching same regular C program on multiple cores/nodes and communicating with each other using message passing (as Gilles Gouaillardet pointed in comment ).

Related

How to convert OCaml signal to POSIX signal or string?

I run a subprocess from an OCaml program and check its termination status. If it exited normally (WEXITED int), I get the expected return code (0 usually indicating success).
However, if it was terminated by a signal (WSIGNALED int), I don't get the proper POSIX signal number. Instead, I get some (negative) OCaml specific signal number.
How do I convert this nonstandard signal number to a proper POSIX signal number, for proper error reports? Alternatively, how do I convert this number to a string?
(I'm aware that there are tons of named integer values like Sys.sigabrt, but do I really have to write that large match statement myself? Moreover, I don't get why they didn't use a proper variant type in the first place, given that those signal numbers are OCaml specific anyway.)
There is a function in the OCaml runtime that does this conversion (naturally). It is not kosher to call this function, but if you don't mind writing code that can break in future releases of OCaml (and other possibly bad outcomes), here is code that works for me:
A wrapper for the OCaml runtime function:
$ cat wrap.c
#include <caml/mlvalues.h>
extern int caml_convert_signal_number(int);
value oc_sig_to_host_sig(value ocsignum)
{
/* Convert a signal number from OCaml to host system.
*/
return Val_int(caml_convert_signal_number(Int_val(ocsignum)));
}
A test program.
$ cat m.ml
external convert : int -> int = "oc_sig_to_host_sig"
let main () =
Printf.printf "converted %d -> %d\n" Sys.sigint (convert Sys.sigint)
let () = main ()
Compile the program and try it out:
$ ocamlopt -o m -I $(ocamlopt -where) wrap.c m.ml
$ ./m
converted -6 -> 2
All in all, it might be better just to write some code that compares against the different signals defined in the Sys module and translates them to strings.

What is the rule behind instruction count in Intel PIN?

I wanted to count instructions in simple recursive fibo function O(2^n). I succeded to do so with bubble sort and matrix multiplication, but in this case it seemed like instruction count ignored my fibo function. Here is the code used for instrumentation:
// Insert a call at the entry point of a routine to increment the call count
RTN_InsertCall(rtn, IPOINT_BEFORE, (AFUNPTR)docount, IARG_PTR, &(rc->_rtnCount), IARG_END);
// For each instruction of the routine
for (INS ins = RTN_InsHead(rtn); INS_Valid(ins); ins = INS_Next(ins))
{
// Insert a call to docount to increment the instruction counter for this rtn
INS_InsertCall(ins, IPOINT_BEFORE, (AFUNPTR)docount, IARG_PTR, &(rc->_icount), IARG_END);
}
I started to wonder what's the difference between this program and the previous ones and my first thought was: here I'm not using an array.
This is what I realised after some manual tests:
a = 5; // instruction ignored by PIN and
// pretty much everything not using array
fibo[1] = 1 // instruction counted properly
a = fibo[1] // instruction ignored by PIN
So it seems like only instructions counted are writes to the memory (that's what I assume). After I changed my fibo function to this it works:
long fibonacciNumber(int n, long *fiboNumbers)
{
if (n < 2) {
fiboNumbers[n] = n;
return n;
}
fiboNumbers[n] = fiboNumbers[n-1] + fiboNumbers[n-2];
return fibonacciNumber(n - 1, fiboNumbers) + fibonacciNumber(n - 2, fiboNumbers);
}
But I would like to count instructions also for programs that aren't written by me. Is there a way to count all type of instrunctions? Is there any particular reason why only this instructions are counted? Any help appreciated.
//Edit
I used disassembly option in Visual Studio to check how it looks and it still makes no sense for me. I can't find the reason why only assingment to array is interpreted by PIN as instruction.
instruction_comparison
This exceeded all my expectations, counted as 2 instructions:
even 2 instructions, not one
PIN, like other low-level profiling and analysis tools, measures individual instructions, low-level orders like "add these two registers" or "load a value from that memory address". The sequence of instructions which a program comprises are generally produced from a high-level language like C++ through a compiler. An individual line of C++ code might be transformed into exactly one instruction, but it's also common for a line to translate to several instructions or even to zero instructions; and the instructions for a line of code may be interleaved with those of other instructions.
Your compiler can output an assembly-language file for your source code, showing what instructions were produced for which lines of code. (For GCC and Clang, this is done with the -S flag.) Note that reading the assembly code output from a compiler is not the best way to learn assembly. Also, I would point you to godbolt.org, a very convenient tool for analyzing assembly output.

How to interface Python with a C program that uses MPI

Currently I have a Python program (serial) that calls a C executable (parallel through MPI) through subprocess.run. However, this is a terribly clunky implementation as it means I have to pass some very large arrays back and forth from the Python to the C program using the file system. I would like to be able to directly pass the arrays from Python to C and back. I think ctypes is what I should use. As I understand it, I would create a dll instead of an executable from my C code to be able to use it with Python.
However, to use MPI you need to launch the program using mpirun/mpiexec. This is not possible if I am simply using the C functions from a dll, correct?
Is there a good way to enable MPI for the function called from the dll? The two possibilities I've found are
launch the python program in parallel using mpi4py, then pass MPI_COMM_WORLD to the C function (per this post How to pass MPI information to ctypes in python)
somehow initialize and spawn processes inside the function without using mpirun. I'm not sure if this is possible.
One possibility, if you are OK with passing everything through the c program rank 0, is to use subprocess.Popen() with stdin=subprocess.PIPE and the communicate() function on the python side and fread() on the c side.
This is obviously fragile, but does keep everything in memory. Also, if your data size is large (which you said it was) you may have to write the data to the child process in chunk. Another option could be to use exe.stdin.write(x) rather than exe.communicate(x)
I created a small example program
c code (program named child):
#include "mpi.h"
#include "stdio.h"
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
double ans;
if(rank == 0){
fread(&ans, sizeof(ans), 1, stdin);
}
MPI_Bcast(&ans, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
printf("rank %d of %d received %lf\n", rank, size, ans);
MPI_Finalize();
}
python code (named driver.py):
#!/usr/bin/env python
import ctypes as ct
import subprocess as sp
x = ct.c_double(3.141592)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
x = ct.c_double(101.1)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
results:
> python ./driver.py
rank 0 of 4 received 3.141592
rank 1 of 4 received 3.141592
rank 2 of 4 received 3.141592
rank 3 of 4 received 3.141592
rank 0 of 4 received 101.100000
rank 2 of 4 received 101.100000
rank 3 of 4 received 101.100000
rank 1 of 4 received 101.100000
I tried using MPI_Comm_connect() and MPI_Comm_accept() through mpi4py, but I couldn't seem to get that working on the python side.
Since most of the time is spent in the C subroutine which is invoked multiple times, and you are running within a resource manager, I would suggest the following approach :
Start all the MPI tasks at once via the following command (assuming you have allocated n+1 slots
mpirun -np 1 python wrapper.py : -np <n> a.out
You likely want to start with a MPI_Comm_split() in order to generate a communicator only for the n tasks implemented by the C program.
Then you will define a "protocol" so the python wrapper can pass parameters to the C tasks, and wait for the result or direct the C program to MPI_Finalize().
You might as well consider using an intercommunicator (first group is for python, second group is for C) but this is really up to you. Intercommunicator semantic can be seen as non intuitive, so make sure you understand how this works if you want to go into that direction.

clEnqueueNDRangeKernel' failed with error 'out of resources'

From my kernel, I call a function say f which has an infinite loop which breaks on ++depth > 5. This works without the following snippet.
for(int j = 0;j < 9;j++){
f1 = inside(prev.o, s[j]);
f2 = inside(x.o, s[j]);
if((f1 ^ f2)){
stage = 0;
break;
}
else if(fabs(offset(x.o, s[j])) < EPSILON)
{
id = j;
stage = 1;
break;
}
}
Looping over the 9 elements in s is the only thing I do here. This is inside the infinite loop. I checked and this does not have a problem running 2 times but the third time it runs out of memory. What is going on? It's not like I am creating any new variables anywhere. There is a lot of code in the while loop which does more complicated computation than the above snippet and that does not run into a problem. My guess is that I'm doing something wrong with storing s.
If you read the OpenCL documentation, the error is not produced because the kernel code is wrong. The code is not even run at all, it all happens at the queueing step:
OpenCL: clEnqueuNDRangeKernel
CL_OUT_OF_RESOURCES:
If there is a failure to queue the execution instance of kernel on the command-queue because of insufficient resources needed to execute the kernel. For example, the explicitly specified local_work_size causes a failure to execute the kernel because of insufficient resources such as registers or local memory.
Another example would be the number of read-only image args used in kernel exceed the CL_DEVICE_MAX_READ_IMAGE_ARGS value for device or the number of write-only image args used in kernel exceed the CL_DEVICE_MAX_WRITE_IMAGE_ARGS value for device or the number of samplers used in kernel exceed CL_DEVICE_MAX_SAMPLERS for device.
Check the local memory size, local group size, constant memory and kernel arguments size.

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

Resources