MPI 3 shared memory and cache conflicts - mpi

When using MPI 3 shared memory, it occurred to me that writing to adjacent memory positions of a shared memory window simultaneously on different tasks seemingly does not work.
I guessed that MPI ignores possible cache conflicts and now my question is if that is correct and MPI indeed does not care about cache coherency, or if this is a quirk of the implementation, or if there is a completely different explanation to that behaviour?
This is a minimal example where, in Fortran, simultaneously writing to distinct addresses in a shared memory window causes a conflict (tested with intel MPI 2017, 2018, 2019 and GNU OpenMPI 3).
program testAlloc
use mpi
use, intrinsic :: ISO_C_BINDING, only: c_ptr, c_f_pointer
implicit none
integer :: ierr
integer :: window
integer(kind=MPI_Address_kind) :: wsize
type(c_ptr) :: baseptr
integer, pointer :: f_ptr
integer :: comm_rank
call MPI_Init(ierr)
! Each processor allocates one entry
wsize = 1
call MPI_WIN_ALLOCATE_SHARED(wsize,4,MPI_INFO_NULL,MPI_COMM_WORLD,baseptr,window,ierr)
! Convert to a fortran pointer
call c_f_pointer(baseptr, f_ptr)
! Now, assign some value simultaneously
f_ptr = 4
! For output, get the mpi rank
call mpi_comm_rank(MPI_COMM_WORLD, comm_rank, ierr)
! Output the assigned value - only one task reports 4, the others report junk
print *, "On task", comm_rank, "value is", f_ptr
call MPI_Win_free(window, ierr)
call MPI_Finalize(ierr)
end program
Curiously, the same program in C does seem to work as intended, which leads to the question if there is something wrong with the Fortran implementation, or the C program is just lucky (tested with the same MPI libraries)
#include <mpi.h>
#include <stdio.h>
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
// Allocate a single resource per task
MPI_Aint wsize = 1;
// Do a shared allocation
int *resource;
MPI_Win window;
MPI_Win_allocate_shared(wsize, sizeof(int), MPI_INFO_NULL, MPI_COMM_WORLD, &resource, &window);
// For output clarification, get the mpi rank
int comm_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &comm_rank);
// Assign some value
*resource = 4;
// Tell us the value - this seems to work
printf("On task %d the value is %d\n",comm_rank,*resource);
MPI_Win_free(&window);
MPI_Finalize();
}

From the MPI 3.1 standard (chapter 11.2.3 page 407)
MPI_WIN_ALLOCATE_SHARED(size, disp_unit, info, comm, baseptr, win)
IN size size of local window in bytes (non-negative integer)
Note the window size is in bytes and not in number of units.
So all you need is to use
wsize = 4
in Fortran (assuming your INTEGER size is indeed 4) and
wsize = sizeof(int);
in C
FWIW
even if the C version seems to give the expected result most of the time, it is also incorrect and I am able to evidence this by running under the program under a debugger.
generally speaking, you might have to declare volatile int * resource; in C to prevent the compiler from performing some optimizations that might impact the behavior of your app (and this is not needed here).

Related

How to interface Python with a C program that uses MPI

Currently I have a Python program (serial) that calls a C executable (parallel through MPI) through subprocess.run. However, this is a terribly clunky implementation as it means I have to pass some very large arrays back and forth from the Python to the C program using the file system. I would like to be able to directly pass the arrays from Python to C and back. I think ctypes is what I should use. As I understand it, I would create a dll instead of an executable from my C code to be able to use it with Python.
However, to use MPI you need to launch the program using mpirun/mpiexec. This is not possible if I am simply using the C functions from a dll, correct?
Is there a good way to enable MPI for the function called from the dll? The two possibilities I've found are
launch the python program in parallel using mpi4py, then pass MPI_COMM_WORLD to the C function (per this post How to pass MPI information to ctypes in python)
somehow initialize and spawn processes inside the function without using mpirun. I'm not sure if this is possible.
One possibility, if you are OK with passing everything through the c program rank 0, is to use subprocess.Popen() with stdin=subprocess.PIPE and the communicate() function on the python side and fread() on the c side.
This is obviously fragile, but does keep everything in memory. Also, if your data size is large (which you said it was) you may have to write the data to the child process in chunk. Another option could be to use exe.stdin.write(x) rather than exe.communicate(x)
I created a small example program
c code (program named child):
#include "mpi.h"
#include "stdio.h"
int main(int argc, char *argv[]){
MPI_Init(&argc, &argv);
int size, rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
double ans;
if(rank == 0){
fread(&ans, sizeof(ans), 1, stdin);
}
MPI_Bcast(&ans, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD);
printf("rank %d of %d received %lf\n", rank, size, ans);
MPI_Finalize();
}
python code (named driver.py):
#!/usr/bin/env python
import ctypes as ct
import subprocess as sp
x = ct.c_double(3.141592)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
x = ct.c_double(101.1)
exe = sp.Popen(['mpirun', '-n', '4', './child'], stdin=sp.PIPE)
exe.communicate(x)
results:
> python ./driver.py
rank 0 of 4 received 3.141592
rank 1 of 4 received 3.141592
rank 2 of 4 received 3.141592
rank 3 of 4 received 3.141592
rank 0 of 4 received 101.100000
rank 2 of 4 received 101.100000
rank 3 of 4 received 101.100000
rank 1 of 4 received 101.100000
I tried using MPI_Comm_connect() and MPI_Comm_accept() through mpi4py, but I couldn't seem to get that working on the python side.
Since most of the time is spent in the C subroutine which is invoked multiple times, and you are running within a resource manager, I would suggest the following approach :
Start all the MPI tasks at once via the following command (assuming you have allocated n+1 slots
mpirun -np 1 python wrapper.py : -np <n> a.out
You likely want to start with a MPI_Comm_split() in order to generate a communicator only for the n tasks implemented by the C program.
Then you will define a "protocol" so the python wrapper can pass parameters to the C tasks, and wait for the result or direct the C program to MPI_Finalize().
You might as well consider using an intercommunicator (first group is for python, second group is for C) but this is really up to you. Intercommunicator semantic can be seen as non intuitive, so make sure you understand how this works if you want to go into that direction.

Variable transform output with fftw3 mpi and modern fortran: heap corruption? Round off error?

I'm new to modern Fortran and am trying to write a lengthy program using FFTW3 and mpi with transposed output. In the process of debugging my (exploding) program, I noticed that the output of the transforms varies slightly from one execution of the program to the next. I wrote a small test program, which on my platform reproduces the variation, and which appears below. The "problem" doesn't appear for 64x64 or 128x128 arrays, but does for 512x512 arrays. The variation seems to be in the last few digits, and doesn't happen every time. (I have to run the program repeatedly to see it.) Does this suggest heap corruption? I couldn't find anything with valgrind. I searched stackoverflow and Googled but can't figure out what's wrong. Am I using pointers incorrectly? Is it just round off error?
I compiled fftw3 from source with the default double precision.
I'm compiling the program below with mpifort -o fickledigits fickledigits.F90 -L/usr/local/lib/openmpi -lfftw3_mpi -lfftw3 -lm
program fickledigits
use, intrinsic :: iso_c_binding
use mpi
implicit none
include 'fftw3-mpi.f03'
!--------------------------------------------------------------------
integer(kind=4) :: np,id,ierr
integer(kind=4) :: i,j,it
integer(C_INTPTR_T), parameter :: L=512,M=L,LL=L/2+1
integer(C_INTPTR_T) :: alloc_local,local_LL,local_M
integer(C_INTPTR_T) :: local_kj_offset,local_i_offset
real(C_DOUBLE), pointer :: in(:,:)
complex(C_DOUBLE_COMPLEX), pointer :: out(:,:)
type(C_PTR) :: data,plan_r2c,plan_c2r
complex(C_DOUBLE_COMPLEX), target, allocatable :: qq(:,:)
!--------------------------------------------------------------------
call MPI_Init ( ierr )
call MPI_Comm_size ( MPI_COMM_WORLD, np, ierr )
call MPI_Comm_rank ( MPI_COMM_WORLD, id, ierr )
call fftw_mpi_init()
alloc_local = fftw_mpi_local_size_2d_transposed(LL,M,MPI_COMM_WORLD,&
local_LL,local_kj_offset,local_M,local_i_offset)
data = fftw_alloc_complex(alloc_local)
call c_f_pointer(data,out,[M,local_LL])
call c_f_pointer(data,in,[2*LL,local_M])
plan_r2c = fftw_mpi_plan_dft_r2c_2d(L,M,in,out,MPI_COMM_WORLD,&
ior(FFTW_MEASURE,FFTW_MPI_TRANSPOSED_OUT))
plan_c2r = fftw_mpi_plan_dft_c2r_2d(L,M,out,in,MPI_COMM_WORLD,&
ior(FFTW_MEASURE,FFTW_MPI_TRANSPOSED_IN))
allocate(qq(M,local_LL))
do i=1,local_LL
do j=1,M
qq(j,i)=dcmplx(j,i)
enddo
enddo
out=>qq
call fftw_mpi_execute_dft_c2r(plan_c2r,out,in)
print*, 'in(10,10)=', in(10,10)
call fftw_destroy_plan(plan_r2c)
call fftw_destroy_plan(plan_c2r)
call fftw_free(data)
deallocate(qq)
call MPI_FINALIZE(ierr)
end program

Exit early on found in OpenCL

I'm trying to write an OpenCL implementation of memchr to help me learn how OpenCL works. What I'm planning to do is to assign each work item a chunk of memory to search. Then, inside each work item, it loops through the chunk searching for the character.
Especially if the buffer is large, I don't want the other threads to keep searching after an occurrence has already been found (assume there is only one occurrence of the character in any given buffer).
What I'm stuck on is how does a work item indicate, both to the host and other threads, when it has found the character?
Thanks,
One way you could do this is to use a global flag variable. You atomically set it to 1 when you find the value and other threads will check on that value when they are doing work.
For example:
__kernel test(__global int* buffer, __global volatile int* flag)
{
int tid = get_global_id(0);
int sx = get_global_size(0);
int i = tid;
while(buffer[i] != 8) //Whatever value we're trying to find.
{
int stop = atomic_add(&flag, 0); //Read the atomic value
if(stop)
break;
i = i + sx;
}
atomic_xchg(&flag, 1); //Set the atomic value
}
This might add more overhead than by just running the whole kernel (unless you are doing a lot of work on every iteration). In addition, this method won't work if each thread is just checking a single value in the array. Each thread must have multiple iterations of work.
Finally, I've seen instances where writing to an atomic variable doesn't immediately commit, so you need to check to see if this code will deadlock on your system because the write isn't committing.

How can I perform 64-bit division with a 32-bit divide instruction?

This is (AFAIK) a specific question within this general topic.
Here's the situation:
I have an embedded system (a video game console) based on a 32-bit RISC microcontroller (a variant of NEC's V810). I want to write a fixed-point math library. I read this article, but the accompanying source code is written in 386 assembly, so it's neither directly usable nor easily modifiable.
The V810 has built-in integer multiply/divide, but I want to use the 18.14 format mentioned in the above article. This requires dividing a 64-bit int by a 32-bit int, and the V810 only does (signed or unsigned) 32-bit/32-bit division (which produces a 32-bit quotient and a 32-bit remainder).
So, my question is: how do I simulate a 64-bit/32-bit divide with a 32-bit/32-bit one (to allow for the pre-shifting of the dividend)? Or, to look at the problem from another way, what's the best way to divide an 18.14 fixed-point by another using standard 32-bit arithmetic/logic operations? ("best" meaning fastest, smallest, or both).
Algebra, (V810) assembly, and pseudo-code are all fine. I will be calling the code from C.
Thanks in advance!
EDIT: Somehow I missed this question... However, it will still need some modification to be super-efficient (it has to be faster than the floating-point div provided by the v810, though it may already be...), so feel free to do my work for me in exchange for reputation points ;) (and credit in my library documentation, of course).
GCC has such a routine for many processors, named _divdi3 (usually implemented using a common divmod call). Here's one. Some Unix kernels have an implementation too, e.g. FreeBSD.
If your dividend is unsigned 64 bits, your divisor is unsigned 32 bits, the architecture is i386 (x86), the div assembly instruction can help you with some preparation:
#include <stdint.h>
/* Returns *a % b, and sets *a = *a_old / b; */
uint32_t UInt64DivAndGetMod(uint64_t *a, uint32_t b) {
#ifdef __i386__ /* u64 / u32 division with little i386 machine code. */
uint32_t upper = ((uint32_t*)a)[1], r;
((uint32_t*)a)[1] = 0;
if (upper >= b) {
((uint32_t*)a)[1] = upper / b;
upper %= b;
}
__asm__("divl %2" : "=a" (((uint32_t*)a)[0]), "=d" (r) :
"rm" (b), "0" (((uint32_t*)a)[0]), "1" (upper));
return r;
#else
const uint64_t q = *a / b; /* Calls __udivdi3 in libgcc. */
const uint32_t r = *a - b * q; /* `r = *a % b' would use __umoddi3. */
*a = q;
return r;
#endif
}
If the line above with __udivdi3 doesn't compile for you, use the __div64_32 function from the Linux kernel: https://github.com/torvalds/linux/blob/master/lib/div64.c

Why does MPI_Init accept pointers to argc and argv?

this is how we use MPI_Init function
int main(int argc, char **argv)
{
MPI_Init(&argc, &argv);
…
}
why does MPI_Init use pointers to argc and argv instead of values of argv?
According to the answer stated here:
Passing arguments via command line with MPI
Most MPI implementations will remove all the mpirun-related arguments in this function so that, after calling it, you can address command line arguments as though it were a normal (non-mpirun) command execution.
i.e. after
mpirun -np 10 myapp myparam1 myparam2
argc = 7(?) because of the mpirun parameters (it also seems to add some) and the indices of myparam1 and myparam2 are unknown
but after
MPI_Init(&argc, &argv)
argc = 3 and myparam1 is at argv[1] and myparam2 is at argv[2]
Apparently this is outside the standard, but I've tested it on linux mpich and it certainly seems to be the case. Without this behaviour it would be very difficult (impossible?) to distinguish application parameters from mpirun parameters.
my guess to potentially allow to remove mpi arguments from commandline.
passing argument count by pointer allows to modify its value from the point of main.
According to OpenMPI man pages:
MPI_Init(3) man page
Open MPI accepts the C/C++ argc and argv arguments to main, but neither modifies, interprets, nor distributes them.
I'm not an expert but I believe the simple answer is that each node that you're working with is working with its own copy of the code. Passing these arguments allows each of the nodes to have access to argc and argv even though they were not passed them through the command line interface.
The original or master node that calls MPI_Init is passed these arguments. MPI_Init allows the other nodes to access them as well.
It is less overhead to just pass two pointers.

Resources