MPI synchronized timing - mpi

Basically I need a function to return the same value at the same physical time on all the nodes. I am originally using gettimeofday, but not sure it will do the synchronizing timing between different nodes or not.
Now I am considering using MPI_Wtime. First I checked MPI_WTIME_IS_GLOBAL value when running MPI tasks, since the manual says that
The value returned for MPI_WTIME_IS_GLOBAL is 1 if clocks MPI_WTIME_IS_GLOBAL at all processes in MPI_COMM_WORLD are synchronized, 0 otherwise.
My MPI running returns 3 instead of 1 or 0. I am not sure what "3" means as retuned in my MPI running?
BTW, the manual also says that:
The boolean variable MPI_WTIME_IS_GLOBAL, a predefined attribute key that indicates whether clocks are synchronized, does not have a valid value in Open MPI, as the clocks are not guaranteed to be synchronized.
but I am actually using OpenMPI-1.7.2. Does that mean I can't find a synchronizing timing method with OpenMPI?

From the OpenMPI (1.7.1) documentation:
The boolean variable MPI_WTIME_IS_GLOBAL, a predefined attribute key
that indicates whether clocks are synchronized, does not have a valid
value in Open MPI, as the clocks are not guaranteed to be
synchronized.
It seems unlikely to me that this feature will have changed from 1.7.1 to 1.7.2 so I suspect that the answer to your question is No, OpenMPI does not provide synchronised timing across processes. An integer value of 3 supports the idea that a boolean variable does not have a valid value.

Related

NetBSD - Can we force semget() to return same semid?

I am working on a NetBSD system.
As I recollect from my memory from a book on UNIX programming by Richard Stevens, I lerant that semget() call returns different values for different invocations, even for the same thread.
I recently happen to see a group of processes where different invocations of this semget() to be returning same set of values for respective IPC keys. I see that same process image in different boxes also yield same value for semid.
So, my question is - Is there anyway we can force semget() to exhibit this behaviour?
semget always returns the semaphore associated with the specified key. If you specify the same key, you will get the same semaphore. I believe this has to be true even with netbsd.
int semget(key_t key, int nsems, int semflg);

MPI and global variables

I have to implement an MPI program. There are some global variables (4 arrays of float numbers and other 6 single float variables) which are first inizialized by the main process reading data from a file. Then I call MPI_Init and, while process of rank 0 waits for results, the other processes (rank 1,2,3,4) work on the arrays etc...
The problem is that those array seem not to be initialized anymore, all is set to 0. I tried to move global variable inside the main function but the result is the same. When MPI_Init() is called all processes are created by fork right? So everyone has a memory copy of the father so why do they see not initizialized arrays?
I fear you have misunderstood.
It is probably best to think of each MPI process as an independent program, albeit one with the same source code as every other process in the computation. Operations that process 0 carries out on variables in its address space have no impact on the contents of the address spaces of other processes.
I'm not sure that the MPI standard even requires process 0 to have values for variables which were declared and initialised prior to the call to mpi_init, that is before process 0 really exists.
Whether it does or not you will have to write code to get the values into the variables in the address space of the other processes. One way to do this would be to have process 0 send the values to the other processes, either one by one or using a broadcast. Another way would be for all processes to read the values from the input files; if you choose this option watch out for contention over i/o resources.
In passing, I don't think it is common for MPI implementations to create processes by forking at the call to mpi_init, forking is more commonly used for creating threads. I think that most MPI implementations actually create the processes when you make a call to mpiexec, the call to mpi_init is the formality which announces that your program is starting its parallel computations.
When MPI_Init() is called all processes are created by fork right?
Wrong.
MPI spawns multiple instances of your program. These instances are separate processes, each with its own memory space. Each process has its own copy of every variable, including globals. MPI_Init() only initializes the MPI environment so that other MPI functions can be called.
As the other answers say, that's not how MPI works. Data is unique to each process and must be explicitly transferred between processes using the API available in the MPI specification.
However, there are programming models that allow this sort of behavior. If, when you say parallel computing, you mean multiple cores on one processor, you might be better served by using something like OpenMP to share your data between threads.
Alternatively, if you do in fact need to use multiple processors (either because your data is too big to fit in one processor's memory, or some other reason), you can take a look at one of the Parallel Global Address Space (PGAS) languages. In those models, you have memory that is globally available to all processes in an execution.
Last, there is a part of MPI that does allow you to expose memory from one process to other processes. It's the Remote Memory Access (RMA) or One-Sided chapter. It can be complex, but powerful if that's the kind of computing model you need.
All of these models will require changing the way your application works, but it sounds like they might map to your problem better.

Are there any reasons why one should use MPI's Wtime?

I've been wondering whether there are any particular reasons why one should use Wtime instead of other time measurement methods? Is it more accurate or reliable?
The only reason I see is platform independence.
Since MPI_Wtime() guarantees that the beginning time at all ranks is the same, it can not only be used for calculating time between any two points at the same rank, but also to compare the the time taken by different ranks to reach a certain point very conveniently.
There can be other applications too for this globally synched clock, but right now i can think only about this.
MPI_Wtime() does not guarantee the global synchronization among process lying on different nodes. It does provide the synchronous clock for process lying on same node but also gettimeofday() provides the same.
According to the manual for MPI_Wtime (Open MPI 4.0.0):
On POSIX platforms, this function may utilize a timer that is cheaper to invoke than the gettimeofday() system call, but will fall back to gettimeofday() if a cheap high-resolution timer is not available. The ompi_info command can be consulted to see if Open MPI supports a native high-resolution timer on your platform; see the value for "MPI_WTIME support" (or "options:mpi-wtime" when viewing the parsable output). If this value is "native", a method that is likely to be cheaper than gettimeofday() will be used to obtain the time when MPI_Wtime is invoked.

Asynchronous MPI with SysV shared memory

We have a large Fortran/MPI code-base which makes use of system-V shared memory segments on a node. We run on fat nodes with 32 processors, but only 2 or 4 NICs, and relatively little memory per CPU; so the idea is that we set up a shared memory segment, on which each CPU performs its calculation (in its block of the SMP array). MPI is then used to handle inter-node communications, but only on the master in the SMP group. The procedure is double-buffered, and has worked nicely for us.
The problem came when we decided to switch to asynchronous comms, for a bit of latency hiding. Since only a couple of CPUs on the node communicate over MPI, but all of the CPUs see the received array (via shared memory), a CPU doesn't know when the communicating CPU has finished, unless we enact some kind of barrier, and then why do asynchronous comms?
The ideal, hypothetical solution would be to put the request tags in an SMP segment and run mpi_request_get_status on the CPU which needs to know. Of course, the request tag is only registered on the communicating CPU, so it doesn't work! Another proposed possibility was to branch a thread off on the communicating thread and use it to run mpi_request_get_status in a loop, with the flag argument in a shared memory segment, so all the other images can see. Unfortunately, that's not an option either, since we are constrained not to use threading libraries.
The only viable option we've come up with seems to work, but feels like a dirty hack. We put an impossible value in the upper-bound address of the receive buffer, that way once the mpi_irecv has completed, the value has changed and hence every CPU knows when it can safely use the buffer. Is that ok? It seems that it would only work reliably if the MPI implementation can be guaranteed to transfer data consecutively. That almost sounds convincing, since we've written this thing in Fortran and so our arrays are contiguous; I would imagine that the access would be also.
Any thoughts?
Thanks,
Joly
Here's a pseudo-code template of the kind of thing I'm doing. Haven't got the code as a reference at home, so I hope I haven't forgotten anything crucial, but I'll make sure when I'm back to the office...
pseudo(array_arg1(:,:), array_arg2(:,:)...)
integer, parameter : num_buffers=2
Complex64bit, smp : buffer(:,:,num_buffers)
integer : prev_node, next_node
integer : send_tag(num_buffers), recv_tag(num_buffers)
integer : current, next
integer : num_nodes
boolean : do_comms
boolean, smp : safe(num_buffers)
boolean, smp : calc_complete(num_cores_on_node,num_buffers)
allocate_arrays(...)
work_out_neighbours(prev_node,next_node)
am_i_a_slave(do_comms)
setup_ipc(buffer,...)
setup_ipc(safe,...)
setup_ipc(calc_complete,...)
current = 1
next = mod(current,num_buffers)+1
safe=true
calc_complete=false
work_out_num_nodes_in_ring(num_nodes)
do i=1,num_nodes
if(do_comms)
check_all_tags_and_set_safe_flags(send_tag, recv_tag, safe) # just in case anything else has finished.
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
safe(current)=true
else
wait_until_true(safe(current))
end if
calc_complete(my_rank,current)=false
calc_complete(my_rank,current)=calculate_stuff(array_arg1,array_arg2..., buffer(current), bounds_on_process)
if(not calc_complete(my_rank,current)) error("fail!")
if(do_comms)
check_all_tags_and_set_safe(send_tag, recv_tag, safe)
check_tags_and_wait_if_need_be(next, send_tag, recv_tag)
recv(prev_node, buffer(next), recv_tag(next))
safe(next)=false
wait_until_true(all(calc_complete(:,current)))
check_tags_and_wait_if_need_be(current, send_tag, recv_tag)
send(next_node, buffer(current), send_tag(current))
safe(current)=false
end if
work_out_new_bounds()
current=next
next=mod(next,num_buffers)+1
end do
end pseudo
So ideally, I would have liked to have run "check_all_tags_and_set_safe_flags" in a loop in another thread on the communicating process, or even better: do away with "safe flags" and make the handle to the sends / receives available on the slaves, then I could run: "check_tags_and_wait_if_need_be(current, send_tag, recv_tag)" (mpi_wait) before the calculation on the slaves instead of "wait_until_true(safe(current))".
"...unless we enact some kind of barrier, and then why do asynchronous comms?"
That sentence is a bit confused. The purpose of asynchrononous communications is to overlap communications and computations; that you can hopefully get some real work done while the communications is going on. But this means you now have two tasks occuring which eventually have to be synchronized, so there has to be something which blocks the tasks at the end of the first communications phase before they go onto the second computation phase (or whatever).
The question of what to do in this case to implement things nicely (it seems like what you've got now works but you're rightly concerned about the fragility of the result) depends on how you're doing the implementation. You use the word threads, but (a) you're using sysv shared memory segments, which you wouldn't need to do if you had threads, and (b) you're constrained not to be using threading libraries, so presumably you actually mean you're fork()ing processes after MPI_Init() or something?
I agree with Hristo that your best bet is almost certainly to use OpenMP for on-node distribution of computation, and would probably greatly simplify your code. It would help to know more about your constraint to not use threading libraries.
Another approach which would still avoid you having to "roll your own" process-based communication layer that you use in addition to MPI would be to have all the processes on the node be MPI processes, but create a few communicators - one to do the global communications, and one "local" communicator per node. Only a couple of processes per node would be a part of a communicator which actually does off-node communications, and the others do work on the shared memory segment. Then you could use MPI-based methods for synchronization (Wait, or Barrier) for the on-node synchronization. The upcoming MPI3 will actually have some explicit support for using local shared memory segments this way.
Finally, if you're absolutely bound and determined to keep doing things through what's essentially your own local-node-only IPC implementation --- since you're already using SysV shared memory segments, you might as well use SysV semaphores to do the synchronization. You're already using your own (somewhat delicate) semaphore-like mechanism to "flag" when the data is ready for computation; here you could use a more robust, already-written semaphore to let the non-MPI processes know when the data is ready for computation (and a similar mechanism to let the MPI process know when the others are done with the computation).

size of _POSIX_PATH_MAX

Is size of _POSIX_PATH_MAX is same for all unix flovors(linux,solaris)..
No, it's not even necessarily the same for given instances of the exact same version of the kernel. In most kernel's its a configurable parameter. It will often require a kernel recompile or relink to change, but it can change without having a whole new kernel.
On some (I think most nowadays) systems that macro doesn't translate into an integer literal, it translates to a system call that returns an integer. So if the kernel allows the system to be reconfigured at runtime it will return the current value for the parameter.
I would simply assume that it can't change during the lifetime of your program. If you assume it can change at any time you end up with race conditions where the value changes in between the time you read it and the time you use it. If you just explicitly state that your program assumes it never changes during the lifetime of the program, then system admins who run it will have to adopt the practice they should be adopting anyway and only change the kernel parameter at startup.
There are three POSIX specified calls that will interest you here:
pathconf and fpathconf
sysconf
I would recommend hunting down other sources as well to get a good feel for which variables are widely supported and which aren't.

Resources