Why MPI_Win_unlock is so low? - mpi

My application uses one-sided communications (MPI_Rget, MPI_Raccumulate) with synchronization primitives like MPI_Win_Lock and MPI_Win_Unlock for its passive target synchronization.
I profiled my application and found that most of time is being spent in MPI_Win_Unlock function (not MPI_Win_Lock), which I cannot understand why.
(1) Does anyone know why MPI_Win_Unlock function takes so much time? (Maybe it's implementation issue)
(2) Can this situation get better if I moves for S/C/P/W synchronization model?
I just need to be sure that all the one-sided operations are not concurrently overlapped.
I am using Intel's MPI Library ver 5.1 which implements MPI V3.
I appended some snippets of my codes (actually it's all :D)
Each MPI process runs 'Run()'
Run ()
// Join
For each Target_Proc i in MPI_COMM_WORLD
RequestDataFrom ( (i + k) % nprocs ); // requests k-step away neighbor's data asynchronously
ConsumeDataFrom (i);
JoinWithMyData (my_rank, i);
WriteBackDataTo (i);
goto the above 'For loop' again if the termination condition does not hold.
MPI_Barrier(MPI_COMM_WORLD);
// Update Data in Window
UpdateMyWindow (my_rank);
RequstDataFrom (target_rank_id)
MPI_Win_Lock (MPI_LOCK_SHARED, target_rank_id, win)
MPI_Rget (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
ConsumeDataFrom (target_rank_id)
MPI_Wait (&requests[target_rank_id])
GetPointerToBuffer (target_rank_id)
WriteBackDataTo (target_rank_id)
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
MPI_Rput (from target_rank_id, win, &requests[target_rank_id])
MPI_Win_Unlock (target_rank_id, win)
UpdateMyWindow ()
MPI_Win_Lock (MPI_LOCK_EXCLUSIVE, target_rank_id, win)
Update()
MPI_Win_Unlock (target_rank_id, win)

The function MPI_Win_unlock will block until all RMA operations of the access epoch have been completed.
As such it is no surprise that your profiler will show that this function takes the majority of time. It will block till the MPI implementation has completed all one-sided communication operations that were posted since the corresponding MPI_Win_lock.
Note that one-sided operations (Put, Get, etc) will merely dispatch the operation and not block till the operation is completed. As such these operations are effectively very similar to non-blocking communication functions (MPI_Isend/MPI_Irecv) without the MPI_Request object. To continue the analogy, MPI_Win_unlock waits on all operations to complete, similar to a MPI_Wait_all.

Related

Device-side enqueue causes CL_OUT_OF_RESOURCES

I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.
I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).

When to use MPI_BUFFER_ATTACH?

As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage

Deadlock: will order of resource return have any potential issue?

// down = acquire the resource
// up = release the resource
typedef int semaphore;
semaphore resource_1;
semaphore resource_2;
void process_A(void) {
down(&resource_1);
down(&resource_2);
use_both_resources();
up(&resource_2);
up(&resource_1);
}
If the resource return in the same order as it acquired, i.e,
void process_A(void) {
down(&resource_1);
down(&resource_2);
use_both_resources();
up(&resource_1);
up(&resource_2);
}
Would that cause any potential problem.
Thanks for any explanation!
The important part is if you are taking the locks in the same order in different threads or not.
The order of release has no effect; there's nothing stopping the program from releasing the second lock after the first one has been released, (unless you're taking new locks in between, but then you're back at the first case; taking the locks in the correct order.)
If you have two functions that try to take the same two locks, in different orders, they could grab one lock each, and wait forever for the other one to release their lock. Example code:
down(first_lock)
down(second_lock)
running concurrently with
down(second_lock)
down(first_lock)
they could both take their first lock before any of them take their second lock, and then they'll deadlock.

qt connectNotify usage

I have a QT class instance, called C, (C inherits QOBJECT) that sends a signal S.
In my program, other QT classes instances X are created and destroyed when the program runs. These other classes connect and disconnect S, i.e. they run:
connect(C,SIGNAL(S()), this, SLOT(my_func())); // <this> is an instance of X
or
disconnect(C,SIGNAL(S()), this, SLOT(my_func()));
In class C, the calculation of whether S should be emitted (and the data associated to it - not shown here) is rather complicated, so I would like the instance of class C (which emits the signal) to be notified when one(or more) object are connected (listening) to S or when all are disconnected.
I have read about the connectNotify and disconnectNotify functions, but their usage is discouraged. Besides the documentation does not state very clearly if there is a one to one relationship between the number of (dis)connectNotify calls and the number of "listener" to the signal (or can one single connectNotify be called for more than one listener?).
Can I just count positively (count++) the number of connectNotify and negatively (count--) the number of disconnectNotify and just react to non-zero value?
Any better way to do this?
First, I think you've got it right that connectNotify and disconnectNotify can be used for this purpose - each connect event will be counted properly, even if it is a duplicate from the same object.
You can also double check this with QObject::receivers
int QObject::receivers ( const char * signal ) const [protected]
Returns the number of receivers connected to the signal. Since both
slots and signals can be used as receivers for signals, and the same
connections can be made many times, the number of receivers is the
same as the number of connections made from this signal. When calling
this function, you can use the SIGNAL() macro to pass a specific
signal: if (receivers(SIGNAL(valueChanged(QByteArray))) > 0) {
QByteArray data;
get_the_value(&data); // expensive operation
emit valueChanged(data); } As the code snippet above illustrates, you can use this function to avoid emitting a signal that
nobody listens to. Warning: This function violates the object-oriented
principle of modularity. However, it might be useful when you need to
perform expensive initialization only if something is connected to a
signal.
My suggestion would be to write a simple test program. Override connectNotify and disconnectNotify to increment/decrement a counter, but also use receivers to verify that the counter is correct. Try connecting multiple times, disconnecting multiple times, disconnecting even if there is no connection, etc.
Something to be careful of: connect and disconnect are thread-safe; I'm not sure if the matching Notify functions are safe also.
Since Qt 5.0, you can do this more easily with the QObject::isSignalConnected function. Example from the documentation:
static const QMetaMethod valueChangedSignal = QMetaMethod::fromSignal(&MyObject::valueChanged);
if (isSignalConnected(valueChangedSignal)) {
QByteArray data;
data = get_the_value(); // expensive operation
emit valueChanged(data);
}

Confusion over the time taken by write() and read() sys calls

The below code simply calculates the time taken to write a file.
#include<time.h>
void main()
{
int fp;
long a,b;
char *str = "Life is like that only";
fp = open("tmp.txt",O_WRONLY,0666);
time(&a);
write(fp,str);
time(&b);
/*(b-a) should be the time taken to write
* the file tmp.txt.
*/
close(fp);
return;
}
My question is that if we have a single CPU then whether the time taken (b-a) would be exact or it can be affected by the execution of other process running parallel.
Some posts here mention that write() and read() can be treated almost like atomic syscalls as if they are not successful EINTR is set that simply means to try again.But still does that mean if it is successful then in the course of its execution all other processes are on hold.
Other processes (that are not using I/O or that are using I/O on different devices) can run while your process is waiting for the write to complete, and your process may not immediately get the CPU back after it completes.
In practice, for a small write to a regular file, your write() will probably return immediately after copying your data into a kernel-space buffer, rather than waiting for it to go all the way to the disk.

Resources