I have some MPI processes which should write to the same file after they finish their task. The problem is that the length of the results is variable and I cannot assume that each process will write at a certain offset.
A possible approach would be to open the file in every process, to write the output at the end and then to close the file. But this way a race condition could occur.
How can I open and write to that file so that the result would be the expected one?
You might think you want the shared file or ordered mode routines. But these routines get little use and so are not well optimized (so they get little use... quite the cycle...)
I hope you intend on doing this collectively. then you can use MPI_SCAN to collect the offsets, then call MPI_FILE_WRITE_AT_ALL to have the MPI library optimize the I/O for you.
(If you are doing this independently, then you will have to do something like... master slave? passing a token? fall back to the shared file pointer routines even though I hate them?)
Here's an approach for a good collective method:
incr = (count*datatype_size);
/* you can skip this call and assume 'offset' is zero if you don't care
about the contents of the file */
MPI_File_get_position(mpi_fh, &offset);
MPI_Scan(&incr, &new_offset, 1, MPI_LONG_LONG_INT,
MPI_SUM, MPI_COMM_WORLD);
new_offset -= incr;
new_offset += offset;
ret = MPI_File_write_at_all(mpi_fh, new_offset, buf, count,
datatype, status);
Related
Trying to roll my own code for STM32F4 UART.
A peculiarity of this chip is that if you use byte addressing as the GNAT compiler does when setting a single bit, the corresponding bit in the other byte of the half word is set. The data sheet says use half word addressing. Is there a way to tell the compiler to do this? I tried
for CR1_register'Size use 16;
but this had no effect. Writing the whole 16 bit word works, but you lose the ability to set named bits.
The GNAT way to do this, as used in the AdaCore Ada Drivers Library, is to use the GNAT-only aspect Volatile_Full_Access, about which the GNAT Reference Manual says
This is similar in effect to pragma Volatile, except that any reference to the object is guaranteed to be done only with instructions that read or write all the bits of the object. Furthermore, if the object is of a composite type, then any reference to a subcomponent of the object is guaranteed to read and/or write all the bits of the object.
The intention is that this be suitable for use with memory-mapped I/O devices on some machines. Note that there are two important respects in which this is different from pragma Atomic. First a reference to a Volatile_Full_Access object is not a sequential action in the RM 9.10 sense and, therefore, does not create a synchronization point. Second, in the case of pragma Atomic, there is no guarantee that all the bits will be accessed if the reference is not to the whole object; the compiler is allowed (and generally will) access only part of the object in this case.
Their code is
-- Control register 1
type CR1_Register is record
-- Send break
SBK : Boolean := False;
...
end record
with Volatile_Full_Access, Size => 32,
Bit_Order => System.Low_Order_First;
for CR1_Register use record
SBK at 0 range 0 .. 0;
...
end record;
Portable way is to do this explicitly: read whole record, modify, then write it back. As long as it is declared Volatile a compiler will not optimize reads and writes out.
-- excerpt from my working code --
declare
R : Control_Register_1 := Module.CR1;
begin
R.UE := True;
Module.CR1 := R;
end;
This is very verbose, but it does its work.
I have a program utilizing OpenCL 2.0 because I want to take advantage of device-side enqueue. I have a test program that performs the following tasks on the host side:
Allocates 16 kilobytes of floating point memory on the device and zeros it out.
Builds the OpenCL program below, and creates a kernel of masterKernel()
Sets the first argument of masterKernel() (heap) to the allocated memory in step 1
Enqueues that masterKernel() via clEnqueueNDRangeKernel() with a work_dim of 1 and a global work size of 1. (So it only runs once, with get_global_id(0) always being zero)
Reads the memory back into the host and displays it.
Here is the OpenCL code:
//This function was stripped down to nothing for testing purposes.
kernel void childKernel(global float* heap)
{
}
//Enqueues the child kernel.
kernel void masterKernel(global float* heap)
{
ndrange_t ndRange = ndrange_1D(16); //Arbitrary, could be any number.
if(get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), 0, ndRange,
^{ childKernel(heap); });
}
}
The program builds successfully. However, when I try to run masterKernel(), The call to enqueue_kernel() here causes the host side call to clEnqueueNDRangeKernel() to fail with an error code of CL_OUT_OF_RESOURCES. OpenCL's documentation says enqueue_kernel() should return CL_SUCCESS or CL_ENQUEUE_FAILURE depending on if the block enqueues successfully or not. It does not say that clEnqueueNDRangeKernel() itself should fail. Here are some other things I've tried:
Commenting out the call to enqueue_kernel() causes the program to succeed.
Adding a line that sets heap[0] to any number causes the host-side program to reflect that change. So I know that it's not a problem with how I'm feeding the arguments in
Modifying the if statement so that it reads something impossible like if(get_global_id(0) == 6000) still causes the error. This tells me that the error is not caused by enqueue_kernel() executing (I verified get_global_size(0) == 1), but merely that it exists in the program at all.
Modifying the if statement to if(0) does make the error not happen.
Making it so childKernel() actually does something does not make the error go away.
I am not really sure what to try next. I know my device supports OpenCL 2.0. My device is an AMD Radeon R9 380 graphics card. I do not have access to any other OpenCL 2.0 capable hardware to test it on.
I ended up figuring this one out. This issue happened because I did not create a device-side queue (one with the flags of CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT).
As far as I know, MPI_BUFFER_ATTACH must be called by a process if it is going to do buffered communication. But does this include the standard MPI_SEND as well? We know that MPI_SEND may behave either as a synchronous send or as a buffered send.
You need to call MPI_Buffer_attach() only if you plan to perform (explicitly) buffered sends via MPI_Bsend().
If you only plan to MPI_Send() or MPI_Isend(), then you do not need to invoke MPI_Buffer_attach().
FWIW, buffered sends are error prone and I strongly encourage you not to use them.
MPI_Buffer_attach
Attaches a user-provided buffer for sending
Synopsis
int MPI_Buffer_attach(void *buffer, int size)
Input Parameters
buffer
initial buffer address (choice)
size
buffer size, in bytes (integer)
Notes
The size given should be the sum of the sizes of all outstanding
Bsends that you intend to have, plus MPI_BSEND_OVERHEAD for each Bsend
that you do. For the purposes of calculating size, you should use
MPI_Pack_size. In other words, in the code
MPI_Buffer_attach( buffer, size );
MPI_Bsend( ..., count=20, datatype=type1, ... );
...
MPI_Bsend( ..., count=40, datatype=type2, ... );
the value of size in the MPI_Buffer_attach call should be greater than the value computed by
MPI_Pack_size( 20, type1, comm, &s1 );
MPI_Pack_size( 40, type2, comm, &s2 );
size = s1 + s2 + 2 * MPI_BSEND_OVERHEAD;
The MPI_BSEND_OVERHEAD gives the maximum amount of space that may be used in the buffer for use by the BSEND routines in using the buffer. This value is in mpi.h (for C) and mpif.h (for Fortran).
Thread and Interrupt Safety
The user is responsible for ensuring that multiple threads do not try to update the same MPI object from different threads. This routine should not be used from within a signal handler.
The MPI standard defined a thread-safe interface but this does not mean that all routines may be called without any thread locks. For example, two threads must not attempt to change the contents of the same MPI_Info object concurrently. The user is responsible in this case for using some mechanism, such as thread locks, to ensure that only one thread at a time makes use of this routine. Because the buffer for buffered sends (e.g., MPI_Bsend) is shared by all threads in a process, the user is responsible for ensuring that only one thread at a time calls this routine or MPI_Buffer_detach.
Notes for Fortran
All MPI routines in Fortran (except for MPI_WTIME and MPI_WTICK) have an additional argument ierr at the end of the argument list. ierr is an integer and has the same meaning as the return value of the routine in C. In Fortran, MPI routines are subroutines, and are invoked with the call statement.
All MPI objects (e.g., MPI_Datatype, MPI_Comm) are of type INTEGER in Fortran.
Errors
All MPI routines (except MPI_Wtime and MPI_Wtick) return an error value; C routines as the value of the function and Fortran routines in the last argument. Before the value is returned, the current MPI error handler is called. By default, this error handler aborts the MPI job. The error handler may be changed with MPI_Comm_set_errhandler (for communicators), MPI_File_set_errhandler (for files), and MPI_Win_set_errhandler (for RMA windows). The MPI-1 routine MPI_Errhandler_set may be used but its use is deprecated. The predefined error handler MPI_ERRORS_RETURN may be used to cause error values to be returned. Note that MPI does not guarentee that an MPI program can continue past an error; however, MPI implementations will attempt to continue whenever possible.
MPI_SUCCESS
No error; MPI routine completed successfully.
MPI_ERR_BUFFER
Invalid buffer pointer. Usually a null buffer where one is not valid.
MPI_ERR_INTERN
An internal error has been detected. This is fatal. Please send a bug report to mpi-bugs#mcs.anl.gov.
See Also MPI_Buffer_detach, MPI_Bsend
Refer Here For More
Buffer allocation and usage
Programming with MPI
MPI - Bsend usage
Let's say I'm implementing the kill program in Go. I can accept numeric signals and PIDs from the commandline and send them to syscall.Kill no problem.
However, I don't know how to implement the "string" form of signal dispatch, e.g. kill -INT 12345.
The real use case is a part of a larger program that prompts the user to send kill signals; not a replacement for kill.
Question:
How can I convert valid signal names to signal numbers on any supported platform, at runtime (or at least without writing per-platform code to be run at compile time)?
What I've tried:
Keep a static map of signal names to numbers. This doesn't work in a cross-platform way (for example, different signal lists are returned by kill -l on Mac OSX versus a modern Linux versus an older Linux, for example). The only way to make this solution work in general would be to make maps for every OS, which would require me to know the behavior of every OS, and keep up to date as they add new signal support.
Shell out to the GNU kill tool and capture the signal lists from it. This is inelegant and kind of a paradox, and also requires a) being able to find kill, b) having the ability/permission to exec subprocesses, and c) being able to predict/parse the output of kill-the-binary.
Use the various Signal types' String method. This just returns strings containing the signal number, e.g. os.Signal(4).String() == "signal 4", which is not useful.
Call the private function runtime.signame, which does exactly what I want. go://linkname hacks will work, but I'm assuming that this sort of thing is frowned-upon for a reason.
Ideas/Things I Haven't Tried:
Use CGo somehow. I'd rather not venture into CGO territory for a project that is otherwise not low-level/needful of native integration at all. If that's the only option, I will, but have no idea where to start.
Use templating and code generation to build lists of signals based on external sources at compile time. This is not preferable for the same reasons as CGo.
Reflect and parse the members of syscall that start with SIG somehow. I am told that this is not possible because names are compiled away; is it possible that, for something as fundamental as signal names, there's someplace they're not compiled away?
Commit d455e41 added this feature in March 2019 as sys/unix.SignalNum() and is thus available at least since Go 1.13. More details in GitHub issue #28027.
From the documentation of the golang.org/x/sys/unix package:
func SignalNum(s string) syscall.Signal
SignalNum returns the syscall.Signal for signal named s, or 0 if a signal with such name is not found. The signal name should start with "SIG".
To answer a similar question, "how can I list the names of all available signals (on a given Unix-like platform)", we can use the inverse function sys/unix.SignalName():
import "golang.org/x/sys/unix"
// See https://github.com/golang/go/issues/28027#issuecomment-427377759
// for why looping in range 0,255 is enough.
for i := syscall.Signal(0); i < syscall.Signal(255); i++ {
name := unix.SignalName(i)
// Signal numbers are not guaranteed to be contiguous.
if name != "" {
fmt.Println(name)
}
}
Update some time after I posted the below answer, Golang's stdlib acquired this functionality. An answer describing how to use that functionality was posted by #marco.m and accepted; the below is not recommended unless the version of Go you are using pre-dates the availability of the right tool for the job.
Since no answers were posted, I'll post the less-than-ideal solution I was able to use by "breaking into" a private signal-enumeration function inside Go's standard library.
The signame internal function can get a signal name by number on Unix and Windows. To call it, you have to use the linkname/assembler workaround. Basically, make a file in your project called empty.s or similar, with no contents, and then a function declaration like so:
//go:linkname signame runtime.signame
func signame(sig uint32) string
Then, you can get a list of all signals known by the operating system by calling signame on an increasing number until it doesn't return a value, like so:
signum := uint32(0)
signalmap = make(map[uint32]string)
for len(signame(signum)) > 0 {
words := strings.Fields(signame(signum))
if words[0] == "signal" || ! strings.HasPrefix(words[0], "SIG") {
signalmap[signum] = ""
} else {
// Remove leading SIG and trailing colon.
signalmap[signum] = strings.TrimRight(words[0][3:], ":")
}
signum++
}
After that runs, signalmap will have keys for every signal that can be sent on the current operating system. It will have an empty string where Go doesn't think the OS has a name for the signal (the kill(1) may name some signals that Go won't return names for, I've found, but it's usually the higher-numbered/nonstandard ones), or a string name, e.g. "INT" where a name can be found.
This behavior is undocumented, subject to change, and may not hold true on some platforms. It would be nice if this were made public, though.
The below code simply calculates the time taken to write a file.
#include<time.h>
void main()
{
int fp;
long a,b;
char *str = "Life is like that only";
fp = open("tmp.txt",O_WRONLY,0666);
time(&a);
write(fp,str);
time(&b);
/*(b-a) should be the time taken to write
* the file tmp.txt.
*/
close(fp);
return;
}
My question is that if we have a single CPU then whether the time taken (b-a) would be exact or it can be affected by the execution of other process running parallel.
Some posts here mention that write() and read() can be treated almost like atomic syscalls as if they are not successful EINTR is set that simply means to try again.But still does that mean if it is successful then in the course of its execution all other processes are on hold.
Other processes (that are not using I/O or that are using I/O on different devices) can run while your process is waiting for the write to complete, and your process may not immediately get the CPU back after it completes.
In practice, for a small write to a regular file, your write() will probably return immediately after copying your data into a kernel-space buffer, rather than waiting for it to go all the way to the disk.