Inconsistent behavior from MPI_Gather - mpi

I am running an MPI C++ program locally with two processes: mpirun -np 2 <programname>. I am seeing inconsistent behavior of the MPI_Gather command. To test, I wrote a very short code snippet. I copied the code to the start of main and it worked fine. But when copied it to other points in the code, it sometimes gives the correct result and sometimes not. The code snippet is copied below. I doubt the issue is with the code snippet itself (since it sometimes works properly). Typically, when I see inconsistent code behavior like this, I suspect a memory corruption. However, I have run Valgrind in this case and it did not report anything amiss (although maybe I am not running Valgrind correctly for MPI - I am not experienced using Valgrind on MPI programs). What could be causing this type of inconsistent behavior and what can I do to detect the problem?
Here is the code snippet.
double val[2] = {0, 1};
val[0] += 10.0*double(gmpirank);
val[1] += 10.0*double(gmpirank);
double recv[4];
printdebug("send", val[0],val[1]);
int err = MPI_Gather(val,2,MPI_DOUBLE,recv,2,MPI_DOUBLE,0,MPI_COMM_WORLD);
if (gmpirank == 0) {
printdebug("recv");
printdebug(recv[0],recv[1]);
printdebug(recv[2],recv[3]);
}
printdebug("finished test", err);
The print debug function prints to a file, which is separate for each process, and separates the input arguments with a comma.
Process 1 prints:
send, 10, 11
finished test, 0
Sometimes, Process 0 prints:
send, 0, 1
recv
0, 1
10, 11
finished test, 0
But when I place the code in other sections of the code, Process 0 sometimes prints something like this:
send, 0, 1
recv
0, 1
2.9643938750474793e-322, 0
finished test, 0

I found the solution. As suspected, the problem was a memory corruption.
I made a beginner mistake when running Valgrind with MPI. I ran:
valgrind <options> mpirun -np 2 <programname>
instead of
mpirun -np 2 valgrind <options> <programname>
Thus, I was running valgrind on "mpirun" itself, not on the intended program. When I ran Valgrind correctly, it identified the memory corruption in an unrelated part of the code.
Kudos to another Stack Overflow Q/A for helping me figure this out: Using valgrind to spot error in mpi code

Related

In a UNIX pipeline how to get the user-tool interaction of the first stage piped into the next stage?

Page 32 of the book "The UNIX Programming Environment" makes this profound statement about UNIX pipes:
The programs in a pipeline actually run at the same time, not one
after another. This means that the programs in a pipeline can be
interactive; the kernel looks after whatever scheduling and
synchronization is need to make it all work.
Wow! By using pipes I can get parallel processing for free!
"I've got to illustrate this awesome capability to my colleagues" I thought. I will implement a demo: create a simple interactive tool that mimics what I type in at the command window. I'll name that tool mimic. Use a pipe to connect it to another tool which counts the number of lines and characters. I'll name that tool wc (sadly, I am working on Windows, which doesn't have the UNIX wc program so I must implement my own). The two tools will run on a command line like this:
mimic | wc
The output of mimic is piped into wc.
Well, that's the idea.
I implemented the mimic tool with a very simple Flex lexer, which I show below. Compiling it generated mimic.exe
When I run mimic.exe from the command line it does indeed mimic what I type:
> mimic
hello world
hello world
greetings
greetings
ctrl-c
I implemented wc using AWK. The AWK program (wc.awk) is shown below. When I run wc from the command line it does indeed count the lines and characters:
> echo Hello World | awk -f wc.awk
lines 1 chars 13
However, when I put them together with a pipe, they don't work as I imagined they would. Here's a sample run:
> mimic | awk -f wc.awk
hello world
greetings
ctrl-c
Hmm, nothing. No mimicking. No line counts. No char counts.
How come it's not working? What am I doing wrong, please?
What can I do to make it work as I expected? I expected it to work like this: I type something in at the command line. mimic repeats it, sending it to the pipe which sends it to wc which reports the number of lines and characters. I type in the next thing at the command line. mimic repeats it, sending it to the pipe which sends it to wc which reports the number of lines and characters. And so forth. That's the behavior I thought I was implementing.
Here is my simple Flex lexer (mimic):
%option noyywrap
%option always-interactive
%%
%%
int main(int argc, char *argv[])
{
yyin = stdin;
yylex();
return 0;
}
Here is my simple AWK program (wc):
{ nchars = nchars + length($0) + 1 }
END { printf("lines %-10d chars %d\n", NR, nchars) }
This question has little or nothing to do with Flex, Bison, Awk, and really not so much about Unix, either (since you're experimenting with Windows).
I don't have Windows handy, but the underlying issue is basically about stdio buffering, so it's reproducible on Unix as well.
To simplify, I only implemented mimic, which I did directly rather than using Flex (which is clearly overkill):
#include <stdio.h>
int main(void) {
for (int ch; (ch = getchar()) != EOF; ) putchar(ch);
return 0;
}
Since you use %always-interactive, which forces Flex to always read one character at a time with fgetc(), that has basically the same sequence of standard library calls as your program, except for simplifying fwrite of one byte to the equivalent putchar.
It certainly has the same execution characteristic:
$ ./mimic
Here we go round the mulberry busy,
Here we go round the mulberry busy,
the mulberry bush, the mulberry bush.
the mulberry bush, the mulberry bush.
In the above, I signalled end-of-input by typing Ctrl-D for the third line. On Windows, I would have had to have typed Ctrl-Z followed by Enter to get the same effect. If I kill the execution by typing Ctrl-C instead, I get roughly the same result (other than the fact that Ctrl-C shows up in the console):
$ ./mimic
Here we go round the mulberry bush,
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
the mulberry bush, the mulberry bush.
^C
Now, since mimic just copies stdin to stdout, I might expect to be able to pipe it into itself and get the same result. But the output is a little different:
$ ./mimic | ./mimic
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
Again, I signaled end of input by typing Ctrl-D. And it was only after I typed the Ctrl-D that any output appeared; at that point, both lines were echoed. And if I terminate the program abruptly with Ctrl-C, I don't get any output at all:
$ ./mimic | ./mimic
Here we go round the mulberry bush,
the mulberry bush, the mulberry bush.
^C
OK, that's the data. Now, we need an explanation. And the explanation has to do with the way C standard library streams are buffered, which you can read about in the setvbuf manpage (and in many places on the web). I'll summarise:
The C standard specifies that all input functions execute "as if" implemented by repeated calls to fgetc, and all output functions "as if" implemented by repeated calls to fputc. Note that this means that there is nothing special about the chunk of bytes written by a single printf or fwrite. The library does not do anything to ensure that the sequence of calls is atomic. If you have two processes both writing to stdout, the messages can get interleaved, and that will happen from time to time.
The data written by fputc (and, consequently, all stdio output functions) is actually placed into an output buffer. This buffer is not part of the Operating System (which may add another layer of buffering). It's strictly part of the stdio library functions, which are ordinary userland functions. You could write them yourself (and it's a pretty good exercise to do so). From time to time, the contents of this buffer are sent to the appropriate operating system interface in order to be transferred to the output device.
"From time to time" is deliberately unspecific. There are three standard buffering modes, although the standard doesn't require them all to be used by a particular library implementation, and it also doesn't restrict the library implementation from using different buffering modes (although the three specified modes do basically cover the useful possibilities). However, most C library implementations you're likely to use do implement all three, pretty well as described in the standard. So take the following as a description of a common implementation technique, but be aware that on certain idiosyncratic platforms, it might not be accurate.
The three buffering modes are:
Unbuffered. In this mode, each byte written is transferred to the operating system (and, it is hoped, to the actual output device) as soon as possible.
Fully buffered. In this mode, there is a buffer of a predetermined size (often 8 kilobytes, but different library implementations have different defaults for different platforms). If you want to, you can supply your own buffer (of an arbitrary size) for a particular output stream, using the setvbuf standard library function (q.v.). Fully-buffered output might stay in the output buffer until it is full (although a given implementation may release output earlier). The buffer will, however, be sent to the operating system if you call fflush or fclose on the stream, or if fclose is called automatically when main returns. (It's not sent if the process dies, though.)
Line-buffered. In this mode, the stream again has an output buffer of a predetermined size, which is usually exactly the same as the buffer used in "fully-buffered" mode. The difference is that the buffer is also sent to the operating system when a new line character ('\n') is written. (If the buffer gets full before an end-of-line character is written, then it is sent to the operating system, just as in Fully-Buffered mode. But most of the time, lines will be fairly short, so the buffer will be sent to the OS at the end of each line.)
Finally, you need to know that stdout is fully-buffered by default unless the standard library can determine that stdout is connected to some kind of console device, in which case it is not fully-buffered. (On Unix, it is typically line-buffered.) By contrast, stderr is not fully-buffered. (On Unix, it is typically unbuffered.) You can change the buffering mode, and the buffer size if relevant, calling setvbuf before the first write operation to the stream.
The above-mentioned defaults are one of the reasons you are encouraged to write error messages to stderr rather than stdout: since stderr is unbuffered, the error message will appear as soon as possible. Also, you should normally put \n at the end of output line; if standard output is a terminal and therefore line buffered, the \n will ensure that the line is actually output, rather than languishing in the output buffer.
You can see all of this in action in the above examples. When I just ran ./mimic, leaving stdout mapped to the terminal, the output showed up each time I entered a line. (That also has to do with the way terminal input is handled by the terminal driver, which is another kettle of fish.)
But when I piped mimic into itself, the first mimics standard output is redirected to a pipe. A pipe is not a terminal, so that mimic's stdout is fully-buffered by default. Since the buffer is longer than the total input, the entire program runs without sending anything to stdout, until the buffer is flushed when stdout is implicitly closed by main returning.
Moreover, if I kill the process (by typing Ctrl-C, for example, or by sending it a SIGKILL signal), then the output buffer is never sent to the operating system, and nothing appears on the console.
If you're writing console apps using standard C library calls, it's very important to understand how stdio output buffering affects the sequence of outputs you see. That's just as true on Windows as on Unix. (Of course, it doesn't apply if you use native Windows or Posix I/O interfaces.)

why "netstat -a" do not exit immediately but "netstat -n" does?

I have checked about the function of "-n" --
"Displays active TCP connections, however, addresses and port numbers are expressed numerically and no attempt is made to determine names."
But I can't see why "-n" can make netstat exit immediately?
From a quick check, I don't see the same description for the "-n" option as you do, and it doesn't make netstat run continuously.
As you didn't specify the version and exact command you are using, I tried both the version that comes with RH7.6 (net-tools 2.10-alpha) and the latest from source code (net-tools 3.14-alpha). The net-tools source code can be found in github [1].
As I couldn't find the exact option you describe, I tried all flags (without combinations) that don't require an argument. As far as I can tell the only options that cause netstat to not exit immediately are '-g' and '-c'. '-c' makes sense as it is the flag for running netstat continuously. For '-g' it isn't as obvious as the continuous behavior is coming from reading the /proc/net/igmp and /proc/net/igmp6 files line-by-line. The first file is read quickly but the igmp6 file takes much longer (1 line per ~1 sec). The '-g' option isn't really continuous, but just takes a lot of time to finish.
From the code, the only reason for continuous execution is (appears 4 times in the code):
if (i || !flag_cnt)
break;
wait_continous();
'i' is a return code from a function and the 'break' command is to break from an infinite for loop, so basically the code will run continuously only if flag_cnt is set (only happens when '-c' is provided) and there were no errors with previous commands.
For the specific issue above there could be a few reasons:
The option involves reading from a file and it takes very long time to finish, but it is not really continuous.
There's a correlation between the given option and flag_cnt, which cause flag_cnt to be set.
There's a call to wait_continous() which doesn't follow the condition above.
As I said, I couldn't reproduce the issue in the original question, nor could I find any flag with the description above. Also, non of the flags besides '-c' caused netstat to run continuously.
If you still want to figure this out I suggest you take a look at your code, or at least specify the net-tools version you use. The kernel version is also important as some code would be compiled-out due to missing kernel support.
[1] https://github.com/ecki/net-tools

Instrumenting R programs using intel-pin

I am instrumenting R programs using pinatrace.so tool to generate trace for read and write memory instructions. What I observe is that multiple #eof statements get printed in the trace file at different places(which should have been actually get printed only at the end of the trace). Also, the immediate next line after #eof gets distorted and is not printed properly.
I am invoking R shell and my R program using the following command:
../../../pin -follow_execv -t obj-intel64/pinatrace.so -- /home/R-3.5.3./bin/R -f hello.R
The trace file gets printed as shown:
0 0x7ffc812cd1c8
1 0x7ffc812cd1c8
0 0x7f7f8555ee78
#eof
f6971ce8
1 0x6f4518
0 0x7ffc171a0b70
.....
.....
1 0x7ffc6da8f078
0 0x7f7c38786e78
#eof
ffc171a07c8
0 0x6f4e30
0 0x6ff918
What is wrong with this instrumentation?
When Pin is invoked with the follow_execv knob, it will create a new copy of itself in every child process that is created. The new copy is not aware that another copy is running in the parent or at all. See here:
If -follow_execv is enabled and the user has not registered to get a notification, Pin will be injected into child/exec-ed process with the same command line as of current process.
If the Pintool wasn't created with -follow_execv in mind, all copies of the Pintool will normally write to the same file. This will create strange artifacts such as what you're seeing, as different processes write to the same file and terminate it while other processes are writing after the terminator.
The simplest solution is to add a PID suffix to the file, another option is to use the Follow Child Process API (linked above) to determine which subprocess is the actual R program you want to trace. Finally, R may have support for instrumentation which you could use to instrument the program itself.

Issue in executing a batch file using PeopleCode in Application engine program

I want to execute a batch file using People code in Application Engine Program. But The program have an issue returning Exec code as a non zero value (Value - 1).
Below is people code snippet below.
Global File &FileLog;
Global string &LogFileName, &Servername, &commandline;
Local string &Footer;
If &Servername = "PSNT" Then
&ScriptName = "D: && D:\psoft\PT854\appserv\prcs\RNBatchFile.bat";
End-If;
&commandline = &ScriptName;
/* Need to commit work or Exec will fail */
CommitWork();
&ExitCode = Exec("cmd.exe /c " | &commandline, %Exec_Synchronous + %FilePath_Absolute);
If &ExitCode <> 0 Then
MessageBox(0, "", 0, 0, ("Batch File Call Failed! Exit code returned by script was " | &ExitCode));
End-If;
Any help how to resolve this issue.
Best bet is to do a trace of the execution.
Thoughts:
Can you log on the the process scheduler you are running this on and execute the script OK?
Is the AE being scheduled or called at run-time?
You should not need to change directory as you are using a fully qualified path to the script.
you should not need to call "cmd /c" as this will create an additional shell for you application to run within, making debuging harder, etc.
Run a trace, and drop us the output. :) HTH
What about changing the working directory to D: inside of the script instead? You are invoking two commands and I'm wondering what the shell is returning to exec. I'm assuming you wrote your script to give the appropriate return code and that isn't the problem.
I couldn't tell from the question text, but are you looking for a negative result, such as -1? I think return codes are usually positive. 0 for success, some other positive number for failure. Negative numbers may be acceptable, but am wondering if Exec doesn't like negative numbers?
Perhaps the PeopleCode ChDir function still works as an alternative to two commands in one line? I haven't tried it for a LONG time.
Another alternative that gives you significant control over the process is to use java.lang.Runtime.exec from PeopleCode: http://jjmpsj.blogspot.com/2010/02/exec-processes-while-controlling-stdin.html.

Prefix Sum with global memory and an error with local memory

I have a Mali GPU which does not support local memory at all.
Everytime I run code consisting of local memory it gives me some errors from the device.
So, I want to transfer my codes to a version that only uses global memory.
I was thinking if it is possible to run a prefix sum/parallel reduction algorithm using global memory only on GPU.
EDITED : I was debugging the error and found a strange thing that one particular line is giving the erorr.
I have e line like this:
`#define LOG_LSIZE 8`
`#define LSIZE_SHIFT_VALUE 4`
`#define LOG_NUM_BANKS 2`
`#define GET_CONFLICT_OFFSET(lid) ((lid) >> LOG_NUM_BANKS)`
`#define LSIZE 32`
`__local int lm_sum[2][LSIZE + LOG_LSIZE]`
`**lm_sum[lid >> LSIZE_SHIFT_VALUE][bi] += lm_sum[lid >> LSIZE_SHIFT_VALUE][ai]**`
lid is local id and I used qork groups size 32. I found that the highlighted line is the cause of the error. I tried using fixed values and found that I cannot use lm_sum on the right side of a statement. If I do, that gives me an error. For example, this line also gives me error:
int temp= lm_sum[0][0]
Any idea on what is going on?
Error:
`In initial.cpp***[14100.684249] Mali<ERROR, BASE_MMU>: In file: /home/jbmaster/work/01.LPD_OpenCL_RFS/01.arm_work_3_0_31/SEC_All_EVT0_TX013-BU-00001-r2p0-00rel0/TX013-BU-00001-r2p0-00rel0/driver/product/kernel/drivers/gpu/arm/t6xx/kbase/src/common/mali_kbase_mmu.c line: 1240 function:kbase_mmu_report_fault_and_kill
[14100.709724] Unhandled Page fault in AS0 at VA 0x00000002000EC1A0
[14100.709728] raw fault status 0x500003C3
[14100.709730] decoded fault status: SLAVE FAULT
[14100.709733] exception type 0xC3: TRANSLATION_FAULT
[14100.709736] access type 0x3: WRITE
[14100.709738] source id 0x5000
[14100.734958]
[14100.736432] Mali<ERROR, BASE_JD>: In file: /home/jbmaster/work/01.LPD_OpenCL_RFS/01.arm_work_3_0_31/SEC_All_EVT0_TX013-BU-00001-r2p0-00rel0/TX013-BU-00001-r2p0-00rel0/driver/product/kernel/drivers/gpu/arm/t6xx/kbase/src/common/mali_kbase_jm.c line: 899 function:kbase_job_slot_hardstop
[14100.761458] Issueing GPU soft-reset instead of hard stopping job due to a hardware issue
[14100.769517] `
Since lm_sum[0][0] doesn't work, the memory for the array is not allocated. You said your GPU doesn't support local memory. Well, you are trying to use lm_sum which is declared to be in local memory (__local int lm_sum[2][LSIZE + LOG_LSIZE]).

Resources