Using valgrind to spot error in mpi code

Using valgrind to spot error in mpi code - mpi

I have a code which works perfect in serial but with mpirun -n 2 ./out it gives the following error:
./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090
I tried to use valgrind such as:
valgrind --leak-check=yes mpirun -n 2 ./out
I got the following output:
==3494== Memcheck, a memory error detector
==3494== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==3494== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==3494== Command: mpirun -n 2 ./out
==3494==
Grid_0/NACA0012.msh
Grid_0/NACA0012.msh
>>> Number of cells: 7734
>>> Number of cells: 7734
0.000000 0 1.470622e-02
*** Error in `./out': malloc(): smallbin double linked list corrupted: 0x00000000024aa090 ***
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 3497 RUNNING AT orhan
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
==3494==
==3494== HEAP SUMMARY:
==3494== in use at exit: 131,120 bytes in 2 blocks
==3494== total heap usage: 1,064 allocs, 1,062 frees, 231,859 bytes allocated
==3494==
==3494== LEAK SUMMARY:
==3494== definitely lost: 0 bytes in 0 blocks
==3494== indirectly lost: 0 bytes in 0 blocks
==3494== possibly lost: 0 bytes in 0 blocks
==3494== still reachable: 131,120 bytes in 2 blocks
==3494== suppressed: 0 bytes in 0 blocks
==3494== Reachable blocks (those to which a pointer was found) are not shown.
==3494== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3494==
==3494== For counts of detected and suppressed errors, rerun with: -v
==3494== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
I am not good in valgrind but what I understood is valgrind saw no problem. Are there better options for valgrind to spot the source of the specific error mentioned?

Note that with the invocation above,
valgrind --leak-check=yes mpirun -n 2 ./out
you are running valgrind on the program mpirun, which presumably has been extensively tested and works correctly, and not the program ./out, which you know to have a problem.
To run valgrind on your test program you will want to do:
mpirun -n 2 valgrind --leak-check=yes ./out
Which uses mpirun to launch 2 processes, each running valgrind --leak-check=yes ./out.

You can never go wrong with a Jonathan Dursi answer but let me just add that with more than one processor it can be a pain to read valgrind output.
Instead of outputting to the console, dump it to a log file. Of course if you dump both processes to the same log file that's not going to be helpful. Instead, log to multiple files -- valgrind interprets '%p' as the process id so you get two (or more) log files:
mpiexec -np 2 valgrind --leak-check=full \
--show-reachable=yes --log-file=nc.vg.%p ./noncontig_coll2 -fname blah

Related

How to configure sqlite3's tcltests?

For research purposes, I am trying to run the sqlite3 full test suite.
I am running under Linux for this test.
I believe I have installed the correct libraries
sudo apt-get install tcl-dev tk-dev
And when I run the configure command, it finds tcl
checking whether to use an in-ram database for temporary tables... no
checking if executables have the .exe suffix... unknown
checking for Tcl configuration... found /usr/lib/tclConfig.sh
checking for existence of /usr/lib/tclConfig.sh... loading
checking for library containing readline... no
checking for library containing tgetent... no
checking for readline in -lreadline... no
checking readline.h usability... no
Then when I run make test it ends with:
Time: orderby8.test 5080 ms
Time: orderby9.test 291 ms
Makefile:1201: recipe for target 'tcltest' failed
make: *** [tcltest] Killed

It is hard to tell because your question misses important details:
What *nix exactly?
More importantly, which Tcl versions are installed by apt?
How do you obtain the sqlite3 sources incl. test harnesses?
I just tested fine the following:
Ubuntu Xenial
apt-get install tcl tcl-dev (turns out to be 8.6.7)
wget https://www.sqlite.org/2018/sqlite-src-3250200.zip
unzip sqlite-src-3250200.zip
cd sqlite-src-3250200
./configure && make test
... and it succeeds to complete the make test run:
...
Time: orderby7.test 6 ms
Time: orderby8.test 115 ms
Time: orderby9.test 6 ms
Time: oserror.test 49 ms
...
Time: zipfile2.test 10 ms
SQLite 2018-09-25 19:08:10 fb90e7189ae6d62e77ba3a308ca5d683f90bbe633cf681865365b8e92792d1c7
0 errors out of 147094 tests on builda Linux 64-bit little-endian
All memory allocations freed - no leaks
Maximum memory usage: 9278848 bytes
Current memory usage: 0 bytes
Number of malloc() : -1 calls
It seems that oserror.test crashes whatever Tcl in your case.

Linux cluster, Rmpi and number of procesess

Since the beginning of November, I'm stuck in to run a parallel job in a Linux cluster. I already search A LOT on the internet searching for information but I simply can't progress. When I start to search for parallelism in R using cluster I discovered the Rmpi. It looked quite simple, but now I don't now more what to do. I have a script to send my job:
#PBS -S /bin/bash
#PBS -N ANN_residencial
#PBS -q linux.q
#PBS -l nodes=8:ppn=8
cd $PBS_O_WORKDIR
source /hpc/modulos/bash/R-3.3.0.sh
export LD_LIBRARY_PATH=/hpc/nlopt-2.4.2/lib:$LD_LIBRARY_PATH
export CPPFLAGS='-I/hpc/nlopt-2.4.2/include '$CPPFLAGS
export PKG_CONFIG_PATH=/hpc/nlopt-2.4.2/lib/pkgconfig:$PKG_CONFIG_PATH
# OPENMPI 1.10 + GCC 5.3
source /hpc/modulos/bash/openmpi-1.10-gcc53.sh
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
And this is the beginning of my R program:
library(caret)
library(Rmpi)
library(doMPI)
cl <- startMPIcluster()
registerDoMPI(cl)
So here is my questions:
1- Is this way I should initialize the processes (i.e. using starMPIcluster whitout a parameter and using at the command line -np 1)?
2- Why when I use this commands the MPI complains with it's frase?
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process....
OBS: He said that for all the 64 processes (because there are 8 nodes with 8 cpus and I'm creating 63 processes)
3- Why when I use this commands on a machine of 60 CPU's he just spawn two workers?

Finally, I got it!
To run a parallel program in R using the Rmpi in a cluster you need to configure the job script according to the system. Next on the command line:
mpiexec --mca orte_base_help_aggregate 0 -np 1 -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
You have to modify to:
mpiexec -np NUM_PROC -hostfile ${PBS_NODEFILE} /hpc/R-3.3.0/bin/R --slave -f sunhpc_mpi.r
On the R code, you must not detail anything 'startMPIcluster()' So, the code will exactly as I wrote above.

Timeout in tests when running pintos

I am just getting started with the pintos projects, working from my home computer that is running ubuntu 14.04 x64 system.
I'm able to compile the project from the src/threads/ directory, and the initial test pintos run alarm-multiple seems to work okay (notice that it runs qemu by default):
zay#ubuntu:~/Documents/pintos/src/threads/build$ pintos run alarm-multiple
Prototype mismatch: sub main::SIGVTALRM () vs none at /home/zay/Documents/pintos/src/utils/pintos line 935.
Constant subroutine SIGVTALRM redefined at /home/zay/Documents/pintos/src/utils/pintos line 927.
qemu-system-x86_64 -drive cache=writeback,file=/tmp/YS3E7FICwo.dsk -m 4 -net none -serial stdio
PiLo hda1
Loading..........
Kernel command line: run alarm-multiple
Pintos booting with 4,088 kB RAM...
382 pages available in kernel pool.
382 pages available in user pool.
Calibrating timer... 286,310,400 loops/s.
Boot complete.
Executing 'alarm-multiple':
(alarm-multiple) begin
(alarm-multiple) Creating 5 threads to sleep 7 times each.
(alarm-multiple) Thread 0 sleeps 10 ticks each time,
(alarm-multiple) thread 1 sleeps 20 ticks each time, and so on.
(alarm-multiple) If successful, product of iteration count and
(alarm-multiple) sleep duration will appear in nondescending order.
(alarm-multiple) thread 0: duration=10, iteration=1, product=10
(alarm-multiple) thread 0: duration=10, iteration=2, product=20
However, when I run make check under src/threads/build, all tests get a timeout fault:
zay#ubuntu:~/Documents/pintos/src/threads/build$ make check
pintos -v -k -T 60 --qemu -- -q run alarm-multiple < /dev/null 2> tests/threads/alarm-multiple.errors > tests/threads/alarm-multiple.output
perl -I../.. ../../tests/threads/alarm-multiple.ck tests/threads/alarm-multiple tests/threads/alarm-multiple.result
FAIL tests/threads/alarm-multiple
run: TIMEOUT after 61 seconds of wall-clock time - load average: 0.20, 0.45, 0.26
pintos -v -k -T 60 --qemu -- -q run alarm-simultaneous < /dev/null 2> tests/threads/alarm-simultaneous.errors > tests/threads/alarm-simultaneous.output
perl -I../.. ../../tests/threads/alarm-simultaneous.ck tests/threads/alarm-simultaneous tests/threads/alarm-simultaneous.result
FAIL tests/threads/alarm-simultaneous
run: TIMEOUT after 61 seconds of wall-clock time - load average: 0.18, 0.40, 0.25
pintos -v -k -T 60 --qemu -- -q run alarm-priority < /dev/null 2> tests/threads/alarm-priority.errors > tests/threads/alarm-priority.output
perl -I../.. ../../tests/threads/alarm-priority.ck tests/threads/alarm-priority tests/threads/alarm-priority.result
FAIL tests/threads/alarm-priority
run: TIMEOUT after 61 seconds of wall-clock time - load average: 0.10, 0.34, 0.2
What changes should I make to solve this problem?

Apparently, QEMU no longer supports the power off sequence on the port 0x8900. Here is a fix that made it work for me (found in chaOs): in the file devices/shutdown.c patch shutdown_power_off as follows:
void
shutdown_power_off (void)
{
// ...
printf ("Powering off...\n");
serial_flush ();
outw (0xB004, 0x2000); // <-- Add this line
// ...
}

If you are using qemu for pintos.
You need to add one line of code in devices/shutdown.c.
Insert the line
outw( 0x604, 0x0 | 0x2000 ); after printf (“Powering off…\n”); serial_flush (); as shown below:
/* This is a special power-off sequence supported by Bochs and
QEMU, but not by physical hardware. */
for (p = s; *p != ' printf ("Powering off...\n");
serial_flush ();
//add the following line
outw( 0x604, 0x0 | 0x2000 );
Follow this guide to find out more

MPI checkpoint usage

I'd like to take advantage of MPI checkpoint feature to save my job. According to the suggestion at https://wiki.mpich.org/mpich/index.php/Checkpointing
I should be able to send SIGUSR1 to mpiexec ( in my case, I send it to mpirun ) to trigger a checkpoint. However, when I do so I don't see any file saved in my checkpoint directory that I specified with -ckpoint-prefix
Here is my mpirun -info output
HYDRA build details:
Version: 4.1 Update 1
Release Date: 20130522
Process Manager: pmi
Bootstrap servers available: ssh rsh fork slurm srun ll llspawn.stdio lsf blaunch sge qrsh persist jmi
Resource management kernels available: slurm srun ll llspawn.stdio lsf blaunch sge qrsh pbs
Checkpointing libraries available: blcr
Demux engines available: poll select
My command line is:
mpirun -ckpointlib blcr -ckpoint-prefix /home/user/temp/ckpoint -ckpoint-interval 1800 -np 274 $PROGPATH/myapp
The way I send signal is kill -s USR1 1900, 1900 is the pid of miprun. Whenever I send the signal, the program simply ends. No crash though. Anybody has experience on MPI checkpoint?

I think I figured it out. I send USR1 to mpirun, but I should send it to mpiexec.hydra instead. Even though some online article says mpirun and mpiexec are the same thing.

Grabbing .jar application output stream to console after console was closed and new one opened on Oracle Solaris 11

On Oracle Solaris 11 console when ps -ef | grep java command is issued I can see running some java process PID, which was started on other console window and then it (console window) was closed (.jar application output then was visible). Is it some way to grab again that application output without restarting .jar file?
Application was started like this (as a root user):
java -jar SomeFile.jar &
Write output to file is not an option in this case.

Yes, you can do that, but it involves some mad skills with gdb. Here is how to do that in Linux and I believe you can do the same in Solaris (since it has gdb and it has all needed system calls I'm gonna use further).
There are 3 file descriptors for standard streams:
stdin: 0
stdout: 1
stderr: 2
You are interested in stdout and stderr (both are console output), so you need file descriptors with numbers 1 and 2, just keep it in mind.
Now I'm gonna show you how to do what you ask for "okular" application (instead of your "java" application) for stderr stream.
Run "okular" in terminal, like this:
$ okular &
and then close this terminal. This is just to simulate your situation.
Open another terminal
Look for "okular" process:
$ ps aux | grep okular
Output:
joe 27599 2.2 0.9 515644 73944 ? S 23:46 0:00 okular
So "okular" PID is 27599.
Look for open file descriptors of "okular" process:
$ ls -l /proc/27599/fd
Output:
lrwx------ 1 joe joe 64 Feb 18 23:46 0 -> /dev/pts/0 (deleted)
lrwx------ 1 joe joe 64 Feb 18 23:46 1 -> /dev/pts/0 (deleted)
lrwx------ 1 joe joe 64 Feb 18 23:46 2 -> /dev/pts/0 (deleted)
You see that all 3 streams are deleted.
Now let's attach to our process with gdb:
$ gdb -p 27599 /usr/bin/okular
Inside of gdb perform next operations:
(gdb) p close(2)
(gdb) p creat("/tmp/okular_2", 0600)
(gdb) detach
(gdb) quit
Here we invoked 2 system calls:
close(), to close file for stderr stream of our process
creat(), to create new file for stderr stream of our process
p is gdb command, it prints (in our case) system calls return values.
Now all new stderr output of our process will be appended to text file /tmp/okular_2. We can read it constantly this way:
$ tail -f /tmp/okular_2
Conclusion
Ok, that's it, we revived stderr stream. You can do the same for stdout stream, the only difference is that you need to call "close(1)" instead of "close(2)" in gdb. Also, in your case be sure to replace all "okular" words with your "java" word.
The most of answer was inspired by this article.
If you need to revive stdin stream, you can attach it to pipe (FIFO) file, see details here.

Yes, it is possible to snoop any process output with Solaris native tools.
One way would be using dtrace which allows tracing processes even when they are already grabbed by a debugger or similar tool.
This dtrace script will display a given process stdout:
#!/bin/ksh
pid=$1
dtrace -qn "syscall::write:entry /pid == $pid && arg0 == 1 /
{ printf(\"%s\",copyinstr(arg1)); }"
You should should pass the process id of the java application to trace as its first argument, eg. $(pgrep -f "java -jar SomeFile.jar").
Replace arg0 == 1 by arg0 == 2 if you want to trace stderr vs stdin.
Should you want to see non displayable characters (in octal), you might use this slightly modified version:
#!/bin/ksh
pid=$1
dtrace -qn "syscall::write:entry /pid == $pid && arg0 == 1 /
{ printf(\"%s\",copyinstr(arg1)); }" | od -c
Another native way is to use the truss command. The following script will show all writes from your process to any file descriptors, and will include a full detailed trace for both stdout and stderr (3799 is your target process pid):
truss -w1,2 -t write -p 3799
dtrace:
http://docs.oracle.com/cd/E18752_01/html/819-5488/gcgkk.html
truss:
http://docs.oracle.com/cd/E36784_01/html/E36870/truss-1.html#scrolltoc

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using valgrind to spot error in mpi code - mpi

Related

How to configure sqlite3's tcltests?

Linux cluster, Rmpi and number of procesess

Timeout in tests when running pintos

MPI checkpoint usage

Grabbing .jar application output stream to console after console was closed and new one opened on Oracle Solaris 11

Categories

Resources