Collecting an MPI Trace - mpi

How can I collect an MPI communication trace on Supercomputers?
I need text files with details of each message (say sender, receiver, size, etc.) that I can parse.
I was using following command for Intel MPI and do not see any text files.
mpirun -trace -n 4 -trace-pt2pt -trace-collectives ./myApp

I am not familiar with Intel MPI's integrated solution.
There is a number of tools that provide MPI tracing.
Performance focussed:
Score-P (Fileformat OTF2)
TAU
Extrae
Correctness checking:
MUST
I recommend to not roll your own solution, because it's not straight forward to match receives to sends and you might run into timing issues because timers are not synchronized across nodes.
You could e.g. trace a run using Score-P, and then use the otf2-print command on the trace to get the text output you wanted. Or you can use the OTF2 reader library and develop a tool on top of it. Here is a short tutorial on how to run Score-P, starting at slide 17

Related

read and copy buffer from kernel in CPU to kernel in FPGA with OpenCL

I'm trying to speed up Ethash algorithm on Xilinx u50 FPGA. My problem is not about FPGA, it is about pass DAG file that are generated in CPU and send it to FPGA.
first I'm using this code in my test. I made a few changes to support Intel OpenCL driver. now if I only using CPU to run Ethash (or in this case xleth) program all process are be done. but in my case I first generate DAG file in CPU and with using 4 core it take 30 second for generating epoch number 0. after that I wanna pass DAG file (in code showing with m_dag) to a new buffer look like g_dag to send it in u50 HBMs.
I can't using only one context in this program, because I'm using 2 separated kernel files (.cl for CPU and .xclbin for FPGA) and when I try to make program and kernel it send me error 33 (CL_INVALID_DEVICE). so I make separate context (with name g_context).
now I wanna know how can I send data from m_contex to g_context? and it that ok and optimize in performance?(send me another solution if you have.)
I send my code in this link so pls if you can, just send me code solution.

Intel PIN: print backtrace when segfault happens in tool

I'm developing a tool for Intel PIN. Somewhere in the runtime, it gives me the below error. I want to know if there is a way to tell PIN to print the backtrace or let me handle the segfault in the tool itself.
I'm running my tool with MPI and it crashes when I insert values into an unordered map.
C: Tool (or Pin) caused signal 11 at PC 0x2b09594533cb
mpirun -np 44 pin-3.7-97619-g0d0c92f4f-gcc-linux/pin -follow_execv -t pin-3.7-97619-g0d0c92f4f-gcc-linux/source/tools/Simp ... -- program
You can use the following API:
PIN_AddInternalExceptionHandler()
from where you get access to an EXCEPTION_INFO structure which is supposed to be manipulated with the exception API.
Otherwise, you can also debug your tool from within a debugger, by launching your tool with the -pause_tool 20 option. Then you have 20 seconds to attach your debugger to the process. Once attached, the debugger stops (at least with Visual Studio) and lets you set the breakpoints you need in your tool's code.
This is not that easy to debug though, as the whole system switch from pintool code, to pin, to target application constantly. Hence there is not a continuous process of steps inside your pintool code that you can follow, as you can expect when debugging "classic single threaded applications".

How can I see detailed work of nodes on a Rocks Cluster?

I built a Rocks Cluster for my school project, which is matrix multiplication, with one frontend and 5 other computers which are nodes. Over MPI I send them partions of matrix which they use for multiplication and then they send data back. Command which I run is:
mpirun -hostfile myhostfile ./myprogram
where myhostfile is a file of names of nodes and their slots(thread) numbers.
My program is working and I'm trying to analize it now.
My question is how can i see the work of each nodes core/processor working on his task, are the all processors working, is there some kind of overload?
I tried to install Vampir profiler and Intels Vtune Amplifierbut but I have some problems attaching them to my program with this command above (other comands dont allow me to run my programs on all threads of a node). All that i have accomplished (to see my nodes working good besides Ganglia) is to login to a node from the frontend and with the command "top" I could see when my program is executing by the number of threads and almost 100% CPU usage on each thread.
Take a look at mpstat
With no params it will show aggregated load for all cores
mpstat -P ALL shows load for each core
This will give you realtime stats for your nodes:
watch pdsh -w compute-01-[01-10] mpstat
(use your compute nodes names)

Process stop getting network data

We have a process (written in c++ /managed), which receives network data via tcpip.
After running the process for a while while tracking network load, it seems that network get into freeze state and the process does not getting data, there are other processes in the system that using networking (same nic) which operates normally.
the process gets out of this frozen situation by itself after several minutes.
Any idea what is happening?
Any counter i can track to see if my process reach some limitations ?
It is going to be very difficult to answer specifically,
-- without knowing what exactly is your process/application about,
-- whether it is a network chat application, or a file server/client, or ......
-- without other details about your process how it is implemented, what libraries it uses, if relevant to problem.
Also you haven't mentioned what OS and environment you are running this process under,
there is very little anyone can help . It could be anything, a busy wait loopl in your code, locking problems if its a multi-threaded code,....
Nonetheless , here are some options to check:
If its linux try below commands to debug and monitor the behaviour of the process and see what could be problem-
top
Check top to see ow much resources(CPU, memory) your process is using and if there is anything abnormally high values in CPU usage for it.
pstack
This should stack frames of the process executing at time of the problem.
netstat
Run this with necessary options (tcp/udp) to check what is the stae of the network sockets opened by your process
gcore -s -c
This forces your process to core when the mentioned problem happens, and then analyze that core file using gdb
gdb
and then use command where at gdb prompt to get full back trace of the process (which functions it was executing last and previous function calls.

which signal does gdb send when attaching to a process?

Which signal does gdb send when attaching to a process? Does this work the same for different UNIXes. E.g. Linux and Mac OS X?
So far I only found out, that SIGTRAP is used to implement breakpoints. Is it used for attaching aswell?
AFAIK it does not need any signals to attach. It just suspends the "inferior" by calling ptrace. It also reads debugged process memory and registers using this calls and it can request instruction single stepping (provided it's implemented on that port of linux), etc.
Software breakpoints are implemented by placing at right location instruction that triggers "trap" or something similar when reached, but debugged process can run full speed until then.
Also (next to reading man ptrace, as already mentioned) see ptrace explanation on wikipedia.

Resources