Intel Pin inscount0 reports different count vs Linux perf - intel

I have a program where I'm trying to trace (instruction trace) with Intel's Pin program (v3.25). The commands I'm running:
pin -t obj-intel64/inscount0.so -- <my binary>
perf stat --event=instructions:{k,u} -- <my binary>
However, the reported number of instructions is wildly different (10x difference) with perf report much much more.
When I used the itrace.so program and the replayed the instructions, I also noticed way fewer instructions than I would've expected for the program. So it got me thinking that there might be something wrong with the pin setup. But I'm not sure where to go from here to debug. Any advice?

Related

Intel PIN: print backtrace when segfault happens in tool

I'm developing a tool for Intel PIN. Somewhere in the runtime, it gives me the below error. I want to know if there is a way to tell PIN to print the backtrace or let me handle the segfault in the tool itself.
I'm running my tool with MPI and it crashes when I insert values into an unordered map.
C: Tool (or Pin) caused signal 11 at PC 0x2b09594533cb
mpirun -np 44 pin-3.7-97619-g0d0c92f4f-gcc-linux/pin -follow_execv -t pin-3.7-97619-g0d0c92f4f-gcc-linux/source/tools/Simp ... -- program
You can use the following API:
PIN_AddInternalExceptionHandler()
from where you get access to an EXCEPTION_INFO structure which is supposed to be manipulated with the exception API.
Otherwise, you can also debug your tool from within a debugger, by launching your tool with the -pause_tool 20 option. Then you have 20 seconds to attach your debugger to the process. Once attached, the debugger stops (at least with Visual Studio) and lets you set the breakpoints you need in your tool's code.
This is not that easy to debug though, as the whole system switch from pintool code, to pin, to target application constantly. Hence there is not a continuous process of steps inside your pintool code that you can follow, as you can expect when debugging "classic single threaded applications".

Collecting an MPI Trace

How can I collect an MPI communication trace on Supercomputers?
I need text files with details of each message (say sender, receiver, size, etc.) that I can parse.
I was using following command for Intel MPI and do not see any text files.
mpirun -trace -n 4 -trace-pt2pt -trace-collectives ./myApp
I am not familiar with Intel MPI's integrated solution.
There is a number of tools that provide MPI tracing.
Performance focussed:
Score-P (Fileformat OTF2)
TAU
Extrae
Correctness checking:
MUST
I recommend to not roll your own solution, because it's not straight forward to match receives to sends and you might run into timing issues because timers are not synchronized across nodes.
You could e.g. trace a run using Score-P, and then use the otf2-print command on the trace to get the text output you wanted. Or you can use the OTF2 reader library and develop a tool on top of it. Here is a short tutorial on how to run Score-P, starting at slide 17

Process stop getting network data

We have a process (written in c++ /managed), which receives network data via tcpip.
After running the process for a while while tracking network load, it seems that network get into freeze state and the process does not getting data, there are other processes in the system that using networking (same nic) which operates normally.
the process gets out of this frozen situation by itself after several minutes.
Any idea what is happening?
Any counter i can track to see if my process reach some limitations ?
It is going to be very difficult to answer specifically,
-- without knowing what exactly is your process/application about,
-- whether it is a network chat application, or a file server/client, or ......
-- without other details about your process how it is implemented, what libraries it uses, if relevant to problem.
Also you haven't mentioned what OS and environment you are running this process under,
there is very little anyone can help . It could be anything, a busy wait loopl in your code, locking problems if its a multi-threaded code,....
Nonetheless , here are some options to check:
If its linux try below commands to debug and monitor the behaviour of the process and see what could be problem-
top
Check top to see ow much resources(CPU, memory) your process is using and if there is anything abnormally high values in CPU usage for it.
pstack
This should stack frames of the process executing at time of the problem.
netstat
Run this with necessary options (tcp/udp) to check what is the stae of the network sockets opened by your process
gcore -s -c
This forces your process to core when the mentioned problem happens, and then analyze that core file using gdb
gdb
and then use command where at gdb prompt to get full back trace of the process (which functions it was executing last and previous function calls.

Confusion about the performance counter

I am working on a program written in OpenCL and running on Fusion APU (CPU+GPU on one die). I wan to get some performance counters such as instructions number, branch number and so on. I have two tools on hand: AMD APP Profiler and CodeAnalyst. When I use the APP Profiler, I found that it seems can only provide instructions counter for GPU, cannot for CPU. Then I use CodeAnalyst, but then three confusions occurred.
On App Profiler, it can give the number of ALUInsts (i.e. the number of executed ALU instructions per work-item) is about 70000. The whole thread space on GPU has 8192 threads, so I intuitively think there are 70000 * 8192 instructions executed by GPU. Is that right?
When I use CodeAnalyst to measure the instructions for the same program on CPU part, it just gave "Ret inst", "Ret branch" such kind of counters, but I am not sure about one thing: this program runs on both CPU and GPU at the same time, what are these counters for? For CPU only, for GPU only? or the sum?
No matter what these counters for, I found that the value of Ret Inst (i.e. retired instructions) is about 40000, it seems too small for the whole program, I guess the instructions for a program should be at order of billions, how it could be only 4w? The attached pic shows the results.
Is there any people can help me resolve these confusions, I am just a tyro here, wish kind help from all of you. Thanks!

Serial Comms dies in WinXP

A bit of history: We have an application, which was originally written many years ago (1998 is the first date in PVCS but the app is about 5 years older than that as it originally was a DOS program). This application communicates with a piece of hardware via serial. When we got to Windows XP we started receiving reports of the app dying after a short time of running. It seems that the serial comms just 'died' and the app was left in a stuck state. The only way to recover from this situation was to restart the application.
The only information I can find regarding this problem was apparently the Windows Message system would miss that information was received, the buffer would fill and the system would get stuck. This snippet of information was left in a old word document, but there's no evidence to back this up. It also mentions that this is only prevalent at high baud rates (115200+).
The solution was to provide customers with USB->Serial converters along with the hardware.
Today: We are working on a new version of the hardware that will run across a network as well as serial ports. So to allow me to work on the network code, minus the actual hardware we are using a VSCOM NetCom113 device. It also installs a virtual comm port on the users (ie: mine) machine.
Now I have got the network code integrated with the app, it appears that the NetCom device exhibits the same behaviour as a physical commport. This is undesirable as I need the app to run longer than ~30 seconds.
Google turns up zero problems that we experience.
I was wondering:
Has anyone experienced this before? If so what did you do to fix/workaround the problem?
Does anyone have any suggestions as to whether the original author of the document is correct and what I can do to test the theory?
Unfortunately I can't post code as the serial code is tightly couple with the rest of the system, though if you have questions regarding it I can answer questions about it.
Updates:
The code is written using Win32 Comm routines - so I am using CreateFile, ReadFile. There's also judicious calls to GetOverlappedResult.
It's not hanging per se, it's just that the comms stops. You can access the menus, click the buttons, but nothing can interact with the connected hardware. Using realterm you can see that no data is coming in or going out.
I think the reference to the windows message is that the problem is internal to windows. Data has arrived but the kernal has missed it and thus not told the rest of the system about it.
Flow control is not used.
Writing a 'simple' test is difficult due the the fact that the code is tightly coupled and the underlying protocol is quite complex and would require a lot of work.
Are you using DOS-style serial code, or the Win32 CreateFile approach?
If the former, be very suspicious: if at all possible I'd convert to the latter.
If the latter, do you know on what kind of system call it's hanging? Are you in a blocking read call? or an overlapped I/O call? or waiting on an event? (I'm not sure I have enough experience to help, but those are the kinds of questions that come to mind)
You might also check into the queue size, which you can set with the SetupComm function.
I don't buy the "Windows Message system" stuff -- it sounds fishy; you can write good Win32 serial i/o code that never uses Windows messages.
edit: does your Overlapped I/O use events? I seem to remember something about auto-reset events occasionally missing their trigger... check your overlapped I/O calls very carefully to see whether you're handling the possible outcomes properly. Perhaps there's a way to make your code more robust by automatically cancelling the overlapped i/o and restarting another read. (I assume the problem is in the read half, not the write half?)
edit 2: A suggestion: assuming the win32 side has missed a byte or packet, and your devices are in deadlock because they're both expecting each other to respond to something, can you tweak the other side of the serial I/O to regularly send some type of "ping" packet with an incrementing counter? (and log the ping packets on the PC side; that way you can see whether you've missed any)
Are you sure you have your flow control set up correctly? DTR, RTS, etc...
-Adam
i have written apps that use usb / bluetooth serial ports and have never had an issue. with bluetooth i have seen bit rates (sustained) of 800,000 bps for long periods of time. most people don't properly implement the port.
My serial port
Not sure if this is a possibility for you, but if you could re-write the code using C#.NET you'd have access to the SerialPort class there. It might remedy your problem. I know a lot of legacy code based around the Win32 API for hardware I/O ports tended to fail in XP due to timing (had a small bit of experience with MIDI).
In addition, I don't know if you can use the Win32 method of Serial Port access in Vista, so that might shut out future MS OSes from being able to use your code.

Resources