perf_event_open and PERF_COUNT_HW_INSTRUCTIONS - intel

I'm trying to profile an existing application with a quite complicated structure. For now I am using perf_event_open and the needed ioctl calls for enabling the events which are of my interest.
The manpage stays that PERF_COUNT_HW_INSTRUCTIONS should be used carefully - so which one should be preferred in case of a Skylake processor? Maybe a specific Intel PMU?

The perf_event_open manpage http://man7.org/linux/man-pages/man2/perf_event_open.2.html
says about PERF_COUNT_HW_INSTRUCTIONS:
PERF_COUNT_HW_INSTRUCTIONS Retired instructions. Be careful, these can be affected by various issues, most notably hardware interrupt counts.
I think this means that COUNT_HW_INSTRUCTIONS can be used (and it is supported almost everywhere). But exact values of COUNT_HW_INSTRUCTIONS for some code fragment may be slightly different in several runs due to noise from interrupts or another logic.
So it is safe to use events PERF_COUNT_HW_INSTRUCTIONS and PERF_COUNT_HW_CPU_CYCLES on most CPU. perf_events subsystem in Linux kernel will map COUNT_HW_CPU_CYCLES to some raw events more suitable to currently used CPU and its PMU.
Depending on your goals you should try to get some statistics on PERF_COUNT_HW_INSTRUCTIONS values for your code fragment. You can also check stability of this counter with several runs of perf stat with some simple program:
perf stat -e cycles:u,instructions:u /bin/echo 123
perf stat -e cycles:u,instructions:u /bin/echo 123
perf stat -e cycles:u,instructions:u /bin/echo 123
Or use integrated repeat function of perf stat:
perf stat --repeat 10 -e cycles:u,instructions:u /bin/echo 123
I have +-10 instructions events variation (less than 0.1%) for 200 thousands total instructions executed, so it is very stable. For cycles I have 5% variation, so it should be cycles event marked with careful warning.

Related

clearmake doesn't like my MAKEFLAGS=j12 values

I use both GNU Make and - woe is me - ClearCase' clearmake.
Now, GNU make respect a flag named MAKEFLAGS, which for me is set to j20 on this multi-core machine I'm on. Unfortunately, clearmake also recognizes this option, yet doesn't except this value. It tells me:
clearmake: Error: Bad option (j)
clearmake: Error: Bad option (2)
clearmake: Error: Bad option (0)
Questions:
Why is this happening? Should ClearMake accommodate GNU Make's usage?
How can I get around it, other then turning the flag off an on repeatedly?
It's been 15 years or so since I used clearmake, but assuming it doesn't support the GNU make-specific GNUMAKEFLAGS variable you can use:
export GNUMAKEFLAGS=-j20
and leave MAKEFLAGS unset.
The "BUILDING SOFTWARE WITH CLEARCASE" clearly states in its T"unsupported Gnu make features" that this option is indeed not supported.
–j [JOBS]
--jobs=[JOBS]
Maybe a clearmake -C -J can help (for testing): there should then be no limit to the number of parallel builds.
Are you calling GNU make from a clearmake build script? Or are you trying to create a single makefile that will support both build tools? I think the GNUMAKEFLAGS EV is safer for GNU make specific values. I would also use
CCASE_MAKEFLAGS for any makeflags that are specific to clearmake.
CCASE_CONC to set the concurrency value. While clearmake no longer passes the -J in MAKEFLAGS, it used to, and if you're using an older clearmake (somewhere in the 7's as I recall), you could upset "child" GNU make sessions since they like -J about as much as clearmake likes -j.
Finally, check the env_ccase man page for the behavior mentioned in CCASE_MAKEFLAGS_V6_OBSOLETE. If you pass MAKEFLAGS explicitly in the build script like
$(MAKE) $(MAKEFLAGS) TARGET=x
And had started clearmake like this:
clearmake -C gnu TARGET=Y
You'll actually get both TARGET macro definitions in the command line. Setting the mentioned EV (at all) avoids the "pass defined macros in MAKEFLAGS" behavior. The switch exists because some people have makefiles that DEPEND on this behavior, while others have ones BROKEN BY this behavior...
Assuming for the sake of argument that your company has a support agreement with either IBM or HCL, this is a good time to use your support channels to bring up clearmake concerns.

why "netstat -a" do not exit immediately but "netstat -n" does?

I have checked about the function of "-n" --
"Displays active TCP connections, however, addresses and port numbers are expressed numerically and no attempt is made to determine names."
But I can't see why "-n" can make netstat exit immediately?
From a quick check, I don't see the same description for the "-n" option as you do, and it doesn't make netstat run continuously.
As you didn't specify the version and exact command you are using, I tried both the version that comes with RH7.6 (net-tools 2.10-alpha) and the latest from source code (net-tools 3.14-alpha). The net-tools source code can be found in github [1].
As I couldn't find the exact option you describe, I tried all flags (without combinations) that don't require an argument. As far as I can tell the only options that cause netstat to not exit immediately are '-g' and '-c'. '-c' makes sense as it is the flag for running netstat continuously. For '-g' it isn't as obvious as the continuous behavior is coming from reading the /proc/net/igmp and /proc/net/igmp6 files line-by-line. The first file is read quickly but the igmp6 file takes much longer (1 line per ~1 sec). The '-g' option isn't really continuous, but just takes a lot of time to finish.
From the code, the only reason for continuous execution is (appears 4 times in the code):
if (i || !flag_cnt)
break;
wait_continous();
'i' is a return code from a function and the 'break' command is to break from an infinite for loop, so basically the code will run continuously only if flag_cnt is set (only happens when '-c' is provided) and there were no errors with previous commands.
For the specific issue above there could be a few reasons:
The option involves reading from a file and it takes very long time to finish, but it is not really continuous.
There's a correlation between the given option and flag_cnt, which cause flag_cnt to be set.
There's a call to wait_continous() which doesn't follow the condition above.
As I said, I couldn't reproduce the issue in the original question, nor could I find any flag with the description above. Also, non of the flags besides '-c' caused netstat to run continuously.
If you still want to figure this out I suggest you take a look at your code, or at least specify the net-tools version you use. The kernel version is also important as some code would be compiled-out due to missing kernel support.
[1] https://github.com/ecki/net-tools

Using MPI on Slurm, is there a way to send messages between separate jobs?

I'm new to using MPI (mpi4py) and Slurm. I need to run about 50000 tasks, so to obey the administrator-set limit of about 1000, I've been running them like this:
sbrunner.sh:
#!/bin/bash
for i in {1..50}
do
sbatch m2slurm.sh $i
sleep 0.1
done
m2slurm.sh:
#!/bin/bash
#SBATCH --job-name=mpi
#SBATCH --output=mpi_50000.out
#SBATCH --time=0:10:00
#SBATCH --ntasks=1000
srun --mpi=pmi2 --output=mpi_50k${1}.out python par.py data_50000.pkl ${1} > ${1}py.out 2> ${1}.err
par.py (irrelevant stuff omitted):
offset = (int(sys.argv[2])-1)*1000
comm = MPI.COMM_WORLD
k = comm.Get_rank()
d = data[k+offset]
# ... do something with d ...
allresults = comm.gather(result, root=0)
comm.Barrier()
if k == 0:
print(allresults)
Is this a sensible way to get around the limit of 1000 tasks?
Is there a better way to consolidate results? I now have 50 files I have to concatenate manually. Is there some comm_world that can exist between different jobs?
I think you need to make your application divide the work among 1000 tasks (MPI ranks) and consolidate the results after that with MPI collective calls i.e. MPI_Reduce or MPI_AllReduce calls.
trying to work around the limit won't help you as the jobs you started will be queued one after another.
Jobs arrays will give similar behavior like what you did in the batch file you provided. So still your application must be able to processes all data items given only N tasks(MPI ranks).
No need to pool to make sure all other jobs are finished take a look at slurm job dependency parameter
https://hpc.nih.gov/docs/job_dependencies.html
Edit:
You can use job dependeny to make a new job that will run after all other jobs finish and this job will collect the results and merge them into one big file. I still believe you are over thinking the obvious solution make rank 0 (master collect all results and save them to the disk)
This looks like a perfect candidate for job arrays. Each job in an array is identical with the exception of a $SLURM_ARRAY_TASK_ID environment variable. You can use this in the same way that you're using the command line variable.
(You'll need to check that MaxArraySize is set high enough by your sysadmin. Check the output of scontrol show config | grep MaxArraySize )
what do you mean by 50000 tasks ?
do you mean one MPI job with 50000 MPI tasks ?
or do you mean 50000 independant serial programs ?
or do you mean any combination where (number of MPI jobs) * (number of tasks per job) = 5000
if 1), well, consult your system administrator. of course you can allocate 50000 slots in multiple SLURM jobs, manually wait they are all running at the same time and then mpirun your app outside of SLURM. this is both ugly and unefficient, and you might get in trouble if this is seen as an attempt to circumvent the system limits.
if 2) or 3), then job array is a good candidate. and if i understand correctly your app, you will need an extra post processing step in order to concatenate/merge all your outputs in a single file.
and if you go with 3), you will need to find the sweet spot
(generally speaking, 50000 serial program are more efficient than fifty 1000 tasks MPI jobs or one 50000 tasks MPI program but merging 50000 file is less efficient than merging 50 files (or not merging anything at all)
and if you cannot do the post-processing on a frontend node, then you can use job dependency to start it after all the computation have completed

Intel icc compiler -O flags and -qopt-report

I am working on a HPC at the moment, and I have a question regarding the icc compiler.
What I want to do is to have a peek at what is going on when I change the optimisation levels through [O0..O3]. The data I want, regarding vectorization and whether code was folded inline etc., seems to be in the report generated by the -qopt-report flag.
I decided to use the greatest level of verbosity for the report which is
-qopt-report5 (I think this is the correct way to use it)
however, when reducing the O-level, the report gets progressively smaller until becoming empty when using the -O0 flag:
icc -O0 -qopt-report5 -c test1.c
I'll keep looking, but if anyone notices me being brain dead, I'd appreciate a pointer to the use of these flags together!
Thanks in advance for any hints.
Cheers,
MArk.
-qopt-report5 is always disabled when you use -O0.
This is "by definition", because opt-report == "optimization report" and O0 == "no optimization", so there is nothing to report about.
Auto-vectorization is generally speaking enabled starting from O2 optimizaiton level, so if you want to explore vectorization aspects, then you need to use at least "-O2 -opt-report5" combination or "higher".
If you want to correlate performance "peaks" and "optimization report", consider using Intel "Vectorization Advisor" (read more here, download from this location for now: https://software.intel.com/en-us/advisor_getting_started_intro )

Who know the history of unix fork?

Fork is a great tool in unix.We can use it to generate our copy and change its behaviour.But I don't know the history of fork.
Does someone can tell me the story?
Actually, unlike many of the basic UNIX features, fork was a relative latecomer (a).
The earliest existence of multiple processes within UNIX consisted of a few (fixed number of) processes, one per terminal that was attached to the PDP-7 machine (b).
The basic idea was that the shell process for a given terminal would accept a command from the user, locate the program file, load a small bootstrap program into high memory and jump to it, passing enough details for the bootstrap code to load the program file.
The bootstrap code, after loading the program into low memory (overwriting the shell), would then jump to it.
When the program was finished, it would call exit but it wasn't like the exit we know and love today. This exit would simply reload the shell and run it using pretty much the same method used to load the program in the first place.
So it was really more like a rudimentary exec command, the one that replaces your current program with another, in the same process space.
The shell would exec your program then, when your program was done, it would again exec the shell by calling exit.
This method was similar to that found in many other interactive systems at the time, including the Multics from whence UNIX got its name.
From the two-way exec, it was actually not that big a leap to adding fork as a process duplicator to work in conjunction. While many systems run another program directly, it's this "just add what's needed" method which is responsible for the separation of duties between fork and exec in UNIX. It also resulted in a very simple fork function.
If you're interested in the early history of various features(c) of Unix, you cannot go past the article The Evolution of the Unix Time-Sharing System by Dennis Ritchie, presented at a 1979 conference in Australia, and subsequently published by AT&T.
(a) Though I mean latecomer in the sense that the separation of the four fundamental forces in the universe was "late", happening some 0.00000000001 seconds after the big bang.</humour>.
(b) Since a question was raised in a comment as to how the shells were originally started off, there's a great resource holding very early source code for Unix over at The Unix Heritage Society, specifically the source code archives and, in particular, the first edition.
The init.s file from the first edition shows how the fixed number of shell processes were created (slightly reformatted):
...
mov $itab, r1 / address of table to r1
1:
mov (r1)+, r0 / 'x, x=0, 1... to r0
beq 1f / branch if table end
movb r0, ttyx+8 / put symbol in ttyx
jsr pc, dfork / go to make new init for this ttyx
mov r0, (r1)+ / save child id in word offer '0, '1, etc
br 1b / set up next child
1:
...
itab:
'0; ..
'1; ..
'2; ..
'3; ..
'4; ..
'5; ..
'6; ..
'7; ..
0
Here you can see the snippet which creates the processes for each connected terminal. These are the days of hard-coded values, no auto detection of terminal quantity involved. The zero-terminated table at itab is used to create a number of processes and hopefully the comments from the code explain how (the only possibly tricky bit is the labels - though there are multiple 1 labels, you branch to the nearest one in a given direction, hence 1b means the closest 1 label in the backwards direction).
The code shown simply processes the table, calling dfork to create a process for each terminal and start getty, the login prompt. The getty program, in turn, eventually started the shell. From that point, it's as I described in the main part of this answer.
(c) No paths (and use of temporary links to get around this limitation), limited processes, why there's a GECOS field in the password file, and all sorts of other trivia, generally interesting only to uber-geeks, of course.

Resources