Let perf use certain performance counters properly with newer processors - intel

I'm trying to use perf to measure certain events, including L1-dcache-stores, on my machine, which has a relatively new processor i9-10900K compared to the relatively old CentOS 7 with kernel 3.10.0-1127
The problem is that perf reports that L1-dcache-stores, together with some other events, is not supported when I run perf stat -e L1-dcache-stores, so I can't use it, at least in a straightforward way that I know. However, under CentOS 8 with kernel 4.18.0-193, perf works fine for this event on the same machine. So, I suspect it is because the older kernel doesn't know how to deal with certain performance counters on processors that are too new, and perf is essentially part of the kernel.
What can I do to use perf on the CentOS 7 system and have things like L1-dcache-stores working properly for my processor? I can't just take the perf binary from CentOS 8 and use it on CentOS 7 because the glibc version is different.
$ sudo perf stat -e L1-dcache-stores echo
Performance counter stats for 'echo':
<not supported> L1-dcache-stores
0.000486304 seconds time elapsed
0.000389000 seconds user
0.000000000 seconds sys

Support for your processor (Comet Lake) was added starting with the CentOS kernel package version 4.18.0-151.el8 (see the commit titled "perf/x86/intel: Add Comet Lake CPU support" listed in the changelog of the kernel). So all model-specific hardware events, including L1-dcache-stores, are supported by name in perf in 4.18.0-193 but not in 3.10.0-1127. That's why you're getting <not supported>. perf stat will tell you that all events categorized as "hardware cache events" to not be supported in 3.10.0-1127. Intel architectural hardware events, such as cycles and instructions, are supported by name in all versions of the perf_event subsystem.
The only way to use model-specific hardware events on a kernel that doesn't support the processor on which it's running is by specifying raw event codes rather than event names in the perf command. This method is described in the "ARBITRARY PMUS" section of the perf-list manual. For example, the perf event L1-dcache-stores would be mapped to the native event MEM_INST_RETIRED.ALL_STORES on your processor. The event code can then be determined by looking up the event name in the Intel manual.
perf stat -e cpu/event=0xd0,umask=0x82,name=MEM_INST_RETIRED.ALL_STORES/ ...

Related

Running performance tools using RBAC

I recently started a job that will involve a lot of performance tweaking.
I was wondering whether tools like eBPF and perf can be used with RBAC? Or will full root access be required? Getting root access might be difficult. We're mainly using fairly old Linux machines - RHEL 6.5. I'm not too familiar with RBAC. It home I have used Dtrace on Solaris, macOS and FreeBSD, but there I have the root password.
RHEL lists several profiling and tracing solutions for RHEL6 including perf in its
Performance Tuning Guide and Developer Guide:
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/performance_tuning_guide/s-analyzperf-perf
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/developer_guide/perf-using
Chapter 3. Monitoring and Analyzing System Performance of Performance Tuning Guide mentions several tools: Gnome System Monitor, KDE System Guard, Performance Co-Pilot (PCP), top/ps/vmstat/sar, tuned and ktune, MRG Tuna, and application profilers SystemTap, Oprofile, Valgrind (which is not real profiler, but cpu emulator with instruction and cache event counting), perf.
Chapter 5. Profiling of Developer Guide lists Valgrind, oprofile, SystemTap, perf, and ftrace.
Usually profiling of kernel or whole system is allowed only for root, or for user with CAP_SYS_ADMIN capability. Some profiling is limited by sysctl variables
kernel.perf_event_paranoid (documented in https://www.kernel.org/doc/Documentation/sysctl/kernel.txt):
perf_event_paranoid:
Controls use of the performance events system by unprivileged
users (without CAP_SYS_ADMIN). The default value is 2.
-1: Allow use of (almost) all events by all users
Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK
>=0: Disallow ftrace function tracepoint by users without CAP_SYS_ADMIN
Disallow raw tracepoint access by users without CAP_SYS_ADMIN
>=1: Disallow CPU event access by users without CAP_SYS_ADMIN
>=2: Disallow kernel profiling by users without CAP_SYS_ADMIN
kernel.kptr_restrict (https://www.kernel.org/doc/Documentation/sysctl/kernel.txt), which also change perf ability to profile kernel
kptr_restrict:
This toggle indicates whether restrictions are placed on
exposing kernel addresses via /proc and other interfaces.
More recent versions of ubuntu and rhel (7.4) has also kernel.yama.ptrace_scope http://security-plus-data-science.blogspot.com/2017/09/some-security-updates-in-rhel-74.html
... use kernel.yama.ptrace_scope to set who can ptrace. The different
values have the following meaning:
# 0 - Default attach security permissions.
# 1 - Restricted attach. Only child processes plus normal permissions.
# 2 - Admin-only attach. Only executables with CAP_SYS_PTRACE.
# 3 - No attach. No process may call ptrace at all. Irrevocable until next boot.
You can temporarily set it like this:
echo 2 > /proc/sys/kernel/yama/ptrace_scope
To profile a program you should have access to debugging it, like attaching with gdb (ptrace capability) or strace. I don't know RHEL or its RBAC so you should check what is available to you. Generally perf profiling of own userspace programs on software events is available for more cases. Access to per-process cpu hardware counters, profiling of programs of other users, profiling of kernel is more limited. I can expect that correctly enabled RBAC should not allow you or root to profile kernel, as perf can inject tracing probes and leak information from kernel or other users.
Qeole says in comment that eBPF is not implemented for RHEL6 (added in RHEL7.6; with XDP - eXpress Data Path in RHEL8), so you only can try ftrace for tracing or stap (SystemTap) for advanced tracing.

libvirt cpu-mode='host-model' confuses while mapping cpu models?

I have physical host which has cpu model 'Intel(R) Xeon(R) CPU E5-2670 v3 # 2.30GHz' and it has 'avx2' flag in cpuinfo. The host has kvm/qemu hypervisor and libvirt configured. I set cpu mode as host-model in domain XML. Guest vm can be created on the host. When I check cpu model of guest vm, it shows as 'SandyBridge' and it also has 'avx2' flag in cpuinfo. But 'SandyBridge' does not support 'avx2' flag but 'Haswell' model does support. It is just due to host-model mode, libvirt finds nearest cpu model to 'Intel(R) Xeon(R) CPU E5-2670 v3 # 2.30GHz' as 'SandyBridge' but it should show 'Haswell' instead. Does that mean libvirt have a bug or it is valid representation in this scenario? I am using libvirt version 1.2.2
Within a particular chip generation (SandyBridge, Haswell, etc), Intel does not in fact guarantee that all different models it makes have the same CPU flags present. We can see this with Haswell or later where some CPUs have the TSX feature and some don't. QEMU/libvirt generally only provide a single model for each Intel generation though, so its possible that your physical CPU might not actually be compatible with the correspondingly named QEMU model.
From libvirt POV, the names are just a shortcut for a particular group of features. As such when identifying the CPU for "host-model" libvirt completely ignores names, and just looks for the CPU whose list of features is most closely related to your host CPU, and then lists any extra CPU features explicitly in the XML. So all this means that even though you have Haswell as your physical CPU, it is entirely possible that libvirt will display a different model name for your guest. There's nothing really wrong with this from a functional POV - the features should all be present still (except for a few that KVM intentionally blocks), it is merely a bit "surprising" to look at.
In your case, what I think is going on is due to the bug in Intel TSX support. This feature was introduced in Haswell, but then blocked in a microcode update after Intel found out it was broken. This causes the 'tsx' feature to disappear from the CPU model in your physical machine. The libvirt/QEMU Haswell CPU model still contains 'tsx', so this means libvirt won't match against your Haswell CPU. In libvirt >= 1.2.14, we introduced a new Haswell-noTSX CPU model to deall with this particular problem, but you say you only have 1.2.2. SandyBridge is simply the next best compatible CPU model that libvirt can find for you.
I found another workaround which doesn't require to upgrade libvirt. I removed hle and rtm flags from the definition of Haswell in cpu mapping xml file used by libvirt (/usr/share/libvirt/cpu_map.xml). And then I restarted libvirt process. Then I rebooted VM and it showed correct model name as Haswell.

Is there a way to limit the number of R processes running

I use the doMC that uses the package multicore. It happened (several times) that when I was debugging (in the console) it went sideways and fork-bombed.
Does R have the setrlimit() syscall?
In pyhton for this i would use resource.RLIMIT_NPROC
Ideally I'd like to restrict the number of R processes running to a number
EDIT: OS is linux CentOS 6
There should be several choices. Here is the relevant section from Writing R Extensions, Section 1.2.1.1
Packages are not standard-alone programs, and an R process could
contain more than one OpenMP-enabled package as well as other components
(for example, an optimized BLAS) making use of OpenMP. So careful
consideration needs to be given to resource usage. OpenMP works with
parallel regions, and for most implementations the default is to use as
many threads as 'CPUs' for such regions. Parallel regions can be
nested, although it is common to use only a single thread below the
first level. The correctness of the detected number of 'CPUs' and the
assumption that the R process is entitled to use them all are both
dubious assumptions. The best way to limit resources is to limit the
overall number of threads available to OpenMP in the R process: this can
be done via environment variable 'OMP_THREAD_LIMIT', where
implemented.(4) Alternatively, the number of threads per region can be
limited by the environment variable 'OMP_NUM_THREADS' or API call
'omp_set_num_threads', or, better, for the regions in your code as part
of their specification. E.g. R uses
#pragma omp parallel for num_threads(nthreads) ...
That way you only control your own code and not that of other OpenMP
users.
One of my favourite tools is a package controlling this: RhpcBLASctl. Here is its Description:
Control the number of threads on 'BLAS' (Aka 'GotoBLAS', 'ACML' and
'MKL'). and possible to control the number of threads in 'OpenMP'. get
a number of logical cores and physical cores if feasible.
After all you need to control the number of parallel session as well as the number of BLAS cores allocated to each of the parallel threads. There is a reason the parallel package has a default of 2 threads per session...
All of this should be largely independent of the flavour of Linux or Unix you are running. Well, apart from the fact that OS X of course (still !!) does not give you OpenMP.
And the very outer level you can control from doMC and friends.
You can use registerDoMC (see the doc here)
registerDoMC(cores=<some number>)
Another option is to use the ulimit command before running the R script:
ulimit -u <some number>
to limit the number of processes R will be able to spawn.
If you want to limit the total number of CPUs several R processes use at the same time, you will need to use cgroups or cpusets and attach the R processes to the cgroup or cpuset. They will then be confined to the physical CPUS defined in the cgroup or cpuset. cgroups allow more control (for instance also memory) but are more complex to setup.

How to get cpu cores utilizations in xen?

I have installed xen as hypervisor and there are dom0 and some paravirtualized machins as domu VMs on it.
I know xentop is used for checking the performance of system and virtual machines and I can read the output of it for measuring the virtual machine cpu utilization, But, it just gives the total usage of cpus!
So, is there any tool or any way to get cpu usages per cores?
I think you may be able to get what you want from XenMon. http://www.virtuatopia.com/index.php/Xen_Monitoring_Tools_and_Techniques
Also, try using xentop with the VCPUs option: -v or press V when inside xentop.
If you are using XenServer, then there are lots of interesting host metrics available, much more than the base Xen setup. Check out http://support.citrix.com/servlet/KbServlet/download/38321-102-714737/XenServer-6.5.0_Administrators%20Guide.pdf, Chapter 9.
It's all particularly good if you use XenCenter to view those metrics, but then I would say that since I wrote a significant portion of it ;-)

How to use GNU make --max-load on a multicore Linux machine?

From the documentation for GNU make: http://www.gnu.org/software/make/manual/make.html#Parallel
When the system is heavily loaded, you will probably want to run fewer
jobs than when it is lightly loaded. You can use the ‘-l’ option to
tell make to limit the number of jobs to run at once, based on the
load average. The ‘-l’ or ‘--max-load’ option is followed by a
floating-point number. For example,
-l 2.5
will not let make start more than one job if the load average is above 2.5.
The ‘-l’ option with no following number removes the load limit, if one was
given with a previous ‘-l’ option.
More precisely, when make goes to start up a job, and it already has
at least one job running, it checks the current load average; if it is
not lower than the limit given with ‘-l’, make waits until the load
average goes below that limit, or until all the other jobs finish.
From the Linux man page for uptime: http://www.unix.com/man-page/Linux/1/uptime/
System load averages is the average number of processes that are
either in a runnable or uninterruptable state. A process in a runnable
state is either using the CPU or waiting to use the CPU. A process
in uninterruptable state is waiting for some I/O access, eg waiting
for disk. The averages are taken over the three time intervals.
Load averages are not normalized for the number of CPUs in a system,
so a load average of 1 means a single CPU system is loaded all the
time while on a 4 CPU system it means it was idle 75% of the time.
I have a parallel makefile and I want to do the obvious thing: have make to keep adding processes until I am getting full CPU usage but I'm not inducing thrashing.
Many (all?) machines today are multicore, so that means that the load average is not the number make should be checking, as that number needs to be adjusted for the number of cores.
Does this mean that the --max-load (aka -l) flag to GNU make is now useless? What are people doing who are running parallel makefiles on multicore machines?
My short answer: --max-load is useful if you're willing to invest the time it takes to make good use of it. With its current implementation there's no simple formula to pick good values, or a pre-fab tool for discovering them.
The build I maintain is fairly large. Before I started maintaining it the build was 6 hours. With -j64 on a ramdisk, now it finishes in 5 minutes (30 on an NFS mount with -j12). My goal here was to find reasonable caps for -j and -l that allows our developers to build quickly but doesn't make the server (build server or NFS server) unusable for everyone else.
To begin with:
If you choose a reasonable -jN value (on your machine) and find a reasonable upper bound for load average (on your machine), they work nicely together to keep things balanced.
If you use a very large -jN value (or unspecified; eg, -j without a number) and limit the load average, gmake will:
continue spawning processes (gmake 3.81 added a throttling mechanism, but that only helps mitigate the problem a little) until the max # of jobs is reached or until the load average goes above your threshold
while the load average is over your threshold:
do nothing until all sub-processes are finished
spawn one job at a time
do it all over again
On Linux at least (and probably other *nix variants), load average is an exponential moving average (UNIX Load Average Reweighed, Neil J. Gunther) that represents the avg number of processes waiting for CPU time (can be caused by too many processes, waiting for IO, page faults, etc). Since it's an exponential moving average, it's weighted such that newer samples have a stronger influence on the current value than older samples.
If you can identify a good "sweet spot" for the right max load and number of parallel jobs (through a combination of educated guesses and empirical testing), assuming you have a long running build: your 1 min avg will hit an equilibrium point (won't fluctuate much). However, if your -jN number is too high for a given max load average, it'll fluctuate quite a bit.
Finding that sweet spot is essentially equivalent to finding optimal parameters to a differential equation. Since it will be subject to initial conditions, the focus is on finding parameters that get the system to stay at equilibrium as opposed to coming up with a "target" load average. By "at equilibrium" I mean: 1m load avg doesn't fluctuate much.
Assuming you're not bottlenecked by limitations in gmake: When you've found a -jN -lM combination that gives a minimum build time: that combination will be pushing your machine to its limits. If the machine needs to be used for other purposes ...
... you may want to scale it back a bit when you're finished optimizing.
Without regard to load avg, the improvements I saw in build time with increasing -jN appeared to be [roughly] logarithmic. That is to say, I saw a larger difference between -j8 and -j12 than between -j12 and -j16.
Things peaked for me somewhere between -j48 and -j64 (on the Solaris machine it was about -j56) because the initial gmake process is single-threaded; at some point that thread cannot start new jobs faster than they finish.
My tests were performed on:
A non-recursive build
recursive builds may see different results; they won't run into the bottleneck I did around -j64
I've done my best to minimize the amount of make-isms (variable expansions, macros, etc) in recipes because recipe parsing occurs in the same thread that spawns parallel jobs. The more complicated recipes are, the more time it spends in the parser instead of spawning/reaping jobs. For example:
No $(shell ...) macros are used in recipes; those are ran during the 1st parsing pass and cached
Most variables are assigned with := to avoid recursive expansion
Solaris 10/sparc
256 cores
no virtualization/logical domains
the build ran on a ramdisk
x86_64 linux
32-core (4x hyper threaded)
no virtualization
the build ran on a fast local drive
Even for a build where the CPU is the bottleneck, -l is not ideal. I use -jN, where N is the number of cores that exist or that I want to spend on the build. Choosing a bigger number doesn't speed up the build in my situation. It doesn't slow it down either, as long as you don't go overboard (such as unlimited launching through -j).
Using -lN is broadly equivalent to -jN, and can work better if the machine has other independent work to do, but there are two quirks (apart from the one you mentioned, the number of cores not accounted for):
Initial spike: when the build starts, make launches a lot of jobs, many more than N. The system load number doesn't immediately increase when a process is forked. That's not a problem in my situation.
Starvation: when some build jobs take a long time and others are equally quick, at the moment the first M quick jobs end jointly, the system load is still ≥N. Soon the system load drops to N - M, but as long as those few slow jobs are dragging on, no new jobs are launched, and cores are left hungry. Make only thinks about launching a new job when an old job ends, and at the start. It doesn't notice the system load dropping in between.
Many (all?) machines today are multicore, so that means that the load
average is not the number make should be checking, as that number
needs to be adjusted for the number of cores.
Does this mean that the --max-load (aka -l) flag to GNU make is now
useless?
No. Imagine jobs with demanding disk i/o. If you started as many jobs as you had CPUs, you still wouldn't utilize the CPU very well.
Personally, I simply use -j because so far it worked well enough for me.
Does this mean that the --max-load (aka -l) flag to GNU make is now useless? What are people doing who are running parallel makefiles on multicore machines?
One of examples is running jobs in test-suite where each test has to compile and link a program. Linking sometimes load system too much, as result - fatal error: ld terminated with signal 9 [Killed]. In my case, it was not memory overhead but CPU usage, so usually suggested swap file didn't help.
With option -l 1 execution is still parallel but linking is almost sequential:
This is really about finding the right balance between RAM usage and CPU usage. The RAM has to feed the CPU with data and the CPU needs to do the work, they need to be working in sync and that depends on your exact settings with regard to its specs.
For my system (CPU: i5-1035G4, 4-core, 8 thread, RAM: 8GB & 10GB swap with swapiness at 99%) the best settings were: -l 1.9 -j7.
With that setting my system was compiling quick using at about 50% performance, so I could still use my system to do everything else in the foreground.

Resources