Difference between two perf events for intel processors - intel

What is the difference between following perf events for intel processors:
UNC_CHA_DIR_UPDATE.HA: Counts only multi-socket cacheline Directory state updates memory writes issued from the HA pipe. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.
UNC_CHA_DIR_UPDATE.TOR: Counts only multi-socket cacheline Directory state updates due to memory writes issued from the TOR pipe which are the result of remote transaction hitting the SF/LLC and returning data Core2Core. This does not include memory write requests which are for I (Invalid) or E (Exclusive) cachelines.
UNC_M2M_DIRECTORY_UPDATE.ANY: Counts when the M2M (Mesh to Memory) updates the multi-socket cacheline Directory to a new state.
The above description about perf events is taken from here.
In particular, if there is a directory update because of the memory write request coming from a remote socket then which perf event will account for that if any?
As per my understanding, since the CHA is responsible for handling the requests coming from the remote sockets via UPI, the directory updates which are caused by the remote requests should be reflected by UNC_CHA_DIR_UPDATE.HA or UNC_CHA_DIR_UPDATE.TOR. But when I run a program (which I will explain shortly), the UNC_M2M_DIRECTORY_UPDATE.ANY count is much larger (more than 34M) whereas the other two events have the count in the order of few thousand. Since there are no other writes happening other than those coming from the remote socket it seems that UNC_M2M_DIRECTORY_UPDATE.ANY measures the number of directory updates(and not the other two events which) happening due to remote writes.
Description of the system
Intel Xeon GOLD 6242 CPU (Intel Cascadelake architecture)
4 sockets with each socket having PMEM attached to it
part of the PMEM is configured to be used as a system RAM on sockets 2 and 3
OS: Linux (kernel 5.4.0-72-generic)
Description of the program:
Note: use numactl to bind the process to node 2 which is a DRAM node
Allocate two buffers of size 1GB each
Initialize these buffers
Move the second buffer to the PMEM attached to socket 3
Perform a data copy from the first buffer to the second buffer

Related

Low performance with MPI communication within a single node

I have a program that is using the openMPI implementation of MPI for data-exchange between processes. Right now I am using this program on only one node, where the data has to be shared from one process to all the others. The total amount of data that the master process is sending is 130 Gb, which is split and sent to 6-8 client processes, but this data-transfer takes an awful amount of time (1 hour).
Knowing that the code is running on the very same node, I would expect that the data-transfer could use some speed-up, through the settings that I could describe when I launch the mpirun program - Do you know which settings could help me to get a faster data-transfer in this scenario? Right now I am using only "--mca btl vader,self" as optional components.
The actual code use MPI_Send() functions that share an amount of data that is near to the maximum amount of data that is possible to transfer with this call. After the data has been transferred to a client-process after multiple MPI_Send() calls, the master process send data to the other pending client-processes.

Does NUMA impact memory bandwidth, or just latency?

I have a problem that is memory bandwidth limited -- I need to read a lot (many GB) of data sequentially from RAM, do some quick processing and write it sequentially to a different location in RAM. Memory latency is not a concern.
Is there any benefit from dividing the work between two or more cores in different NUMA zones? Equivalently, does working across zones reduce the available bandwidth?
For bandwidth-limited, multi-threaded code, the behavior in a NUMA system will primarily depend how "local" each thread's data accesses are, and secondarily on details of the remote accesses.
In a typical 2-socket server system, the local memory bandwidth available to two NUMA nodes is twice that available to a single node. (But remember that it may take many threads running on many cores to reach asymptotic bandwidth for each socket.)
The STREAM Benchmark, for example, is typically run in a configuration that allows almost all accesses from every thread to be "local". This is implemented by assuming "first touch" NUMA placement -- when allocated memory is first written, the OS has to create mappings from the process virtual address space to physical addresses, and (by default) the OS chooses physical addresses that are in the same NUMA node as the core that executed the store instruction.
"Local" bandwidth (to DRAM) in most systems is approximately symmetric (for reads and writes) and relatively easy to understand. "Remote" bandwidth is much more asymmetric for reads and writes, and there is usually significant contention between the read/write commands going between the chips and the data moving between the chips. The overall ratio of local to remote bandwidth also varies significantly across processor generations. For some processors (e.g., Xeon E5 v3 and probably v4), the interconnect is relatively fast, so jobs with poor locality can often be run with all of the memory interleaved between the two sockets.
Local bandwidths have increased significantly since then, with more recent processors generally strongly favoring local access.
Example from the Intel Xeon Platinum 8160 (2 UPI links between chips):
Local Bandwidth for Reads (each socket) ~112 GB/s
Remote Bandwidth for Reads (one-direction at a time) ~34 GB/s
Local bandwidth scales perfectly in two-socket systems, and remote bandwidth also scales very well when using both sockets (each socket reading data from the other socket).
It gets more complicated with combined read and write traffic between sockets, because the read traffic from node 0 to node 1 competes with the write traffic from node 1 to node 0, etc.
Local Bandwidth for 1R:1W (each socket) ~101 GB/s (reduced due to read/write scheduling overhead)
Remote Bandwidth for 1R:1W (one socket running at a time) ~50 GB/s -- more bandwidth is available because both directions are being used, but this also means that if both sockets are doing the same thing, there will be conflicts. I see less than 60 GB/s aggregate when both sockets are running 1R:1W remote at the same time.
Of course different ratios of local to remote accesses will change the scaling. Timing can also be an issue -- if the threads are doing local accesses at the same time, then remote accesses at the same time, there will be more contention in the remote access portion (compared to a case in which the threads are doing their remote accesses at different times).

Kernel pipeline and clEnqueueReadBuffer

I have a pipeline of kernels:
1) kernel A writes data into buffer X
2) buffer X is copied to host via clEnqueueReadBuffer
3) host data is processed, in callback triggered by clEnqueueReadBuffer
repeat above
Buffer X is created with the following flags :
CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE; | CL_MEM_HOST_READ_ONLY
My question: once clEnqueueReadBuffer is complete (I have an event triggered by CL_COMPLETE), is it safe for kernel A to run again without
overwriting data being processed on the host ?
Or should I process the data on the host before I allow kernel A to run again?
Because I am seeing a bug in my code indicating that it is not safe for kernel A to run until I process the data on the host.
Thanks!
This is what the OpenCL 1.2 specification has to say about buffers created with CL_MEM_USE_HOST_PTR:
If specified, it indicates that the application wants the OpenCL implementation to use memory referenced by host_ptr as the storage bits for the memory object.
The implication of this is that it is not safe to simultaneously access this buffer from both the host and the device (unless both are just reading). If you want the host and device allocations to be distinct, just create your buffer without the CL_MEM_USE_HOST_PTR flag.

what does the return value of cudaDeviceProp::asyncEngineCount mean?

I have read the documentation, and it said that if it returns 1:
device can concurrently copy memory between host and device while executing a kernel
If it is 2:
device can concurrently copy memory between host and device in both directions and execute a kernel at the same time
What exactly is the difference?
With 1 DMA engine, the device can either download data from the CPU or upload data to the CPU, but not do both simultaneously. With 2 DMA engines, the device can do both in parallel.
Regardless of the number of available DMA engines, the device also has an execution engine which can be running a kernel in parallel of ongoing memory operations.

Munin Graphs meaning

I've been using Munin for some days and I think it's very interesting information, but I don't understand some of the graphs, and how they can be used/read to get information to improve system.
The ones I don't understand are:
Disk
Disk throughput per device
Inode usage in percent
IOstat
Firewall Throughput
Processes
Fork rate
Number of threads
VMstat
System
Available entropy
File table usage
Individual interrupts
Inode table usage
Interrupts and context switches
Ty!
Munin creates graphs that enable you to see trends. This is very useful to see if a change you made doesn't negatively impact the performance of the system.
Disk
Disk throughput per device & IOstat
The amount of data written or read from a disk device. Disks are always slow compared to memory. A lot of disk reads could for example indicate that your database server doesn't have enough RAM.
Inode usage in percent
Every filesystem has a index where information about the files is stored, like name, permissions and location on the disk. With many small files the space available to this index could run out. If that happens no new files can be saved to that filesystem, even if there is enough space on the device.
Firewall Throughput
Just like it says, the amount of packets going though the iptables firewall. Often this firewall is active on all interfaces on the system. This is only really interesting if you run munin on a router/firewall/gateway system.
Processes
Fork rate
Processes are created by forking a existing process into two processes. This is the rate at wich new processes are created.
Number of threads
The total number of processes running in the system.
VMstat
Usage of cpu time.
running: time spent running non-kernel code
I/O sleep: Time spent waiting for IO
System
Available entropy: The entropy is the measure of the random numbers available from /dev/urandom. These random numbers are needed to create SSL connections. If you create a large number of SSL connections this randomness pool could possibly run out of real random numbers.
File table usage
The total number of files open in the system. If this number suddenly goes up there might be a program that is not releasing its file handles properly.

Resources