Why are PMM (Intel Optane Persistent Memory) event counters in Intel Xeon Cascade Lake processor giving non-deterministic results? - intel

Some of the events such as unc_m_pmm_wpq_inserts, unc_m_pmm_rpq_inserts, unc_m2m_imc_writes.to_pmm, unc_m2m_imc_reads.to_pmm are giving drastically different values on multiple runs of the same program with the same parameters. What might be the issue here?

Related

Should I aim for multiple physical cores or multiple threads for parallel computing in R?

I am new to parallel computing and this may be a trivial question. I am thinking about which processor to choose for parallel computing (on a single machine)? In particular, I would like to know whether I should aim for a high number (physical) cores or a high number of threads?
I am working with R (package parallel) on Windows. Typically, the datasets are not large, so the limit is not the memory but the number and duration of independent processes run on the data.
I understood that parallel makes use of logical cores (i.e., hardware threads) but that such threads do not work truly in parallel because they share “execution resources” https://en.wikipedia.org/wiki/Hyper-threading. So, would e.g., 4 (physical) cores with 1 thread each result in more speed (throughput) than 2 (physical) cores with 2 threads each (i.e., 4 logical cores)?
Suggestions on specific processors are also more than welcome.
For memory or I/O intensive workloads, a HT enabled processor provides better performance and power efficiency, at lower cost. For compute intensive workloads, possible gain due to additional logical threads reduces. Your application seems to be compute intensive. If that is the only kind of workload the system has to execute, you can look for system with higher number of physical core processors.
The problem is there are limited number of processor which doesn't support logical threads. Most of the processor from Intel support Hyper-threading. Cost of with and without HT processors is not very different. With a HT enabled processor a system can handle more diverse workloads. It can multitask more efficiently.
It is possible to disable HT by configuring the BIOS.

How the OS and the Driver effects the OpenCL kernel timing?

For measuring an OpenCL kernel execution time we either uses a:
1- CPU Timers .. but we need to consider that the OCL functions are non-blocking hence we need to use the clFinish() routine for achieving full throughput.
2- GPU Timers .. that is using clGetEventProfilingInfo() routine along with setting the CL_QUEUE_PROFILING_ENABLE flag in properties argument of either clCreateCommandQueue() or clSetCommandQueueProperty()routines.
How can the Operating System and the Driver version effect the accuracy of the timers used to measure the kernel execution time ?
All that I know is that we need to warm-up the device with at least one kernel call to absorb the latency of the OpenCL resource allocation at the very beginning.
1- You will not get accurate timings if you only use CPU timing due to non-blocking kernel launch, the time you spend on the driver and it may even differ due to context switches from OS perspectives.
2- GPU timers depend on GPU hardware counters. Using the events to read the counters will give you the most accurate timings you can get. Since CPU or the OS do not meddle with GPU hardware counters, the effect will be none. The only case that may affect is the driver in the case of how hardware counters are handled.
The warming-up part is for data-transfers and memory allocation so it does not affect how hardware counters behave.

Is OpenMP and MPI hybrid program faster than pure MPI?

I am developing some program than runs on 4 node cluster with 4 cores on each node. I have a quite fast version of OpenMP version of the program that only runs on one cluster and I am trying to scale it using MPI. Due to my limited experience I am wondering which one would give me faster performance, a OpenMP hybrid architecture or a MPI only architecture? I have seen this slide claiming that the hybrid one generally cannot out perform the pure MPI one, but it does not give supporting evidence and is kind of counter-intuitive for me.
BTW, My platform use infiniband to interconnect nodes.
Thank a lot,
Bob
Shared memory is usually more efficient than message passing, as the latter usually requires increased data movement (moving data from the source to its destination) which is costly both performance-wise and energy-wise. This cost is predicted to keep growing with every generation.
The material states that MPI-only applications are usually on-par or better than hybrid applications, although they usually have larger memory requirements.
However, they are based on the fact that most of the large hybrid applications shown were based on parallel computation then serial communication.
This kind of implementations are usually susceptible to the following problems:
Non uniform memory access: having two sockets in a single node is a popular setup in HPC. Since modern processors have their memory controller on chip, half of the memory will be easily accessible from the local memory controller, meanwhile the other half has to pass through the remote memory controller (i.e., the one present in the other socket). Therefore, how the program allocates memory is very important: if the memory is reserved in the serialized phase (on the closest possible memory), then half of the cores will suffer longer main memory accesses.
Load balance: each *parallel computation to serialized communication** phase implies a synchronization barrier. This barriers force the fastest cores to wait for the slowest cores in a parallel region. Fastest/slowest unbalance may be affected by OS preemption (time is shared with other system processes), dynamic frequency scaling, etc.
Some of this issues are more straightforward to solve than others. For example,
the multiple-socket NUMA problem can be mitigated placing different MPI processes in different sockets inside the same node.
To really exploit the efficiency of shared memory parallelism, the best option is trying to overlap communication with computation and ensure load balance between all processes, so that the synchronization cost is mitigated.
However, developing hybrid applications which are both load balanced and do not impose big synchronization barriers is very difficult, and nowadays there is a strong research effort to address this complexity.

Parallelism in OpenCL on 1 cpu device

Is it possible to achieve the same level of parallelism with a multiple core CPU device as that of multiple heterogenous devices ( like GPU and CPU ) in OpenCL?
I have an intel i5 and am looking to optimise my code. When I query the platform for devices I get only one device returned: the CPU. I was wondering how I could optimise my code by using this.
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices or does it have to be done manually by the programmer?
Can a cpu device achieve the same level of parallelism as a gpu? Pretty much always no.
The number of compute units in a gpu is almost always more than in a cpu. For example, $50 can get you a video card with 10 compute units (Radeon 6450). The cheapest 8-core cpus on newegg are going for $189 (desktop cpu) and $269 (server).
The compute units of a cpu will run faster due to clock speed, and execute branching code much better than a gpu. You want a cpu if your workload has a lot of conditional statements.
A gpu will execute the same instructions on many pieces of data. The 6450 gpu has 16 'stream processors' per compute unit to make this happen. Gpus are great when you have to do the same (small/medium) tasks many times. Matrix multiplication, n-boy computations, reduction operations, and some sorting algorithms run much better on gpu/accelerator hardware than on a cpu.
I answered a similar question with more detail a few weeks ago. (This one)
Getting back to your question about the "same level of parallelism" -- cpus don't have the same level of parallelism as gpus, except in cases where the gpu under performs on the execution of the actual kernel.
On your i5 system, there would be only one cpu device. This represents the entire cpu. When you query for the number of compute units, opencl will return the number of cores you have. If you want to use all cores, you just run the kernel on your device, and opencl will use all of the compute units (cores) for you.
Short answer: yes, it will run in parallel and no, no need to do it manually.
Long answer:
Also, if I used a single command queue for this device, would the application automatically assign the kernels to different compute devices [...]
Either you need to revise your OpenCL vocabulary or I didn't understand your question. You only have one device and core != device!
One CPU, regardless of how many cores it has, is one device. The same goes for a GPU: one GPU, which has hundreds of cores, is only one device. You send jobs to the device through the queue and the device's driver. Your jobs can (and will) be split up into work-items. Then, some (how many depends on the device/driver) work-items are executed in parallel. On the GPU aswell as on the CPU, one work-item is executed by one kernel. (This might not be completely true but it is a very helpful abstraction.)
If you enqueue several kernels in one queue (without connecting them through a wait event!), the driver may or may not run them in parallel.
It is the very goal of OpenCL to allow you to compute work-items in parallel regardless of whether it is using several devices' cores in parallel or only a single devices cores.
If this confuses you, watch these really good (and long) videos: http://macresearch.org/opencl
How are you determining the OPENCL device count? I have an Intel I3 laptop that gives me 2 OpenCL compute units? It has 2 cores.
According to Intels spec an I5-2300 has 4 cores and supports 4 threads. It isn't hyper-threaded. I would expect a OpenCL call to the query the # devices to give you a count of 4.

OpenCL Execution model multiple queued kernels

I was curious as to how the GPU executes the same kernel multiple times.
I have a kernel which is being queued hundreds (possibly thousands) of times in a row, and using the AMD App Profiler I noticed that it would execute clusters of kernels extremely fast, then like clockwork every so often a kernel would "hang" (i.e. take orders of magnitude longer to execute). I think it's every 64th kernel that hangs.
This is odd because each time through the kernel performs the exact same operations with the same local and global sizes. I'm even re-using the same buffers.
Is there something about the execution model that I'm missing (perhaps other programs/the OS accessing the GPU or the timing frequency of the GPU memory). I'm testing this on an ATI HD5650 card under Windows 7 (64-bit), with AMD App SDK 2.5 with in-order queue execution.
As a side note, if the I don't have any global memory accesses in my kernel (a rather impractical prospect), the profiler puts a gap in between the quick executing kernels and where the slow executing kernels were before is now a large empty gap where none of my kernels are being executed.
As a follow-up question, is there anything that can be done to fix this?
It's probable you're seeing the effects of your GPU's maximum number of concurrent tasks. Each task enqueued is assigned to one or more multiprocessors, which are frequently capable of running hundreds of workitems at a time - of the same kernel, enqueued in the same call. Perhaps what you're seeing is the OpenCL runtime waiting for one of the multiprocessors to free up. This relates most directly to the occupancy issue - if the work size can't keep the multiprocessor busy, through memory latencies and all, it has idle cycles. The limit here depends on how many registers (local or private memory) your kernel requires. In summary, you want to write your kernel to operate on multiple pieces of data more so than queueing it many times.
Did your measurement include reading back results from the apparently fast executions?

Resources