How does Intel Xeon CPU handle cores contending for AVX units? - intel

I'm designing a signal processing application to run an an Intel Xeon CPU using linux. It will have several parallel threads, each allocated to it's own core. Each will also use the IPP library to speed up calculations using the AVX units. What will happen if I run more AVX unit dependent threads that there are AVX units? Will threads just block until an AVX unit is available? Can they be shared somehow? Something more sinister?

Each core can run two threads. It is likely that the operating system will put two threads in the same core if you run many threads and do not explicitly assign threads to different cores. Two threads running in the same core will compete for the same execution units. If execution unit throughput is a bottleneck then there is no advantage in hyperthreading.

Related

CUDA MPS for OpenCL?

CUDA MPS allows you to run multiple processes in parallel on the GPU, thus fully utilizing the GPU for operations that don't take full advantage. Is there an equivalent for OpenCL? Or is there a different approach in OpenCL?
If you use multiple OpenCL command queues that don't have event interdependencies, an OpenCL runtime could keep the GPU cores busy with varied work from each queue. It's really up to the implementation as to whether this actually happens. You'd need to check each vendor's OpenCL guide to see if they support concurrent GPU kernels.

Hyperthreading makes my code run slower?

Some multithreaded code I just wrote appears to run slower under hyperthreaded CPUs - i.e. disabling hyperthreading makes it run FASTER. Is this normal?
This depends entirely on use case. A subjective term like normal has a lot of leeway! There are use cases where Hyper-Threading (HT) makes sense, and cases where it will have a performance impact.
One such case of performance decrease is for applications making heavy use of AVX instructions. The AVX instructions are carried out in the vector processing unit(VPU), of which there is one per core in Intel Xeon processors. Additional threads will block when trying to access the VPU if it is not available, leading to no performance improvement with the use of HT.
If you have say, 4 cores with HT, allowing you to run 8 threads, you will only actually be able to run 4 VPU instructions at a time - so your other 4 threads will be blocked as they complete. The additional overhead of the blocking and scheduling will usually net you a lower throughput than if you were running 4 threads on 4 cores, with HT disabled.
Likewise, running just 4 threads on the 8 cores, the OS scheduler can schedule the threads to run on any physical core - so there may still be a chance where one thread blocks waiting for another to complete. Some newer applications and job schedulers can now coordinate with the OS to "pin" threads on physical cores, allowing HT to be enabled, but not to oversubscribe the amount of threads that are running on a core. Over time this will probably get better, but does require awareness on the developer's part.
For more general purpose use cases, like a generic server handling many types of workloads, the advantage of HT running additional threads in a single core it usually a performance gain.

What are some computers that support NUMA?

What are some computers that support NUMA? Also, how many cores are required? I have tried searching in Google and Bing but couldn't find any answers.
NUMA Support
The traditional model for multiprocessor support is symmetric multiprocessor (SMP). In this model, each processor has equal access to memory and I/O. As more processors are added, the processor bus becomes a limitation for system performance.
System designers use non-uniform memory access (NUMA) to increase processor speed without increasing the load on the processor bus. The architecture is non-uniform because each processor is close to some parts of memory and farther from other parts of memory. The processor quickly gains access to the memory it is close to, while it can take longer to gain access to memory that is farther away.
In a NUMA system, CPUs are arranged in smaller systems called nodes. Each node has its own processors and memory, and is connected to the larger system through a cache-coherent interconnect bus.
The system attempts to improve performance by scheduling threads on processors that are in the same node as the memory being used. It attempts to satisfy memory-allocation requests from within the node, but will allocate memory from other nodes if necessary. It also provides an API to make the topology of the system available to applications. You can improve the performance of your applications by using the NUMA functions to optimize scheduling and memory usage.
************************************************************************
Multiple Processors
Computers with multiple processors are typically designed for one of two architectures: non-uniform memory access (NUMA) or symmetric multiprocessing (SMP).
In a NUMA computer, each processor is closer to some parts of memory than others, making memory access faster for some parts of memory than other parts. Under the NUMA model, the system attempts to schedule threads on processors that are close to the memory being used. For more information about NUMA, see NUMA Support.
In an SMP computer, two or more identical processors or cores connect to a single shared main memory. Under the SMP model, any thread can be assigned to any processor. Therefore, scheduling threads on an SMP computer is similar to scheduling threads on a computer with a single processor. However, the scheduler has a pool of processors, so that it can schedule threads to run concurrently. Scheduling is still determined by thread priority, but it can be influenced by setting thread affinity and thread ideal processor, as discussed in this topic.

Vectorized Code on GPU

I am using OpenCL to execute a procedure on different GPUs and CPUs simultaneously to get a high performance results. The Intel OpenCL is always showing a message that the Kernel is not vectorized, so it will only run on different cores but will not run using SIMD instructions. My question is, if I rewrite the code so that the SIMD instruction can be exploit with the OpenCL code, will it increase the GPU Performance also?
Yes - but beware that this is not necessary on AMD GCN based APU/GPU or Nvidia Fermi or higher GPU hardware for good performance -they do scalar operations with great utilization. CPUs and Intels GPU however can greatly benefit via SIMD instructions which is what the vector operations boil down to.

How does OpenCL host execute kernels on itself?

When we have multi-core CPU, OpenCL treats it as a single device with multiple compute units and for every device we can create some command queues.How can CPU as a host, create a command queue on itself? I think in this situation it become multithreading rather than parallel computing.
Some devices, including most CPU devices can be segmented into sub-devices using the extension "cl_ext_device_fission". When you use device fission, you are still getting parallel processing in that the host thread can do other tasks while the kernel is running on some CPU cores.
When not using device fission, the CPU device will block essentially block the host program while a kernel is running. Even if some opencl implementations are non-blocking during kernel execution, the performance hit to the host would be too great to allow much work to be done by the host thread.
So it's still parallel computation, but I guess the host application's core is technically multi-threading during kernel execution.
It's parallel computing using multithreading. When you use an OpenCL CPU driver and enqueue kernels for the CPU device, the driver uses threads to execute the kernel, in order to fully leverage all of the CPUs cores (and also typically uses vector instructions like SSE to fully utilize each core).

Resources