SSE 4 memory load optimizations - intel

When using SSE instructions/intrinsics, say for 256-bit registers, has anyone been able to reduce time spent loading the extended registers from memory by using either the prefetch instruction on the next 32-byte chunk, or by some other technique? Assume the data to be loaded is already properly aligned in memory.

See the x86 tag wiki for more info about x86 CPU performance. Hardware prefetchers are pretty good at locking onto patterns of sequential access, so you don't usually need software prefetch instructions.
Usually it's not a win to do a wide vector load an unpack it into separate integer registers. Once you've touched a cache line, more loads from it are cheap, and throughput from L1 cache into registers isn't usually the problem. Using ALU instruction to unpack a 256b load into separate 32 or 64b integers just takes more instructions and means you're more likely to bottleneck on ALU throughput.

Related

Memory Object Assignation to Context Mechanism In OpenCL

I'd like to know what exactly happens when we assign a memory object to a context in OpenCL.
Does the runtime copies the data to all of the devices which are associated with the context?
I'd be thankful if you help me understand this issue :-)
Generally and typically the copy happens when the runtime handles the clEnqueueWriteBuffer / clEnqueueReadBuffer commands.
However, if you created the memory object using certain combinations of flags, the runtime can choose to copy the memory sooner than that (like right after creation) or later (like on-demand before running a kernel or even on-demand as it needs it). Vendor documentation often indicates if they take special advantage of any of these flags.
A couple of the "interesting" variations:
Shared memory (Intel Ingrated Graphics GPUs, AMD APUs, and CPU drivers): You can allocate a buffer and never copy it to the device because the device can access host memory.
On-demand paging: Some discrete GPUs can copy buffer memory over PCIe as it is read or written by a kernel.
Those are both "advanced" usage of OpenCL buffers. You should probably start with "regular" buffers and work your way up if they don't do what you need.
This post describes the extra flags fairly well.

Is OpenMP and MPI hybrid program faster than pure MPI?

I am developing some program than runs on 4 node cluster with 4 cores on each node. I have a quite fast version of OpenMP version of the program that only runs on one cluster and I am trying to scale it using MPI. Due to my limited experience I am wondering which one would give me faster performance, a OpenMP hybrid architecture or a MPI only architecture? I have seen this slide claiming that the hybrid one generally cannot out perform the pure MPI one, but it does not give supporting evidence and is kind of counter-intuitive for me.
BTW, My platform use infiniband to interconnect nodes.
Thank a lot,
Bob
Shared memory is usually more efficient than message passing, as the latter usually requires increased data movement (moving data from the source to its destination) which is costly both performance-wise and energy-wise. This cost is predicted to keep growing with every generation.
The material states that MPI-only applications are usually on-par or better than hybrid applications, although they usually have larger memory requirements.
However, they are based on the fact that most of the large hybrid applications shown were based on parallel computation then serial communication.
This kind of implementations are usually susceptible to the following problems:
Non uniform memory access: having two sockets in a single node is a popular setup in HPC. Since modern processors have their memory controller on chip, half of the memory will be easily accessible from the local memory controller, meanwhile the other half has to pass through the remote memory controller (i.e., the one present in the other socket). Therefore, how the program allocates memory is very important: if the memory is reserved in the serialized phase (on the closest possible memory), then half of the cores will suffer longer main memory accesses.
Load balance: each *parallel computation to serialized communication** phase implies a synchronization barrier. This barriers force the fastest cores to wait for the slowest cores in a parallel region. Fastest/slowest unbalance may be affected by OS preemption (time is shared with other system processes), dynamic frequency scaling, etc.
Some of this issues are more straightforward to solve than others. For example,
the multiple-socket NUMA problem can be mitigated placing different MPI processes in different sockets inside the same node.
To really exploit the efficiency of shared memory parallelism, the best option is trying to overlap communication with computation and ensure load balance between all processes, so that the synchronization cost is mitigated.
However, developing hybrid applications which are both load balanced and do not impose big synchronization barriers is very difficult, and nowadays there is a strong research effort to address this complexity.

OpenCL shared memory optimisation

I am solving a 2d Laplace equation using OpenCL.
The global memory access version runs faster than the one using shared memory.
The algorithm used for shared memory is same as that in the OpenCL Game of Life code.
https://www.olcf.ornl.gov/tutorials/opencl-game-of-life/
If anyone has faced the same problem please help. If anyone wants to see the kernel I can post it.
If your global-memory really runs faster than your local-memory version (assuming both are equally optimized depending on the memory space you're using), maybe this paper could answer your question.
Here's a summary of what it says:
Usage of local memory in a kernel add another constraint to the number of concurrent workgroups that can be run on the same compute unit.
Thus, in certain cases, it may be more efficient to remove this constraint and live with the high latency of global memory accesses. More wavefronts (warps in NVidia-parlance, each workgroup is divided into wavefronts/warps) running on the same compute unit allow your GPU to hide latency better: if one is waiting for a memory access to complete, another can compute during this time.
In the end, each kernel will take more wall-time to proceed, but your GPU will be completely busy because it is running more of them concurrently.
No, it doesn't. It only says that ALL OTHER THINGS BEING EQUAL, an access from local memory is faster than an access from global memory. It seems to me that global accesses in your kernel are being coalesced which yields better performance.
Using shared memory (memory shared with CPU) isn't always going to be faster. Using a modern graphics card It would only be faster in the situation that the GPU/CPU are both performing oepratoins on the same data, and needed to share information with each-other, as memory wouldn't have to be copied from the card to the system and vice-versa.
However, if your program is running entirely on the GPU, it could very well execute faster by running in local memory (GDDR5) exclusively since the GPU's memory will not only likely be much faster than your systems, there will not be any latency caused by reading memory over the PCI-E lane.
Think of the Graphics Card's memory as a type of "l3 cache" and your system's memory a resource shared by the entire system, you only use it when multiple devices need to share information (or if your cache is full). I'm not a CUDA or OpenCL programmer, I've never even written Hello World in these applications. I've only read a few white papers, it's just common sense (or maybe my Computer Science degree is useful after all).

How to "stream" data from and to global memory?

The codeproject.com showcase Part 2: OpenCL™ – Memory Spaces states that Global memory should be considered as streaming memory [...] and that the best performance will be achieved when streaming contiguous memory addresses or memory access patterns that can exploit the full bandwidth of the memory subsystem.
My understanding of this sentence is, that for optimal performance one should constantly fill and read global memory while the GPU is working on the kernels. But I have no idea, how I would implement such an concept and I am not able to recognize it in the (rather simple) examples and tutorials I've read.
Do know a good example or can link to one?
Bonus question: Is this analog in the CUDA framework?
I agree with talonmies about his interpretation of that guideline: sequential memory access are fastest. It's pretty obvious (to any OpenCL-capable developer) that sequential memory accesses are the fastest though, so it's funny that NVidia explicitly spells it out like that.
Your interpretation, although not what that document is saying, is also correct. If your algorithm allows it, it is best to upload in reasonably sized chunks asynchronously so it can get started on the compute sooner, overlapping compute with DMA transfers to/from system RAM.
It is also helpful to have more than one wavefront/warp, so the device can interleave them to hide memory latency. Good GPUs are heavily optimized to be able to do this switching extremely fast to stay busy while blocked on memory.
My understanding of this sentence is,
that for optimal performance one
should constantly fill and read global
memory while the GPU is working on the
kernels
That isn't really a correct interpretation.
Typical OpenCL devices (ie. GPUs) have extremely high bandwidth, high latency global memory systems. This sort of memory system is highly optimized for access to contiguous or linear memory access. What that piece you quote is really saying is that OpenCL kernels should be designed to access global memory in the sort of contiguous fashion which is optimal for GPU memory. NVIDIA call this sort of optimal, contiguous memory access "coalesced", and discuss memory access pattern optimization for their hardware in some detail in both their CUDA and OpenCL guides.

UNIX Domain sockets vs Shared Memory (Mapped File)

Can anyone tell, how slow are the UNIX domain sockets, compared to Shared Memory (or the alternative memory-mapped file)?
Thanks.
It's more a question of design, than speed (Shared Memory is faster), domain sockets are definitively more UNIX-style, and do a lot less problems. In terms of choice know beforehand:
Domain Sockets advantages
blocking and non-blocking mode and switching between them
you don't have to free them when tasks are completed
Domain sockets disadvantages
must read and write in a linear fashion
Shared Memory advantages
non-linear storage
will never block
multiple programs can access it
Shared Memory disadvantages
need locking implementation
need manual freeing, even if unused by any program
That's all I can think of now. However, I'd go with domain sockets any day -- not to mention that it's a lot easier then to reimplement them to do distributed computing. The speed gain of Shared Memory will be lost because of the need of a safe design. However, if you know exactly what you're doing, and use the proper kernel calls, you can achieve greater speed with Shared Memory.
In terms of speed shared memory is definitely the winner. With sockets you will have at least two copies of the data - from sending process to the kernel buffer, then from the kernel to the receiving process. With shared memory the latency will only be bound by the cache consistency algorithm between the cores on the box.
As Kornel notes though, dealing with shared memory is more involved since you have to come up with your own synchronization/signalling scheme, which might add a delay depending on which route you go. Definitely use semaphores in shared memory (implemented with futex on Linux) to avoid system calls in non-contended case.
Both are inter process communication (IPC) mechanisms.
UNIX domain sockets are uses for communication between processes on one host similar as TCP-Sockets are used between different hosts.
Shared memory (SHM) is a piece of memory where you can put data and share this between processes.
SHM provides you random access by using pointers, Sockets can be written or read but you cannot rewind or do positioning.
#Kornel Kisielewicz 's answer is good IMO. Just adding my own results here for sockets, not only Unix domain sockets.
Shared Memory
Performance is very high. No copies with RAW access data. Fastest access for sure.
Synchronization needed. Design not so easy to setup for complex cases.
Fixed size. Growing shared memory is doable but memory has to be unmapped first, growed, and then remapped.
Signaling mechanism can be quite slow, see here : Boost.Interprocess notify() performance. Especially if you want to do lots of exchanges between processes. Signaling mechanism not so easy to setup also.
Sockets
Easy to setup.
Can be used on different machines.
No complex synchronisation needed.
Size is not a problem if you use TCP. Simple design with header containing the packet size and then send the data.
Ping/Pong exchange is fast because it can be treated as hardware interruption by the OS.
Performance is average: a few copies of data are made.
High CPU consumption compared to shared memory. Sockets calls are not that cheap if you use them a lot.
In my tests, exchanges of small chunks of data (around 1MByte/second) shows no real advantage for shared memory. I would even say that ping/pong exchanges were faster using TCP (due to simple and efficient signaling mechanism). BUT when exchanging large amount of data (around 200MBytes/second), I had 20% CPU consumption with sockets, compared to 3% CPU using shared memory. So a huge win for shared memory in terms of CPU because read and write socket calls are not cheap.
In this case - sockets are faster. Writing to shared memory is faster then any IPC but writing to a memory mapped file and writing to shared memory are 2 completely different things.
when writing to a memory mapped file you need to "flush" what was written to the shared memory to an actual binded file (not exactly, the flush is being done for you), so you copy your data first to the shared memory, and then you copy it again (flush) to the actual file and that is super duper expansive - more then anything, even more then writing to socket, you are gaining nothing by doing that.

Resources