Tree-based communication in MPI barrier - mpi

This below link describes the barrier implementation in MPI (Message Passing Interface).
How is barrier implemented in message passing systems?
However, tree-based communication in MPI barrier is not described. What will be the implementation of tree-based communication in MPI barrier?

Since the major implementations OpenMPI and MPICH2 are open source and some commercial flavors are besed on them you could find out for yourself. The guys behind MPICH2 may evene have published papers about their implementation.
The answer may even depend on the actual device (shm, tcp, infiniband) being used.

Related

Why should there be minimal work before MPI_Init?

The documentations for both MPICH and OpenMPI mention that there should be minimal amount of work done before MPI_Init or after MPI_Finilize:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible.
What is the reason behind this?
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
I believe it was worded in the like that in order to allow MPI implementations that spawn its ranks within MPI_Init. That means not all ranks are technically guaranteed to exist before MPI_Init. If you had opened file descriptors or performed other things with side effects on the process state, it would become a huge mess.
Afaik no major current MPI implementation does that, nevertheless an MPI implementation might use this requirement for other tricks.
EDIT: I found no evidence of this and only remember this from way back, so I'm not sure about it. I can't seem to find the formulation in MPI standard that you quoted from MPICH. However, the MPI standard regulates which MPI functions you may call before MPI_Init:
The only MPI functions that may be invoked before the MPI initialization routines are called are MPI_GET_VERSION, MPI_GET_LIBRARY_VERSION, MPI_INITIALIZED, MPI_FINALIZED, and any function with the prefix MPI_T_.
The MPI_Init documentation of MPICH is giving some hints:
The MPI standard does not say what a program can do before an MPI_INIT or after an MPI_FINALIZE. In the MPICH implementation, you should do as little as possible. In particular, avoid anything that changes the external state of the program, such as opening files, reading standard input or writing to standard output.
BTW, I would not expect MPI_Init to do communications. These would happen later.
And the mpich/init.c implementation is free software; you can study its source code and understand that it is initializing some timers, some threads, etc... (and that should indeed happen really early).
To me it seems perfectly reasonable for processes to do a significant amount of calculations before starting the communication with each other.
Of course, but these should happen after MPI_Init (but before some MPI_Send etc).
On some supercomputers, MPI might use dedicated hardware (like InfiniBand, Fibre Channel, etc...) and there might be some hardware or operating system reasons to initialize it very early. So it makes sense to call MPI_Init very early. BTW, it is also given the pointers to main arguments and I guess that it would modify them before further processing by your main. Then the call to MPI_Init is probably the first statement of your main.

AES Encrypting for ESP8266 implemented on Software or Hardware? How to implement?

I need to write a basic encryption program for ESP8266. I did read the datasheet (https://www.espressif.com/sites/default/files/documentation/0a-esp8266ex_datasheet_en.pdf), and them says that existis the methods of encrypt: WEP/TKIP/AES. My main question is: The AES method, is implemented on software or hardware? This module is very simple, (36KB RAM, 90MHz CPU clock), so the algorithm is heavy to process. If AES is implemented in hardware, I think this task gets simpler, but I don't know how to use this. I did read at web, and the examples uses a #include "AES.h" lib, I don't know if this is implemented on hardware or software. The site of ESP8266 don't reply this question. So, I wants know about this and how, or where I found help, to implement this.
Ps.: I don't want use Arduino.
Also, I've already use this, https://github.com/CHERTS/esp8266-devkit/tree/master/Espressif/examples/ESP8266. But, for little jobs.
It's a software implementation. The RTOS SDK contains two implementations of AES, one of them shared with the basic SDK - all in software:
https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_RTOS_SDK/third_party/mbedtls/library/aes.c
https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_RTOS_SDK/third_party/ssl/crypto/ssl_aes.c
https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_SDK/third_party/mbedtls/library/aes.c
In addition, there's an implementation optimized for the AES-NI instruction set: https://github.com/CHERTS/esp8266-devkit/blob/master/Espressif/ESP8266_RTOS_SDK/third_party/mbedtls/library/aesni.c
However, AES-NI is only implemented by certain Intel and AMD CPUs. So it will not be compiled.
There are no signs of a hardware implementation.

OpenCL, Vulkan, Sycl

I am trying to understand the OpenCL ecosystem and how Vulkan comes into play.
I understand that OpenCL is a framework to execute code on GPUs as well as CPUs, using kernels that may be compiled to SPIR.
Vulkan can also be used as a compute-API using the same SPIR language.
SYCL is a new specification that allows writing OpenCL code as proper standard-conforming C++14. It is my understanding that there are no free implementations of this specification yet.
Given that,
How does OpenCL relate to Vulkan? I understand that OpenCL is higher level and abstracts the devices, but does ( or could ) it use Vulkan internally? (instead of relying on vendor specific drivers)
Vulkan is advertised as both a compute and graphics API, however I found very little resources for the compute part. Why is that ?
Vulkan has performance advantages over OpenGL. Is the same true for Vulkan vs OpenCl? (OpenCL is sadly notorious to being slower than CUDA.)
Does SYCL use OpenCL internally or could it use Vulkan ? Or does it use neither and instead rely on low level, vendor specific APIs to be implemented ?
How does OpenCL relates to vulkan ? I understand that OpenCL is higher level and abstracts the devices, but does ( or could ) it uses Vulkan internally ?
They're not related to each other at all.
Well, they do technically use the same intermediate shader language, but Vulkan forbids the Kernel execution model, and OpenCL forbids the Shader execution model. Because of that, you can't just take a shader meant for OpenCL and stick it in Vulkan, or vice-versa.
Vulkan is advertised as both a compute and graphics api, however I found very little resources for the compute part - why is that ?
Because the Khronos Group likes misleading marketing blurbs.
Vulkan is no more of a compute API than OpenGL. It may have Compute Shaders, but they're limited in functionality. The kind of stuff you can do in an OpenCL compute operation is just not available through OpenGL/Vulkan CS's.
Vulkan CS's, like OpenGL's CS's, are intended to be used for one thing: to support graphics operations. To do frustum culling, build indirect graphics commands, manipulate particle systems, and other such things. CS's operate at the same numerical precision as graphical shaders.
Vulkan has a performance advantages over OpenGL. Is the same true for Vulkan vs OpenCl?
The performance of a compute system is based primarily on the quality of its implementation. It's not OpenCL that's slow; it's your OpenCL implementation that's slower than it possibly could be.
Vulkan CS's are no different in this regard. The performance will be based on the maturity of the drivers.
Also, there's the fact that, again, there's a lot of stuff you can do in an OpenCL compute operation that you cannot do in a Vulkan CS.
Does SYCL uses OpenCL internally or could it use vulkan ?
From the Khronos Group:
SYCL (pronounced ‘sickle’) is a royalty-free, cross-platform abstraction layer that builds on the underlying concepts, portability and efficiency of OpenCL...
So yes, it's built on top of OpenCL.
How does OpenCL relates to vulkan ?
They both can pipeline a separable work from host to gpu and gpu to host using queues to reduce communication overhead using multiple threads. Directx-opengl cannot?
OpenCL: Initial release August 28, 2009. Broader hardware support. Pointers allowed but only to be used in device. You can use local memory shared between threads. Much easier to start a hello world. Has api overhead for commands unless they are device-side queued. You can choose implicit multi-device synchronization or explicit management. Bugs are mostly fixed for 1.2 but I don't know about version 2.0.
Vulkan: Initial release 16 February 2016(but progress from 2014). Narrower hardware support. Can SPIR-V handle pointers? Maybe not? No local-memory option? Hard to start hello world. Less api overhead. Can you choose implicit multi-device management? Still buggy for Dota-2 game and some other games. Using both graphics and compute pipeline at the same time can hide even more latency.
if opencl had vulkan in it, then it has been hidden from public for 7-9 years. If they could add it, why didn't they do it for opengl?(maybe because of pressure by physx/cuda?)
Vulkan is advertised as both a compute and graphics api, however I
found very little resources for the compute part - why is that ?
It needs more time, just like opencl.
You can check info aboout compute shaders here:
https://www.khronos.org/registry/vulkan/specs/1.0/xhtml/vkspec.html#fundamentals-floatingpoint
Here is an example of particle system managed by compute shaders:
https://github.com/SaschaWillems/Vulkan/tree/master/computeparticles
below that, there are raytracers and image processing examples too.
Vulkan has a performance advantages over OpenGL. Is the same true for
Vulkan vs OpenCl?
Vulkan doesn't need to synchronize for another API. Its about command buffers synchronization between commandqueues.
OpenCL needs to synchronize with opengl or directx (or vulkan?) before using a shared buffer(cl-gl or dx-cl interop buffers). This has an overhead and you need to hide it using buffer swapping and pipelining. If no shared buffer exists, it can run concurrently on modern hardware with opengl or directx.
OpenCL is sadly notorious to being slower than CUDA
It was, but now its mature and challenges cuda, especially with much wider hardware support from all gaming gpus to fpgas using version 2.1, such as in future Intel can put an fpga into a Core i3 and enable it for (soft-x86 core ip) many-core cpu model closing the gap between a gpu performance and a cpu to upgrade its cpu-physx gaming experience or simply let an opencl physics implementation shape it and use at least %90 die-area instead of a soft-core's %10-%20 effectively used area.
With same price, AMD gpus can compute faster on opencl and with same compute power Intel igpus draw less power. (edit: except when algorithms are sensitive to cache performance where Nvidia has upperhand)
Besides, I wrote a SGEMM opencl kernel and run on a HD7870 at 1.1 Tflops and checked internet then saw a SGEMM henchmark on a GTX680 for same performance using a popular title on CUDA!(price ratio of gtx680/hd7870 was 2). (edit: Nvidia's cc3.0 doesn't use L1 cache when reading global arrays and my kernel was purely local/shared memory + some registers "tiled")
Does SYCL uses OpenCL internally or could it use vulkan ? Or does it
use neither and instead relies on low level, vendor specific apis to
be implemented ?
Here,
https://www.khronos.org/assets/uploads/developers/library/2015-iwocl/Khronos-SYCL-May15.pdf
says
Provides methods for dealing with targets that do not have
OpenCL(yet!)
A fallback CPU implementation is debuggable!
so it can fall back to a pure threaded version(similar to java's aparapi).
Can access OpenCL objects from SYCL objects
Can construct SYCL objects from OpenCL object
Interop with OpenGL remains in SYCL
- Uses the same structures/types
it uses opencl(maybe not directly, but with an upgraded driver communication?), it develops parallel to opencl but can fallback to threads.
from the smallest OpenCL 1.2 embedded device to the most advanced
OpenCL 2.2 accelerators

MPI and OpenMP on Desktop CPUs

I was just wondering how is it possible that OpenMP (shared memory) and MPI (distributed memory) could run on normal desktop CPUs like i7 for example. Is there some kind of a virtual machine that can simulate shared and distributed memory on these CPUs? I am asking it because when learnig OpenMP and MPI, the structures of supercomputers is shown, with shared memory or different nodes for distributed memory, each node with its own processor and memory.
MPI assumes nothing about how and where MPI processes run. As far as MPI is concerned, processes are just entites that have a unique address known as their rank and MPI gives them the ability to send and receive data in the form of messages. How exactly are the messages transfered is left to the implementation. The model is so general that MPI can run virtually on any platform imaginable.
OpenMP deals with shared memory programming using threads. Threads are just concurrent instruction flows that can access a shared memory space. They can execute in a timesharing fashion on a single CPU core or they can execute on multiple cores inside a single CPU chip, or they can be distributed among multiple CPUs connected together by some sophisticated network that allows them to access each others memory.
Given all that, MPI does not require that each process executes on a dedicated CPU core or that millions of cores should be necessarily put on separate boards connected with some high speed network - performance does, as well as technical limitations. You can happily run a 100 processes MPI job on a single CPU core though performance would be very very bad but it will still work (given enough memory is available). The same applies to OpenMP - it does not require that each thread is scheduled on a dedicated CPU core but doing so gives the best performance.
That's why MPI and OpenMP are called abstractions - they are general enough that the execution hardware can vary greatly while source code is kept the same.
A modern multicore-CPU-based PC is a shared-memory computer. It is a sensible approximation to think of each core as a processor, and that they all have equal access to the same RAM. This approximation hides a lot of details of processor and chip architectures.
It has always (well, perhaps not always, but for almost as long as MPI has been around) been possible to use message-passing (of which MPI is one standard) on a shared-memory computer so that you can run the same MPI-enabled program as you would on a genuinely distributed-memory machine.
At the application level a programmer only cares about calls to MPI routines. At the systems level the MPI run-time translates these calls into, well on a cluster or supercomputer, into instructions to send stuff over the interconnect. On a shared-memory computer it could instead translate these calls into instructions to send stuff over the internal bus.
This is by no means a comprehensive introduction to the topics you've raised, but that's what Google and all the published sources out there are for.

Do most modern kernels use DMA for network IO with generic Ethernet controllers?

In most modern operating systems like Linux and Windows, is network IO typically accomplished using DMA? This is concerning generic Ethernet controllers; I'm not asking about things that require special drivers (such as many wireless cards, at least in Linux). I imagine the answer is "yes," but I'm interested in any sources (esp. for the Linux kernel), as well as resources providing more general information. Thanks.
I don't know that there really is such a thing as a generic network interface controller, but the nearest thing I know of -- the NE2000 interface specification, implemented by a large number of cheap controllers -- appears to have at least some limited DMA support, and more sophisticated controllers are likely to include more sophisticated features.
The question should be a bit different:
Is typical network adapter have dma
controller on board ?
After finding answer on this question ( i guess in 99.9% it will be yes), you should ask about specific driver for each card. I assume that any decent driver will fully utilize hardware capabilities (i.e DMA support in our case), but question about OS is not relevant, since no OS can force the driver to implement DMA support. A high level OS like Windows and Linux provide a primitives to easier implementation of DMA, but implementing is responsibility of the driver.

Resources