Since MPI processes are invoded by mpirun/mpiexec, not by the resource manager or scheduler(torque/maui), then how to use cgroup to isolate memory and cpuset usage for every MPI process, note that we can not modify the MPI library(openmpi/mpich2), but it's acceptable for modification of resource manager and scheduler, thanks.
Related
When we have multi-core CPU, OpenCL treats it as a single device with multiple compute units and for every device we can create some command queues.How can CPU as a host, create a command queue on itself? I think in this situation it become multithreading rather than parallel computing.
Some devices, including most CPU devices can be segmented into sub-devices using the extension "cl_ext_device_fission". When you use device fission, you are still getting parallel processing in that the host thread can do other tasks while the kernel is running on some CPU cores.
When not using device fission, the CPU device will block essentially block the host program while a kernel is running. Even if some opencl implementations are non-blocking during kernel execution, the performance hit to the host would be too great to allow much work to be done by the host thread.
So it's still parallel computation, but I guess the host application's core is technically multi-threading during kernel execution.
It's parallel computing using multithreading. When you use an OpenCL CPU driver and enqueue kernels for the CPU device, the driver uses threads to execute the kernel, in order to fully leverage all of the CPUs cores (and also typically uses vector instructions like SSE to fully utilize each core).
As per my little knowledge mpirun and mpiexec both are launcher. Can anybody tell the exact difference between mpiexec and mpirun?
mpiexec is defined in the MPI standard (well, the recent versions at least) and I refer you to those (your favourite search engine will find them for you) for details.
mpirun is a command implemented by many MPI implementations. It has never, however, been standardised and there have always been, often subtle, differences between implementations. For details see the documentation of the implementation(s) of your choice.
And yes, they are both used to launch MPI programs, these days mpiexec is generally preferable because it is standardised.
I know the question's been answered, but I think the answer isn't the best. I ran into a few issues on the cluster here with mpirun and looked to see if there was a difference between mpirun and mpiexec. This is what I found:
Description
Mpiexec is a replacement program for the script mpirun, which is part
of the mpich package. It is used to initialize a parallel job from
within a PBS batch or interactive environment. Mpiexec uses the task
manager library of PBS to spawn copies of the executable on the nodes
in a PBS allocation.
Reasons to use mpiexec rather than a script (mpirun) or an external
daemon (mpd):
Starting tasks with the TM interface is much faster than invoking a separate rsh or ssh once for each process.
Resources used by the spawned processes are accounted correctly with mpiexec, and reported in the PBS logs, because all the processes
of a parallel job remain under the control of PBS, unlike when using
startup scripts such as mpirun.
Tasks that exceed their assigned limits of CPU time, wallclock time, memory usage, or disk space are killed cleanly by PBS. It is
quite hard for processes to escape control of the resource manager
when using mpiexec.
You can use mpiexec to enforce a security policy. If all jobs are required to startup using mpiexec and the PBS execution environment,
it is not necessary to enable rsh or ssh access to the compute nodes
in the cluster.
Ref: https://www.osc.edu/~djohnson/mpiexec/
I need to debug an MPI code for which I only have access to a single node/machine. The problem is the bug I am looking for only arises when running on more than node but it doesn't when running, for example, two MPI tasks in the same node, everything goes fine. I assume that my MPI implementation (mviapich2) cleverly treats tasks running on the same node by, for example, replacing network communications by IPC strategies or even memcpy.
So my question is: how could I run two MPI tasks on a single node but making MPI treat them as tasks on different nodes? Is that possible?
You can disable the MVAPICH2 shared memory device by setting the MV2_USE_SHARED_MEM environment variable to 0:
mpiexec ... -env MV2_USE_SHARED_MEM 0 ... ./executable
Make sure that your MVAPICH2 was built with the TCP/IP device, otherwise your ranks won't be able to communicate with shared memory support turned off.
I was just wondering how is it possible that OpenMP (shared memory) and MPI (distributed memory) could run on normal desktop CPUs like i7 for example. Is there some kind of a virtual machine that can simulate shared and distributed memory on these CPUs? I am asking it because when learnig OpenMP and MPI, the structures of supercomputers is shown, with shared memory or different nodes for distributed memory, each node with its own processor and memory.
MPI assumes nothing about how and where MPI processes run. As far as MPI is concerned, processes are just entites that have a unique address known as their rank and MPI gives them the ability to send and receive data in the form of messages. How exactly are the messages transfered is left to the implementation. The model is so general that MPI can run virtually on any platform imaginable.
OpenMP deals with shared memory programming using threads. Threads are just concurrent instruction flows that can access a shared memory space. They can execute in a timesharing fashion on a single CPU core or they can execute on multiple cores inside a single CPU chip, or they can be distributed among multiple CPUs connected together by some sophisticated network that allows them to access each others memory.
Given all that, MPI does not require that each process executes on a dedicated CPU core or that millions of cores should be necessarily put on separate boards connected with some high speed network - performance does, as well as technical limitations. You can happily run a 100 processes MPI job on a single CPU core though performance would be very very bad but it will still work (given enough memory is available). The same applies to OpenMP - it does not require that each thread is scheduled on a dedicated CPU core but doing so gives the best performance.
That's why MPI and OpenMP are called abstractions - they are general enough that the execution hardware can vary greatly while source code is kept the same.
A modern multicore-CPU-based PC is a shared-memory computer. It is a sensible approximation to think of each core as a processor, and that they all have equal access to the same RAM. This approximation hides a lot of details of processor and chip architectures.
It has always (well, perhaps not always, but for almost as long as MPI has been around) been possible to use message-passing (of which MPI is one standard) on a shared-memory computer so that you can run the same MPI-enabled program as you would on a genuinely distributed-memory machine.
At the application level a programmer only cares about calls to MPI routines. At the systems level the MPI run-time translates these calls into, well on a cluster or supercomputer, into instructions to send stuff over the interconnect. On a shared-memory computer it could instead translate these calls into instructions to send stuff over the internal bus.
This is by no means a comprehensive introduction to the topics you've raised, but that's what Google and all the published sources out there are for.
We have multi-device system which distribute main task across them. Each subtask consist of:
enqueue write buffer
enqueue kernel
enqueue read buffer
All enqueues are async and command queues are in-order. We assign a callback to cl_event of enqueue read buffer where we determine whether main task is completed. If it's not, we schedule one more subtask to the queue.
Unfortunately, we've found that keeping host's CPU busy doesn't allow it to process callbacks from other devices (GPU) and most of the time they are not involved in work. The idea is to exclude host's cpu from the list of devices we use to complete main task.
You should look into device fission. If your platform supports this feature, you will be able to create an opencl device with any combination of cpu cores. Look here for details. This extension will allow you to save some number of cores for your host application.
I like how it allows you create sub-devices which share various level of cache memory. You might be interested in CL_DEVICE_PARTITION_BY_NAMES_EXT (search for "CL_DEVICE_PARTITION_BY NAMES_EXT" on the page).