I was wondering if it is possible to change the number of PE's at run time in MPI fortran using Intel Compilers.
My problem is very specific and I would like to know if I can reduce the number of PE after I reach some point in my computation.
My case is as follow:
I have a code that crunches a lot of number. To solve huge problems I need around 128 PE's. But, when I finish my computation and I start printing the solution, the other 127 PE stay idle and this is a huge waste of resources.
Is it possible to "deallocate" those 127 PE's when I am done with my computation and I still printing the solution?
I do not think there is a straightforward way to achieve this.
A relatively simple option is to start your MPI app with one task, then MPI_Comm_spawn() 127 PEs, do your computation, terminates the 127 PEs and continue the serial part.
Generally speaking, such a 128 PE job is started via a resource manager, and imho, the real issue is whether the batch manager can support job shrink (iirc, SLURM does), and whether this is without any impact on MPI (this is a desired feature and PMIx has plans for that, but i have no idea whether SLURM supports this).
My best advice is to do things differently, and use MPI-IO to print your solution in parallel.
Related
I met a problem when using clBuildProgram() on GTX 750. The kernel failed to build with error code -5(CL_OUT_OF_RESOURCES) and an empty build log.
There is a possible solution, which is adding '-cl-nv-verbose' as input option to clBuildProgram(). However, it doesn't work for all kernels.
Based on that, I tried another optimization option which is '-cl-opt-disable'. It also works for some kernels.
Then I got confused.
I cannot find the real reason for causing the error;
Why do different build-options make sense for some kernels?
The error seems like architecture independent.Since the same Opencl code is executed successfully on GTX 750, while failed on Tesla P100.
Does anyone has ideas?
Possible reasons I can think of:
Running out of registers. This happens if you have a lot of (private) variables in your kernel code, especially arrays. Each core only has a certain amount of registers available (architecture dependent), and it may not be possible for the compiler to "spill" them to global memory. If this is the problem, you can try to rearrange your code so your variables have more limited scope, or you can try to move some arrays to local memory (bearing in mind this is shared between work items in a group, and also limited in size). A good GPU profiler/code analysis tool should be able to tell you how much register pressure there is, so if you've got the kernel working on some hardware, you should be able to find out register pressure for that, and draw conclusions for other hardware too.
Code size itself. I didn't think this should be much of a problem anymore on modern GPUs, but it might be possible if you have truly gigantic kernels.
In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?
MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.
I was given a little excercise where I had to implement a Monte Carlo algorithm by using MPI to estimate the total volume of n spheres, having the coordinates of their center and radius in 3 dimensions. Even if we must use MPI, we can launch all the processes on our local machine, so there's no network overhead. I implemented two versions of this excericse:
One, using MPI_Send and MPI_Recv (where the process of rank 0 only waits for partial results from the others to perform the final sum)
http://pastebin.com/AV41hJqn
The other, using MPI_Reduce, also here process of rank 0 waits for partial results.
http://pastebin.com/8b0czv6a
I expected that both the programs would take the same time to finish, but I see that the one using MPI_Reduce is faster. Why this? Where's the difference?
There could be a lot of reasons depending on which MPI implementation you're using, what kind of hardware you're running on and how optimized the implementation is to take advantage of that. This Google Scholar search gives some idea of the variety of work done on this. To give you a few ideas of what it could be:
Since reductions can be completed in intermediate steps, it may be possible to use a different topology than the basic rank 0 collect-from-all approach, with tradeoffs in latency and bandwidth.
Within a compute node (or on your desktop or laptop if you're trying this with a toy problem), it may be possible to exploit locality within cores, between cores on a CPU socket or between sockets to order the computations and communication in a way that's more efficient for the hardware. It sounds from the abstract like this paper from IBM may give some concrete details about some of these design decisions. Alternatively, the implementation might choose a cache-oblivious scheme for better performance within a general compute node.
Persistent communication (MPI_Send_init and MPI_Recv_init) can be used under the hood in the MPI_Reduce implementation. These routines can perform better than their blocking and non-blocking counterparts due to providing the MPI implementation and hardware with extra details about how the program is grouping its communications.
This is not a comprehensive list, but hopefully it gets you started and provides some ideas for how to search out more details if you're interested.
Lets say there is a computer with 4 CPUs each having 2 cores, so totally 8 cores. With my limited understanding I think that all processors share same memory in this case. Now, is it better to directly use openMP or to use MPI to make it general so that the code could work on both distributed and shared settings. Also, if I use MPI for a shared setting would performance decrease compared with openMP?
Whether you need or want MPI or OpenMP (or both) heavily depends the type of application you are running, and whether your problem is mostly memory-bound or CPU-bound (or both). Furthermore, it depends on the type of hardware you are running on. A few examples:
Example 1
You need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power.
In this case you probably want to use MPI and start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.
Example 2
You usually have small datasets and only want to speed up your application, which is computationally heavy. Also, you do not want to spend much time thinking about parallelization, but more your algorithms in general.
In this case OpenMP is your first choice. You only need to add a few statements here and there (e.g. in front of your for loops that you want to accelerate), and if your program is not too complex, OpenMP will do the rest for you automatically.
Example 3
You want it all. You need more memory, i.e. more computing nodes, but you also want to speed up your calculations as much as possible, i.e. running on more than one core per node.
Now your hardware comes into play. From my personal experience, if you have only a few cores per node (4-8), the performance penalty created by the general overhead of using OpenMP (i.e. starting up the OpenMP threads etc.) is more than the overhead of processor-internal MPI communication (i.e. sending MPI messages between processes that actually share memory and would not need MPI to communicate).
However, if you are working on a machine with more cores per node (16+), it will become necessary to use a hybrid approach, i.e. parallelizing with MPI and OpenMP at the same time. In this case, hybrid parallelization will be necessary to make full use of your computational resources, but it is also the most difficult to code and to maintain.
Summary
If you have a problem that is small enough to be run on just one node, use OpenMP. If you know that you need more than one node (and thus definitely need MPI), but you favor code readability/effort over performance, use only MPI. If using MPI only does not give you the speedup you would like/require, you have to do it all and go hybrid.
To your second question (in case that did not become clear):
If you setup is such that you do not need MPI at all (because your will always run on only one node), use OpenMP as it will be faster. But If you know that you need MPI anyways, I would start with that and only add OpenMP later, when you know that you've exhausted all reasonable optimization options for MPI.
With most distributed memory platforms nowadays consisting of SMP or NUMA nodes it just makes no sense to not use OpenMP. OpenMP and MPI can perfectly work together; OpenMP feeds the cores on each node and MPI communicates between the nodes. This is called hybrid programming. It was considered exotic 10 years ago but now it is becoming mainstream in High Performance Computing.
As for the question itself, the right answer, given the information provided, has always been one and the same: IT DEPENDS.
For use on a single shared memory machine like that, I'd recommend OpenMP. It make some aspects of the problem simpler and might be faster.
If you ever plan to move to a distributed memory machine, then use MPI. It'll save you solving the same problem twice.
The reason I say OpenMP might be faster is because a good implementation of MPI could be clever enough to spot that it's being used in a shared memory environment and optimise its behaviour accordingly.
Just for a bigger picture, hybrid programming has become popular because OpenMP benefits from cache topology, by using the same address space. As MPI might have the same data replicated over the memory (because process can't share data) it might suffer from cache cancelation.
On the other hand, if you partition your data correctly, and each processor has a private cache, it might come to a point were your problem fit completely in cache. In this case you have super linear speedups.
By talking in cache, there are very different cache topology on recent processors, and has always: IT DEPENDS...
I have a C++ code using mpi and is executed in a sequential-parallel-sequential pattern. The above pattern is repeated in a time loop.
While validating the code with the serial code, I could get a reduction in time for the parallel part and in fact the reduction is almost linear with the no of processors.
The problem that I am facing is that the time required for the sequential part also increases considerably when using higher no of processors.
The parallel part takes less time to be executed in comparison with total sequential time of the entire program.
Therefore although there is a reduction in time for the parallel part when using higher no of processors, the saving in time is lost considerably due to increase in time while executing the sequential part. Also the sequential part includes a large no of computations at each time step and writing the data to an output file at some specified time.
All the processors are made to run during the execution of sequential part and the data is gathered to the root processor after the parallel computation and only the root processor is allowed to write the file.
Therefore can anyone suggest what is the efficient way to calculate the serial part (large no of operations + write the file) of the parallel code ? I would also like to clarify on any of the point if required.
Thanks in advance.
First of all, do file writing from separate thread (or process in MPI terms), so other threads can use your cores for computations.
Then, check why your parallel version is much slower than sequential. Often this means you creates too small tasks so communication between threads (synchronization) eats your performance. Think if tasks can be combined into chunks and complete chunks processed in parallel.
And, of course, use any profiler that is good for multithreading environment.
[EDIT]
sequential part = part of your logic that cannot be (and is not) paralleled, do you mean the same? sequential part on multicore can work a bit slower, probably because of OS dispatcher or something like this. It's weird that you see noticable difference.
Disk is sequential by its nature, so writing to disk from many threads don't give any benefits, but can lead to the situation when many threads try to do this simultaneously and waits for each other instead of doing something useful.
BTW, what MPI implementation do you use?
Your problem description is too high-level, provide some pseudo-code or something like this, this can help us to help you.