I feel like this is a dumb question, but I really don't know how to google this, since every combination of the words 'ScaLAPACK' and 'block' just bombards you with information about the block-cyclic matrix distribution ScaLAPACK uses. However, what I want to know is whether ScaLAPACK subroutines are generally 'blocking' or 'non-blocking' in MPI lingo, i.e. whether or not they wait to finish on all other processes before they return. Sorry again for the stupid question.
Yes, ScalaPACK routines are blocking calls. If you wish to use non-blocking scalapack routines, it is still in its infancy, but I would recommend checking out the SLITE (https://www.icl.utk.edu/research/slate) project.
Hope it helped!
Related
my current understanding of MPI nonblocking routines is that they allow for the overlapping of communication and computation. However, I also understood that this overlapping is not guaranteed by the MPI implementation. Then, what could be the factors that inhibit the overlapping? Thanks.
Non-blocking routines were not primarily motivated by latency hiding (I'll use this as a shorter synonym of "overlap of computations and communication"): the prime use was to be able to write deadlock/serialization-free code. For the longest time, achieving actual performance improvement required periodically activating the MPI library by MPI_Iprobe or such tricks. The basic problem was that during your computation, there was no guarantee that the MPI layer would do anything at all.
The problem of forcing "MPI progress" still persists, but these days MPI implementations such as from Intel or mvapich (sorry, I don't know about OpenMPI) have environment variables with which you can force "progress threads". Also, network cards may be clever enough to work while your processor is otherwise engaged. And even with all this, improvement is not guaranteed because of the overhead you are introducing.
I'm learning C++ on my own. I'm an EE and learned it about 20 years ago, but in the progress of my career I stopped programming and didn't take it up again until recently. I should add that I never took any classes in programming.
I have a theoretical question about pointers. In reading the books about pointers it seems they have an important role in C++. My problem is that I can't see what that is. I see that pointers have a role in arrays, but I can't see their role in anything else.
I can see what they do, but I don't see why use pointers in the situations I see them in. Either references or straight variables would work just as well. I have a feeling the answer lies in the area of memory ( it's optimal use), but I just don't know.
Any answers would be appreciated. Thanks.
Consider the following from cplusplus.com:
"[T]here may be cases where the memory needs of a program can only be
determined during runtime. For example, when the memory needed depends
on user input. On these cases, programs need to dynamically allocate
memory, for which the C++ language integrates the operators new and
delete."
If you could determine all your memory needs prior to run time and did not need to make use of any abstract data type like a linked list, then yes, it would be difficult to see their use. However, what if you want to store values in an array, but you don't yet know how big that array will need to be?
Another value of pointers arises when you consider passing values from function to function. You may find this thread of value regarding the differences between pointers and references in C++ and how/why to use each.
We have been having several pedagogical conversations focused on pointers on the CSEducators.SE site. I'd encourage you to read those as well:
Simple Pointer Examples in C
Lesson Idea: Arrays, Pointers, and Syntactic Sugar
Pointers come from C, which had no concept of reference, and which C++ inherited from.
Everything that can be done with a reference in C++ is done with a pointer in C.
I find this question really great because it is pure.
A programming language is considered "safe" when the programs written in it can only call functions and access data that the program can name.
Now, the concept of pointer was invented to break this sandbox of safety and provide developer with freedom to think and act outside of the box.
Think of pointers as poor man's tool to achieve something not provided by the programming language itself.
It is misleading to think you could achieve higher performance if programmed some algorithm using pointers. Optimization is privilege of the compiler and hardware, not human.
In my programm, I would like to heavily parallelize many mathematical calculations, the results of which are then written to an output file.
I successfully implemented that using collective communication (gather, scatter etc.) but I noticed that using these synchronizing routines, the slowest among all processors dominates the execution time and heavily reduces overall computation time, as fast processors spend a lot of time waiting.
So I decided to switch to the scheme, where one (master) processor is dedicated to receiving chunks of results and handling the file output, and alle the other processors calculate these results and send them to the master using non-blocking send routines.
Unfortunately, I don't really know how to implement the master code; Do I need to run an infinite loop with MPI_Recv(), listening for incoming messages? How do I know when to stop the loop? Can I combine MPI_Isend() and MPI_Recv(), or do both method need to be non-blocking? How is this typically done?
MPI 3.1 provides non-blocking collectives. I would strongly recommend that instead of implementing it on your own.
However, it may not help you after all. Eventually you need the data from all processes, even the slow ones. So you are likely to wait at some point again. Non-blocking communication overlaps communication and computation, but it doesn't fix your load imbalances.
Update (more or less a long clarification comment)
There are several layers to your question, I might have been confused by the title as to what kind of answer you were expecting. Maybe the question is rather
How do I implement a centralized work queue in MPI?
This pops up regularly, most recently here. But that is actually often undesirable because a central component quickly becomes a bottleneck in large scale programs. So the actual problem you have, is that your work decomposition & mapping is imbalanced. So the more fundamental "X-question" is
How do I load balance an MPI application?
At that point you must provide more information about your mathematical problem and it's current implementation. Preferably in form of an [mcve]. Again, there is no standard solution. Load balancing is a huge research area. It may even be a topic for CS.SE rather than SO.
In OpenCL, which is C-99, we have two options for creating something function-like:
macros
functions
[edit: well, or use a templating language, new option 3 :-) ]
I heard somewhere (can't find any official reference to this anywhere, just saw it in a comment somewhere on stackoverflow, once), that functions are almost always inlined in practice, and that therefore using functions is ok performance-wise?
But macros are basically guaranteed to be inlined, by the nature of macros. But susceptible to bugs, eg if dont add parentheses around everything, and not typesafe.
In practice, what works well? What is most standard? What is likely to be most portable?
I suppose my requirements are some combination of:
as fast as possible
as little register pressure as possible
when used with compile-time constants, should ideally be guaranteed to be optimized-away to another constant
easy to maintain...
standard, not too weird, since I'm looking at using this for an opensource project, that I hope other people will contribute to
But macros are basically guaranteed to be inlined, by the nature of macros.
On GPU at least, the OpenCL compilers aggressively inlines pretty much everything, with the exception of recursive functions (OpenCL 2.0). This is both for hardware constraints and performance reasons.
While this is indeed implementation dependent, I have yet to see a GPU binary that is not aggressively inlined. I do not work much with CPU OpenCL but I believe the optimizer strategy may be similar, although the hardware constraints are not the same.
But as far as the standard is concerned, there are no guarantees.
Let's go through your requirements:
as fast as possible
as little register pressure as possible
when used with compile-time constants, should ideally be guaranteed to be optimized-away to another constant
Inline functions are as fast as macros, do not use more registers, and will be optimized away when possible.
easy to maintain...
Functions are much easier to maintain that macros. They are type safe, they can be refactored easily, etc, the list goes on forever.
standard, not too weird, since I'm looking at using this for an opensource project, that I hope other people will contribute to
I believe that is very subjective. I personally hate macros with a passion and avoid them like the plague. But I know some very successful projects that use them extensively (for example Boost.Compute). It's up to you.
Can someone suggest a good way to understand how MPI works?
If you are familiar with threads, then you treat each node as a thread (to an extend)
You send a message (work) to a node and it does some work and then returns you some results.
Similar behaviors between thread & MPI:
They all involve partitioning a work and process it separately.
They all would have overhead when more node/threads involved, MPI overhead is more significant compared to thread, passing messages around nodes would cause significant overhead if work is not carefully partitioned, you might end up with the time passing messages > computational time required to process job.
Difference behaviors:
They have different memory models, each MPI node does not share memory with others and does not know anything about the rest of world unless you send something to it.
Here you can find some learning materials http://www.mcs.anl.gov/research/projects/mpi/
Parallel programming is one of those subjects that is "intrinsically" complex (as opposed to the "accidental" complexity, as noted by Fred Brooks).
I used Parallel Programming in MPI by Peter Pacheco. This book gives a good overview of the basic MPI topics, available API's, and common patterns for parallel program construction.