timing single function in a hybrid code of MPI/OpenMP - mpi

I have a hybrid code with MPI/OpenMP. I want to know what is the time spent for particular function, let's say A, for every MPI process. This function is called inside the OpenMP do/for loops also in a very complicated way by various functions on top of it (i.e. some other functions let's say B and C may be calling A which also might be inside the OpenMP do/for loops). I was planning to do it as follows:
double A()
{
time1 = MPI_Wtime();
//compute result...
//Note: inside this function there is no OpenMP or MPI calls...
// just pure computation of results...
time2 = MPI_Wtime();
printf("myRank=%d timeSpent=%f\n", myRank, (time2-time1));
return result;
}
Would the sum of all the times per every MPI process be the total time spent for this function by that MPI process? If not please can you show me how to get it correctly, thanks!

We don't want to reinvent the wheel and we don't want to reinvent the MPI profiler. That would be hard.
There are very powerful tools available from the manufactures of many cluster systems. For example Cray machines usually come with CrayPat which spits out magic.
Additionally there is free software such as this http://mpip.sourceforge.net/

I would recommend not to reinvent the wheel but to use some professional grade software already built for profiling such as TAU, or MPIP or Gprof ...
heres a decent presentation to get you started

Related

MPI_Send/Recv vs. MPI_Reduce

I was given a little excercise where I had to implement a Monte Carlo algorithm by using MPI to estimate the total volume of n spheres, having the coordinates of their center and radius in 3 dimensions. Even if we must use MPI, we can launch all the processes on our local machine, so there's no network overhead. I implemented two versions of this excericse:
One, using MPI_Send and MPI_Recv (where the process of rank 0 only waits for partial results from the others to perform the final sum)
http://pastebin.com/AV41hJqn
The other, using MPI_Reduce, also here process of rank 0 waits for partial results.
http://pastebin.com/8b0czv6a
I expected that both the programs would take the same time to finish, but I see that the one using MPI_Reduce is faster. Why this? Where's the difference?
There could be a lot of reasons depending on which MPI implementation you're using, what kind of hardware you're running on and how optimized the implementation is to take advantage of that. This Google Scholar search gives some idea of the variety of work done on this. To give you a few ideas of what it could be:
Since reductions can be completed in intermediate steps, it may be possible to use a different topology than the basic rank 0 collect-from-all approach, with tradeoffs in latency and bandwidth.
Within a compute node (or on your desktop or laptop if you're trying this with a toy problem), it may be possible to exploit locality within cores, between cores on a CPU socket or between sockets to order the computations and communication in a way that's more efficient for the hardware. It sounds from the abstract like this paper from IBM may give some concrete details about some of these design decisions. Alternatively, the implementation might choose a cache-oblivious scheme for better performance within a general compute node.
Persistent communication (MPI_Send_init and MPI_Recv_init) can be used under the hood in the MPI_Reduce implementation. These routines can perform better than their blocking and non-blocking counterparts due to providing the MPI implementation and hardware with extra details about how the program is grouping its communications.
This is not a comprehensive list, but hopefully it gets you started and provides some ideas for how to search out more details if you're interested.

STD Classes in CUDA Kernel

I know that there is no way using std classes such as string, vector, map or set in CUDA kernel. However, it's very uncomfortable without them. I have to write a lot of code in CUDA kernel, so I would like to use at least strings and vectors. I'm not talking about something like thrust. I want to be able to write something like this:
__global__ void kernel()
{
cuda_vector<int> a;
for(int i=0;i<10;i++)
a.push_back(i);
}
int main()
{
kernel<<<1,512>>>();
return 0;
}
This should create 512 threads and in each thread I want to create cuda_vector class and use it as std::vector. I didn't find any solution on the internet and I started to write my own class. Each function of this class is defined as "__ host __ " and " __ device __" function so that I can use it on both CPU and GPU.
Theoretically, it can be implemented, however only on Fermi architecture. Because, we need to allocate memory dynamically. I have GTX 580 and started to write my own Vector. But it's tiring and needs a lot of time. Isn't there any implementation which I can use? I can't believe that there isn't any. Do so many software developers write on CUDA without it? And noone tried to write his/her own version?
The reason you don't find something like std::vector for cuda is performance. Your traditional vector object doesn't fit well with the CUDA model. If you are planning on using only 512 threads and each one will be managing a std::vector like object your performance is going to be worse than running the same code on the CPU.
GPU threads are not like CPU threads, they should be as light as possible. Use thread blocks and shared memory to have the threads cooperate. If you are manipulating a string, each thread should be working on one character, if you are using vectors in the CPU pass an array of that to the GPU, and have each thread work on one element. Basically, think about how to solve the problem with the CUDA programming model as apposed to solving it with a CPU approach and then translating it to CUDA.
I've not used it, but the CuPP framework may be of interest to you, especially the vector<T> implementation. Looks like it could do what you need it to do.

Efficient way to execute the sequential part(large no of operations + writing file) of a parallel code?

I have a C++ code using mpi and is executed in a sequential-parallel-sequential pattern. The above pattern is repeated in a time loop.
While validating the code with the serial code, I could get a reduction in time for the parallel part and in fact the reduction is almost linear with the no of processors.
The problem that I am facing is that the time required for the sequential part also increases considerably when using higher no of processors.
The parallel part takes less time to be executed in comparison with total sequential time of the entire program.
Therefore although there is a reduction in time for the parallel part when using higher no of processors, the saving in time is lost considerably due to increase in time while executing the sequential part. Also the sequential part includes a large no of computations at each time step and writing the data to an output file at some specified time.
All the processors are made to run during the execution of sequential part and the data is gathered to the root processor after the parallel computation and only the root processor is allowed to write the file.
Therefore can anyone suggest what is the efficient way to calculate the serial part (large no of operations + write the file) of the parallel code ? I would also like to clarify on any of the point if required.
Thanks in advance.
First of all, do file writing from separate thread (or process in MPI terms), so other threads can use your cores for computations.
Then, check why your parallel version is much slower than sequential. Often this means you creates too small tasks so communication between threads (synchronization) eats your performance. Think if tasks can be combined into chunks and complete chunks processed in parallel.
And, of course, use any profiler that is good for multithreading environment.
[EDIT]
sequential part = part of your logic that cannot be (and is not) paralleled, do you mean the same? sequential part on multicore can work a bit slower, probably because of OS dispatcher or something like this. It's weird that you see noticable difference.
Disk is sequential by its nature, so writing to disk from many threads don't give any benefits, but can lead to the situation when many threads try to do this simultaneously and waits for each other instead of doing something useful.
BTW, what MPI implementation do you use?
Your problem description is too high-level, provide some pseudo-code or something like this, this can help us to help you.

Why shouldn't I use F# asynchronous workflows for parallelism?

I have been learning F# recently, being particularly interested in its ease of exploiting data parallelism. The data |> Array.map |> Async.Parallel |> Async.RunSynchronously idiom seems very easy to understand and straightforward to use and get real value from.
So why is it that async is not really intended for this? Donald Syme himself says that PLINQ and Futures are probably a better choice. And other answers I've read here agree with that as well as recommending TPL. (PLINQ doesn't seem too much different to the above built-in functions, as long as you're using the F# Powerpack to get the PSeq functions.)
F# and functional languages make a lot of sense for this, and some applications have achieved great success with async parallelism.
So why shouldn't I use async to execute parallel data processes? What am I going to lose by writing parallel async code instead of using PLINQ or TPL?
So why shouldn't I use async to execute parallel data processes?
If you have a tiny number of completely independent non-async tasks and lots of cores then there is nothing wrong with using async to achieve parallelism. However, if your tasks are dependent in any way or you have more tasks than cores or you push the use of async too far into the code then you will be leaving a lot of performance on the table and could do a lot better by choosing a more appropriate foundation for parallel programming.
Note that your example can be written even more elegantly using the TPL from F# though:
Array.Parallel.map f xs
What am I going to lose by writing parallel async code instead of using PLINQ or TPL?
You lose the ability to write cache oblivious code and, consequently, will suffer from lots of cache misses and, therefore, all cores stalling waiting for shared memory which means poor scalability on a multicore.
The TPL is built upon the idea that child tasks should execute on the same core as their parent with a high probability and, therefore, will benefit from reusing the same data because it will be hot in the local CPU cache. There is no such assurance with async.
I wrote an article that re-implements one C# TPL sample using both Task and Async, which also has some comments on the difference between the two. You can find it here and there is also a more advanced async-based version.
Here is a quote from the first article that compares the two options:
The choice between the two possible implementations depends on many factors. Asynchronous workflows were designed specifically for F#, so they more naturally fit with the language. They offer better performance for I/O bound tasks and provide more convenient exception handling. Moreover, the sequential syntax is quite convenient. On the other hand, tasks are optimized for CPU bound calculations and make it easier to access the result of calculation from other places of the application without explicit caching.
I always figured it's what TPL, PLinq etc... give you over and above what Async does. (Cancellation mechanisms is the one that comes to mind.) This question has some better answers.
This article hints at a slight performance advantage to TPL, but probably not enough to be significant.

Just in Time compilation always faster?

Greetings to all the compiler designers here on Stack Overflow.
I am currently working on a project, which focuses on developing a new scripting language for use with high-performance computing. The source code is first compiled into a byte code representation. The byte code is then loaded by the runtime, which performs aggressive (and possibly time consuming) optimizations on it (which go much further, than what even most "ahead-of-time" compilers do, after all that's the whole point in the project). Keep in mind the result of this process is still byte code.
The byte code is then run on a virtual machine. Currently, this virtual machine is implemented using a straight-forward jump table and a message pump. The virtual machine runs over the byte code with a pointer, loads the instruction under the pointer, looks up an instruction handler in the jump table and jumps into it. The instruction handler carries out the appropriate actions and finally returns control to the message loop. The virtual machine's instruction pointer is incremented and the whole process starts over again. The performance I am able to achieve with this approach is actually quite amazing. Of course, the code of the actual instruction handlers is again fine-tuned by hand.
Now most "professional" run-time environments (like Java, .NET, etc.) use Just-in-Time compilation to translate the byte code into native code before execution. A VM using a JIT does usually have much better performance than a byte code interpreter. Now the question is, since all an interpreter basically does is load an instruction and look up a jump target in a jump table (remember the instruction handler itself is statically compiled into the interpreter, so it is already native code), will the use of Just-in-Time compilation result in a performance gain or will it actually degrade performance? I cannot really imagine the jump table of the interpreter to degrade performance that much to make up the time that was spent on compiling that code using a JITer. I understand that a JITer can perform additional optimization on the code, but in my case very aggressive optimization is already performed on the byte code level prior to execution. Do you think I could gain more speed by replacing the interpreter by a JIT compiler? If so, why?
I understand that implementing both approaches and benchmarking will provide the most accurate answer to this question, but it might not be worth the time if there is a clear-cut answer.
Thanks.
The answer lies in the ratio of single-byte-code-instruction complexity to jump table overheads. If you're modelling high level operations like large matrix multiplications, then a little overhead will be insignificant. If you're incrementing a single integer, then of course that's being dramatically impacted by the jump table. Overall, the balance will depend upon the nature of the more time-critical tasks the language is used for. If it's meant to be a general purpose language, then it's more useful for everything to have minimal overhead as you don't know what will be used in a tight loop. To quickly quantify the potential improvement, simply benchmark some nested loops doing some simple operations (but ones that can't be optimised away) versus an equivalent C or C++ program.
When you use an interpreter, the code cache in your processor caches the interpreter code; not the byte code (which may be cached in the data cache). Since code caches are 2 to 3 times faster than data caches, IIRC; you may see a performance boost if you JIT compile. Also, the native, real code you are executing is probably PIC; something which can be avoided for JITted code.
Everything else depends on how optimized the byte code is, IMHO.
JIT can theoretically optimize better, since it has information not available at compile time (especially about typical runtime behavior). So it can for example do better branch prediction, roll out loops as needed, et.c.
I am sure your jumptable approach is OK, but I still think it would perform rather poor compared to straight C code, don't you think?

Resources