Variable memory allocation in MPI Code - mpi

In a cluster running MPI code, is a copy of all the declared-variables, sent to all nodes , so that all nodes can access it locally, and not perform a remote memory access ?

No, MPI itself can't do this for you in single call.
There is an own state of memory in every MPI process, and every value may be different in any of MPI process.
The only way of sending/receiving data is to use explicit calls of MPI, like Send or Recv. You can pack most of your data into some memory space and send this area of memory to each MPI Process, but this area will not contain 'every declared variable', only variables placed manually into this area.
Update:
Each node runs a copy of the program. Each copy will initialize variables as it want (it can be the same initialization, or individual, based on MPI Process number, called Rank; got from MPI_Comm_Rank function). So every variable exist in N copyes; one set per MPI Process. Every process sees variables, but only the set it owns. Values of variables are unsyncronized automatically.
So, task of programmer is to syncronize values of variables between Nodes (mpi processes).
E.g. here is small MPI program to compute Pi:
http://www.mcs.anl.gov/research/projects/mpi/usingmpi/examples/simplempi/cpi_c.htm
It will send value of the 'n' variable from first process to all other (MPI_Bcast); and every process will send its own 'mypi' after calculation into 'pi' variable of first process (with addition of individual values via MPI_Reduce function).
Only first process will be able to read N from user (via scanf) and this code is conditionally executed based on rank of process; other processes must get the N from the first because they didn't read it from user directly.
Update2 (sorry for late answer):
This is syntax of MPI_Bcast. Programmer should give an address of variable into this function. Each of MPI processes will give the address of its own 'n' variable (it can be different). And the MPI_Bcast will
check the rank of current process and compare with other argument, the rank of "Broadcaster".
If the current process is broadcaster, MPI_Bcast will read value, placed in memory at given address (it will read value of the 'n' variable on "Broadcaster"); then the value will be send via network.
Else, if the current process is not a broadcaster, it is an "receiver". MPI_Bcast at receiver will get the value from "Broadcaster" (Using MPI Library internals, via network) and store the value in memory of current process at given address.
So, the address is given to this function because on some nodes the function will write to the variable. Only value is send via network.

Related

Using MPI_Gatherv with only a subset of all processes

I want to use MPI_Gatherv to collect data onto a single process. The thing is, I don't need data from all other processes, just a subset of them. The only communicator in the code is MPI_COMM_WORLD. So, does every process in MPI_COMM_WORLD have to call MPI_Gatherv? I have looked at the MPI standard and can't really make out what it is saying. If all processes must make the call, can some of the values in MPI_Gatherv's "recvcounts" array be zero? Or would there be some other way to signal to the root process which processes should be ignored? I guess I could introduce another communicator, but for this particular problem that would be coding overkill.

how are variables found and is it done in constant time

one thing I was recently thinking about was how a computer finds his variables. When we run a program, the program will create multiple layers in the stack, one layer for every new scope it opens and put either the variable value or a pointer in case of storage in heap in this scope. When the scope is done, it and all its variables will be destroyed. But how does a computer know where its variables are? And which ones to use if the same variables occur to be present more often.
How I imagine it, the computer searches the scope it is in like an array and if it doesn't find the variable it follows the stack downwards like a linked list and searches the next scope like an array.
That leads to the assumption that a global variable is the slowest to use since it has to traverse all the way back to the last scope. So it has a computational time from a * n (a = average amount of variables per scope, n = amount of scopes). If I now assume that my code is recursive and within the recursive function calls on a global variable (let's say I have defined the variable const PI = 3.1416 and I use it in every recursion), then it would traverse it backwards again for every single call and if my recursive function takes 1000 recursion, then it does that 1000 times.
But on the other hand, while learning about recursion, I have never heard that referring to variables that are not found inside the recursive scope is to be avoided if possible. Therefore I wonder if I am right with my thoughts. Can someone please shed some light on the issue.
You got it the other way around: scopes, frames, heaps don't make variables, variables make scopes, frames, heaps.
Both are a bit of a stretch actually but my point is to avoid focusing on the lifetime of a variable (that's what terms like heap and stack really mean) and instead take a look under the hood.
Memory is a form of storage where each cell is assigned a number, the cell is called word and the number is called address.
The set of addresses is called address space, an address space is usually a range of addresses or a union of ranges of addresses.
The compiler assumes the program data will be loaded at a specific address, say X, and that the there is enough memory after X (i.e. X+1, X+2, X+3, ..., all exists) for all the data.
Variables are then laid out sequentially from X onward, it is the job of the compiler to keep the association between the address X+k and the variable instance.
Note that a variable may be instanced more than one time, calling a function twice or recursion are both examples of that.
In the first case, the two instances can share the same address X+k since they are don't overlap in time (by the time the second instance is alive, the first is over).
In the second case, the two instances overlap in time and two addresses must be used.
So we see that it is the lifetime of a variable that affects how the mapping between the variable name and its address (a.k.a. the allocation of the variable) is done.
Two common strategies are:
A stack
We start from an address X+b and allocates new instances at successive addresses X+b+1, X+b+2, etc.
The current address (e.g. X+b+54) is stored somewhere (it is the stack pointer).
When we want to free a variable we set the stack pointer back (e.g. from X+b+54 to X+b+53).
We can see that it's impossible to free a variable that is not the last allocated.
This allows for a very fast allocation/deallocation and naturally fits the need of a function frame that holds the local variables: when a function is invoked the new variables are allocated, when it ends they are removed.
From what we noted above, we see that if f calls g (i.e. f is the parent of g) then the variables of f cannot be deallocated before those of g.
This again naturally fits the semantics of functions.
The heap
This strategy dynamically allocate a variable instance at an address X+o.
The runtime reserves a block of addresses and manages their status (free, occupied), when asked, it can give a free address and mark it occupied.
This is useful to allocate an object whose size depends on the user input, for example.
The heap (static)
Some variables have the lifespan of the program but their size and number is known a compile time.
In this case, the compiler simply assigns each instance a unique address X+i.
They cannot be deallocated, they are loaded in memory in batch along with the program code and stay there until the program is unloaded.
I left behind some details, like the fact that the stack most often than not grows from bigger to lower addresses (so it can be put at the farthest edge of the memory) and that variables occupy more than one address.
Some programming languages, especially interpreted ones, don't associate addresses to variable instances, instead, they keep a map between the variable name (properly qualified) and the variable value, this way the lifespan of a variable can be controlled in many particular ways (see Closure in Javascript).
Global variables are allocated in the static heap, only one instance is present (only one address).
Each recursive function that uses it always references directly to the sole instance because the unique address is known at compile time.
Local variables in a function are allocated in the stack and each invocation of a function (recursive or not) uses a new set of instances (the addresses don't need to be the same each time, but they could).
Simply put, there is no lookup, variables are allocated so that the code can access them once compiler (either relatively, in the stack, or absolutely, in the heap).

MPI_Scatter: order of scatter

I my work, I noticed that even if I scatter same amount of data to each process, it takes more time to transfer data from root to the highest-rank process. I tested this on distributed memory machine. If a MWE is needed I will prepare one but before that I would like to know if MPI_Scatter gives privilege to lower rank processes.
The MPI standard does not say such a thing, so MPI libraries are free to implement MPI_Scatter() the way they want regarding which task might return earlier than others.
Open MPI for example can either do a linear or a binomial scatter (by default, the algo is chosen based on communicator and message sizes).
That being said, all data has to be sent from the root process to the other nodes, so obviously, some nodes will be served first. If root process has rank zero, i would expect the highest rank process receive the data at last (i am not aware of any MPI library implementing a topology aware MPI_Scatter(), but that might come some day). If root process has not rank zero, then MPI might internally renumber the ranks (so root is always virtual rank zero), and if this pattern is implemented, the last process to receive the data would be (root + size - 1) % size.
If this is suboptimal from your application point of view, you always have the option to re-implement MPI_Scatter() your own way (that can call the library provided PMPI_Scatter() if needed). An other approach would be to MPI_Comm_split() (with a single color) in order to renumber the ranks, and use the new communicator for MPI_Scatter()

Query in MPI initialization

If we call MPI_Init() we know that multiple copies of the same executable run on different machines. Suppose MPI_Init() is in a function f(), then will multiple copies of main() function exist too?
The main problem that I am facing is of taking inputs. In effect, what is happening is that input is being taken once but the main function is running several times. The processor with rank 0 always seems to have the input, rest of them have random values. So to send the values do we have to broadcast the input from processor 0 to all the other processors?
MPI_Init() doesn't create multiple copies, it just initializes in-process MPI library. Multiple copies of your process are created before that, most probably with some kind of mpirun command (that is how you run your MPI application). All processes are independent from the beginning, so answering the first part of your question — yes, multiple copies of main() will exist, and they will exist even if you don't call MPI_Init.
The answer to your question about inputs depends on nature of the inputs: if it's typed in from console, then you have to input the values only in one process (e.g. rank 0) and then broadcast them. If the inputs are in some file or specified as a command-line argument, then all processes can access them.

MPI and global variables

I have to implement an MPI program. There are some global variables (4 arrays of float numbers and other 6 single float variables) which are first inizialized by the main process reading data from a file. Then I call MPI_Init and, while process of rank 0 waits for results, the other processes (rank 1,2,3,4) work on the arrays etc...
The problem is that those array seem not to be initialized anymore, all is set to 0. I tried to move global variable inside the main function but the result is the same. When MPI_Init() is called all processes are created by fork right? So everyone has a memory copy of the father so why do they see not initizialized arrays?
I fear you have misunderstood.
It is probably best to think of each MPI process as an independent program, albeit one with the same source code as every other process in the computation. Operations that process 0 carries out on variables in its address space have no impact on the contents of the address spaces of other processes.
I'm not sure that the MPI standard even requires process 0 to have values for variables which were declared and initialised prior to the call to mpi_init, that is before process 0 really exists.
Whether it does or not you will have to write code to get the values into the variables in the address space of the other processes. One way to do this would be to have process 0 send the values to the other processes, either one by one or using a broadcast. Another way would be for all processes to read the values from the input files; if you choose this option watch out for contention over i/o resources.
In passing, I don't think it is common for MPI implementations to create processes by forking at the call to mpi_init, forking is more commonly used for creating threads. I think that most MPI implementations actually create the processes when you make a call to mpiexec, the call to mpi_init is the formality which announces that your program is starting its parallel computations.
When MPI_Init() is called all processes are created by fork right?
Wrong.
MPI spawns multiple instances of your program. These instances are separate processes, each with its own memory space. Each process has its own copy of every variable, including globals. MPI_Init() only initializes the MPI environment so that other MPI functions can be called.
As the other answers say, that's not how MPI works. Data is unique to each process and must be explicitly transferred between processes using the API available in the MPI specification.
However, there are programming models that allow this sort of behavior. If, when you say parallel computing, you mean multiple cores on one processor, you might be better served by using something like OpenMP to share your data between threads.
Alternatively, if you do in fact need to use multiple processors (either because your data is too big to fit in one processor's memory, or some other reason), you can take a look at one of the Parallel Global Address Space (PGAS) languages. In those models, you have memory that is globally available to all processes in an execution.
Last, there is a part of MPI that does allow you to expose memory from one process to other processes. It's the Remote Memory Access (RMA) or One-Sided chapter. It can be complex, but powerful if that's the kind of computing model you need.
All of these models will require changing the way your application works, but it sounds like they might map to your problem better.

Resources