How to computer sending/receiving process in MPI_Allreduce - mpi

In MPI_Allreduce() each step you send a number of bytes to another process. You also receive a number of bytes from the same process you're sending to.
Is there an straightforward way to compute these partner processes? For example:
In step 0: process 0-1, 2-3, ...
In step 1: process 0-2, 1-3, ...
In step 2: process 0-4, ...

Related

Calculate Queue Wait Times

I am looking to calculate the wait in a queue per position or a general time based on your queue position. It is a FIFO.
List of current performance status of the service
Size AvTime Queue Processing AvgFileSize(mb)
1 (0 - 1 mb) 2.57 18 3 0.21
2 (1 - 5 mb) 12.43 2 4 2.16
3 (5 - 10 mb) 23.38 9 8 6.72
4 (10 - 25 mb) 38.17 1 4 12.52
5 (>= 25 mb) 109.31 0 0 32.41
The current list of processing and queued batch files. Only lists the current users files so that is why there are queue numbers missing.
Queue Filename Status
30 Batch (3456).XML(2) Queue
20 Batch (2399).xml(3) Queue
14 batch (1495).xml(1) Queue
12 batch (1497).xml(1) Queue
15 batch (1499).xml(1) Queue
10 batch (1500).xml(4) Queue
13 batch (1496).xml(1) Queue
11 batch (1501).xml(1) Queue
9 batch (1498).xml(1) Queue
8 batch (1494).xml(1) Queue
7 batch (1493).xml(1) Queue
6 batch (1492).xml(1) Queue
5 batch (1491).xml(1) Queue
4 batch (1490).xml(1) Queue
3 batch (1).xml(1) Queue
2 Batch1.xml(1) Queue
1 Batch1.XML(2) Queue
Batch1.xml(1) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(2) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
So I am looking to add more information to the list how long until a batch file at position 20 will be waiting in the queue before it starts processing.
Queue Filename Status
30 (*30min) Batch (3456).XML(2) Queue
20 (*10min) Batch (2399).xml(3) Queue
...
*estimated
Your question doesn't quite provide enough context to make it possible to answer, but I can make some guesses based on the sample displays you provided.
Looks like you have a "single queue, multiple server" setup. In other words, you have a single FIFO queue, and a some fixed number N of jobs that can be in processing at any given time. Is that right?
For your algorithm, let's assume you have the following information:
Position of our job in queue (position N means there are N jobs ahead
of us)
Size of our job
Size of each job ahead of us in the queue
Pool of jobs being processed, with a certain maximum size N
Size of each job currently being processed
Elapsed time for each job currently in process (how long since that job started)
First of all, you will need a function ExpectedJobDuration(jobsize) that computes an expected job processing time for a job of a given size, based on the statistics shown in your "performance status" table. This looks pretty straightforward. Given a job size, first figure out which of your five size categories it falls into (0: 0-1mb, 1: 1-5mb, etc.) Then take your job size and multiply by the average time divided by the average size of jobs in that category. That will give you an estimate of ExpectedJobDuration(jobsize), which will tell you how long it takes to run a job of a given size, under the assumption that job time is proportional to job size, for jobs within a particular size range.
Now, for a job of a given size that's already been in process for a given time ElapsedProcessingTime, how long do we expect it to to take complete? A simple answer would be something like:
ExpectedRemainingTime = ExpectedJobDuration(jobsize) - ElapsedProcessingTime.
For jobs sitting the the queue this will be exactly the same as the expected job duration; for jobs already being processed we subtract the time the job has already been in work. However, if there is some random variation in job processing times, this is not exactly right, and could turn out to be negative. This is sort of like the actuarial problem: the average lifespan of a person is X years, how long do we expect someone to live if they are already Y years old? You would need a lot more statistical data to compute this, so for practical purposes, if the answer comes out negative, just set it to zero. (If someone is 100 years old, and the average human lifespan is 90, expect them to die at any moment. That's not quite right, but perhaps OK as a first approximation. Unless you are the 100 year old person, and not yet ready to die. :-))
OK, now we have a way to compute how long each job ahead of us in the queue should take, and how long it should take to complete jobs already in process.
If the number of jobs currently being processed is less than N (the max that can be processed at any given time) then our job can start right away. So in that case we have the answer - expected delay until our job can start is zero seconds.
Now let's look at the case where we are in position 0 in the queue. That means there are no jobs ahead of us in the queue, so our expected time to start is the minimum of the ExpectedRemainingTime of the jobs in the processing pool.
Now that gives us the basis for a recursive function that computes delay until our expected start time.
DelayUntilStart(jobPool, currentJob, queue) {
find minJob in jobPool with minimum ExpectedRemainingTIme
if currentJob is in position zero of queue
return expectedRemainingTime(minJob)
else
remove minJob from jobPool
pop the top job from the queue and put it in the jobPool
return ExpectedRemainingTime(minJob) + DelayUntilStart(jobPool, currentJob, queue)
done
}
Note - we may have a very long job ahead of us in the queue - but that doesn't mean we have to wait for it to complete. We just have to wait for it to get into the pool of jobs currently being processed, and then a shorter job might complete and let us into the pool.
The algorithm I just described is going to be an approximation. But it's probably about as good you are going to get without a lot of statistics about job processing times. For practical purposes I bet it would work pretty well.

CPU Pipeline: How to find average instruction execution time

In a CPU with a four (4)-stage pipeline composed of fetch, decode, execute, and write
back, each stage takes 10, 6, 8, and 8 ns, respectively. Which of the following is an
approximate average instruction execution time in nanoseconds (ns) in the CPU? Here, the
number of instructions to be executed is sufficiently large. In addition, the overhead for the
pipelining process is negligible, and the latency impact from all hazards is ignored.
a) 6
b) 8
c) 10
d) 32
Answer is 10ns.But i thought it might be 8ns since execute stage takes 8ns.please explain simply.thanks
Each instruction must go though the four stages. Once the pipeline is full, the flow of instructions in and out is determined by the duration of the longest stage:
Fetch|Decode|Exec|Write|
10ns | 6ns |8ns | 8ns |
-----+------+----+-----+
I7 I6 I5 --> I4 : I3 : I2 : I1 --> out
-----+------+----+-----+
I1..I7 are instructions. I1..I4 are in the pipeline, I5..I7 are
waiting to enter the pipeline.
After 6ns I3 is ready to move from Decode to Exec, but cannot because the stage Exec is still occupied by I2
After 2ns more (8ns total), I1 moves out of Write, I2 moves from Exec to Write, and I3 can finally move from Decode to Exec
I4 is still blocking Fetch, so I5 cannot enter
After 2ns more (10ns total) I4 moves from Fetch to Exec, and I5 can enter.
You see that the pipeline stalls until the longest stage is completed; one instruction enters the pipeline every 10ns. (The Decode stage will be idle 40% percent of the time, and the Exec and Write stages 20% of the time.)
In a pipelined situation, "the rate at which an output is produced" is determined by the slowest stage. It doesn't matter how fast the rest of the pipeline works, things are bound by the rate decoder operates. Therefore we could expect the pipeline to produce an output every 10 ns. "The rate at which an output is produced" can be interpreted as the average execution time. So its 10 ns.

Is there a size limit of variable in MPI_bcast?

I have the latest MPICH2 (3.0.4) compiled with intel fort compiler in a quad-core, dual CPU (Intel Xeon) machine.
I am encountering one MPI_bcast problem where, I am unable to broadcast the array
gpsi(1:201,1:381,1:38,1:20,1:7)
making it an array of size 407410920. When I try to broadcast this array I have the following error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7f506d811010, count=407410920,
MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
rank 1 in job 31 Grace_52261 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
MPI launch string is: mpiexec -n 2 %B/tvdbootstrap
Testing MPI configuration with 'mpich2version'
Exit value was 127 (expected 0), status: execute_command_t::exited
Launching MPI job with command: mpiexec -n 2 %B/tvdbootstrap
Server args: -callback 127.0.0.1:4142 -set_pw 65f76672:41f20a5c
So the question, is there a limit in the size of variable in MPI_bcast or is the size of my array is more than what it can handle?
As John said, your array is too big because it can no longer be described by an int variable. When this is the case, you have a few options.
Use multiple MPI calls to send your data. For this option, you would just divide your data up into chunks smaller than 2^31 and send them individually until you've received everything.
Use MPI datatypes. With this option, you need to create a datatype to describe some portion of your data, then send multiples of that datatype. For example, if you are just sending an array of 100 integers, you can create a datatype of 10 integers using MPI_TYPE_VECTOR, then send 10 of that new datatype. Datatypes can be a bit confusing when you're first taking a look at them, but they are very powerful for sending either large data or non-contiguous data.
Yes, there is a limit. It's usually 2^31 so about two billion elements. You say your array has 407 million elements so it seems like it should work. However, if the limit is two billion bytes, then you are exceeding it by about 30%. Try cutting your array size in half and see if that works.
See: Maximum amount of data that can be sent using MPI::Send

Error in MPI broadcast

Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.

How to calculate the number of messages sent in a distributed system?

Suppose we have n processes forming a general network. We don't know which are connected together, but we know the number of the processes (n).If at each round, a process sends a message to all processes it is connected to, receives 1 message from each of them, and the program executes for r rounds, is there a way to find how many messages have been sent during the program execution?
As you have pointed out, without the exact network structure it is impossible to put a specific value on the number of messages sent. Instead, we can look at its Big-O value.
Now just to be clear what we mean by Big-O:
Big-O refers to the upper bound (ie. the worst possible case) complexity
It is possible (and quite likely in real systems) that the actual value will be less
Without some function that describes the average case (eg. each process is connected to, on average N / 2 other processes) we must assume the worst case
By "worst case" for this problem we mean the maximum number of messages are sent
So let us assume the worst case, in which each process is connected to N - 1 other processes.
Let us also define some variables:
S := the set of processes
N := the number of processes in S
We can represent the set S as a complete (every node connects to every other node), undirected graph in which each node in the graph corresponds to a process and each edge in the graph corresponds to 2 messages sent (1 outgoing transmission and one reply).
From here, we see that the number of edges in a complete graph is (N(N-1))/2
So in the worst case, the number of messages sent is N(N-1), or N^2 - N.
Now because we are dealing with Big-O notation, we are interested in how this value grows as a function of N.
By the triangle inequality, we can see that O(N^2 - N) is an element of O(N^2).
So the number of messages sent grows as N^2 in the worst case.
It is also possible to arrive at this result using an adjacency matrix, that is an N x N matrix where a 1 in the (i, j)th element refers to an edge from node i to node j.
Because in the original problem each process sends a single message to all connected processes, who respond with a single message, we can see that for every pair (i, j) and (j, i) there will be an edge (one representing an outgoing message, one a reply). The exception to this will be pairs where i = j, ie. we a process doesn't send itself a message.
So the matrix will be completely filled with 1s with the exception of the diagonal.
0 1 1 1
1 0 1 1
1 1 0 1
1 1 1 0
Above: the adjacency matrix for N = 4.
So we first want to determine a formula for the total number of messages sent as a function of the number of nodes.
By the area of a rectangle we can see that the number of 1s in the matrix (ignoring the diagonal) would be N x N = N^2.
Now we must consider the diagonal. The number of pairs (x, x) that can exist is given by the function f(i) where Z(N) -> Z(N) x Z(N) : f(i) = (i, i) -- this means that there will be exactly N distinct solutions to this function.
So the overall result is that we have N^2 - N messages when the diagonal is considered.
Now, we use the same Big-O reasoning from above to arrive at the same conclusion, the number of messages grows in the worst case as O(N^2).
So, now you need only take into consideration the number of rounds that have occured, leaving you with O(RN^2).
Of course now you must consider whether you really do have the worst case ...

Resources