I have the latest MPICH2 (3.0.4) compiled with intel fort compiler in a quad-core, dual CPU (Intel Xeon) machine.
I am encountering one MPI_bcast problem where, I am unable to broadcast the array
gpsi(1:201,1:381,1:38,1:20,1:7)
making it an array of size 407410920. When I try to broadcast this array I have the following error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7f506d811010, count=407410920,
MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
rank 1 in job 31 Grace_52261 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
MPI launch string is: mpiexec -n 2 %B/tvdbootstrap
Testing MPI configuration with 'mpich2version'
Exit value was 127 (expected 0), status: execute_command_t::exited
Launching MPI job with command: mpiexec -n 2 %B/tvdbootstrap
Server args: -callback 127.0.0.1:4142 -set_pw 65f76672:41f20a5c
So the question, is there a limit in the size of variable in MPI_bcast or is the size of my array is more than what it can handle?
As John said, your array is too big because it can no longer be described by an int variable. When this is the case, you have a few options.
Use multiple MPI calls to send your data. For this option, you would just divide your data up into chunks smaller than 2^31 and send them individually until you've received everything.
Use MPI datatypes. With this option, you need to create a datatype to describe some portion of your data, then send multiples of that datatype. For example, if you are just sending an array of 100 integers, you can create a datatype of 10 integers using MPI_TYPE_VECTOR, then send 10 of that new datatype. Datatypes can be a bit confusing when you're first taking a look at them, but they are very powerful for sending either large data or non-contiguous data.
Yes, there is a limit. It's usually 2^31 so about two billion elements. You say your array has 407 million elements so it seems like it should work. However, if the limit is two billion bytes, then you are exceeding it by about 30%. Try cutting your array size in half and see if that works.
See: Maximum amount of data that can be sent using MPI::Send
Related
I have an old APL application that runs on DOS that performs FFT and IFFT. It generates a rank limit error on GNU APL. With a workspace of 55 GB there should be no limit on rank or symbols. The only limit that makes sense is a user settable workspace size limit so we don't max out the memory on a 64-bit machine.
To test this, one can observe that a←(n⍴2)⍴2 fails with n>8 on GNU APL 64 bit whereas a 16 bit APL for DOS runs until out of memory.
Is there a way to change the rank limit?
The default the rank limit is indeed 8, but can be configured using the MAX_RANK configuration parameter. You can either use a GNU APL configuration file to do so, or simply use a command line parameter, e.g. MAX_RANK=64.
By the way, all APL implementations I know of have a max rank, and I believe your old DOS APL has such a limit too, only that you happen to run out of memory before you hit it, due to all axes doubling the number of elements (and the elements being at least one byte each) when you do a←(n⍴2)⍴2. Try a←(n⍴1)⍴2 which doesn't add additional elements, and you're likely to find that the maximum rank is 15 or 63 or something like that.
I'm running a Notebook on JupyterLab. I am loading in some large Monte Carlo chains as numpy arrays which have the shape (500000, 150). I have 10 chains which I load into a list in the following way:
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
If I load 5 chains then all works well. If I try to load 10 chains, after about 6 or 7 I get the error:
Kernel Restarting
The kernel for my_code.ipynb appears to have died. It will restart automatically.
I have tried loading the chains in different orders to make sure there is not a problem with any single chain. It always fails when loading number 6 or 7 no matter the order, so I think the chains themselves are fine.
I have also tried to load 5 chain in one list and then in the next cell try to load the other 5, but the fail still happens when I get to 6 or 7, even when I split like this.
So it seems like the problem is that I'm loading too much data into the Notebook or something like that. Does this seem right? Is there a work around?
It is indeed possible that you are running out of memory, though unlikely that it's actually your system that is running out of memory (unless it's a very small system). It is typical behavior that if jupyter exceeds its memory limits, the kernel will die and restart, see here, here and here.
Consider that if you are using the float64 datatype by default, the memory usage (in megabytes) per array is:
N_rows * N_cols * 64 / 8 / 1024 / 1024
For N_rows = 500000 and N_cols = 150, that's 572 megabytes per array. You can verify this directly using numpy's dtype and nbytes attributes (noting that the output is in bytes, not bits):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
print(chain.dtype)
print(chain.nbytes / 1024 / 1024)
If you are trying to load 10 of these arrays, that's about 6 gigabytes.
One workaround is increasing the memory limits for jupyter, per the posts referenced above. Another simple workaround is using a less memory-intensive floating point datatype. If you don't really need the digits of accuracy afforded by float64 (see here) you could simply use a smaller floating point representation, e.g. float32
chain = np.loadtxt('my_chain_{}.txt'.format(i), dtype=np.float32)
chains.append(chain)
Given that you can get to 6 or 7 already, halving the data usage of each chain should be enough to get you to 10.
You could be running out of memory.
Try to load the chains one by one and then concatenate them.
chains = []
for i in range(10):
chain = np.loadtxt('my_chain_{}.txt'.format(i))
chains.append(chain)
if i > 0:
chains = np.concatenate((chains[0],
chains[1]), axis=0)
chains.pop(1)
I'm trying to execute a parallel code with two processes in one node on a 48 core machine with two sockets (1 process per socket). Each process must only use 23 cores of Socket. Process 0 must go from core 1 to core 23 (Socket 0), and Process 1 from core 25 to core 47 (Socket 1). I would like to know how I can exclude the processes from using the first core in each socket (Core 0 in Socket 0 and Core 24 in Socket 1) using SRUN.
You can use the map_cpu option provided by srun. Use it in the flag --cpu_bind.
From the srun manual:
map_cpu:
Bind by setting CPU masks on tasks (or ranks) as specified where is <cpu_id_for_task_0>,<cpu_id_for_task_1>,... CPU IDs are
interpreted as decimal values unless they are preceded with '0x' in
which case they interpreted as hexadecimal values. If the number of
tasks (or ranks) exceeds the number of elements in this list, elements
in the list will be reused as needed starting from the beginning of
the list. To simplify support for large task counts, the lists may
follow a map with an asterisk and repetition count. For example
"map_cpu:0x0f4,0xf04".
Something like this could be applied in your case:
srun --cpu_bind=map_cpu:1,2...23,25,26...44 binary #replace ... with actual numbers
Sorry for the long post. I did read some other MPI broadcast related errors but I couldn't
find out why my program is failing.
I am new to MPI and I am facing this problem. First I will explain what I am trying to do:
My declarations :
ROWTAG 400
COLUMNTAG 800
Create a 2 X 2 Cartesian topology.
Rank 0 has the whole matrix. It wants to dissipate parts of matrix to all the processes in the 2 X 2 Cartesian topology. For now, instead
of matrix I am just dealing with integers. So for process P(i,j) in 2 X 2 Cartesian topology, (i - row , j - column), I want it to receive
(ROWTAG + i ) in one message and (COLUMNTAG + j) in another message.
My strategy to do so is:
Processes: P(0,0) , P(0,1), P(1,0), P(1,1)
P(0,0) has all the initial data.
P(0,0) sends (ROWTAG+1) (in this case 401) to P(1,0) - In essense P(1,0) is responsible for dissipating information related to row 1 for all the processes in Row 1 - I just used a blocking send
P(0,0) sends (COLUMNTAG+1) (in this case 801) to P(0,1) - In essense P(0,1) is responsible for dissipating information related to column 1 for all the processes in Column 1 - Used a blocking send
For each process, I made a row_group containing all the processes in that row and out of this created a row_comm (communicator object)
For each process, I made a col_group containing all the processes in that column and out of this created a col_comm (communicator object)
At this point, P(0,0) has given information related to row 'i' to Process P(i,0) and P(0,0) has given information related to column 'j' to
P(0,j). I call P(i,0) and P(0,j) as row_head and col_head respectively.
For Process P(i,j) , P(i,0) gives information related to row i, and P(0,j) gives information related to column j.
I used a broad cast call:
MPI_Bcast(&row_data,1,MPI_INT,row_head,row_comm)
MPI_Bcast(&col_data,1,MPI_INT,col_head,col_comm)
Please find my code here: http://pastebin.com/NpqRWaWN
Here is the error I see:
* An error occurred in MPI_Bcast
on communicator MPI COMMUNICATOR 5 CREATE FROM 3
MPI_ERR_ROOT: invalid root
* MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
Also please let me know if there is any better way to distribute the matrix data.
There are several errors in your program. First, row_Ranks is declared with one element less and when writing to it, you possibly overwrite other stack variables:
int col_Ranks[SIZE], row_Ranks[SIZE-1];
// ^^^^^^
On my test system the program just hangs because of that.
Second, you create new subcommunicators out of matrixComm but you use rank numbers from the latter to address processes in the former when performing the broadcast. That doesn't work. For example, in a 2x2 Cartesian communicator ranks range from 0 to 3. In any column- or row-wise subgroup there are only two processes with ranks 0 and 1 - there is neither rank 2 nor rank 3. If you take a look at the value of row_head across the ranks, it is 2 in two of them, hence the error.
For a much better way to distribute the data, you should refer to this extremely informative answer.
Say, I run a parallel program using MPI. Execution command
mpirun -n 8 -npernode 2 <prg>
launches 8 processes in total. That is 2 processes per node and 4 nodes in total. (OpenMPI 1.5). Where a node comprises 1 CPU (dual core) and network interconnect between nodes is InfiniBand.
Now, the rank number (or process number) can be determined with
int myrank;
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
This returns a number between 0 and 7.
But, How can I determine the node number (in this case a number between 0 and 3) and the process number within a node (number between 0 and 1)?
I believe you can achieve that with MPI-3 in this manner:
MPI_Comm shmcomm;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &shmcomm);
int shmrank;
MPI_Comm_rank(shmcomm, &shmrank);
It depends on the MPI implementation - and there is no standard for this particular problem.
Open MPI has some environment variables that can help. OMPI_COMM_WORLD_LOCAL_RANK will give you the local rank within a node - ie. this is the process number which you are looking for. A call to getenv will therefore answer your problem - but this is not portable to other MPI implementations.
See this for the (short) list of variables in OpenMPI.
I don't know of a corresponding "node number".
This exact problem is discussed on Markus Wittmann's Blog, MPI Node-Local Rank determination.
There, three strategies are suggested:
A naive, portable solution employs MPI_Get_processor_name or gethostname to create an unique identifier for the node and performs an MPI_Alltoall on it. [...]
[Method 2] relies on MPI_Comm_split, which provides an easy way to split a communicator into subgroups (sub-communicators). [...]
Shared memory can be utilized, if available. [...]
For some working code (presumably LGPL licensed?), Wittmann links to MpiNodeRank.cpp from the APSM library.
Alternatively you can use
int MPI_Get_processor_name( char *name, int *resultlen )
to retreive node name, then use it as color in
int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm)
This is not as simple as MPI_Comm_split_type, however it offers a bit more freedom to split your comunicator the way you want.