Determine number of processes in a MPMD for each program [duplicate] - mpi

This question already has answers here:
Openmpi mpmd get communication size
(3 answers)
Closed 5 years ago.
Is it possible to know from the program how many processes are executing prog_1 and prog_2 ?
mpirun -np 3 prog_1 : -np 5 prog_2
I mean, how can I know inside prog_1 that is being executed by 3 processes ?

i do not think there is a straightforward and portable way to achieve this.
the program name is in argv[0], so you can MPI_Gather() them and MPI_Bcast() or MPI_Scatter() the info you need.
an other approach is to start with the first program only, and then MPI_Comm_spawn() the second program.

Related

Is hierachical parallelism possible with MPI libraries?

I'm writing a computational code with MPI. I have a few parts of the software each compute different part of the problem. Each part is written with MPI thus could be run as an independent module. Now I want to combine these parts to be run together within one program, and all parts of the code run in parallel while each part itself is also running in parallel.
e.g. Total number of nodes = 10, part1 run with 6 nodes and part 2 run with 4 nodes and both running together.
Is there ways that I can mpirun with 10 nodes and mpi_init each part with desired number of node without rewritten the overall program to allocate process for each part of code?
This is not straightforward.
One option is to use an external program that with MPI_Comm_spawn() (twice) your sub-programs. The drawback is this requires one slot.
An other option needs some rewriting, since all the tasks will end up in the same MPI_COMM_WORLD, it is up to them to MPI_Comm_split() based on who they are, and use the resulting communicator instead of MPI_COMM_WORLD.

Torque/OpenMPI dynamically allocate nodes based on number of processors

I was wondering if Torque is smart enough to assign the correct number of nodes based on how many mpi cores you request. For our cluster, we have heterogenous nodes and it can be quite wasteful to just put the number of nodes you want and processors per node. So I was wondering if you could just do something like this
qsub -I -l procs:1000
mpiexec -n 1000 mympijob
However, torque only allocates one node with this command (as I didn't specify a # of nodes). Is there a way the correct number of nodes based on my number of procs so it can be maximally efficient?
Sidebar - We are probably switching to SLURM soon, is this well within the capabilities?
Typically, what we do after the resources are allocated is not something that the scheduler can control.
In this case,
mpirun/mpiexec -n 1000
gets executed after the resources are allocated by the schduler.
The best way to go forward is to use the environment variables set by the scheduler
$MPI_HOSTS
as the value passed through the switch -n.
example:
mpirun $MPI_HOSTS <your program of choice>
You can request the number of cores that you want by adding the ppn argument to nodes.
qsub -l nodes=2:ppn=16
This allocates 32 cores, in two nodes.

Is there a size limit of variable in MPI_bcast?

I have the latest MPICH2 (3.0.4) compiled with intel fort compiler in a quad-core, dual CPU (Intel Xeon) machine.
I am encountering one MPI_bcast problem where, I am unable to broadcast the array
gpsi(1:201,1:381,1:38,1:20,1:7)
making it an array of size 407410920. When I try to broadcast this array I have the following error
Fatal error in PMPI_Bcast: Other MPI error, error stack:
PMPI_Bcast(1525)......: MPI_Bcast(buf=0x7f506d811010, count=407410920,
MPI_DOUBLE_PRECISION, root=0, MPI_COMM_WORLD) failed
MPIR_Bcast_impl(1369).:
MPIR_Bcast_intra(1160):
MPIR_SMP_Bcast(1077)..: Failure during collective
rank 1 in job 31 Grace_52261 caused collective abort of all ranks
exit status of rank 1: killed by signal 9
MPI launch string is: mpiexec -n 2 %B/tvdbootstrap
Testing MPI configuration with 'mpich2version'
Exit value was 127 (expected 0), status: execute_command_t::exited
Launching MPI job with command: mpiexec -n 2 %B/tvdbootstrap
Server args: -callback 127.0.0.1:4142 -set_pw 65f76672:41f20a5c
So the question, is there a limit in the size of variable in MPI_bcast or is the size of my array is more than what it can handle?
As John said, your array is too big because it can no longer be described by an int variable. When this is the case, you have a few options.
Use multiple MPI calls to send your data. For this option, you would just divide your data up into chunks smaller than 2^31 and send them individually until you've received everything.
Use MPI datatypes. With this option, you need to create a datatype to describe some portion of your data, then send multiples of that datatype. For example, if you are just sending an array of 100 integers, you can create a datatype of 10 integers using MPI_TYPE_VECTOR, then send 10 of that new datatype. Datatypes can be a bit confusing when you're first taking a look at them, but they are very powerful for sending either large data or non-contiguous data.
Yes, there is a limit. It's usually 2^31 so about two billion elements. You say your array has 407 million elements so it seems like it should work. However, if the limit is two billion bytes, then you are exceeding it by about 30%. Try cutting your array size in half and see if that works.
See: Maximum amount of data that can be sent using MPI::Send

what do "user","system", and "elapsed" times mean in R [duplicate]

This question already has answers here:
What are 'user' and 'system' times measuring in R system.time(exp) output?
(5 answers)
Closed 9 years ago.
I am adopting parallel computing in R, and doing some benchmark works. I notice that when multiple cores are used, system.time shows increased times for user and system, but the elapsed time is decreased. Does this indicate that parallel computing is effective? Thanks.
If you do help(system.time) you get a hint to also look at help(proc.time). I quote from its help page:
Value:
An object of class ‘"proc_time"’ which is a numeric vector of
length 5, containing the user, system, and total elapsed times for
the currently running R process, and the cumulative sum of user
and system times of any child processes spawned by it on which it
has waited. (The ‘print’ method uses the ‘summary’ method to
combine the child times with those of the main process.)
The definition of ‘user’ and ‘system’ times is from your OS.
Typically it is something like
_The ‘user time’ is the CPU time charged for the execution of user
instructions of the calling process. The ‘system time’ is the CPU
time charged for execution by the system on behalf of the calling
process._
Times of child processes are not available on Windows and will
always be given as ‘NA’.
The resolution of the times will be system-specific and on
Unix-alikes times are rounded down to milliseconds. On modern
systems they will be that accurate, but on older systems they
might be accurate to 1/100 or 1/60 sec. They are typically
available to 10ms on Windows.

Why does sqrt(4) - 2 equal -8.1648465955514287168521180122928e-39 when using the windows calculator? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How is floating point stored? When does it matter?
Using the built-in calculator on my Win7 x64 I get the number -8.1648465955514287168521180122928e-39 when calculation sqrt(4)-2.
I would expect the result to be 0.
There's some error with floating-point values, when you go to subtract them on occasion. You may get a representation that's 0 or really close to 0 (10^-39's pretty close).
For more information, check out Fractions in Binary on Wikipedia.

Resources