Based on the message size same MPI collectives use different algorithms. I am particularly interested in MPICH, how am I able to print these values in bytes for every collective I am using?
MPICH provides a number of environment variables to control what algorithm it chooses at runtime. Check your MPICH installation's documentation for the default values of these variables. For example, on my PC running Fedora 23, after installing the mpich packages, I have a file at /usr/share/doc/mpich/README.envvar documenting these variables. Following is a section of that file:
MPIR_CVAR_ALLGATHER_LONG_MSG_SIZE
Aliases: MPIR_PARAM_ALLGATHER_LONG_MSG_SIZE
MPICH_ALLGATHER_LONG_MSG_SIZE
Description: For MPI_Allgather and MPI_Allgatherv, the long message
algorithm will be used if the send buffer size is >= this value (in
bytes) (See also: MPIR_CVAR_ALLGATHER_SHORT_MSG_SIZE)
Default: 524288
Related
Suppose you take an RV32 program and try running it on a 64-bit system, what compatibility issues are likely to arise?
As I understand it, the instruction encoding is the same, and on RISC-V (like other modern RISC architectures, though unlike x86), ALU operations automatically operate on whatever the word size is, so if you add the contents of a pair of registers, you will get a 32-bit or 64-bit addition as applicable. Load and store of course work on an explicitly specified size because they depend on how many bytes have been allocated in memory.
One theoretically possible compatibility issue would arise if the code depends on bits past 32 being discarded, e.g. add 2^31 to itself and compare the result with zero.
Another more practical issue would arise if the operating system supplies memory addresses outside the first 4 gigabytes, which would be garbled when the code stores the addresses in 32-bit variables.
Are there any other issues I am missing?
You are correct about both of those possible compatibility issues.
Additionally, some Control and Status Registers (namely cycleh, instreth, timeh) are not needed in RV64I and therefore don't exist. Any code which tries to access them should error.
However, there are instructions to use only the lower 32 bits for ALU operations. Which could potentially be changed by replacing the opcode and funct3 in the binary.
So with an operating system mode which returns only 32-bit addresses, it would be possible to replace a binary with a working 64 bit version so long as cycleh and friends aren't used.
References RISC-V Specification v2.2:
Chapter 4 of the RISC-V Spec. v2.2 outlines the differences from RV32I to RV64I.
Chapter 2.8 goes over Control and Status Registers
Table 19.3 lists all of the CSRs in the standard.
MPI standard states that when parallel programs are running on heterogenerous environment, they may have different representations for a same datatype(like big endian and small endian machines for intergers), so datatype representation conversion might be needed when doing point to point communication. I don't know how Open MPI implements this.
For instance, current Open MPI uses UCX library defaultly, I have study some codes of UCX library and Open MPI's ucx module. However, for continuous datatype like MPI_INT, I didn't find any representation conversion happen. I wonder is it because I miss that part or the implementation didn't satisfy the standard?
If you want to run an Open MPI app on an heterogeneous cluster, you have to configure --enable-heterogeneous (this is disabled by default). Keep in mind this is supposed to work, but it is lightly tested, mainly because of a lack of interest/real use cases. FWIW, IBM Power is now little endian, and Fujitsu is moving from Sparc to ARM for HPC, so virtually all HPC processors are (or will soon be) little endian.
Open MPI uses convertors (see opal/datatype/opal_convertor.h) to pack the data before sending it, and unpack it once received.
The data is packed in its current endianness. Data conversion (e.g. swap bytes) is performed by the receiver if the sender has a different endianness.
There are two ways of using UCX : pml/ucx and pml/ob1+btl/ucx and I have tested none of them in a heterogeneous environment. If you are facing some issues with pml/ucx, try mpirun --mca pml ob1 ....
On a test server there are two Samsung 960 Pro SSDs, exactly same maker, model and size. On both I've installed a fresh install of exactly the same OS, OmniOS r15026.
By pressing F8 at POST time, I can access the motherboard BOOT manager, and choose one of the two boot drives. Thus, I know which one the system booted from.
But how can one know programmatically, after boot, which is the boot disk?
It seems that is:
Not possible on Linux,
Not possible on FreeBsd
Possible on macOS.
Does Solaris/illumos offer some introspective hooks to determine which is the boot disk?
Is it possible to programmatically determine which is the boot disk on Solaris/illumos?
A command line tool would be fine too.
Edit 1: Thanks to #andrew-henle, I have come to know command eeprom.
As expected it is available on illumos, but on test server with OmniOS unfortunately it doesn't return much:
root#omnios:~# eeprom
keyboard-layout=US-English
ata-dma-enabled=1
atapi-cd-dma-enabled=1
ttyd-rts-dtr-off=false
ttyd-ignore-cd=true
ttyc-rts-dtr-off=false
ttyc-ignore-cd=true
ttyb-rts-dtr-off=false
ttyb-ignore-cd=true
ttya-rts-dtr-off=false
ttya-ignore-cd=true
ttyd-mode=9600,8,n,1,-
ttyc-mode=9600,8,n,1,-
ttyb-mode=9600,8,n,1,-
ttya-mode=9600,8,n,1,-
lba-access-ok=1
root#omnios:~# eeprom boot-device
boot-device: data not available.
Solution on OmniOS r15026
Thanks to #abarczyk I was able to determine the correct boot disk.
I had to use a slightly different syntax:
root#omnios:~# /usr/sbin/prtconf -v | ggrep -1 bootpath
value='unix'
name='bootpath' type=string items=1
value='/pci#38,0/pci1022,1453#1,1/pci144d,a801#0/blkdev#w0025385971B16535,0:b
With /usr/sbin/format, I was able to see entry corresponds to
16. c1t0025385971B16535d0 <Samsung-SSD 960 PRO 512GB-2B6QCXP7-476.94GB>
/pci#38,0/pci1022,1453#1,1/pci144d,a801#0/blkdev#w0025385971B16535,0
which is correct, as that is the disk I manually selected in BIOS.
Thank you very much to #abarczyk and #andrew-henle to consider this and offer instructive help.
The best way to find the device from which the systems is booted is to check prtconf -vp output:
# /usr/sbin/prtconf -vp | grep bootpath
bootpath: '/pci#0,600000/pci#0/scsi#1/disk#0,0:a'
On my Solaris 11.4 Beta system, there is a very useful command called devprop which helps answer your question:
$ devprop -s bootpath
/pci#0,0/pci1849,8c02#1f,2/disk#1,0:b
then you just have to look through the output of format to see what that translates to. On my system, that is
9. c2t1d0 <ATA-ST1000DM003-1CH1-CC47-931.51GB>
/pci#0,0/pci1849,8c02#1f,2/disk#1,0
Use the eeprom command.
Per the eeprom man page:
Description
eeprom displays or changes the values of parameters in the EEPROM.
It processes parameters in the order given. When processing a
parameter accompanied by a value, eeprom makes the indicated
alteration to the EEPROM; otherwise, it displays the parameter's
value. When given no parameter specifiers, eeprom displays the values
of all EEPROM parameters. A '−' (hyphen) flag specifies that
parameters and values are to be read from the standard input (one
parameter or parameter=value per line).
Only the super-user may alter the EEPROM contents.
eeprom verifies the EEPROM checksums and complains if they are
incorrect.
platform-name is the name of the platform implementation and can be
found using the –i option of uname(1).
SPARC
SPARC based systems implement firmware password protection with
eeprom, using the security-mode, security-password and
security-#badlogins properties.
x86
EEPROM storage is simulated using a file residing in the
platform-specific boot area. The /boot/solaris/bootenv.rc file
simulates EEPROM storage.
Because x86 based systems typically implement password protection in
the system BIOS, there is no support for password protection in the
eeprom program. While it is possible to set the security-mode,
security-password and security-#badlogins properties on x86 based
systems, these properties have no special meaning or behavior on x86
based systems.
I use the doMC that uses the package multicore. It happened (several times) that when I was debugging (in the console) it went sideways and fork-bombed.
Does R have the setrlimit() syscall?
In pyhton for this i would use resource.RLIMIT_NPROC
Ideally I'd like to restrict the number of R processes running to a number
EDIT: OS is linux CentOS 6
There should be several choices. Here is the relevant section from Writing R Extensions, Section 1.2.1.1
Packages are not standard-alone programs, and an R process could
contain more than one OpenMP-enabled package as well as other components
(for example, an optimized BLAS) making use of OpenMP. So careful
consideration needs to be given to resource usage. OpenMP works with
parallel regions, and for most implementations the default is to use as
many threads as 'CPUs' for such regions. Parallel regions can be
nested, although it is common to use only a single thread below the
first level. The correctness of the detected number of 'CPUs' and the
assumption that the R process is entitled to use them all are both
dubious assumptions. The best way to limit resources is to limit the
overall number of threads available to OpenMP in the R process: this can
be done via environment variable 'OMP_THREAD_LIMIT', where
implemented.(4) Alternatively, the number of threads per region can be
limited by the environment variable 'OMP_NUM_THREADS' or API call
'omp_set_num_threads', or, better, for the regions in your code as part
of their specification. E.g. R uses
#pragma omp parallel for num_threads(nthreads) ...
That way you only control your own code and not that of other OpenMP
users.
One of my favourite tools is a package controlling this: RhpcBLASctl. Here is its Description:
Control the number of threads on 'BLAS' (Aka 'GotoBLAS', 'ACML' and
'MKL'). and possible to control the number of threads in 'OpenMP'. get
a number of logical cores and physical cores if feasible.
After all you need to control the number of parallel session as well as the number of BLAS cores allocated to each of the parallel threads. There is a reason the parallel package has a default of 2 threads per session...
All of this should be largely independent of the flavour of Linux or Unix you are running. Well, apart from the fact that OS X of course (still !!) does not give you OpenMP.
And the very outer level you can control from doMC and friends.
You can use registerDoMC (see the doc here)
registerDoMC(cores=<some number>)
Another option is to use the ulimit command before running the R script:
ulimit -u <some number>
to limit the number of processes R will be able to spawn.
If you want to limit the total number of CPUs several R processes use at the same time, you will need to use cgroups or cpusets and attach the R processes to the cgroup or cpuset. They will then be confined to the physical CPUS defined in the cgroup or cpuset. cgroups allow more control (for instance also memory) but are more complex to setup.
I'm trying to optimize my kernel functions and ran into a bit of an issue. First, this may be Radeon R9 (Hawaii) related, but it should happen for other GPU devices as well.
For the host I have two platform options. Either compile and run as an x86-program, or run as an x64-program. Depending which platform I chose, I get different compiled kernels. One that uses 32-bit pointers and pointer arithmetic, and the other that uses 64-bit pointers. The generated IL code shows the difference, in the first case it is
prog kernel &__OpenCL_execute_kernel(
kernarg_u32 %_.global_offset_0,
kernarg_u32 %_.global_offset_1,
...
and in the second case it is:
prog kernel &__OpenCL_execute_kernel(
kernarg_u64 %_.global_offset_0,
kernarg_u64 %_.global_offset_1,
...
64-bit arithmetic on a GPU is rather expensive and consumes a lot of additional VGPRs. In my case, the 64-bit pointer version requires 8 VGPRs more and has about 140 VALUInsts more, as shown by CodeXL. Performance overall is about 37% worse in my case between the slower 64-bit and the faster 32-bit kernel code. Which is, other than internal pointer arithmetic, completely identical. I have tried to optimize this, but even with plain offsets I'm still stuck with a lot of ADD_U64 IL-instructions, which in ISA-code produce two instructions: V_ADD_I32 and V_ADDC_U32. And of course all pointers require double private memory space (hence more VGPRs).
Now my question is: Is there a way to "cross"-compile an OpenCL kernel so a x64-program can create a 32-bit-pointer kernel? I don't need to address that much memory in the GPU, so addressing less than 4 GiB of memory space is fine. As my host is also executing AVX-512 instructions with all 32 zmm registers, which is only available in x64 mode, an x86-program is not an option. That makes the whole situation a bit challenging.
Well, my fallback solution is to spawn a x86-child process that uses shared memory and acts as a compiling gate. But I'd rather not do that if a simple flag or (AMD specific) setting in OpenCL does the trick.
Please don't reply with a why-that-is-response. I'm completely aware why the x64-program and kernel behave that way.
I've a couple ideas, but not being familiar with the guts of the AMD GPU OpenCL implementation, I am stabbing in the dark.
Can you pass the data in via an image (even if it's not)? On Intel GPUs going through the sampler provides a different path and can avoid 64-bit arithmetic even in the 64-bit version.
Does AMD have an extension that allows you to block read and write? This can help if the compiler proves that the address is uniform (scalar). E.g. something like Intel Subgroups (which enable some block IO). On Intel this helps avoid shipping a SIMD's worth of addresses across the bus for a scatter/gather (and saves register space too).
(This is a stretch.) Does compiling for OpenCL 1.2 or lower help? That is, specify -cl-std=CL1.2? If the compiler knows that SVM is not being used (>=OpenCL 2.0) and were to run a conservative analysis on the program to prove that it's not doing something wild with pointer arithmetic, it could feasibly do arithmetic in 32-bit and implicitly add a 64-bit relative offset to all addresses (making the GPU program think that it's using 32-bit addresses).
Again, I know nothing about AMD specifics, but I feel your pain with this problem.