I'm trying to run the intel version of the HPL benchmark here and I'm a bit confused by the options.
What I want to do (for now) is a single-node run. The node has 2x Xeon Platinum 8276 processors, so 56 cores total. So my PxQ should be 56.
However the intel docs say:
MPI_PROC_NUM should be equal to PxQ (i.e 56) - this gets passed to mpirun -np
MPI_PER_NODE should be equal to the number of sockets in the system (i.e. 2) - this gets passed to mpirun -perhost
To me those don't seem consistent? And how does using OMP_NUM_THREADS fit into this?
Related
I am a high school student. An error occurred while studying and coding the basic theory of mpi. I searched on the internet and tried everything, but I couldn't understand it well.
The code is really simple. There is no problem with the code and I understood it well.
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int num_procs, my_rank;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
printf("Hello world! I'm rank %d among %d processes.\n", my_rank, num_procs);
MPI_Finalize();
return 0;
}
But there was a problem with running mpi. It works well when i type it like this.
mpirun -np 2 ./hello
Hello world! I'm rank 1 among 2 processes.
Hello world! I'm rank 0 among 2 processes.
This error occurs at -np 3.
mpirun -np 3 ./hello
`There are not enough slots available in the system to satisfy the 3
slots that were requested by the application:
./hello
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores? I don't exactly understand this part.
There is not much data about mpi in Korea, so I always googling and studying. If that's the cause, is there any way to increase the number of processes? Other people wrote that there was an error in -np 17, how did they increase the process to double digits? Is the computer capable? Please explain it easily so that I can understand it well.
My laptop is Intel i5 and cpu core is 2 and 4 threads. Did such a problem happen because there were only 2 cores?
Yes. By default Open MPI uses the number of cores as slots. So since you only have 2 cores, you can only launch maximum of 2 processes.
If that's the cause, is there any way to increase the number of processes?
Yes, If you use --use-hwthread-cpus with your mpirun command you can use upto 4 mpi processes in your laptop since you have 4 threads in your laptop. Try running the command, mpirun -np 4 --use-hwthread-cpus a.out
Also, you can use --oversubscribe option to increase the number of processes greater than the available cores/threads. For example try this mpirun -np 10 --oversubscribe a.out
I'm using RocksDB via the C API.
I have a test program that opens a database, does 1,000 writes (gathering timing data between initiation of write and callback), does 1,000 reads, and shuts down.
This works. Average time to do a write is about 1ms.
I modified the test program to turn on write syncing via this
rocksdb_writeoptions_set_sync(wri_u, 1);
and ran it again. Average time to do a write is about 8ms.
So far, so good.
HOWEVER, I then ran strace on both versions of the program to verify that fsync() or fdatasync() or msync() is getting called.
The no-sync program shows 4 invocations of fsync(), 2 of fdatasync() and 0 of msync(). Reasonable.
...but the sync version of the program shows the same 4, 2, and 0. Odd! Surprising! Worrying!
The sync version DOES show 2 interesting deltas from the no-sync version: (i) 2 calls to nanosleep() per write, (ii) an 80% increase in the time spent in mmap().
One out-of-my-butt theory is that perhaps msync() [ or a stand-in for it ] is actually implemented in terms of nanosleep() ?
This is on a desktop Linux 16.04
uname -a
Linux mithril 4.4.0-139-generic #165-Ubuntu SMP Wed Oct 24 10:58:50 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Anyway, my question is, as per the subject line:
Am I properly forcing RocksDB to use fsync? ... because neither fsync() nor msync() shows in strace
Thanks.
Yes, this is the correct way to turn fsync() on.
The issue is that strace must be used with the -f flag to trace system calls in new threads ... and RocksDB was doing all syncs in other threads.
I have just done a clean install of R on Ubuntu 18.04, and it does not really work.
I can do stuff like make vectors and dataframes, and get summaries. I can also make a histogram with the hist command just fine. However, the plot command does not work.
The following code, which is very basic and should work just fine:
data(faithful)
plot(faithful$eruptions)
runs for about 30 seconds, before giving the following error:
Error: C stack usage 7970244 is too close to the limit
I have seen lots of posts of other people having the same error, but it seems to be because they are dealing with large datasets/lots of recursion or something like that, but I have this problem even with a dataset of just 3 values. R should definitely be able to handle this without me increasing the limit, and it should not take 30 seconds to run.
Does anybody know what the problem could be?
Edit:
Output of ulimit -a:
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 28697
max locked memory (kbytes, -l) 16384
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 28697
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Version info:
R version 3.4.4 (2018-03-15) -- "Someone to Lean On"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
This is plain R in the terminal, not through any IDE.
Edit 2:
I have discovered that it works just fine on other user accounts on the computer. I've also started getting other weird issues (after a reboot I have to login to a Gnome session, then log out before I can log into plasma, but other users don't have that problem, sometimes the terminal can't launch, lots of things are crashing, etc) so I think this has nothing to do with R and is much bigger. Unfortunately, it might be a hardware issue (this computer has had wacky issues before on other operating systems).
Some weird behaviour was recently observed by a colleague, and I have been able to reproduce it. We have a computer for simulations, which is powered by two Xeon processors with 18 cores each, giving us 36 cores to work with.
When, we launch an application using 2 processes, mpi always binds on cores 0 and 1 of socket 0. Thus, if we run 4 simulations using 2 processes each, cores 0 and 1 are doing all the work with a CPU-usage of each process of 25%.
See the reported bindings of MPI below. When we use more than 2 processes for each simulation, MPI behaves as expected, i.e. when running 4 simulations using 3 processes each, then 12 cores are working with each process having 100% CPU-use.
[user#apollo3 tmp]$ mpirun -np 2 --report-bindings myApp -parallel > run01.log &
[1] 5374
[user#apollo3 tmp]$ [apollo3:05374] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[apollo3:05374] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
[user#apollo3 tmp]$ mpirun -np 2 --report-bindings myApp > run02.log &
[2] 5385
[user#apollo3 tmp]$ [apollo3:05385] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././.][./././././././././././././././././.]
[apollo3:05385] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././.][./././././././././././././././././.]
What could be the reason for this binding behavior of MPI?
We run OpenMPI 1.10 on our machine
[user#apollo3 tmp]$ mpirun --version
mpirun (Open MPI) 1.10.0
Long story short, this is not a bug but a feature.
various instances of mpirun do not communicate with each other, and hence each MPI job believes it is running alone on the system, and so uses cores 0 and 1.
the simplest option is to disable binding if you know you will be running several jobs on the same machine.
mpirun -bind-to none ...
will do the trick.
A better option is to use a resource manager (such as SLURM, PBS or others) and make sure Open MPI was built to support it.
The resource managers will allocate different set of cores to each job, and hence there will be no more overlap.
A similar question was asked recently, see yet an other option at How to use mpirun to use different CPU cores for different programs?
Do you know if there is a UNIX command that will tell me what the CPU configuration for my Sun OS UNIX machine is? I am also trying to determine the memory configuration. Is there a UNIX command that will tell me that?
There is no standard Unix command, AFAIK. I haven't used Sun OS, but on Linux, you can use this:
cat /proc/cpuinfo
Sorry that it is Linux, not Sun OS. There is probably something similar though for Sun OS.
The nproc command shows the number of processing units available:
$ nproc
Sample outputs: 4
lscpu gathers CPU architecture information form /proc/cpuinfon in human-read-able format:
$ lscpu
Sample outputs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 15
Stepping: 7
CPU MHz: 1866.669
BogoMIPS: 3732.83
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-7
Try psrinfo to find the processor type and the number of physical processors installed on the system.
Firstly, it probably depends which version of Solaris you're running, but also what hardware you have.
On SPARC at least, you have psrinfo to show you processor information, which run on its own will show you the number of CPUs the machine sees. psrinfo -p shows you the number of physical processors installed. From that you can deduce the number of threads/cores per physical processors.
prtdiag will display a fair bit of info about the hardware in your machine. It looks like on a V240 you do get memory channel info from prtdiag, but you don't on a T2000. I guess that's an architecture issue between UltraSPARC IIIi and UltraSPARC T1.
I think you can use prtdiag or prtconf on many UNIXs
My favorite is to look at the boot messages. If it's been recently booted try running /etc/dmesg. Otherwise find the boot messages, logged in /var/adm or some place in /var.