How does nat bench work ? I reach 30,34TB using nats bench but I don't understand why

How does nat bench work ? I reach 30,34TB using nats bench but I don't understand why - nats.io

I'm new to NATS, I first bench the NATS with nats-bench
quanlm#quanlm2:~/go/src/github.com/nats-io/nats.go/examples/nats-bench$ go run main.go -np 1 -n 100000000 -ms 1600000 -csv test foo
Starting benchmark [msgs=100000000, msgsize=1600000, pubs=1, subs=0]
Pub stats: 20,848,474 msgs/sec ~ 30.34 TB/sec
Saved metric data in csv file test
My computer setup:
"Intel(R) Xeon(R) CPU E5-2630 v4 # 2.20GHz CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 16
CPU MHz: 2199.996"
Memory: 24GB
SSD
I have no idea how nat-bench work to reach that 30.34 TB/sec
Is this just a bug or I did somethings wrong ?
Btw: what does msgsize count in: bits/byte ?

Sorry for the delay. You are publishing 100 million messages, each of 1600000 bytes (about 1.5 MB). The bench utility reports that it was able to send 20,848,474 messages per second, and since one message is 1600000 bytes, that is: 20,848,474 * 1600000 = 33357558400000 bytes per second, which is about 30.34 TB per second.

Related

Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port

A simple MPI application is failing with the following error when host1 is included in the hostfile.
Error: Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port description in business card
This application works fine when host1 is excluded from the hostfile.
I tried using cluster checker. I have attached the corresponding cluster checker log.
Can you please help me with the interpretation of this log as this seems to mostly contain the differences between various hosts that were specified with “-f (machinelist) without really high-lighting any issue with host-e8 that can explain this error.
Please find below the logs
SUMMARY
Command-line: clck -f machinesToTest -c clck.xml -Fhealth_user -Fhealth_base
-Fhealth_extended_user -Fmpi_prereq_user -l debug
Tests Run: health_user, health_base, health_extended_user,
mpi_prereq_user
**WARNING**: 9 tests failed to run. Information may be incomplete. See
clck_execution_warnings.log for more information.
Overall Result: 33 issues found - FUNCTIONALITY (3), HARDWARE UNIFORMITY (11),
PERFORMANCE (9), SOFTWARE UNIFORMITY (10)
--------------------------------------------------------------------------------
7 nodes tested: host-a2, host-b[1,3,6], host1,
host-c1, host-d
0 nodes with no issues:
7 nodes with issues: host-a2, host-b[1,3,6], host1,
host-c1, host-d
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. mpi-local-broken
Message: The single node MPI "Hello World" program did not run
successfully.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_local_functionality
2. memlock-too-small
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: memory_uniformity_user
3. memlock-too-small-ethernet
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_ethernet
HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
1. memory-not-uniform
Message: The amount of physical memory is not within the range of
792070572.0 KiB +/- 262144.0 KiB defined by nodes in the same
grouping.
5 nodes: host-b[1,3], host1, host-c1, host-d
Test: memory_uniformity_base
Details:
#Nodes Memory Nodes
1 1584974816.0 KiB host-c1
1 2113513608.0 KiB host1
1 529153152.0 KiB host-d
1 790940180.0 KiB host-b1
1 790940184.0 KiB host-b3
2. logical-cores-not-uniform:24
Message: The logical cores, '24', is not uniform across all nodes in the
same grouping. 67% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
2 nodes: host-b[1,3]
Test: cpu_base
3. logical-cores-not-uniform:48
Message: The logical cores, '48', is not uniform across all nodes in the
same grouping. 33% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
1 node: host-b6
Test: cpu_base
4. threads-per-core-not-uniform:1
Message: The number of threads available per core, '1', is not uniform.
67% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
2 nodes: host-b[1,3]
Test: cpu_base
5. threads-per-core-not-uniform:2
Message: The number of threads available per core, '2', is not uniform.
33% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
1 node: host-b6
Test: cpu_base
6. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-a2
Test: cpu_base
7. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) Gold 6256 CPU # 3.60GHz', is
not uniform. 43% of nodes in the same grouping have the same
CPU model.
3 nodes: host-b[1,3,6]
Test: cpu_base
8. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7- 4870 # 2.40GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host1
Test: cpu_base
9. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7-4830 v3 # 2.10GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-c1
Test: cpu_base
10. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2640 v3 # 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-d
Test: cpu_base
11. ethernet-firmware-version-is-not-consistent
Message: Inconsistent Ethernet firmware version.
3 nodes: host-a2, host1, host-c1
Test: ethernet
Details:
#Nodes Firmware Version Nodes
1 0x80000887, 1.2028.0 host-c1
1 0x800008e8 host-a2
1 4.0.596 host1
1 5719-v1.46 NCSI v1.3.16.0 host-a2
PERFORMANCE
The following performance issues were detected:
1. process-is-high-cpu
Message: Processes using high CPU.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
5 nodes: host-a2, host-b[3,6], host1, host-c1
Test: node_process_status
Details:
#Nodes User PID %CPU Process Nodes
1 usera 204058 98.9 /med/code7/usera/blue4/rnd/software/amd64.linux.gnu.product/distribVelsyn host-b3
1 userb 120854 98.5 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 userb 71486 98.6 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 wvgrid 11116 37.2 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-a2
1 wvgrid 19160 21.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-b6
1 wvgrid 25097 79.7 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-c1
1 wvgrid 90731 58.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host1
2. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.528 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
3. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 3.277 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
4. sgemm-data-is-substandard-avx512
Message: The following SGEMM benchmark results are below the accepted
4.147 TFLOPS(100%). The acceptable fraction (90%) can be set
using the <sgemm-peak-fraction> option in the configuration
file. For more details, please refer to the Intel(R) Cluster
Checker User Guide.
3 nodes: host-b[1,3,6]
Test: sgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 2.355 TFLOPS 57 host-b6
1 3.181 TFLOPS 77 host-b1
1 3.277 TFLOPS 79 host-b3
5. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 2.355 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
6. dgemm-data-is-substandard-avx512
Message: The DGEMM benchmark result is below the accepted 2.074
TFLOPS(100%). The acceptable fraction (90%) can be set using the
<dgemm-peak-fraction> option in the configuration file. For more
details, please refer to the Intel(R) Cluster Checker User
Guide.
3 nodes: host-b[1,3,6]
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 1.389 TFLOPS 67 host-b1
1 1.528 TFLOPS 74 host-b3
1 1.570 TFLOPS 76 host-b6
7. dgemm-data-is-substandard
Message: The following DGEMM benchmark results are below the theoretical
peak of 1.165 TFLOPS.
1 node: host-a2
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 845.441 GFLOPS 73 host-a2
8. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 845.441 GFLOPS is due
to a conflicting process, pid '11116', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-a2
Test: node_process_status
9. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.570 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
1. ethernet-driver-is-not-consistent
Message: Inconsistent Ethernet driver.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Driver Nodes
1 netxen_nic host1
1 tg3 host-a2
2. kernel-not-uniform
Message: The Linux kernel version, '3.10.0-957.27.2.el7.x86_64', is not
uniform. 86% of nodes in the same grouping have the same
version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: kernel_version_uniformity
3. kernel-not-uniform
Message: The Linux kernel version, '2.6.32-573.26.1.el6.x86_64', is not
uniform. 14% of nodes in the same grouping have the same
version.
1 node: host-d
Test: kernel_version_uniformity
4. environment-variable-not-uniform
Message: Environment variables are not uniform across the nodes.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: environment_variables_uniformity
Details:
#Nodes Variable Value Nodes
6 G_BROKEN_FILENAMES host-a2, host-b[1,3,6], host1, host-c1
6 KDE_IS_PRELINKED host-a2, host-b[1,3,6], host1, host-c1
6 MODULEPATH host-a2, host-b[1,3,6], host1, host-c1
6 MODULESHOME host-a2, host-b[1,3,6], host1, host-c1
1 G_BROKEN_FILENAMES 1 host-d
1 KDE_IS_PRELINKED 1 host-d
1 MODULEPATH /usr/share/Modules/modulefiles:/etc/modulefiles host-d
1 MODULESHOME /usr/share/Modules host-d
5. perl-not-uniform
Message: The Perl version, '5.16.3', is not uniform. 86% of nodes in the
same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: perl_functionality
6. perl-not-uniform
Message: The Perl version, '5.10.1', is not uniform. 14% of nodes in the
same grouping have the same version.
1 node: host-d
Test: perl_functionality
7. python-not-uniform
Message: The Python version, '2.7.5', is not uniform. 86% of nodes in
the same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: python_functionality
8. python-not-uniform
Message: The Python version, '2.6.6', is not uniform. 14% of nodes in
the same grouping have the same version.
1 node: host-d
Test: python_functionality
9. ethernet-driver-version-is-not-consistent
Message: Inconsistent Ethernet driver version.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Version Nodes
1 3.137 host-a2
1 4.0.82 host1
10. ethernet-interrupt-coalescing-state-not-uniform
Message: Ethernet interrupt coalescing is not enabled/disabled uniformly
across nodes in the same grouping.
Remedy: Append "/sbin/ethtool -C eno1 rx-usecs <value>" to the site
specific system startup script. Use '0' to permanently disable
Ethernet interrupt coalescing or other value as needed. The
site specific system startup script is typically
/etc/rc.d/rc.local or /etc/rc.d/boot.local.
1 node: host1
Test: ethernet
Details:
#Nodes State Interface Nodes
1 enabled eno1 host1
1 enabled eno3 host1
--------------------------------------------------------------------------------
INFORMATIONAL
The following additional information was detected:
1. mpi-network-interface
Message: The cluster has 1 network interfaces (Ethernet). Intel(R) MPI
Library uses by default the first interface detected in the
order of: (1) Intel(R) Omni-Path Architecture (Intel(R) OPA),
(2) InfiniBand, (3) Ethernet. You can set a specific interface
by setting the environment variable I_MPI_OFI_PROVIDER.
Ethernet: I_MPI_OFI_PROVIDER=sockets mpiexec.hydra; InfiniBand:
I_MPI_OFI_PROVIDER=verbs mpiexec.hydra; Intel(R) OPA:
I_MPI_OFI_PROVIDER=psm2 mpiexec.hydra.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_prereq_user
--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 1
00:34:46 April 23 2021 UTC
Nodefile used: machinesToTest
Databases used: $HOME/.clck/2021.1.1/clck.db

I tried to use a consistent ethernet driver version in host1 and follow the remedy provided in the log for ethernet-interrupt-coalescing-state-not-uniform and run the sample on heterogeneous nodes including host1.

Running jobs (scripts) on Beowulf with multiple machines with memory allocation

I built a simple Beowulf cluster with one master and 4 nodes (total of 128 cores) following this tutorial.
https://www.youtube.com/watch?v=gvR1eQyxS9I
I was successful to run a "Hello World" program by allocating some cores of my cluster. This is what i used:
$ mpiexec -n 64 -f hosts ./mpi_hello
Now, i know how to run a multithread program, I would like to allocate some memory. I am planning to do some data analysis.
Each nodes have 16GB of ram. How can I allocate 32GB or 64GB of ram for data analysis?
Thank you very much for your help.

Using all cores with Microsoft R Open and Google Compute Engine

I'm using Microsoft R Open on a GCE instance that has two vCPUs. Here are its specs.
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 2
Core(s) per socket: 1
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU # 2.30GHz
Stepping: 0
CPU MHz: 2300.000
BogoMIPS: 4600.00
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush
mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc
eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hyp
ervisor lahf_lm abm fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms xsaveopt
Even though I have two cores, Microsoft R Open seems to recognize only one of them, so I'm not taking full advantage of my computing capacity. I can't set the numbers of threads manually either.
Microsoft R Open 3.3.2
The enhanced R distribution from Microsoft
Microsoft packages Copyright (C) 2016 Microsoft Corporation
Using the Intel MKL for parallel mathematical computing(using 1 cores).
Default CRAN mirror snapshot taken on 2016-11-01.
See: https://mran.microsoft.com/.
> getMKLthreads()
[1] 1
> setMKLthreads(2)
Number of threads at maximum: no change has been made.
Here's a graph showing CPU usage. It never uses more than 50% of CPU power.
So, what should I do so I can use all my cores with MRO?

you are running Xeon which is hyper threaded . You have 1 cpu with hyper threading,the os treats it as 2 cpus but there is only one physical cpu. MRO uses the physical cores only(without hyper threading)

You can use this:
library(doParallel)
no_cores <- detectCores() - 1
registerDoParallel(cores=no_cores)
It will one core less than the actual cores you have. It leaves one core for OS operations. Try it out.

Scaling nginx with static files -- non-Persistent requests kill req/s

Working on a project where we need to server a small static xml file ~40k / s.
All incoming requests are sent to the server from HAProxy. However, none of the requests will be persistent.
The issue is that when benchmarking with non-Persistent requests, the nginx instance caps out at 19 114 req/s. When persistent connections are enabled, performance increases by nearly an order of magnitude, to 168 867 req/s. The results are similar with G-wan.
When benchmarking non-persistent requests, CPU usage is minimal.
What can I do to increase performance with non-persistent connections and nginx?
[root#spare01 lighttpd-weighttp-c24b505]# ./weighttp -n 1000000 -c 100 -t 16 "http://192.168.1.40/feed.txt"
finished in 52 sec, 315 millisec and 603 microsec, 19114 req/s, 5413 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 290000000 bytes total, 231000000 bytes http, 59000000 bytes data
[root#spare01 lighttpd-weighttp-c24b505]# ./weighttp -n 1000000 -c 100 -t 16 -k "http://192.168.1.40/feed.txt"
finished in 5 sec, 921 millisec and 791 microsec, 168867 req/s, 48640 kbyte/s
requests: 1000000 total, 1000000 started, 1000000 done, 1000000 succeeded, 0 failed, 0 errored
status codes: 1000000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 294950245 bytes total, 235950245 bytes http, 59000000 bytes data

Your 2 tests are similar (except HTTP Keep-Alives):
./weighttp -n 1000000 -c 100 -t 16 "http://192.168.1.40/feed.txt"
./weighttp -n 1000000 -c 100 -t 16 -k "http://192.168.1.40/feed.txt"
And the one with HTTP Keep-Alives is 10x faster:
finished in 52 sec, 19114 req/s, 5413 kbyte/s
finished in 5 sec, 168867 req/s, 48640 kbyte/s
First, HTTP Keep-Alives (persistant connections) make HTTP requests run faster because:
Without HTTP Keep-Alives, the client must establish a new CONNECTION for EACH request (this is slow because of the TCP handshake).
With HTTP Keep-Alives, the client can send all requests at once (using the SAME CONNECTION). This is faster because there are less things to do.
Second, you say that the static file XML size is "small".
Is "small" nearer to 1 KB or 1 MB? We don't know. But that makes a huge difference in terms of available options to speedup things.
Huge files are usually served through sendfile() because it works in the kernel, freeing the usermode server from the burden of reading from disk and buffering.
Small files can use more flexible options available for application developers in usermode, but here also, file size matters (bytes and kilobytes are different animals).
Third, you are using 16 threads with your test. Are you really enjoying 16 PHYSICAL CPU Cores on BOTH the client and the server machines?
If that's not the case, then you are simply slowing-down the test to the point that you are no longer testing the web servers.
As you see, many factors have an influence on performance. And there are more with OS tuning (the TCP stack options, available file handles, system buffers, etc.).
To get the most of a system, you need to examinate all those parameters, and pick the best for your particular exercise.

Command to find information about CPUs on a UNIX machine

Do you know if there is a UNIX command that will tell me what the CPU configuration for my Sun OS UNIX machine is? I am also trying to determine the memory configuration. Is there a UNIX command that will tell me that?

There is no standard Unix command, AFAIK. I haven't used Sun OS, but on Linux, you can use this:
cat /proc/cpuinfo
Sorry that it is Linux, not Sun OS. There is probably something similar though for Sun OS.

The nproc command shows the number of processing units available:
$ nproc
Sample outputs: 4
lscpu gathers CPU architecture information form /proc/cpuinfon in human-read-able format:
$ lscpu
Sample outputs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 15
Stepping: 7
CPU MHz: 1866.669
BogoMIPS: 3732.83
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-7

Try psrinfo to find the processor type and the number of physical processors installed on the system.

Firstly, it probably depends which version of Solaris you're running, but also what hardware you have.
On SPARC at least, you have psrinfo to show you processor information, which run on its own will show you the number of CPUs the machine sees. psrinfo -p shows you the number of physical processors installed. From that you can deduce the number of threads/cores per physical processors.
prtdiag will display a fair bit of info about the hardware in your machine. It looks like on a V240 you do get memory channel info from prtdiag, but you don't on a T2000. I guess that's an architecture issue between UltraSPARC IIIi and UltraSPARC T1.

I think you can use prtdiag or prtconf on many UNIXs

My favorite is to look at the boot messages. If it's been recently booted try running /etc/dmesg. Otherwise find the boot messages, logged in /var/adm or some place in /var.