Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port

Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port - mpi

A simple MPI application is failing with the following error when host1 is included in the hostfile.
Error: Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port description in business card
This application works fine when host1 is excluded from the hostfile.
I tried using cluster checker. I have attached the corresponding cluster checker log.
Can you please help me with the interpretation of this log as this seems to mostly contain the differences between various hosts that were specified with “-f (machinelist) without really high-lighting any issue with host-e8 that can explain this error.
Please find below the logs
SUMMARY
Command-line: clck -f machinesToTest -c clck.xml -Fhealth_user -Fhealth_base
-Fhealth_extended_user -Fmpi_prereq_user -l debug
Tests Run: health_user, health_base, health_extended_user,
mpi_prereq_user
**WARNING**: 9 tests failed to run. Information may be incomplete. See
clck_execution_warnings.log for more information.
Overall Result: 33 issues found - FUNCTIONALITY (3), HARDWARE UNIFORMITY (11),
PERFORMANCE (9), SOFTWARE UNIFORMITY (10)
--------------------------------------------------------------------------------
7 nodes tested: host-a2, host-b[1,3,6], host1,
host-c1, host-d
0 nodes with no issues:
7 nodes with issues: host-a2, host-b[1,3,6], host1,
host-c1, host-d
--------------------------------------------------------------------------------
FUNCTIONALITY
The following functionality issues were detected:
1. mpi-local-broken
Message: The single node MPI "Hello World" program did not run
successfully.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_local_functionality
2. memlock-too-small
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: memory_uniformity_user
3. memlock-too-small-ethernet
Message: The memlock limit, '64', is smaller than recommended.
Remedy: We recommend correcting the limit of locked memory in
/etc/security/limits.conf to the following values: "* hard
memlock unlimited" "* soft memlock unlimited"
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_ethernet
HARDWARE UNIFORMITY
The following hardware uniformity issues were detected:
1. memory-not-uniform
Message: The amount of physical memory is not within the range of
792070572.0 KiB +/- 262144.0 KiB defined by nodes in the same
grouping.
5 nodes: host-b[1,3], host1, host-c1, host-d
Test: memory_uniformity_base
Details:
#Nodes Memory Nodes
1 1584974816.0 KiB host-c1
1 2113513608.0 KiB host1
1 529153152.0 KiB host-d
1 790940180.0 KiB host-b1
1 790940184.0 KiB host-b3
2. logical-cores-not-uniform:24
Message: The logical cores, '24', is not uniform across all nodes in the
same grouping. 67% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
2 nodes: host-b[1,3]
Test: cpu_base
3. logical-cores-not-uniform:48
Message: The logical cores, '48', is not uniform across all nodes in the
same grouping. 33% of nodes in the same grouping have the same
number of logical cores.
Remedy: Please ensure that BIOS settings that can influence the number
of logical cores, like Hyper-Threading, VMX, VT-d and x2apic,
are uniform across nodes in the same grouping.
1 node: host-b6
Test: cpu_base
4. threads-per-core-not-uniform:1
Message: The number of threads available per core, '1', is not uniform.
67% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
2 nodes: host-b[1,3]
Test: cpu_base
5. threads-per-core-not-uniform:2
Message: The number of threads available per core, '2', is not uniform.
33% of nodes in the same grouping have the same number of
threads available per core.
Remedy: Please enable/disable hyper-threading uniformly on Intel(R)
CPUs.
1 node: host-b6
Test: cpu_base
6. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-a2
Test: cpu_base
7. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) Gold 6256 CPU # 3.60GHz', is
not uniform. 43% of nodes in the same grouping have the same
CPU model.
3 nodes: host-b[1,3,6]
Test: cpu_base
8. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7- 4870 # 2.40GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host1
Test: cpu_base
9. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E7-4830 v3 # 2.10GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-c1
Test: cpu_base
10. cpu-model-name-not-uniform
Message: The CPU model, 'Intel(R) Xeon(R) CPU E5-2640 v3 # 2.60GHz', is
not uniform. 14% of nodes in the same grouping have the same
CPU model.
1 node: host-d
Test: cpu_base
11. ethernet-firmware-version-is-not-consistent
Message: Inconsistent Ethernet firmware version.
3 nodes: host-a2, host1, host-c1
Test: ethernet
Details:
#Nodes Firmware Version Nodes
1 0x80000887, 1.2028.0 host-c1
1 0x800008e8 host-a2
1 4.0.596 host1
1 5719-v1.46 NCSI v1.3.16.0 host-a2
PERFORMANCE
The following performance issues were detected:
1. process-is-high-cpu
Message: Processes using high CPU.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
5 nodes: host-a2, host-b[3,6], host1, host-c1
Test: node_process_status
Details:
#Nodes User PID %CPU Process Nodes
1 usera 204058 98.9 /med/code7/usera/blue4/rnd/software/amd64.linux.gnu.product/distribVelsyn host-b3
1 userb 120854 98.5 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 userb 71486 98.6 /med/code7/userb/rb21B/software/amd64.linux.gnu.product/velsyn host1
1 wvgrid 11116 37.2 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-a2
1 wvgrid 19160 21.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-b6
1 wvgrid 25097 79.7 /wv/wv-med/sge/bin/lx-amd64/sge_execd host-c1
1 wvgrid 90731 58.1 /wv/wv-med/sge/bin/lx-amd64/sge_execd host1
2. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.528 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
3. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 3.277 TFLOPS is due to
a conflicting process, pid '204058', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b3
Test: node_process_status
4. sgemm-data-is-substandard-avx512
Message: The following SGEMM benchmark results are below the accepted
4.147 TFLOPS(100%). The acceptable fraction (90%) can be set
using the <sgemm-peak-fraction> option in the configuration
file. For more details, please refer to the Intel(R) Cluster
Checker User Guide.
3 nodes: host-b[1,3,6]
Test: sgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 2.355 TFLOPS 57 host-b6
1 3.181 TFLOPS 77 host-b1
1 3.277 TFLOPS 79 host-b3
5. substandard-sgemm-due-to-high-cpu-process
Message: The substandard SGEMM benchmark result of 2.355 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
6. dgemm-data-is-substandard-avx512
Message: The DGEMM benchmark result is below the accepted 2.074
TFLOPS(100%). The acceptable fraction (90%) can be set using the
<dgemm-peak-fraction> option in the configuration file. For more
details, please refer to the Intel(R) Cluster Checker User
Guide.
3 nodes: host-b[1,3,6]
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 1.389 TFLOPS 67 host-b1
1 1.528 TFLOPS 74 host-b3
1 1.570 TFLOPS 76 host-b6
7. dgemm-data-is-substandard
Message: The following DGEMM benchmark results are below the theoretical
peak of 1.165 TFLOPS.
1 node: host-a2
Test: dgemm_cpu_performance
Details:
#Nodes Result %Below Peak Nodes
1 845.441 GFLOPS 73 host-a2
8. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 845.441 GFLOPS is due
to a conflicting process, pid '11116', using a large amount of
cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-a2
Test: node_process_status
9. substandard-dgemm-due-to-high-cpu-process
Message: The substandard DGEMM benchmark result of 1.570 TFLOPS is due to
a conflicting process, pid '19160', using a large amount of cpu.
Remedy: If this command is running in error, kill the process on the
node (if you are not the owner of the process, elevated
privileges may be required.)
1 node: host-b6
Test: node_process_status
SOFTWARE UNIFORMITY
The following software uniformity issues were detected:
1. ethernet-driver-is-not-consistent
Message: Inconsistent Ethernet driver.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Driver Nodes
1 netxen_nic host1
1 tg3 host-a2
2. kernel-not-uniform
Message: The Linux kernel version, '3.10.0-957.27.2.el7.x86_64', is not
uniform. 86% of nodes in the same grouping have the same
version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: kernel_version_uniformity
3. kernel-not-uniform
Message: The Linux kernel version, '2.6.32-573.26.1.el6.x86_64', is not
uniform. 14% of nodes in the same grouping have the same
version.
1 node: host-d
Test: kernel_version_uniformity
4. environment-variable-not-uniform
Message: Environment variables are not uniform across the nodes.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: environment_variables_uniformity
Details:
#Nodes Variable Value Nodes
6 G_BROKEN_FILENAMES host-a2, host-b[1,3,6], host1, host-c1
6 KDE_IS_PRELINKED host-a2, host-b[1,3,6], host1, host-c1
6 MODULEPATH host-a2, host-b[1,3,6], host1, host-c1
6 MODULESHOME host-a2, host-b[1,3,6], host1, host-c1
1 G_BROKEN_FILENAMES 1 host-d
1 KDE_IS_PRELINKED 1 host-d
1 MODULEPATH /usr/share/Modules/modulefiles:/etc/modulefiles host-d
1 MODULESHOME /usr/share/Modules host-d
5. perl-not-uniform
Message: The Perl version, '5.16.3', is not uniform. 86% of nodes in the
same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: perl_functionality
6. perl-not-uniform
Message: The Perl version, '5.10.1', is not uniform. 14% of nodes in the
same grouping have the same version.
1 node: host-d
Test: perl_functionality
7. python-not-uniform
Message: The Python version, '2.7.5', is not uniform. 86% of nodes in
the same grouping have the same version.
6 nodes: host-a2, host-b[1,3,6], host1,
host-c1
Test: python_functionality
8. python-not-uniform
Message: The Python version, '2.6.6', is not uniform. 14% of nodes in
the same grouping have the same version.
1 node: host-d
Test: python_functionality
9. ethernet-driver-version-is-not-consistent
Message: Inconsistent Ethernet driver version.
2 nodes: host-a2, host1
Test: ethernet
Details:
#Nodes Version Nodes
1 3.137 host-a2
1 4.0.82 host1
10. ethernet-interrupt-coalescing-state-not-uniform
Message: Ethernet interrupt coalescing is not enabled/disabled uniformly
across nodes in the same grouping.
Remedy: Append "/sbin/ethtool -C eno1 rx-usecs <value>" to the site
specific system startup script. Use '0' to permanently disable
Ethernet interrupt coalescing or other value as needed. The
site specific system startup script is typically
/etc/rc.d/rc.local or /etc/rc.d/boot.local.
1 node: host1
Test: ethernet
Details:
#Nodes State Interface Nodes
1 enabled eno1 host1
1 enabled eno3 host1
--------------------------------------------------------------------------------
INFORMATIONAL
The following additional information was detected:
1. mpi-network-interface
Message: The cluster has 1 network interfaces (Ethernet). Intel(R) MPI
Library uses by default the first interface detected in the
order of: (1) Intel(R) Omni-Path Architecture (Intel(R) OPA),
(2) InfiniBand, (3) Ethernet. You can set a specific interface
by setting the environment variable I_MPI_OFI_PROVIDER.
Ethernet: I_MPI_OFI_PROVIDER=sockets mpiexec.hydra; InfiniBand:
I_MPI_OFI_PROVIDER=verbs mpiexec.hydra; Intel(R) OPA:
I_MPI_OFI_PROVIDER=psm2 mpiexec.hydra.
7 nodes: host-a2, host-b[1,3,6], host1,
host-c1, host-d
Test: mpi_prereq_user
--------------------------------------------------------------------------------
Intel(R) Cluster Checker 2021 Update 1
00:34:46 April 23 2021 UTC
Nodefile used: machinesToTest
Databases used: $HOME/.clck/2021.1.1/clck.db

I tried to use a consistent ethernet driver version in host1 and follow the remedy provided in the log for ethernet-interrupt-coalescing-state-not-uniform and run the sample on heterogeneous nodes including host1.

Related

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

We are having trouble with openmpi 4.0.5 on our cluster: It works as long as only 1 node is requested, but as soon as more than 1 is requested (e.g. mpirun -np 24 ./hello_world with --ntasks-per-node=12) it crashes and we get the following error message:
--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 2
slots that were requested by the application:
./hello_world
Either request fewer slots for your application, or make more slots
available for use.
A "slot" is the Open MPI term for an allocatable unit where we can
launch a process. The number of slots available are defined by the
environment in which Open MPI processes are run:
1. Hostfile, via "slots=N" clauses (N defaults to number of
processor cores if not provided)
2. The --host command line parameter, via a ":N" suffix on the
hostname (N defaults to 1 if not provided)
3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
4. If none of a hostfile, the --host command line parameter, or an
RM is present, Open MPI defaults to the number of processor cores
In all the above cases, if you want Open MPI to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.
Alternatively, you can use the --oversubscribe option to ignore the
number of available slots when deciding the number of processes to
launch.
--------------------------------------------------------------------------
I have tried using --oversubscribe, but this will still only use 1 node, even though smaller jobs would run that way. I have also tried specifically requesting nodes (e.g. -host node36,node37), but this results in the following error message:
[node37:16739] *** Process received signal ***
[node37:16739] Signal: Segmentation fault (11)
[node37:16739] Signal code: Address not mapped (1)
[node37:16739] Failing at address: (nil)
[node37:16739] [ 0] /lib64/libpthread.so.0(+0xf5f0)[0x2ac57d70e5f0]
[node37:16739] [ 1] /lib64/libc.so.6(+0x13ed5a)[0x2ac57da59d5a]
[node37:16739] [ 2] /usr/lib64/openmpi/lib/libopen-rte.so.12(orte_daemon+0x10d7)[0x2ac57c6c4827]
[node37:16739] [ 3] orted[0x4007a7]
[node37:16739] [ 4] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2ac57d93d505]
[node37:16739] [ 5] orted[0x400810]
[node37:16739] *** End of error message ***
The cluster has 59 nodes. Slurm 19.05.0 is used as a scheduler and gcc 9.1.0 to compile.
I don't have much experience with mpi - any help would be much appreciated! Maybe someone is familiar with this error and could point me towards what the problem might be.
Thanks for your help,
Johanna

Total count of HugePages getting reduced from 6000 to 16 and Free pages to 0

I am testing a DPDK application with 2M Hugepages, so I changed the /proc/cmdline of my redhat VM to start with 6000 huge pages as shown below on my VM with total memory of 32GB.
grep Huge /proc/meminfo
AnonHugePages: 6144 kB
HugePages_Total: 6000
HugePages_Free: 6000
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB*
But now when I start my application, it reports that application is asking for 5094 MB of memory but only 32 MB is available as shown below:
./build/app -l 4-7 -n 4 --socket-mem 5094,5094 --file-prefix dp -w 0000:13:00.0 -w 0000:1b:00.0
EAL: Detected 8 lcore(s)
EAL: Multi-process socket /var/run/.dp_unix
EAL: Probing VFIO support...
EAL: Not enough memory available on socket 0! Requested: 5094MB, available: 32MB
EAL: FATAL: Cannot init memory
EAL: Cannot init memory
EAL: Error - exiting with code: 1
Cause: Error with EAL initialization
And now when I check Huge pages again, it only shows 16 pages as below, please let me know why my Huge pages are getting reduced to 16 from initial 6000 due to which my application is not able to get memory.
grep Huge /proc/meminfo
AnonHugePages: 6144 kB
HugePages_Total: 16
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
./dpdk-devbind --status
Network devices using DPDK-compatible driver
============================================
0000:13:00.0 'VMXNET3 Ethernet Controller 07b0' drv=igb_uio unused=vmxnet3
0000:1b:00.0 'VMXNET3 Ethernet Controller 07b0' drv=igb_uio unused=vmxnet3
Network devices using kernel driver
===================================
0000:04:00.0 'VMXNET3 Ethernet Controller 07b0' if=ens161 drv=vmxnet3 unused=igb_uio *Active*
0000:0b:00.0 'VMXNET3 Ethernet Controller 07b0' if=ens192 drv=vmxnet3 unused=igb_uio *Active*
0000:0c:00.0 'VMXNET3 Ethernet Controller 07b0' if=ens193 drv=vmxnet3 unused=igb_uio *Active*
I also tried to increase the huge pages at run time but it doesn't help, it first increases but again on running the app, it reports that memory not available.
echo 6000 > /proc/sys/vm/nr_hugepages
echo "vm.nr_hugepages=6000" >> /etc/sysctl.conf
grep Huge /proc/meminfo
AnonHugePages: 6144 kB
HugePages_Total: 6000
HugePages_Free: 5984
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
./build/app -l 4-7 -n 4 --socket-mem 5094,5094 --file-prefix dp -w 0000:13:00.0 -w 0000:1b:00.0
EAL: Detected 8 lcore(s)
EAL: Multi-process socket /var/run/.dp_unix
EAL: Probing VFIO support...
EAL: Not enough memory available on socket 0! Requested: 5094MB, available: 32MB
EAL: FATAL: Cannot init memory
EAL: Cannot init memory
EAL: Error - exiting with code: 1
Cause: Error with EAL initialization
grep Huge /proc/meminfo
AnonHugePages: 6144 kB
HugePages_Total: 16
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

Seems there was some issue with the Centos 7 VM as Huge pages count was not making any sense, so I recreated the VM which resolved the issue.

If the requirement of your application is to have 5094 pages of 2MB, can you re-run your application with --socket-mem 5094,1.
but if your requirement is to have 5094 * 2, can you build the hugepages during boot by editing grub.conf as ' default_hugepagesz=2M hugepagesz=2M hugepages=10188'
Note: there is huge difference between 17.11 LTS and 18.11 LTS how huge pages are mapped and used.

How does nat bench work ? I reach 30,34TB using nats bench but I don't understand why

I'm new to NATS, I first bench the NATS with nats-bench
quanlm#quanlm2:~/go/src/github.com/nats-io/nats.go/examples/nats-bench$ go run main.go -np 1 -n 100000000 -ms 1600000 -csv test foo
Starting benchmark [msgs=100000000, msgsize=1600000, pubs=1, subs=0]
Pub stats: 20,848,474 msgs/sec ~ 30.34 TB/sec
Saved metric data in csv file test
My computer setup:
"Intel(R) Xeon(R) CPU E5-2630 v4 # 2.20GHz CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 16
CPU MHz: 2199.996"
Memory: 24GB
SSD
I have no idea how nat-bench work to reach that 30.34 TB/sec
Is this just a bug or I did somethings wrong ?
Btw: what does msgsize count in: bits/byte ?

Sorry for the delay. You are publishing 100 million messages, each of 1600000 bytes (about 1.5 MB). The bench utility reports that it was able to send 20,848,474 messages per second, and since one message is 1600000 bytes, that is: 20,848,474 * 1600000 = 33357558400000 bytes per second, which is about 30.34 TB per second.

Airflow simple tasks failing without logs with small parallelism LocalExecutor (was working with SequentialExecutor)

Running airflow (v1.10.5) dag that ran fine with SequentialExecutor now has many (though not all) simple tasks that fail without any log information when running with LocalExecutor and minimal parallelism, eg.
<airflow.cfg>
# overall task concurrency limit for airflow
parallelism = 8 # which is same as number of cores shown by lscpu
# max tasks per dag
dag_concurrency = 2
# max instances of a given dag that can run on airflow
max_active_runs_per_dag = 1
# max threads used per worker / core
max_threads = 2
# 40G of RAM available total
# CPUs: 8 (sockets 4, cores per socket 4)
see https://www.astronomer.io/guides/airflow-scaling-workers/
Looking at the airflow-webserver.* logs nothing looks out of the ordinary, but looking at airflow-scheduler.out I see...
[airflow#airflowetl airflow]$ tail -n 20 airflow-scheduler.out
....
[2019-12-18 11:29:17,773] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table1 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,779] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table2 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:17,782] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table3 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status failed for try_number 1
[2019-12-18 11:29:18,833] {scheduler_job.py:832} WARNING - Set 1 task instances to state=None as their associated DagRun was not in RUNNING state
[2019-12-18 11:29:18,844] {scheduler_job.py:1283} INFO - Executor reports execution of mydag.task_level1_table4 execution_date=2019-12-18 21:21:48.424900+00:00 exited with status success for try_number 1
....
but not really sure what to take away from this.
Anyone know what could be going on here or how to get more helpful debugging info?

Looking again at my lscpu specs, I noticed...
[airflow#airflowetl airflow]$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
Notice Thread(s) per core: 1
Looking at my airflow.cfg settings I see max_threads = 2. Setting max_threads = 1 and restarting both the scheduler seems to have fixed the problem.
If anyone knows more about what exactly is going wrong under the hood (eg. why the task fails rather than just waiting for another thread to become available), would be interested to hear about it.

Command to find information about CPUs on a UNIX machine

Do you know if there is a UNIX command that will tell me what the CPU configuration for my Sun OS UNIX machine is? I am also trying to determine the memory configuration. Is there a UNIX command that will tell me that?

There is no standard Unix command, AFAIK. I haven't used Sun OS, but on Linux, you can use this:
cat /proc/cpuinfo
Sorry that it is Linux, not Sun OS. There is probably something similar though for Sun OS.

The nproc command shows the number of processing units available:
$ nproc
Sample outputs: 4
lscpu gathers CPU architecture information form /proc/cpuinfon in human-read-able format:
$ lscpu
Sample outputs:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
CPU socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 15
Stepping: 7
CPU MHz: 1866.669
BogoMIPS: 3732.83
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-7

Try psrinfo to find the processor type and the number of physical processors installed on the system.

Firstly, it probably depends which version of Solaris you're running, but also what hardware you have.
On SPARC at least, you have psrinfo to show you processor information, which run on its own will show you the number of CPUs the machine sees. psrinfo -p shows you the number of physical processors installed. From that you can deduce the number of threads/cores per physical processors.
prtdiag will display a fair bit of info about the hardware in your machine. It looks like on a V240 you do get memory channel info from prtdiag, but you don't on a T2000. I guess that's an architecture issue between UltraSPARC IIIi and UltraSPARC T1.

I think you can use prtdiag or prtconf on many UNIXs

My favorite is to look at the boot messages. If it's been recently booted try running /etc/dmesg. Otherwise find the boot messages, logged in /var/adm or some place in /var.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fatal error in PMPI_Init: Other MPI error, error stack: Missing hostname or invalid host/port - mpi

I tried to use a consistent ethernet driver version in host1 and follow the remedy provided in the log for ethernet-interrupt-coalescing-state-not-uniform and run the sample on heterogeneous nodes including host1.

Related

Openmpi 4.0.5 fails to distribute tasks to more than 1 node

Total count of HugePages getting reduced from 6000 to 16 and Free pages to 0

How does nat bench work ? I reach 30,34TB using nats bench but I don't understand why

Airflow simple tasks failing without logs with small parallelism LocalExecutor (was working with SequentialExecutor)

Command to find information about CPUs on a UNIX machine

Categories

Resources