Some multithreaded code I just wrote appears to run slower under hyperthreaded CPUs - i.e. disabling hyperthreading makes it run FASTER. Is this normal?
This depends entirely on use case. A subjective term like normal has a lot of leeway! There are use cases where Hyper-Threading (HT) makes sense, and cases where it will have a performance impact.
One such case of performance decrease is for applications making heavy use of AVX instructions. The AVX instructions are carried out in the vector processing unit(VPU), of which there is one per core in Intel Xeon processors. Additional threads will block when trying to access the VPU if it is not available, leading to no performance improvement with the use of HT.
If you have say, 4 cores with HT, allowing you to run 8 threads, you will only actually be able to run 4 VPU instructions at a time - so your other 4 threads will be blocked as they complete. The additional overhead of the blocking and scheduling will usually net you a lower throughput than if you were running 4 threads on 4 cores, with HT disabled.
Likewise, running just 4 threads on the 8 cores, the OS scheduler can schedule the threads to run on any physical core - so there may still be a chance where one thread blocks waiting for another to complete. Some newer applications and job schedulers can now coordinate with the OS to "pin" threads on physical cores, allowing HT to be enabled, but not to oversubscribe the amount of threads that are running on a core. Over time this will probably get better, but does require awareness on the developer's part.
For more general purpose use cases, like a generic server handling many types of workloads, the advantage of HT running additional threads in a single core it usually a performance gain.
If one had the software to check, would the program be able to complete in a reasonable amount of time, or at all? What software is necessary? Are multiple P.C.'s necessary for checking just one number?
Let's quote an article about the most recently discovered large prime 274,207,281 - 1.
https://www.mersenne.org/primes/?press=M74207281
The primality proof took 31 days of non-stop computing on a PC with an Intel I7-4790 CPU. To prove there were no errors in the prime discovery process, the new prime was independently verified using both different software and hardware. Andreas Hoglund and David Stanfill each verified the prime using the CUDALucas software running on NVidia Titan Black GPUs in 2.3 days. David Stanfill verified it using ClLucas on an AMD Fury X GPU in 3.5 days. Serge Batalov also verified it using Ernst Mayer's MLucas software on two Intel Xeon 18-core Amazon EC2 servers in 3.5 days.
So anywhere from 3 days to a month depending on how much hardware you put in a single computer.
Environment :
machines : 2.1 xeon, 128 GB ram, 32 cpu
os : centos 7.2 15.11
cassandra version : 2.1.15
opscenter version : 5.2.5
3 keyspaces : Opscenter (3 tables), OpsCenter (10 tables), application`s keyspace with (485 tables)
2 Datacenters, 1 for cassandra (5 machines )and another one DCOPS to store up opscenter data (1 machine).
Right now the agents on the nodes consume on average ~ 1300 cpu (out of 3200 available). The only transactioned data being ~ 1500 w/s on the application keyspace.
Any relation between number tables and opscenter? Is it behaving alike, eating a lot of CPU because agents are trying to write the data from too many metrics or is it some kind of a bug!?
Note, same behaviour on previous version of opscenter 5.2.4. For this reason i first tried to upg opscenter to newest version available.
From opscenter 5.2.5 release notes :
"Fixed an issue with high CPU usage by agents on some cluster topologies. (OPSC-6045)"
Any help/opinion much appreciated.
Thank you.
Observing with the awesome tool you provided Chris, on specific agent`s PID noticed that the heap utilisation was constant above 90% and that triggered a lot of GC activity with huge GC pauses of almost 1 sec. In this period of time i suspect the pooling threads had to wait and block my cpu alot. Anyway i am not a specialist in this area.
I took the decision to enlarge the heap for the agent from default 128 to a nicer value of 512 and i saw that all the GC pressure went off and now any thread allocation is doing nicely.
Overall the cpu utilization dropped from values of 40-50% down to 1-2% for the opscenter agent. And i can live with 1-2% since i know for sure that the CPU is consumed by the jmx-metrics.
So my advice is to edit the file:
datastax-agent-env.sh
and alter the default 128 value of Xmx
-Xmx512M
save the file, restart the agent, and monitor for a while.
http://s000.tinyupload.com/?file_id=71546243887469218237
Thank you again Chris.
Hope this will help other folks.
My windows system has 8 cores.
When I use 8 CPUs with my MPI: mpiexec.exe -n 8, all of my 8 available processors are busy in task manager which makes sense.
When I use 2 cores: mpiexec.exe -n 2, I expect only 2 cores should be busy but that's not the case and I have an irregular CPU usage distributed over 8 cores.
Is this observation expected?
Yes, this behaviour is expected. A general-purpose operating system such as (most versions of) Windows moves processes around cores. One reason for this is to ensure that no process is starved of execution time. Don't forget that on most Windows computers there will be all sorts of processes running at the same time as your computational processes. Use the Task Manager to see what is going on and don't be surprised if there are dozens of processes running in addition to the 2 running your MPI program.
So, yes, with two processes running a computationally-intensive program you can expect the core usage to be irregular, but to average out at 2/8 over time.
Of course, for the special case of parallel MPI programs this behaviour may be performance damaging. Generally MPI implementations provide some way to 'pin' processes to cores. Consult the documentation for your MPI implementation for how to do this. But don't be surprised if you find that performance actually drops
Hi I am kind of MPI noob so please bear with me on this one. :)
Say I have an MPI program called foo.c and I run the executable with
mpirun -np 3 ./foo
Now this means the program will be run in parallel using 3 processors (1 process per processor). But since most processors today have more than one core, (take 2 cores per processor say) does this mean the program will be run on 3 cores or 3 processors?
Probably this has to do with my poor understanding of what the difference between a core and a processor really is so if you could also explain a little more that would be helpful.
Thank you.
mpirun will execute a number of "processes" on the machine. The cpu or core where these processes are executed is operating-system dependent.
On a N cpu machines with M cores on each cpu, you have room for N*M processes running at full speed.
But, typically:
If you have multiple cores, each process will run on a separate core
If you ask for more processes than the available core*cpus, everything will run, but with a lower efficiency (yes, you can run multi-process jobs on a single-cpu single-core machine...)
If you are using a queuing system or a preconfigured MPI system for which a list of remote machines exists, the allocation will be distributed on the remote machines.
(Depending of the mpi implementation, there might be some options to force a specific cpu or core, but you should not need to worry about that).
Distribution of processes to cores and processors is handled by the operating system and the MPI implementation. Running on a desktop, the operating system will generally put each process on a different core, potentially redistributing processes during run-time. In larger systems such a s a supercomputer or a cluster, the distribution is handled by resource managers such as SLURM. However this happens, one or multiple processes will be assigned to each core.
Regarding hardware, a core can run only a single process at a time. Technologies such as hyper-threading allows multiple processes to share the resources of a single core. There are cases where two or more processes per core is optimal. For instance, if a processes is doing a large amount of file I/O another may take its place and do computation while the first is hung on a read or write.
In short, give MPI the number of processes you want to execute. Distribution of these processes is then handled transparent to the user. The number of processes that you use should be determined by requirements of the application (powers of 2, number of files to be read), the number of cores available, and the optimal number of processes per core for the application.
The OS Scheduler will try to optimally allocate separate cores to your parallel application's processes in a multi core system OR to separate processors in multi processor system.
The interesting case is a multi-core multi cpu system. Again you can let the OS Scheduler do it for you , OR you can enforce the ( logical/physical) core affinity to your processes to bind them to a particular core.
The mpirun command uses a hostlist. If don't specify it, it will probably use "localhost" and run all your processes there. If you run 3 processes and you have a 4 core machine, you probably get good speedup because the OS will generally put them on different cores. If you only have two cores, then one core will get two processes.
The previous is not entirely true, since the OS is allowed to move processes, so you may want to use numactl to bind them to a core.
If you are on a multi-node cluster, then a well-setup mpi will generate a hostfile where each node appears as many times as it has cores. So on a 4 node cluster with 8 cores per node, you can request up to 32 processes and expect close to perfect speedup. (If your code and your algorithm allow that, of course.) Requesting 9 processes on that cluster may put 8 on one node and the 9th on another, which is of course not great for performance. You'd hope that your cluster software comes with an mpirun that spreads the processes out better than that.
from performance view of MPI job,there are some explicit rule:
1) if you code is pure MPI code (BLAS is not tuned with openMP), turn off hyperthread and set the tasks number of job per node to the cores of node
2) if you code is MPI+openMP, you can set PPN (processes per node) to the cores of node and OMP_NUM_THEADS to the 2(if there are two hardware threads per core)
3) if you code is MPI+openMP and you cluster is huge then you can set PPN (processes per node) to 1 and OMP_NUM_THEADS to the logical CPU numbers to save the communication overhead
In order to provide a useful framework I would consider this hierarchy:
a motherboard can hold one or more chips/dice;
a chip/die can contain one or more cores (independent CPUs);
a CPU can work out one or more threads concurrently (the multithreading I know of consists of two threads)
In the early days, you had most often one motherboard with one chip with one CPU running one thread. Only one process at a time could be dealt with, and the attending hardware set was referred to as the processor. There was was one-to-one mapping between pieces of software (the task to run) and pieces of hardware (the device to run the task).
Process is definitely a software notion. 'Thread' is, cast quite simply, a specification of 'process' in the context of parallel concurrent computing. Nowadays processor can refer to a physical device as well as its extended processing capabilities (multithreading again, which to be sure is a technological implementation). For example, you can have machines with two chips on the motherboard, with four core/CPUs per chip, and with each core/CPUs running two threads concurrently. Then you would be able to run 2x4x2=16 processes (without oversubscription of resources, of courses).
The MPI syntax you quote addresses processes (option np), or threads if you like. The description part of man mpirun even refers to processes as 'slots' (for example, see the specs for the hostfile).
Slots indicate how many processes can potentially execute on a node.
This usage sounds like a legacy of that close correspondence between units of hardware and units of software that was standard back then. 'Slot' is originally a material/hardware feature, not unlike the term 'socket' that has undergone a similar change of semantics at times.
So indeed I feel quite some sympathy for your confusion. If you are a Linux user, you can visualize the report of cat /proc/cpuinfo. These lines refer to one processor named '2' out of four:
processor : 2
...
physical id : 0
siblings : 4
core id : 2
cpu cores : 4
They say that in this one machine I have gotten only one chip (since 'phyical id' takes only one value in the whole list, omitted), that this one chip as 4 'cpu cores' and that this one chip is running four siblings (4 threads, so there is no multithreading). In this case there are 4 processing elements, and 4 cpu cores.
In the example above with multithreading, you would see a listing for 16 processors, 2 values for 'physical id' (chips), 'cpu cores' equal to 4 (per chip) and `siblings' equal to 8 (per chip) since multithreading is enabled on that chip. In this case you have four times as many processors as cores.
Therefore, in this extended context, 'processor' indicates the machine's capability to work on a 'process', and this is what MPI and you want to use, regardeless of the number and feats of cores that can enable this. You only need to gain an overview of where these processing capabilities come from.
Another useful Linux command is then lscpu:
...
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
...
There 'socket' indeed is the physical connection in the motherboard where the chip is plugged into, so it is a byname of chip indeed. Indeed no multithreading here.
I am indebted to the discussions in this other post https://unix.stackexchange.com/q/146051/132913