Finest grain at which a CPU usage can be logged / recorded - cpu-usage

There is something fractal to CPU loads. Depending on the time resolution at which you look at the output of typical "procstat" or "top" programs, the slope will look more or less detailed.
Apart from "load" programs, that take the "average" CPU usage over 1, 5, 15 minutes, many tools give CPU usage with a second accuracy.
When the sampling rate becomes higher than the free / malloc frequencies of a program, spikes start to become visible on a graph where the CPU usage seemed smooth.
The question is : what is the highest possible sampling rate possible / available, to display a CPU usage ?
Like in quantic systems, measuring probably changes the state of the entity that is measured (if the sampling tool runs on the CPU that is sampled). Are there any systems where CPU usage is monitored, very accurately, from another, external CPU, in a way that the measurements have no effect of the load of the monitored CPU ?

Related

Monitoring CPU Utilization using Prometheus

I am trying to monitor the cpu utilization of the machine in which Prometheus is installed and running. I have a metric process_cpu_seconds_total. I can find irate or rate of this metric. But I am not too sure how to come up with the percentage value for CPU utilization. Is there anyway I can use this process_cpu_seconds_total metric to find the CPU utilization of the machine where Prometheus runs?
A late answer for others' benefit too:
If you're wanting to just monitor the percentage of CPU that the prometheus process uses, you can use process_cpu_seconds_total, e.g. something like:
avg by (instance) (irate(process_cpu_seconds_total{job="prometheus"}[1m]))
However, if you want a general monitor of the machine CPU as I suspect you might be, you should set-up Node exporter and then use a similar query to the above, with the metric node_cpu_seconds_total. E.g.:
avg by (instance,mode) (irate(node_cpu_seconds_total{mode!='idle'}[1m]))
The rate or irate are equivalent to the percentage (out of 1) since they are how many seconds used of a second, but usually need to be aggregated across cores/cpus on the machine.
Brian Brazil's post on Prometheus CPU monitoring is very relevant and useful: https://www.robustperception.io/understanding-machine-cpu-usage
One way to do is to leverage proper cgroup resource reporting. Cgroup divides a CPU core time to 1024 shares. So by knowing how many shares the process consumes, you can always find the percent of CPU utilization.
Now in your case, if you have the change rate of CPU seconds, which is how much time the process used CPU time in the last time unit (assuming 1s from now on). Then depends how many cores you have, 1 CPU in the last 1 unit will have 1 CPU second. So if your rate of change is 3 and you have 4 cores.
3/4 = 75% CPU utilization.
It is only a rough estimation, as your process_total_cpu time is probably not very accurate due to delay and latency etc.

CPU memory access time

Does the average data and instruction access time of the CPU depends on the execution time of an instruction?
For example if miss ratio is 0.1, 50% instructions need memory access,L1 access time 3 clock cycles, mis penalty is 20 and instructions execute in 1 cycles what is the average memory access time?
I'm assume you're talking about a CISC architecture where compute instructions can have memory references. If you have a sequence of ADDs that access memory, then memory requests will come more often than a sequence of the same number of DIVs, because the DIVs take longer. This won't affect the time of the memory access -- only locality of reference will affect the average memory access time.
If you're talking about a RISC arch, then we have separate memory access instructions. If memory instructions have a miss rate of 10%, then the average access latency will be the L1 access time (3 cycles for hit or miss) plus the L1 miss penalty times the miss rate (0.1 * 20), totaling an average access time of 5 cycles.
If half of your instructions are memory instructions, then that would factor into clocks per instruction (CPI), which would depend on miss rate and also dependency stalls. CPI will also be affected by the extent to which memory access time can overlap computation, which would be the case in an out-of-order processor.
I can't answer your question a lot better because you're not being very specific. To do well in a computer architecture class, you will have to learn how to figure out how to compute average access times and CPI.
Well, I'll go ahead and answer your question, but then, please read my comments below to put things into a modern perspective:
Time = Cycles * (1/Clock_Speed) [ unit check: seconds = clocks * seconds/clocks ]
So, to get the exact time you'll need to know the clock speed of your machine, for now, my answer will be in terms of Cycles
Avg_mem_access_time_in_cycles = cache_hit_time + miss_rate*miss_penalty
= 3 + 0.1*20
= 5 cycles
Remember, here I'm assuming your miss rate of 0.1 means 10% of cache accesses miss the cache. If you're meaning 10% of instructions, then you need to halve that (because only 50% of instrs are memory ops).
Now, if you want the average CPI (cycles per instr)
CPI = instr% * Avg_mem_access_time + instr% * Avg_instr_access_time
= 0.5*5 + 0.5*1 = 3 cycles per instruction
Finally, if you want the average instr execution time, you need to multiply 3 by the reciprocal of the frequency (clock speed) of your machine.
Comments:
Comp. Arch classes basically teach you a very simplified way of what the hardware is doing. Current architectures are much much more complex and such a model (ie the equations above) is very unrealistic. For one thing, access time to various levels of cache can be variable (depending on where physically the responding cache is on the multi- or many-core CPU); also access time to memory (which typically 100s of cycles) is also variable depending on contention of resources (eg bandwidth)...etc. Finally, in modern CPUs, instructions typically execute in parallel (ILP) depending on the width of the processor pipeline. This means adding up instr execution latencies is basically wrong (unless your processor is a single-issue processor that only executes one instr at a time and blocks other instructions on miss events such as cache miss and br mispredicts...). However, for educational purpose and for "average" results, the equations are okay.
One more thing, if you have a multi-level cache hierarchy, then the miss_penalty of level 1 cache will be as follows:
L1$ miss penalty = L2 access time + L1_miss_rate*L2_miss_penalty
If you have an L3 cache, you do a similar thing to L2_miss_penalty and so on

Local memory and registers scale linearly with work group size - how to choose optimal size?

My kernel's local memory and register usage scale linearly with work group size. Besides trial and error, are there guidelines for choosing the optimal work group size? I am targeting AMD hardware, where the maximum work group size is 256; should I try to maximize the number of work items in group, or does this risk reducing occupancy and creating register spilling?
You should do both : try to maximize occupancy avoiding register spilling at all costs i.e. get the most of the resources available on your platform.
If you are using nvcc, you can get the number of registers a single thread would need to execute your kernel like this. Then using this information with the local memory needed (that's your input) you can use the CUDA occupancy calculator to see the impact on occupancy. That does not replace a good old "trial-and-error" though.
EDIT: You are using AMD. I don't know how you can map NVIDIA compute capability to AMD devices though.

How to improve the speed of MPI_scatter/MPI_gather?

I found that time used for MPI_scatter/MPI_gather continuously increased (somehow linearly) as the number of workers increases, especially when the workers are across different nodes.
I thought that MPI_scatter/MPI_gather is a parallel process, and wonder what leads to the above increasing? Is there any trick to make it faster, especially for workers distributing across CPU nodes?
The root rank has to push a fixed amount of data to the other ranks. As long as all ranks reside on the same compute node, the process is limited by the memory bandwidth available. Once more nodes become involved, the network bandwidth, usually much lower than the memory bandwidth, becomes the limiting factor.
Also the time to send a message is roughly divided in two parts - initial (network setup and MPI protocol handshake) latency and then the time it takes to physically transfer the actual data bits. As the amount of data is fixed, the total physical transfer time remains the same (as long as the transport type and therefore the bandwidth stays the same) but more setup/latency overhead is being added with each new rank that data is scattered to or gathered from, therefore the linear increase in the time it takes to complete the operation.
How an MPI_Scatter/Gather will work varies between implementations. Some MPI implementations may choose to use a series of MPI_Send as an underlying mechanism.
The parameters that may affect how MPI_Scatter works are:
1. Number of processes
2. Size of data
3. Interconnect
For example, an implementation may avoid using a broadcast for very small number of ranks sending/receiving very large data.

OpenCl Statement , true or false?

I was reading some results. And there I saw that they used 5120 work-groups and a local-size of 1. I have a limited knowledge about OpenCl and I was wondering if this statement is correct:
As can be seen for the GPU, the first test has 5120 work-groups, with
1 work-item each. This means that the threads which are executed in
parallel are limited to the amount of computing units there are in the
machine. For example if a GPU has 20 computing units there can only be
a maximum of 20 threads which are working in parallel. Though when the
local size is increased to 2, twice the amount of threads are run
simultaneously
From reading some info on OpenCl, it seems about right. Though I need a second opinion.
update. Hmm, nat chouf's comment is right, I understood the question as "in flight at the same time" instead of "physically executed at the same time".
As I wrote, several work-groups can be scheduled at a given time in a single compute unit. The number of such "in-flight" work-groups is limited by the available resources (local memory, registers, etc.) on each compute unit.
In existing implementations (afaik) a compute unit will pick a block (warp/wavefront) of work-items from the same work-group for execution, among all blocks in flight in the compute unit. One "instruction" of this block is inserted in the pipeline (it may take several cycles, and each "instruction" may correspond to several operations in each work-item), and then another block is picked.
So, yes, if work-group size is 1, only 1 work-item per compute unit will be physically started simultaneously. But potentially all work-items may be in-flight in the GPU at the same time.

Resources