Understand what is using up "nice" CPU - cpu-usage

I am running a small Cassandra cluster on Google Compute Engine. From our CPU graphs (as reported by collectd), I notice that a nontrivial amount of processor time is spent in NICE. How can I find out what process is consuming this? I've tried just start top and staring at it for a while, but the NICE cpu usage is a bit spikey (most of the time, NICE is at 0%; only on occasion will it spike up to 30-40%) so "sit and wait" isn't very effective.

"Nice" generally refers to to the priority of a process. (More positive values are lower priority, more negative values are higher priority.) You can run ps -eo nice,pid,args | grep '^\s*[1-9]' to get a list of positive nice (low priority) commands.
On a CPU graph NICE time is time spent running processes with positive nice value (ie low priority). This means that it is consuming CPU, but will give up that CPU time for most other processes. Any USER CPU time for one of the processes listed in the above ps command will show up as NICE.

Related

Effective total time for a callee function is higher than that of caller function in intel-vtune

I have a multi-threading application and when I run vtune-profiler on it, under the caller/callee tab, I see that the callee function's CPU Time: Total - Effective Time is larger than caller function's CPU Time: Total - Effective Time.
eg.
caller function - A
callee function - B (no one calls B but A)
Function
CPU time: Total
-
Effective Time
A
54%
B
57%
My understanding is that Cpu Time: Total is the sum of CPU time: self + time of all the callee's of that function. By that definition should not Cpu Time: Total of A be greater than B?
What am I missing here?
It might have happened that the function B is being called by some other function along with A so there must be this issue.
Intel VTune profiler works by sampling and numbers are less accurate for short run time. If your application runs for a very short duration you could consider using allow multiple runs in VTune or increasing the run time.
Also Intel VTune Profiler sometimes rounds off the numbers so it might not give ideal result but the difference is very small like 0.1% but in your question its 3% difference so this won't be the reason for it.

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?
As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.
You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?
Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.
The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

Generating CPU utilization levels

First I would like to let you know that I have recently asked this question already, however it was considered to be unclear, see Linux: CPU benchmark requiring longer time and different CPU utilization levels. This is now a new attempt to formulate the question using a different approach.
What I need: In my research, I look at the CPU utilization of a computer and analyze the CPU utilization pattern within a period of time. For example, a CPU utilization pattern within time period 0 to 10 has the following form:
time, % CPU used
0 , 21.1
1 , 17
2 , 18
3 , 41
4 , 42
5 , 60
6 , 62
7 , 62
8 , 61
9 , 50
10 , 49
I am interested in finding a simple representation for a given CPU utilization pattern. For the evaluation part, I need to create some CPU utilization patterns on my laptop which I will then record and analyse. These CPU utilization patterns that I need to create on my laptop should
be over a time period of more than 5 minutes, ideally of about 20 minutes.
the CPU utilization pattern should have "some kind of dynamic behavior" or in other words, the % CPU used should not be (almost) constant over time, but should vary over time.
My Question: How can I create such a utilization pattern? Of course, I could just run an arbitrary program on my laptop and I will obtain a desired CPU pattern. However, this solution is not ideal since a reader of my work has no means to repeat this experiment if wanted since he has not access to the program I used. Therefore it would be much more beneficial to use something instead of an arbitrary program on my laptop (in my previous post I was thinking about open source CPU benchmarks for example). Can anyone recommend me something?
Many thanks!
I suggest a moving average. Select a window size and use it to average over. You'll need to decide what type of patterns you want to identify since the wider the window, the more smoothing you get and the fewer "features" you'll see. And CPU activity is very bursty. For example, if you are trying to identify cache bottlenecks, you'll want a small window, probably in the 10ms to 100ms range. If instead you want to correlate to longer term features, such as energy or load, you'll want a larger window, perhaps 10sec to minutes.
It looks like you are using OS provided CPU usage and not hardware registers. This means that the OS is already doing some smoothing. It may also be doing estimation for some performance values. Try to find documentation on this if you are integrating over a smaller window. A word of warning: this level of information can be hard to find. You may have to do a lot of digging. Depending upon your familiarity with kernel code, it may be easier to look at the code.

Give CPU more power to plot in Octave

I made this function in Octave which plots fractals. Now, it takes a long time to plot all the points I've calculated. I've made my function as efficient as possible, the only way I think I can make it plot faster is by having my CPU completely focus itself on the function or telling it somehow it should focus on my plot.
Is there a way I can do this or is this really the limit?
To determine how much CPU is being consumed for your plot, run your plot, and in a separate window (assuming your on Linux/Unix), run the top command. (for windows, launch the task master and switch to the 'Processes' tab, click on CPU header to sort by CPU).
(The rollover description for Octave on the tag on your question says that Octave is a scripting language. I would expect it's calling gnuplot to create the plots. Look for this as the highest CPU consumer).
You should see that your Octave/gnuplot cmd is near the top of the list, and for top there is a column labeled %CPU (or similar). This will show you how much CPU that process is consuming.
I would expect to see that process is consuming 95% or more CPU. If you see that is a significantly lower number, then you need to check the processes below that, are they consuming the remaining CPU (some sort of Virus scan (on a PC), or DB or Server?)? If a competing program is the problem, then you'll have to decide if you can wait til it/they are finished, OR that you can kill them and restart later. (For lunix, use kill -15 pid or kill -11 pid. Only use kill -9 pid as a last resort. Search here for articles on correct order for trying to kill -$n)
If there are no competing processes AND it octave/gnuplot is using less than 95%, then you'll have to find alternate tools to see what is holding up the process. (This is unlikely, it's possible some part of your overall plotting process is either Disk I/O or Network I/O bound).
So, it depends on the timescale you're currently experiencing versus the time you "want" to experience.
Does your system have multiple CPUs? Then you'll need to study the octave/gnuplot documentation to see if it supports a switch to indicate "use $n available CPUs for processing". (Or find a plotting program that does support using $n multiple CPUs).
Realistically, if your process now takes 10 mins, and you can, by eliminating competing processes, go from 60% to 90%, that is a %50 increase in CPU, but will only reduce it to 5 mins (not certain, maybe less, math is not my strong point ;-)). Being able to divide the task over 5-10-?? CPUs will be the most certain path to faster turn-around times.
So, to go further with this, you'll need to edit your question with some data points. How long is your plot taking? How big is the file it's processing. Is there something especially math intensive for the plotting you're doing? Could a pre-processed data file speed up the calcs? Also, if the results of top don't show gnuplot running at 99% CPU, then edit your posting to show the top output that will help us understand your problem. (Paste in your top output, select it with your mouse, and then use the formatting tool {} at the top of the input box to keep the formatting and avoid having the output wrap in your posting).
IHTH.
P.S. Note the # of followers for each of the tags you've assigned to your question by rolling over. You might get more useful "eyes" on your question by including a tag for the OS you're using, and a tag related to performance measurement/testing (Go to the tags tab and type in various terms to see how many followers you're getting. One bit of S.O. etiquette is to only specify 1 programming language (if appropriate) and that may apply to OS's too.)

Resources