Effective total time for a callee function is higher than that of caller function in intel-vtune

Effective total time for a callee function is higher than that of caller function in intel-vtune - intel

I have a multi-threading application and when I run vtune-profiler on it, under the caller/callee tab, I see that the callee function's CPU Time: Total - Effective Time is larger than caller function's CPU Time: Total - Effective Time.
eg.
caller function - A
callee function - B (no one calls B but A)
Function
CPU time: Total
-
Effective Time
A
54%
B
57%
My understanding is that Cpu Time: Total is the sum of CPU time: self + time of all the callee's of that function. By that definition should not Cpu Time: Total of A be greater than B?
What am I missing here?

It might have happened that the function B is being called by some other function along with A so there must be this issue.
Intel VTune profiler works by sampling and numbers are less accurate for short run time. If your application runs for a very short duration you could consider using allow multiple runs in VTune or increasing the run time.
Also Intel VTune Profiler sometimes rounds off the numbers so it might not give ideal result but the difference is very small like 0.1% but in your question its 3% difference so this won't be the reason for it.

Related

Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. AFAIK, mem_load_uops_retired.l3_miss, counts the number of DRAM demand (i.e., non-prefetch) data read accesses. offcore_response.demand_data_rd.l3_miss.local_dram, as its name suggests, counts the number of demand data reads targeted to DRAM. Therefore, these two events seem to be equivalent (or at least almost the same). But based on the following benchmarks the former event is much less frequent than the latter:
1) Initializing a 1000-Elment Global Array in a Loop in C:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) Opening a PDF Document in Evince:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) Running Wireshark for 5 seconds:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) Running Blur Filter on an Image in Inkscape:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable? Why? Please, tell me if the benchmarks are too complicated and coarse-grained!

The following table shows the differences between these two events on Haswell to the best of my (current) knowledge:
mem_load_uops_retired.l3_miss
offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops
Per uop per line
Y
Cacheable Non-Retired Load Uops
N
Y
Uncacheable WC Retired Load Uops
One event per line
N
Uncacheable UC Retired Load Uops
May occur
N
Uncacheable WC or UC Non-Retired Load Uops
N
N
Locked Loads of any type to any memory type
May occur
I don't know
Legacy IO requests
May occur
N
L1D Prefetches
N
Y
L2 Prefetches into L2 or L3
N
N
Software prefetches with no intention for write
N
Y
Page Walk Loads
N
Y
Servicing Unit
Any
Local DRAM
Reliability
May not be reliable
Reliable
It should be clear to you now that these events, in general, are not equivalent at all. Also comparing the counts of these two events to deduce something meaningful is not an easy task.
In all of the examples you presented, the offcore_response.demand_data_rd.l3_miss.local_dram event count is larger than the mem_load_uops_retired.l3_miss event count. However, it's not hard to come up with real examples where the latter is larger than the former.
In all four benchmarks,
offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as
frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
I think the description "nearly twice" really only applies to the second example, but not the others. I can't comment on the numbers you've shown without seeing the exact code and execution environment information.

groupsize vs execution time?

I make a simple program abput a vector adder and want to test the execution time vs the groupsize.
when I change the groupsize from 1024 to 5012 to 256 and to 128. The execution time is very similar. Why? in my view, when I `use smaller groupsizes, we should have more groups and they can work in the cores in parallel which could lead less execution time(for example, if workgroupsize change from 512 to 256, the execution time should reduce half?) but in my experinment in gpu, the execution time is siilar? is my view wrong?

Because number of workitems per group is not a visible bottleneck for vector addition. The global memory performance ise bottleneck. If data comes from host then pci-e performance is also bottleneck.

OpenCL: Confused by CL_DEVICE_MAX_COMPUTE_UNITS

I'm confused by this CL_DEVICE_MAX_COMPUTE_UNITS. For instance my Intel GPU on Mac, this number is 48. Does this mean the max number of parallel tasks run at the same time is 48 or the multiple of 48, maybe 96, 144...? (I know each compute unit is composed of 1 or more processing elements and each processing element is actually in charge of a "thread". What if these each of the 48 compute units is composed of more than 1 processing elements ). In other words, for my Mac, the "ideal" speedup, although impossible in reality, is 48 times faster than a CPU core (we assume the single "core" computation speed of CPU and GPU is the same), or the multiple of 48, maybe 96, 144...?

Summary: Your speedup is a little complicated, but your machine's (Intel GPU, probably GEN8 or GEN9) fp32 throughput 768 FLOPs per (GPU) clock and 1536 for fp16. Let's assume fp32, so something less than 768x (maybe a third of this depending on CPU speed). See below for the reasoning and some very important caveats.
A Quick Aside on CL_DEVICE_MAX_COMPUTE_UNITS:
Intel does something wonky when with CL_DEVICE_MAX_COMPUTE_UNITS with its GPU driver.
From the clGetDeviceInfo (OpenCL 2.0). CL_DEVICE_MAX_COMPUTE_UNITS says
The number of parallel compute units on the OpenCL device. A
work-group executes on a single compute unit. The minimum value is 1.
However, the Intel Graphics driver does not actually follow this definition and instead returns the number of EUs (Execution Units) --- An EU a grouping of the SIMD ALUs and slots for 7 different SIMD threads (registers and what not). Each SIMD thread represents 8, 16, or 32 workitems depending on what the compiler picks (we want higher, but register pressure can force us lower).
A workgroup is actually limited to a "Slice" (see the figure in section 5.5 "Slice Architecture"), which happens to be 24 EUs (in recent HW). Pick the GEN8 or GEN9 documents. Each slice has it's own SLM, barriers, and L3. Given that your apple book is reporting 48 EUs, I'd say that you have two slices.
Maximum Speedup:
Let's ignore this major annoyance and work with the EU number (and from those arch docs above). For "speedup" I'm comparing a single threaded FP32 calculation on the CPU. With good parallelization etc on the CPU, the speedup would be less, of course.
Each of the 48 EUs can issue two SIMD4 operations per clock in ideal circumstances. Assuming those are fused multiply-add's (so really two ops), that gives us:
48 EUs * 2 SIMD4 ops per EU * 2 (if the op is a fused multiply add)
= 192 SIMD4 ops per clock
= 768 FLOPs per clock for single precision floating point
So your ideal speedup is actually ~768. But there's a bunch of things that chip into this ideal number.
Setup and teardown time. Let's ignore this (assume the WL time dominates the runtime).
The GPU clock maxes out around a gigahertz while the CPU runs faster. Factor that ratio in. (crudely 1/3 maybe? 3Ghz on the CPU vs 1Ghz on the GPU).
If the computation is not heavily multiply-adds "mads", divide by 2 since I doubled above. Many important workloads are "mad"-dominated though.
The execution is mostly non-divergent. If a SIMD thread branches into an if-then-else, the entire SIMD thread (8,16,or 32 workitems) has to execute that code.
Register banking collisions delays can reduce EU ALU throughput. Typically the compiler does a great job avoiding this, but it can theoretically chew into your performance a bit (usually a few percent depending on register pressure).
Buffer address calculation can chew off a few percent too (EU must spend time doing integer compute to read and write addresses).
If one uses too much SLM or barriers, the GPU must leave some of the EU's idle so that there's enough SLM for each work item on the machine. (You can tweak your algorithm to fix this.)
We must keep the WL compute bound. If we blow out any cache in the data access hierarchy, we run into scenarios where no thread is ready to run on an EU and must stall. Assume we avoid this.
?. I'm probably forgetting other things that can go wrong.
We call the efficiency the percentage of theoretical perfect. So if our workload runs at ~530 FLOPs per clock, then we are 60% efficient of the theoretical 768. I've seen very carefully tuned workloads exceed 90% efficiency, but it definitely can take some work.

The ideal speedup you can get is the total number of processing elements which in your case corresponds to 48 * number of processing elements per compute unit. I do not know of a way to get the number of processing elements from OpenCL (that does not mean that it is not possible), however you can just google it for your GPU.
Up to my knowledge, a compute unit consists of one or multiple processing elements (for GPUs usually a lot), a register file, and some local memory. The threads of a compute unit are executed in a SIMD (single instruction multiple data) fashion. This means that the threads of a compute unit all execute the same operation but on different data.
Also, the speedup you get depends on how you execute a kernel function. Since a single work-group can not run on multiple compute units you need a sufficient number of work-groups in order to fully utilize all of the compute units. In addition, the work-group size should be a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE.

Understand what is using up "nice" CPU

I am running a small Cassandra cluster on Google Compute Engine. From our CPU graphs (as reported by collectd), I notice that a nontrivial amount of processor time is spent in NICE. How can I find out what process is consuming this? I've tried just start top and staring at it for a while, but the NICE cpu usage is a bit spikey (most of the time, NICE is at 0%; only on occasion will it spike up to 30-40%) so "sit and wait" isn't very effective.

"Nice" generally refers to to the priority of a process. (More positive values are lower priority, more negative values are higher priority.) You can run ps -eo nice,pid,args | grep '^\s*[1-9]' to get a list of positive nice (low priority) commands.
On a CPU graph NICE time is time spent running processes with positive nice value (ie low priority). This means that it is consuming CPU, but will give up that CPU time for most other processes. Any USER CPU time for one of the processes listed in the above ps command will show up as NICE.

How does this formula that calculates CPU utilization work?

I've been given this question
Consider a system running ten I/0-bound tasks and one CpU-bound task. Assume that the I/O-bound tasks issue and I/O operation once for every millisecond of CPU computing and that each I/O operation takes 10 milliseconds to complete. Also assume that the context-switching overhead is .1 millisecond and that all processes are long running tasks Describe the CPU utilization for round-robin scheduler when:
a. The time quantum is 1 millisecond
b. The time quantum is 10 milliseconds
and I found answer for it
The time quantum is 1 millisecond: Irrespective of which process is scheduled, the
scheduler incurs a 0.1 millisecond context-switching cost for every context-switch.
This results in a CPU utilization of 1/1.1 * 100 = 91%.
The time quantum is 10 milliseconds: The I/O-bound tasks incur a context switch
after using up only 1 millisecond of the time quantum. The time required to cycle
through all the processes is therefore 10*1.1 + 10.1 (as each I/O-bound task
executes for 1millisecond and then incur the context switch task, whereas the CPU-
bound task executes for 10 milliseconds before incurring a context switch). The CPU
utilization is therefore 20/21.1 * 100 = 94%.
My only question how is this person deriving the formula for CPU Utilization? I can't seem to under stand where he/she is getting the numbers 20/21.1 * 100 = 94%, and 1/1.1 * 100 = 91%.

For the first case, every task uses 1msec to do work and .1msec to switch; thus, it is spending 1 of every 1.1 msec doing work.
For the second case, it is similar: of the 21.1 msec spent to go through all tasks, only 20 of that is doing actual work.

This is the best possible explanation to above problem :
http://jade-cheng.com/uh/coursework/ics-412/homework-4.pdf

for part a
we have 11 process(10 i/o,1 cpu). Each takes 1ms execution time and 0.1ms switching time.
So total time taken by a process is: 10(I/o)*1(1ms of cpu)+1(CPU bounded process)*1(1ms of cpu)+11*0.1(total switching time)=12.1ms.
In this 12.1ms, time for which cpu was busy/doing execution=10*1(For 10 I/O precoess)+1*1(for 1 CPU process)=10+1=11
CPU utilisation=(11/12.1)*100=(1/1.1)*100=91%approx
for part b
Though time quantum is 10ms, but I/O bound process will only occupy 1ms of cpu and then go to block state as it need I/O, and thus there is 0.1ms of context switching.
So total time taken by I/O bound process will be= 10*1
But CPU bounded process uses its whole 10ms of time slice and 0.1ms of switching. So it takes total time of 1*10=10ms
And total context switching time=11*0.1=1.1ms
Therefor total time taken=10+10+1.1=21.1ms
and time for which cpu was busy/doing execution=10*1+1*10=20
CPU utilisation=(20/21.1)*100=94%approx

I was going through the same question. this is how i understood it
In first case , when time quantum is 1 msec, if we think about gantt chart, all I/O bound process will come (lets call p1-p10) followed by p11 which is CPU bound. so total 10 context switches in 11 ms. so effective work done by CPU in that 11 msec is only 11-(10*.1ms) ie 10 ms. so CPU utilization is (10/11)*100= 90%
same way, in 2nd case, there will be 11 switches(last one is of CPU bound process) if i consider 20.1 msec of time. so effective time cpu worked is 20.1-(11*.1)= 19ms. so CPU utilization (19/20.1)*100=94%

I was confused beyond belief for some reason on this question...after looking at all the answers here I finally understood through carefully looking at the jade-cheng link given by another user. There was no formula I could find in the book (maybe I missed it) but here is my version of the answer, in a kind of pseudo-formula style:
WARNING: This is probably wrong, but maybe you can show me where I went wrong.
a)
[(10 I/O processes)(1ms) + (1 cpu process)(1ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(1ms) + (10 context switches)*(0.1ms)] = 10/11 = 91%
b)
[(10 I/O processes)(1ms) + (1 cpu process)(10ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(10ms) + (10 context switches)*(0.1ms)] = 20/21 = 95%

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex