How does this formula that calculates CPU utilization work? - cpu-usage

I've been given this question
Consider a system running ten I/0-bound tasks and one CpU-bound task. Assume that the I/O-bound tasks issue and I/O operation once for every millisecond of CPU computing and that each I/O operation takes 10 milliseconds to complete. Also assume that the context-switching overhead is .1 millisecond and that all processes are long running tasks Describe the CPU utilization for round-robin scheduler when:
a. The time quantum is 1 millisecond
b. The time quantum is 10 milliseconds
and I found answer for it
The time quantum is 1 millisecond: Irrespective of which process is scheduled, the
scheduler incurs a 0.1 millisecond context-switching cost for every context-switch.
This results in a CPU utilization of 1/1.1 * 100 = 91%.
The time quantum is 10 milliseconds: The I/O-bound tasks incur a context switch
after using up only 1 millisecond of the time quantum. The time required to cycle
through all the processes is therefore 10*1.1 + 10.1 (as each I/O-bound task
executes for 1millisecond and then incur the context switch task, whereas the CPU-
bound task executes for 10 milliseconds before incurring a context switch). The CPU
utilization is therefore 20/21.1 * 100 = 94%.
My only question how is this person deriving the formula for CPU Utilization? I can't seem to under stand where he/she is getting the numbers 20/21.1 * 100 = 94%, and 1/1.1 * 100 = 91%.

For the first case, every task uses 1msec to do work and .1msec to switch; thus, it is spending 1 of every 1.1 msec doing work.
For the second case, it is similar: of the 21.1 msec spent to go through all tasks, only 20 of that is doing actual work.

This is the best possible explanation to above problem :
http://jade-cheng.com/uh/coursework/ics-412/homework-4.pdf

for part a
we have 11 process(10 i/o,1 cpu). Each takes 1ms execution time and 0.1ms switching time.
So total time taken by a process is: 10(I/o)*1(1ms of cpu)+1(CPU bounded process)*1(1ms of cpu)+11*0.1(total switching time)=12.1ms.
In this 12.1ms, time for which cpu was busy/doing execution=10*1(For 10 I/O precoess)+1*1(for 1 CPU process)=10+1=11
CPU utilisation=(11/12.1)*100=(1/1.1)*100=91%approx
for part b
Though time quantum is 10ms, but I/O bound process will only occupy 1ms of cpu and then go to block state as it need I/O, and thus there is 0.1ms of context switching.
So total time taken by I/O bound process will be= 10*1
But CPU bounded process uses its whole 10ms of time slice and 0.1ms of switching. So it takes total time of 1*10=10ms
And total context switching time=11*0.1=1.1ms
Therefor total time taken=10+10+1.1=21.1ms
and time for which cpu was busy/doing execution=10*1+1*10=20
CPU utilisation=(20/21.1)*100=94%approx

I was going through the same question. this is how i understood it
In first case , when time quantum is 1 msec, if we think about gantt chart, all I/O bound process will come (lets call p1-p10) followed by p11 which is CPU bound. so total 10 context switches in 11 ms. so effective work done by CPU in that 11 msec is only 11-(10*.1ms) ie 10 ms. so CPU utilization is (10/11)*100= 90%
same way, in 2nd case, there will be 11 switches(last one is of CPU bound process) if i consider 20.1 msec of time. so effective time cpu worked is 20.1-(11*.1)= 19ms. so CPU utilization (19/20.1)*100=94%

I was confused beyond belief for some reason on this question...after looking at all the answers here I finally understood through carefully looking at the jade-cheng link given by another user. There was no formula I could find in the book (maybe I missed it) but here is my version of the answer, in a kind of pseudo-formula style:
WARNING: This is probably wrong, but maybe you can show me where I went wrong.
a)
[(10 I/O processes)(1ms) + (1 cpu process)(1ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(1ms) + (10 context switches)*(0.1ms)] = 10/11 = 91%
b)
[(10 I/O processes)(1ms) + (1 cpu process)(10ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(10ms) + (10 context switches)*(0.1ms)] = 20/21 = 95%

Related

Effective total time for a callee function is higher than that of caller function in intel-vtune

I have a multi-threading application and when I run vtune-profiler on it, under the caller/callee tab, I see that the callee function's CPU Time: Total - Effective Time is larger than caller function's CPU Time: Total - Effective Time.
eg.
caller function - A
callee function - B (no one calls B but A)
Function
CPU time: Total
-
Effective Time
A
54%
B
57%
My understanding is that Cpu Time: Total is the sum of CPU time: self + time of all the callee's of that function. By that definition should not Cpu Time: Total of A be greater than B?
What am I missing here?
It might have happened that the function B is being called by some other function along with A so there must be this issue.
Intel VTune profiler works by sampling and numbers are less accurate for short run time. If your application runs for a very short duration you could consider using allow multiple runs in VTune or increasing the run time.
Also Intel VTune Profiler sometimes rounds off the numbers so it might not give ideal result but the difference is very small like 0.1% but in your question its 3% difference so this won't be the reason for it.

Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. AFAIK, mem_load_uops_retired.l3_miss, counts the number of DRAM demand (i.e., non-prefetch) data read accesses. offcore_response.demand_data_rd.l3_miss.local_dram, as its name suggests, counts the number of demand data reads targeted to DRAM. Therefore, these two events seem to be equivalent (or at least almost the same). But based on the following benchmarks the former event is much less frequent than the latter:
1) Initializing a 1000-Elment Global Array in a Loop in C:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) Opening a PDF Document in Evince:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) Running Wireshark for 5 seconds:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) Running Blur Filter on an Image in Inkscape:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable? Why? Please, tell me if the benchmarks are too complicated and coarse-grained!
The following table shows the differences between these two events on Haswell to the best of my (current) knowledge:
mem_load_uops_retired.l3_miss
offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops
Per uop per line
Y
Cacheable Non-Retired Load Uops
N
Y
Uncacheable WC Retired Load Uops
One event per line
N
Uncacheable UC Retired Load Uops
May occur
N
Uncacheable WC or UC Non-Retired Load Uops
N
N
Locked Loads of any type to any memory type
May occur
I don't know
Legacy IO requests
May occur
N
L1D Prefetches
N
Y
L2 Prefetches into L2 or L3
N
N
Software prefetches with no intention for write
N
Y
Page Walk Loads
N
Y
Servicing Unit
Any
Local DRAM
Reliability
May not be reliable
Reliable
It should be clear to you now that these events, in general, are not equivalent at all. Also comparing the counts of these two events to deduce something meaningful is not an easy task.
In all of the examples you presented, the offcore_response.demand_data_rd.l3_miss.local_dram event count is larger than the mem_load_uops_retired.l3_miss event count. However, it's not hard to come up with real examples where the latter is larger than the former.
In all four benchmarks,
offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as
frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
I think the description "nearly twice" really only applies to the second example, but not the others. I can't comment on the numbers you've shown without seeing the exact code and execution environment information.

Calculate Queue Wait Times

I am looking to calculate the wait in a queue per position or a general time based on your queue position. It is a FIFO.
List of current performance status of the service
Size AvTime Queue Processing AvgFileSize(mb)
1 (0 - 1 mb) 2.57 18 3 0.21
2 (1 - 5 mb) 12.43 2 4 2.16
3 (5 - 10 mb) 23.38 9 8 6.72
4 (10 - 25 mb) 38.17 1 4 12.52
5 (>= 25 mb) 109.31 0 0 32.41
The current list of processing and queued batch files. Only lists the current users files so that is why there are queue numbers missing.
Queue Filename Status
30 Batch (3456).XML(2) Queue
20 Batch (2399).xml(3) Queue
14 batch (1495).xml(1) Queue
12 batch (1497).xml(1) Queue
15 batch (1499).xml(1) Queue
10 batch (1500).xml(4) Queue
13 batch (1496).xml(1) Queue
11 batch (1501).xml(1) Queue
9 batch (1498).xml(1) Queue
8 batch (1494).xml(1) Queue
7 batch (1493).xml(1) Queue
6 batch (1492).xml(1) Queue
5 batch (1491).xml(1) Queue
4 batch (1490).xml(1) Queue
3 batch (1).xml(1) Queue
2 Batch1.xml(1) Queue
1 Batch1.XML(2) Queue
Batch1.xml(1) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(1) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(2) Processing
Batch1.xml(2) Processing
Batch1.xml(3) Processing
Batch1.xml(3) Processing
Batch1.xml(4) Processing
Batch1.xml(2) Processing
So I am looking to add more information to the list how long until a batch file at position 20 will be waiting in the queue before it starts processing.
Queue Filename Status
30 (*30min) Batch (3456).XML(2) Queue
20 (*10min) Batch (2399).xml(3) Queue
...
*estimated
Your question doesn't quite provide enough context to make it possible to answer, but I can make some guesses based on the sample displays you provided.
Looks like you have a "single queue, multiple server" setup. In other words, you have a single FIFO queue, and a some fixed number N of jobs that can be in processing at any given time. Is that right?
For your algorithm, let's assume you have the following information:
Position of our job in queue (position N means there are N jobs ahead
of us)
Size of our job
Size of each job ahead of us in the queue
Pool of jobs being processed, with a certain maximum size N
Size of each job currently being processed
Elapsed time for each job currently in process (how long since that job started)
First of all, you will need a function ExpectedJobDuration(jobsize) that computes an expected job processing time for a job of a given size, based on the statistics shown in your "performance status" table. This looks pretty straightforward. Given a job size, first figure out which of your five size categories it falls into (0: 0-1mb, 1: 1-5mb, etc.) Then take your job size and multiply by the average time divided by the average size of jobs in that category. That will give you an estimate of ExpectedJobDuration(jobsize), which will tell you how long it takes to run a job of a given size, under the assumption that job time is proportional to job size, for jobs within a particular size range.
Now, for a job of a given size that's already been in process for a given time ElapsedProcessingTime, how long do we expect it to to take complete? A simple answer would be something like:
ExpectedRemainingTime = ExpectedJobDuration(jobsize) - ElapsedProcessingTime.
For jobs sitting the the queue this will be exactly the same as the expected job duration; for jobs already being processed we subtract the time the job has already been in work. However, if there is some random variation in job processing times, this is not exactly right, and could turn out to be negative. This is sort of like the actuarial problem: the average lifespan of a person is X years, how long do we expect someone to live if they are already Y years old? You would need a lot more statistical data to compute this, so for practical purposes, if the answer comes out negative, just set it to zero. (If someone is 100 years old, and the average human lifespan is 90, expect them to die at any moment. That's not quite right, but perhaps OK as a first approximation. Unless you are the 100 year old person, and not yet ready to die. :-))
OK, now we have a way to compute how long each job ahead of us in the queue should take, and how long it should take to complete jobs already in process.
If the number of jobs currently being processed is less than N (the max that can be processed at any given time) then our job can start right away. So in that case we have the answer - expected delay until our job can start is zero seconds.
Now let's look at the case where we are in position 0 in the queue. That means there are no jobs ahead of us in the queue, so our expected time to start is the minimum of the ExpectedRemainingTime of the jobs in the processing pool.
Now that gives us the basis for a recursive function that computes delay until our expected start time.
DelayUntilStart(jobPool, currentJob, queue) {
find minJob in jobPool with minimum ExpectedRemainingTIme
if currentJob is in position zero of queue
return expectedRemainingTime(minJob)
else
remove minJob from jobPool
pop the top job from the queue and put it in the jobPool
return ExpectedRemainingTime(minJob) + DelayUntilStart(jobPool, currentJob, queue)
done
}
Note - we may have a very long job ahead of us in the queue - but that doesn't mean we have to wait for it to complete. We just have to wait for it to get into the pool of jobs currently being processed, and then a shorter job might complete and let us into the pool.
The algorithm I just described is going to be an approximation. But it's probably about as good you are going to get without a lot of statistics about job processing times. For practical purposes I bet it would work pretty well.

groupsize vs execution time?

I make a simple program abput a vector adder and want to test the execution time vs the groupsize.
when I change the groupsize from 1024 to 5012 to 256 and to 128. The execution time is very similar. Why? in my view, when I `use smaller groupsizes, we should have more groups and they can work in the cores in parallel which could lead less execution time(for example, if workgroupsize change from 512 to 256, the execution time should reduce half?) but in my experinment in gpu, the execution time is siilar? is my view wrong?
Because number of workitems per group is not a visible bottleneck for vector addition. The global memory performance ise bottleneck. If data comes from host then pci-e performance is also bottleneck.

Constant size of task - the same executing time on 1x and 2x CPU - OpenCl

I have a problem with understanding my results regarding Integral algorithm (implemented in OpenCl).
I have access to two Intel Xeon E5-2680 v3 , one has 12 cores.
From OpenCl I don't know why but I can see only one device but I can request 12 or 24 cores, so I guess it does not matter if I "see" one or two devices, if 24 cores are used (2 CPUs).
I was running those tasks with max local size = 4096, and minimal global size = 4096, and for 1 CPU and 2 CPU executing time was the same, I was changing global size to 2* 4096, 4* 4096, 8* 4096 and when I reached 16* 4096 global size, 1CPU was slowing down, but 2x CPU was speeding up, and every next global size I changed to bigger than before it stayed that way, 2x CPU was 2x faster than 1x CPU.
I don't understand why from the beginning we can't see advantage of 2x CPU over 1x CPU.
What is also important to me, I was collecting power consumption for CPU's, and in that last global size = 8* 4096 when we see the same execution time of 1 and 2 CPUs I can see a bit smaller power consumption for 2 CPUs, and when global size was growing, that 2 CPU consumption was lower than on 1 CPU I guess because of 2x faster time execution, but shouldn't it be equal or bigger than on 1 CPU?
What may be important: I checked that always 1 and 2 CPUs have 2.5 Ghz freq, and it is not changing.
My questions regarding above are:
Why on smaller global Size's 1 CPU and 2 CPU have equal execution time?
Why on bigger global size's 2 CPU have smaller power consumption.
Why in that one point when Global Size = 8*4096 when we have equal execution times, I have slightly less power consumption with 2 CPUs than 1 CPU.
I need to add that every run was made 10x so those results are not accidental
Here are my results:
Why on smaller global Size's 1 CPU and 2 CPU have equal execution
time?
Because you used 4096 as local size. Each compute unit for a cpu is 1 core. You put 16x4096 for global size so it used 16 cores. Probably you used a memory bound kernel or one core accesses other CPU's cache or memory so it couldn't matter if it used 1 core or N cores. When you increase global size, other CPU memory could be used more often an becomes more symmetrical memory access pattern.
Why on bigger global size's 2 CPU have smaller power consumption.
2 CPU have more cache so they can schedule more kernels at the same time, maybe even reusing of data is made low powered than accessing ram. Gettin data from ram should be more power consuming than getting it from cache.
Why in that one point when Global Size = 8*4096 when we have equal
execution times, I have slightly less power consumption with 2 CPUs
than 1 CPU.
Using 8 cores(8 * local size), single CPU must have been in use and even if it is not, same memory bank groups could be in use by both CPU and memory bandwidth is bottlenecking. Again, 2 CPUs have more cache so there must some data-reuse to use advantage of bigger cache that decrease power consumption.
You should try different device fission combinations to get maximum locality and data sharability for cores. Threads could be randomly distributed among CPUs and cores and hardware threads. Device fission solves this problem and gives more control over thread scheduling.

Resources