Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events

Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events - intel

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. AFAIK, mem_load_uops_retired.l3_miss, counts the number of DRAM demand (i.e., non-prefetch) data read accesses. offcore_response.demand_data_rd.l3_miss.local_dram, as its name suggests, counts the number of demand data reads targeted to DRAM. Therefore, these two events seem to be equivalent (or at least almost the same). But based on the following benchmarks the former event is much less frequent than the latter:
1) Initializing a 1000-Elment Global Array in a Loop in C:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) Opening a PDF Document in Evince:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) Running Wireshark for 5 seconds:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) Running Blur Filter on an Image in Inkscape:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable? Why? Please, tell me if the benchmarks are too complicated and coarse-grained!

The following table shows the differences between these two events on Haswell to the best of my (current) knowledge:
mem_load_uops_retired.l3_miss
offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops
Per uop per line
Y
Cacheable Non-Retired Load Uops
N
Y
Uncacheable WC Retired Load Uops
One event per line
N
Uncacheable UC Retired Load Uops
May occur
N
Uncacheable WC or UC Non-Retired Load Uops
N
N
Locked Loads of any type to any memory type
May occur
I don't know
Legacy IO requests
May occur
N
L1D Prefetches
N
Y
L2 Prefetches into L2 or L3
N
N
Software prefetches with no intention for write
N
Y
Page Walk Loads
N
Y
Servicing Unit
Any
Local DRAM
Reliability
May not be reliable
Reliable
It should be clear to you now that these events, in general, are not equivalent at all. Also comparing the counts of these two events to deduce something meaningful is not an easy task.
In all of the examples you presented, the offcore_response.demand_data_rd.l3_miss.local_dram event count is larger than the mem_load_uops_retired.l3_miss event count. However, it's not hard to come up with real examples where the latter is larger than the former.
In all four benchmarks,
offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as
frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
I think the description "nearly twice" really only applies to the second example, but not the others. I can't comment on the numbers you've shown without seeing the exact code and execution environment information.

Related

Effective total time for a callee function is higher than that of caller function in intel-vtune

I have a multi-threading application and when I run vtune-profiler on it, under the caller/callee tab, I see that the callee function's CPU Time: Total - Effective Time is larger than caller function's CPU Time: Total - Effective Time.
eg.
caller function - A
callee function - B (no one calls B but A)
Function
CPU time: Total
-
Effective Time
A
54%
B
57%
My understanding is that Cpu Time: Total is the sum of CPU time: self + time of all the callee's of that function. By that definition should not Cpu Time: Total of A be greater than B?
What am I missing here?

It might have happened that the function B is being called by some other function along with A so there must be this issue.
Intel VTune profiler works by sampling and numbers are less accurate for short run time. If your application runs for a very short duration you could consider using allow multiple runs in VTune or increasing the run time.
Also Intel VTune Profiler sometimes rounds off the numbers so it might not give ideal result but the difference is very small like 0.1% but in your question its 3% difference so this won't be the reason for it.

Foreach in R: optimise RAM & CPU use by sorting tasks (objects)?

I have ~200 .Rds datasets that I perform various operations on (different scripts) in a pipeline (of multiple scripts). In most of these scripts I've begun with a for loop and upgraded to a foreach. My problem is that the dataset objects are different sizes (x axis is size in mb):
so if I optimise core number usage (I have a 12core 16gbRAM machine at the office and a 16core 32gbRAM machine at home), it'll whip through the first 90 without incident, but then larger files bunch up and max out the total RAM allocation (remember Rds files are compressed so these are larger in RAM than on disk, but the variability in file size at least gives an indication of the problem). This causes workers to crash and typically leaves me with 1 to 3 cores running through the remainder of the big files (using .errorhandling = "pass"). I'm thinking it would be great to optimise the core number based on number and RAM size of workers, and total available RAM, and figured others might have been in a similar dilemma and developed strategies to address this. Some approaches I've thought of but not tried:
Approach 1: first loop or list through the files on disk, potentially by opening & closing them, use object.size() to get their sizes in RAM, sort largest to smallest, cut halfway, reverse the order of the second half, and intersperse them: smallest, biggest, 2nd smallest, 2nd biggest, etc. 2 workers (or any even numbered multiple) should therefore be working on the 'mean' RAM usage. However: worker 1 will finish its job faster than any other job in the stack and then go onto job 3, the 2nd smallest, likely finish that really quickly also then do job 4, the second largest, while worker 2 is still on the largest, meaning that by job 4, this approach has the machine processing the 2 largest RAM objects concurrently, the opposite of what we want.
Approach 2: sort objects by size-in-RAM for each object, small to large. Starting from object 1, iteratively add subsequent objects' RAM usage until total RAM core number is exceeded. Foreach on that batch. Repeat. This would work but requires some convoluted coding (probably a for loop wrapper around the foreach which passes the foreach its task list each time?). Also if there are a lot of tasks which won't exceed the RAM (per my example), the cores limit batching process will mean all 12 or 16 have to complete before the next 12 or 16 are started, introducing inefficiency.
Approach 3: sort small-large per 2. Run foreach with all cores. This will churn through the small ones maximally efficiently until the tasks get bigger, at which point workers will start to crash, reducing the number of workers sharing the RAM and thus increasing the chance the remaining workers can continue. Conceptually this will mean cores-1 tasks fail and need to be re-run, but the code is easy and should work fast. I already have code that checks the output directory and removes tasks from the jobs list if they've already been completed, which means I could just re-run this approach, however I should anticipate further losses and therefore reruns required unless I lower the cores number.
Approach 4: as 3 but somehow close the worker (reduce core number) BEFORE the task is assigned, meaning the task doesn't have to trigger a RAM overrun and fail in order to reduce worker count. This would also mean no having to restart RStudio.
Approach 5: ideally there would be some intelligent queueing system in foreach that would do this all for me but beggars can't be choosers! Conceptually this would be similar to 4, above: for each worker, don't start the next task until there's sufficient RAM available.
Any thoughts appreciated from folks who've run into similar issues. Cheers!

I've thought a bit about this too.
My problem is a bit different, I don't have any crash but more some slowdowns due to swapping when not enough RAM.
Things that may work:
randomize the iterations so that it is approximately evenly distributed (without needing to know the timings in advance)
similar to approach 5, have some barriers (waiting of some workers with a while loop and Sys.sleep()) while not enough memory (e.g. determined via package {memuse}).
Things I do in practice:
always store the results of iterations in foreach loops and test if already computed (RDS file already exists)
skip some iterations if needed
rerun the "intensive" iterations using less cores

Display used CPU hours with slurm

I have a user account on a super computer where jobs are handled with slurm.
I would like to know the total amount of CPU hours that I have consumed on this super computer. I think that's an understandable question, because there is only a limited number of CPU hours available per project. I'm surprised that an answer is not easy to find.
I know that there are all these commands like sacct, sreport, sshare, etc... but it seems that there is no simple command that displays the used CPU hours.
Can someone help me out?

As others have commented, sacct should give you that information. You will need to look at the man page to get information for past jobs. You can specify a --starttime and --endtime to restrict your query to match your allocation as it ends/renews. The -l options should get you more information than you need so you can get a smaller set of options by specifying what you need with --format.
In your instance, the correct answer is to ask the administrators. You have been given an allocation of time to draw from. They likely have a system that will show you your balance and you can reconcile your balance against the output of sacct. Also, if the system you are using has different node types such as high memory, GPU, MIC, or old, they will likely charge you differently for those resources.

You can get an overview of the used CPU hours with the following:
sacct -SYYYY-mm-dd -u username -ojobid,start,end,alloccpu,cputime | column -t
You will could calculate the total accounting units (SBU in our system) multiplying CPUTime by AllocCPU which means multiplying the total (sysem+user) CPU time by the amount of CPU used.
An example:
JobID NodeList State Start End AllocCPUS CPUTime
------------ --------------- ---------- ------------------- ------------------- ---------- ----------
6328552 tcn[595-604] CANCELLED+ 2019-05-21T14:07:57 2019-05-23T16:48:15 240 506-17:12:00
6328552.bat+ tcn595 CANCELLED 2019-05-21T14:07:57 2019-05-23T16:48:16 24 50-16:07:36
6328552.0 tcn[595-604] FAILED 2019-05-21T14:10:37 2019-05-23T16:48:18 240 506-06:44:00
6332520 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 72 25-00:38:24
6332520.bat+ tcn384 COMPLETED 2019-05-23T16:06:04 2019-05-24T00:26:36 24 8-08:12:48
6332520.0 tcn[384,386,45+ COMPLETED 2019-05-23T16:06:09 2019-05-24T00:26:33 60 20-20:24:00
6332530 tcn[37,41,44,4+ FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 240 400-08:12:00
6332530.bat+ tcn37 FAILED 2019-05-23T17:11:31 2019-05-25T09:13:34 24 40-00:49:12
6332530.0 tcn[37,41,44,4+ CANCELLED+ 2019-05-23T17:11:35 2019-05-25T09:13:34 240 400-07:56:00
The fields are shown in the the manpage. They can be shown as -oOPTION (in lower case or in proper POSIX notation --format='Option,AnotherOption...' (a list is in the man).
So far so good. But there is a big caveat here:
What you see here is perfect to get an idea of what you have run or what to expect in terms of CPU / hours. But this will not necessarily reflect your real budget status, as in many cases each node / partition may have an extra parameter, the weight, which is a parameter set for accounting purposes and not part of SLURM. For instance,the GPU nodes may have a weight value of x3, which means that each GPU/hour is measured as 3 SBU instead of 1 for budgetary purposes. What I mean to say is that you can use sacct to gain insight on the CPU times but this will not necessarily reflect how much SBU credits you still have.

what do "user","system", and "elapsed" times mean in R [duplicate]

This question already has answers here:
What are 'user' and 'system' times measuring in R system.time(exp) output?
(5 answers)
Closed 9 years ago.
I am adopting parallel computing in R, and doing some benchmark works. I notice that when multiple cores are used, system.time shows increased times for user and system, but the elapsed time is decreased. Does this indicate that parallel computing is effective? Thanks.

If you do help(system.time) you get a hint to also look at help(proc.time). I quote from its help page:
Value:
An object of class ‘"proc_time"’ which is a numeric vector of
length 5, containing the user, system, and total elapsed times for
the currently running R process, and the cumulative sum of user
and system times of any child processes spawned by it on which it
has waited. (The ‘print’ method uses the ‘summary’ method to
combine the child times with those of the main process.)
The definition of ‘user’ and ‘system’ times is from your OS.
Typically it is something like
_The ‘user time’ is the CPU time charged for the execution of user
instructions of the calling process. The ‘system time’ is the CPU
time charged for execution by the system on behalf of the calling
process._
Times of child processes are not available on Windows and will
always be given as ‘NA’.
The resolution of the times will be system-specific and on
Unix-alikes times are rounded down to milliseconds. On modern
systems they will be that accurate, but on older systems they
might be accurate to 1/100 or 1/60 sec. They are typically
available to 10ms on Windows.

How does this formula that calculates CPU utilization work?

I've been given this question
Consider a system running ten I/0-bound tasks and one CpU-bound task. Assume that the I/O-bound tasks issue and I/O operation once for every millisecond of CPU computing and that each I/O operation takes 10 milliseconds to complete. Also assume that the context-switching overhead is .1 millisecond and that all processes are long running tasks Describe the CPU utilization for round-robin scheduler when:
a. The time quantum is 1 millisecond
b. The time quantum is 10 milliseconds
and I found answer for it
The time quantum is 1 millisecond: Irrespective of which process is scheduled, the
scheduler incurs a 0.1 millisecond context-switching cost for every context-switch.
This results in a CPU utilization of 1/1.1 * 100 = 91%.
The time quantum is 10 milliseconds: The I/O-bound tasks incur a context switch
after using up only 1 millisecond of the time quantum. The time required to cycle
through all the processes is therefore 10*1.1 + 10.1 (as each I/O-bound task
executes for 1millisecond and then incur the context switch task, whereas the CPU-
bound task executes for 10 milliseconds before incurring a context switch). The CPU
utilization is therefore 20/21.1 * 100 = 94%.
My only question how is this person deriving the formula for CPU Utilization? I can't seem to under stand where he/she is getting the numbers 20/21.1 * 100 = 94%, and 1/1.1 * 100 = 91%.

For the first case, every task uses 1msec to do work and .1msec to switch; thus, it is spending 1 of every 1.1 msec doing work.
For the second case, it is similar: of the 21.1 msec spent to go through all tasks, only 20 of that is doing actual work.

This is the best possible explanation to above problem :
http://jade-cheng.com/uh/coursework/ics-412/homework-4.pdf

for part a
we have 11 process(10 i/o,1 cpu). Each takes 1ms execution time and 0.1ms switching time.
So total time taken by a process is: 10(I/o)*1(1ms of cpu)+1(CPU bounded process)*1(1ms of cpu)+11*0.1(total switching time)=12.1ms.
In this 12.1ms, time for which cpu was busy/doing execution=10*1(For 10 I/O precoess)+1*1(for 1 CPU process)=10+1=11
CPU utilisation=(11/12.1)*100=(1/1.1)*100=91%approx
for part b
Though time quantum is 10ms, but I/O bound process will only occupy 1ms of cpu and then go to block state as it need I/O, and thus there is 0.1ms of context switching.
So total time taken by I/O bound process will be= 10*1
But CPU bounded process uses its whole 10ms of time slice and 0.1ms of switching. So it takes total time of 1*10=10ms
And total context switching time=11*0.1=1.1ms
Therefor total time taken=10+10+1.1=21.1ms
and time for which cpu was busy/doing execution=10*1+1*10=20
CPU utilisation=(20/21.1)*100=94%approx

I was going through the same question. this is how i understood it
In first case , when time quantum is 1 msec, if we think about gantt chart, all I/O bound process will come (lets call p1-p10) followed by p11 which is CPU bound. so total 10 context switches in 11 ms. so effective work done by CPU in that 11 msec is only 11-(10*.1ms) ie 10 ms. so CPU utilization is (10/11)*100= 90%
same way, in 2nd case, there will be 11 switches(last one is of CPU bound process) if i consider 20.1 msec of time. so effective time cpu worked is 20.1-(11*.1)= 19ms. so CPU utilization (19/20.1)*100=94%

I was confused beyond belief for some reason on this question...after looking at all the answers here I finally understood through carefully looking at the jade-cheng link given by another user. There was no formula I could find in the book (maybe I missed it) but here is my version of the answer, in a kind of pseudo-formula style:
WARNING: This is probably wrong, but maybe you can show me where I went wrong.
a)
[(10 I/O processes)(1ms) + (1 cpu process)(1ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(1ms) + (10 context switches)*(0.1ms)] = 10/11 = 91%
b)
[(10 I/O processes)(1ms) + (1 cpu process)(10ms)] / [(10 I/O processes)(1ms) + (1 cpu process)(10ms) + (10 context switches)*(0.1ms)] = 20/21 = 95%

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex