mclapply user time larger than elapsed time - r

I am trying to use mclapply function of the parallel package in R. The function is assigning the values to sequence matrix by calculating log likelihood distance - a CPU-intensive operation.
The resulting system.time values are confusing:
> system.time(mclapply(worksample,function(x){p_seqi_modj(x,worksample[[1]],c(1:17))}))
user system elapsed
29.339 1.242 18.581
I thought that elapsed means aggregated time (user+system). What does the above result mean in this case and to what time should I orient myself? My unparallelized version takes less in user time and much more in elapsed.

The help page ?system.time says the value returned by the function is an object of class proc_time, and that we should consult ?proc.time. There we learn that user time is
cumulative sum of user and system times of any child processes
so your task has spent about 15s on each core (mclapply defaults to using 2 cores, see the mc.cores argument).
Actually, we see earlier in the help page that proc.time() returns five elements that separate the process and child times, and that the summary method used in printing collapses the user and system time into process + child times, so there is a bit more information available.

Related

Difference Between mem_load_uops_retired.l3_miss and offcore_response.demand_data_rd.l3_miss.local_dram Events

I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. AFAIK, mem_load_uops_retired.l3_miss, counts the number of DRAM demand (i.e., non-prefetch) data read accesses. offcore_response.demand_data_rd.l3_miss.local_dram, as its name suggests, counts the number of demand data reads targeted to DRAM. Therefore, these two events seem to be equivalent (or at least almost the same). But based on the following benchmarks the former event is much less frequent than the latter:
1) Initializing a 1000-Elment Global Array in a Loop in C:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) Opening a PDF Document in Evince:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) Running Wireshark for 5 seconds:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) Running Blur Filter on an Image in Inkscape:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable? Why? Please, tell me if the benchmarks are too complicated and coarse-grained!
The following table shows the differences between these two events on Haswell to the best of my (current) knowledge:
mem_load_uops_retired.l3_miss
offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops
Per uop per line
Y
Cacheable Non-Retired Load Uops
N
Y
Uncacheable WC Retired Load Uops
One event per line
N
Uncacheable UC Retired Load Uops
May occur
N
Uncacheable WC or UC Non-Retired Load Uops
N
N
Locked Loads of any type to any memory type
May occur
I don't know
Legacy IO requests
May occur
N
L1D Prefetches
N
Y
L2 Prefetches into L2 or L3
N
N
Software prefetches with no intention for write
N
Y
Page Walk Loads
N
Y
Servicing Unit
Any
Local DRAM
Reliability
May not be reliable
Reliable
It should be clear to you now that these events, in general, are not equivalent at all. Also comparing the counts of these two events to deduce something meaningful is not an easy task.
In all of the examples you presented, the offcore_response.demand_data_rd.l3_miss.local_dram event count is larger than the mem_load_uops_retired.l3_miss event count. However, it's not hard to come up with real examples where the latter is larger than the former.
In all four benchmarks,
offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as
frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
I think the description "nearly twice" really only applies to the second example, but not the others. I can't comment on the numbers you've shown without seeing the exact code and execution environment information.

system.time command - diffrence?

When I run this command:
system.time(fread('x.csv', header = T))
I receive this output:
user system elapsed
4.740 0.048 4.785
In simple terms, what does each of those means, besides "elapsed," which the time that has passed since running the command? What do User and System mean?
From http://www.ats.ucla.edu/stat/r/faq/timing_code.htm
The values presented (user, system, and elapsed) will be defined by your operating system, but generally, the user time relates to the execution of the code, the system time relates to your CPU, and the elapsed time is the difference in times since you started the stopwatch (and will be equal to the sum of user and system times if the chunk of code was run altogether). While the difference of .42 seconds may not seem like much, this gain in efficiency is huge!

Why is the R match function so slow?

The following should find the location of the first instance of the integer 1:
array <- rep(1,10000000)
system.time(match(1,array))
This returns
user system elapsed
0.720 1.243 1.964
If I run the same task using an array of size 100 I get this:
user system elapsed
0 0 0
Since all it should be doing is looking at the first value in the array and returning a match, the time taken should be that of a lookup and a comparison, regardless of the size of the array. If I wrote this in lower-level language it would cost in the order of a handful of clock cycles (a microsecond or less?) regardless of the array size. Why does it take a second in R? It seems to be iterating through he whole array...
Is there a way for it to abort once it has found its match, rather than continuing to iterate unnecessarily?
The reason is that R is not actually doing linear search, but it sets up a hash table. This is effective if you are searching for several items, but not very effective if you are searching for one number only. Here is the profiling of the function:
A "better" implementation could use a linear search, if you are searching for a single integer in an array. I guess that would be faster.

How to verify number of function evaluations when profiling R code

When profiling R code with Rprof-type functions we get the time spent in function alone and the time spent in function and callees. However, as far as I know we don't get the number of times a given function was evaluated.
For example, assume I wants to compare two integration functions:
integrate_1(myfunc, from = -Inf, to = Inf)
integrate_2(myfunc, from = -Inf, to Inf)
I could easily see how much time each function takes and where this time was spent, but I don't know how to check how many times myfunc had to be evaluated in each of the integrate functions.
Thanks,
One way of implementing Joran's counter method is to use the trace function.
For example, first we set the counter to zero. (Assigned in the global environment, for convenience.)
count <- 0
Then set up the trace. Here we set it on the identity function (that just returns the value that you input to it).
trace("identity", quote(count <<- count + 1), print = FALSE)
Now whenever identity is called, the value of count is incremented. print = FALSE just stops a message being printed to the console when the function is called.
Let's call the function a few times and inspect the count:
for(i in seq_len(123)) identity(1)
count
## [1] 123
Rprof works by sampling the call stack on a timer. It does not count calls.
It records the sampled call stacks in a file, and though it does not record line numbers where calls occur, those samples are still useful for seeing what causes time to be spent.
For example, if you happen to look at M random samples, and you see a pattern like A calling B calling C on N of them, then you know the program spends roughly fraction N/M of its time doing that (assuming N > 1).
If you see such a thing, and you can think of a way to avoid even part of it, you will save a substantial fraction of the total time.
Rprof comes with a summarization tool that gives you the kind of numbers you mentioned, but I don't find those numbers useful anyway.
I would much rather get a real sense of what's happening.

what do "user","system", and "elapsed" times mean in R [duplicate]

This question already has answers here:
What are 'user' and 'system' times measuring in R system.time(exp) output?
(5 answers)
Closed 9 years ago.
I am adopting parallel computing in R, and doing some benchmark works. I notice that when multiple cores are used, system.time shows increased times for user and system, but the elapsed time is decreased. Does this indicate that parallel computing is effective? Thanks.
If you do help(system.time) you get a hint to also look at help(proc.time). I quote from its help page:
Value:
An object of class ‘"proc_time"’ which is a numeric vector of
length 5, containing the user, system, and total elapsed times for
the currently running R process, and the cumulative sum of user
and system times of any child processes spawned by it on which it
has waited. (The ‘print’ method uses the ‘summary’ method to
combine the child times with those of the main process.)
The definition of ‘user’ and ‘system’ times is from your OS.
Typically it is something like
_The ‘user time’ is the CPU time charged for the execution of user
instructions of the calling process. The ‘system time’ is the CPU
time charged for execution by the system on behalf of the calling
process._
Times of child processes are not available on Windows and will
always be given as ‘NA’.
The resolution of the times will be system-specific and on
Unix-alikes times are rounded down to milliseconds. On modern
systems they will be that accurate, but on older systems they
might be accurate to 1/100 or 1/60 sec. They are typically
available to 10ms on Windows.

Resources