In RGui terminal, I just typed
ptm = proc.time()
ptm
The result is like
user system elapsed
0.21 0.87 50.32
So, it takes 50.32 seconds? I did nothing. What is the time unit for elapsed ?
Many thanks if someone can help it or forward this to some expert.
The time unit of all these three numbers is second.
elapsed is the time counted from the RGui terminal session starts. So if we type the code again, it can be found the elapsed time will always grow larger, it is accumulative.
Related
I have an Intel(R) Core(TM) i7-4720HQ CPU # 2.60GHz (Haswell) processor. AFAIK, mem_load_uops_retired.l3_miss, counts the number of DRAM demand (i.e., non-prefetch) data read accesses. offcore_response.demand_data_rd.l3_miss.local_dram, as its name suggests, counts the number of demand data reads targeted to DRAM. Therefore, these two events seem to be equivalent (or at least almost the same). But based on the following benchmarks the former event is much less frequent than the latter:
1) Initializing a 1000-Elment Global Array in a Loop in C:
Performance counter stats for '/home/ahmad/Simple Progs/loop':
1,363 mem_load_uops_retired.l3_miss
1,543 offcore_response.demand_data_rd.l3_miss.local_dram
0.000749574 seconds time elapsed
0.000778000 seconds user
0.000000000 seconds sys
2) Opening a PDF Document in Evince:
Performance counter stats for '/opt/evince-3.28.4/bin/evince':
936,152 mem_load_uops_retired.l3_miss
1,853,998 offcore_response.demand_data_rd.l3_miss.local_dram
4.346408203 seconds time elapsed
1.644826000 seconds user
0.103411000 seconds sys
3) Running Wireshark for 5 seconds:
Performance counter stats for 'wireshark':
5,161,671 mem_load_uops_retired.l3_miss
8,126,526 offcore_response.demand_data_rd.l3_miss.local_dram
15.713828395 seconds time elapsed
0.904280000 seconds user
0.693906000 seconds sys
4) Running Blur Filter on an Image in Inkscape:
Performance counter stats for 'inkscape':
13,852,121 mem_load_uops_retired.l3_miss
23,475,970 offcore_response.demand_data_rd.l3_miss.local_dram
25.355643897 seconds time elapsed
7.244404000 seconds user
1.019895000 seconds sys
In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable? Why? Please, tell me if the benchmarks are too complicated and coarse-grained!
The following table shows the differences between these two events on Haswell to the best of my (current) knowledge:
mem_load_uops_retired.l3_miss
offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops
Per uop per line
Y
Cacheable Non-Retired Load Uops
N
Y
Uncacheable WC Retired Load Uops
One event per line
N
Uncacheable UC Retired Load Uops
May occur
N
Uncacheable WC or UC Non-Retired Load Uops
N
N
Locked Loads of any type to any memory type
May occur
I don't know
Legacy IO requests
May occur
N
L1D Prefetches
N
Y
L2 Prefetches into L2 or L3
N
N
Software prefetches with no intention for write
N
Y
Page Walk Loads
N
Y
Servicing Unit
Any
Local DRAM
Reliability
May not be reliable
Reliable
It should be clear to you now that these events, in general, are not equivalent at all. Also comparing the counts of these two events to deduce something meaningful is not an easy task.
In all of the examples you presented, the offcore_response.demand_data_rd.l3_miss.local_dram event count is larger than the mem_load_uops_retired.l3_miss event count. However, it's not hard to come up with real examples where the latter is larger than the former.
In all four benchmarks,
offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as
frequent as mem_load_uops_retired.l3_miss. Is this reasonable?
I think the description "nearly twice" really only applies to the second example, but not the others. I can't comment on the numbers you've shown without seeing the exact code and execution environment information.
When I run this command:
system.time(fread('x.csv', header = T))
I receive this output:
user system elapsed
4.740 0.048 4.785
In simple terms, what does each of those means, besides "elapsed," which the time that has passed since running the command? What do User and System mean?
From http://www.ats.ucla.edu/stat/r/faq/timing_code.htm
The values presented (user, system, and elapsed) will be defined by your operating system, but generally, the user time relates to the execution of the code, the system time relates to your CPU, and the elapsed time is the difference in times since you started the stopwatch (and will be equal to the sum of user and system times if the chunk of code was run altogether). While the difference of .42 seconds may not seem like much, this gain in efficiency is huge!
I am trying to use mclapply function of the parallel package in R. The function is assigning the values to sequence matrix by calculating log likelihood distance - a CPU-intensive operation.
The resulting system.time values are confusing:
> system.time(mclapply(worksample,function(x){p_seqi_modj(x,worksample[[1]],c(1:17))}))
user system elapsed
29.339 1.242 18.581
I thought that elapsed means aggregated time (user+system). What does the above result mean in this case and to what time should I orient myself? My unparallelized version takes less in user time and much more in elapsed.
The help page ?system.time says the value returned by the function is an object of class proc_time, and that we should consult ?proc.time. There we learn that user time is
cumulative sum of user and system times of any child processes
so your task has spent about 15s on each core (mclapply defaults to using 2 cores, see the mc.cores argument).
Actually, we see earlier in the help page that proc.time() returns five elements that separate the process and child times, and that the summary method used in printing collapses the user and system time into process + child times, so there is a bit more information available.
I'm trying to process a bunch of csv files and return data frames in R, in parallel using mclapply(). I have a 64 core machine, and I can't seem to get anymore that 1 core utilized at the moment using mclapply(). In fact, it is a bit quicker to run lapply() rather than mclapply() at the moment. Here is an example that shows that mclapply() is not utilizing more the cores available:
library(parallel)
test <- lapply(1:100,function(x) rnorm(10000))
system.time(x <- lapply(test,function(x) loess.smooth(x,x)))
system.time(x <- mclapply(test,function(x) loess.smooth(x,x), mc.cores=32))
user system elapsed
0.000 0.000 7.234
user system elapsed
0.000 0.000 8.612
Is there some trick to getting this working? I had to compile R from source on this machine (v3.0.1), are there some compile flags that I missed to allow forking? detectCores() tells me that I indeed do have 64 cores to play with...
Any tips appreciated!
I get similar results to you, but if I change rnorm(10000) to rnorm(100000), I get significant speed up. I would guess that the additional overhead is canceling out any performance benefit for such a small scale problem.
I'm playing around with parallellization in R for the first time. As a first toy example, I tried
library(doMC)
registerDoMC()
B<-10000
myFunc<-function()
{
for(i in 1:B) sqrt(i)
}
myFunc2<-function()
{
foreach(i = 1:B) %do% sqrt(i)
}
myParFunc<-function()
{
foreach(i = 1:B) %dopar% sqrt(i)
}
I know that sqrt() executes too fast for parallellization to matter, but what I didn't expect was that foreach() %do% would be slower than for():
> system.time(myFunc())
user system elapsed
0.004 0.000 0.005
> system.time(myFunc2())
user system elapsed
6.756 0.000 6.759
> system.time(myParFunc())
user system elapsed
6.140 0.524 6.096
In most examples that I've seen, foreach() %dopar% is compared to foreach() %do% rather than for(). Since foreach() %do% was much slower than for() in my toy example, I'm now a bit confused. Somehow, I thought that these were equivalent ways of constructing for-loops. What is the difference? Are they ever equivalent? Is foreach() %do% always slower?
UPDATE: Following #Peter Fines answer, I update myFunc as follows:
a<-rep(NA,B)
myFunc<-function()
{
for(i in 1:B) a[i]<-sqrt(i)
}
This makes for() a bit slower, but not much:
> system.time(myFunc())
user system elapsed
0.036 0.000 0.035
> system.time(myFunc2())
user system elapsed
6.380 0.000 6.385
for will run sqrt B times, presumably discarding the answer each time. foreach, however, returns a list containing the result of each execution of the loop body. This would contribute considerable extra overhead, regardless of whether it's running in parallel or sequential mode (%dopar% or %do%).
I based my answer by running the following code, which appears to be confirmed by the foreach vignette, which states "foreach differs from a for loop in that its return is a list of values, whereas a for loop has no value and uses side effects to convey its result."
> print(for(i in 1:10) sqrt(i))
NULL
> print(foreach(i = 1:10) %do% sqrt(i))
[[1]]
[1] 1
[[2]]
[1] 1.414214
[[3]]
... etc
UPDATE: I see from your updated question that the above answer isn't nearly sufficient to account for the performance difference. So I looked at the source code for foreach and can see that there is a LOT going on! I haven't tried to understand exactly how it works, however do.R and foreach.R show that even when %do% is run, large parts of the foreach configuration is still run, which would make sense if perhaps the %do% option is largely provided to allow you to test foreach code without having to have a parallel backend configured and loaded. It also needs to support the more advanced nesting and iteration facilities that foreach provides.
There are references in the code to results caching, error checking, debugging and the creation of local environment variables for the arguments of each iteration (see the function doSEQ in do.R for example). I'd imagine this is what creates the difference that you've observed. Of course, if you were running much more complicated code inside your loop (that would actually benefit from a parallelisation framework like foreach), this overhead would become irrelevant compared with the benefits it provides.