Getting more info from Rprof() - r

I've been trying to dig into what the time-hogs are in some R code I've written, so I'm using Rprof. The output isn't yet very helpful though:
> summaryRprof()
$by.self
self.time self.pct total.time total.pct
"$<-.data.frame" 2.38 23.2 2.38 23.2
"FUN" 2.04 19.9 10.20 99.6
"[.data.frame" 1.74 17.0 5.54 54.1
"[.factor" 1.42 13.9 2.90 28.3
...
Is there some way to dig deeper and find out which specific invocations of $<-.data.frame, and FUN (which is probably from by()), etc. are actually the culprits? Or will I need to refactor the code and make smaller functional chunks in order to get more fine-grained results?
The only reason I'm resisting refactoring is that I'd have to pass data structures into the functions, and all the passing is by value, so that seems like a step in the wrong direction.
Thanks.

The existing CRAN package profr and proftools are useful for this. The latter can use Rgraphviz which isn't always installable.
The R Wiki page on profiling has additional info and a nice script by Romain which can also visualize (but requires graphviz).

Rprof takes samples of the call stack at intervals of time - that's the good news.
What I would do is get access to the raw stack samples (stackshots) that it collects, and pick several at random and examine them. What I'm looking for is call sites (not just functions, but the places where one function calls another) that appear on multiple samples. For example, if a call site appears on 50% of samples, then that's what it costs, because its possible removal would save roughly 50% of total time. (Seems obvious, right? But it's not well known.)
Not every costly call site is optimizable, but some are, unless the program is already as fast as possible.
(Don't be distracted by issues like how many samples you need to look at. If something is going to save you a reasonable fraction of time, then it appears on a similar fraction of samples. The exact number doesn't matter. What matters is that you find it. Also don't be distracted by graph and recursion and time measurement and counting issues. What matters is, for each call site you see, the fraction of stack samples that show it.)

Parsing the output that Rprof generates isn't too hard, and then you get access to absolutely everything.

Related

eval(parse(...))) pattern for potentially numeric values

I came across the following code and wondering if there is a reason for this given that the num variable comes from a numeric column (integer or double).
num_val <- eval(parse(text = num %>% as.character()))
Could it be a pattern that is (was) useful for something in older versions of R?
No, that hasn't really changed for a very long time.
It allows strings like "1+1" to be evaluated and stored as the result (i.e. 2).
If num was already numeric, it would have the effect of rounding it to the value that as.character() would calculate. Typically that value is not affected by the options("digits") setting and is usually much more precise, but it's not quite the same precision as used internally.
If num held an integer, that would convert it to a double.
If num held a vector of values, that would evaluate all of them, then only keep the last one.
I'd have to see the context, but I can't think of any useful reason to use code like that if num was known to be numeric. Only the first reason (allowing users to enter expressions instead of numbers) would make sense.
A very good general piece of advice is to never use the eval(parse( combination in your own code. From the fortunes package:
> library(fortunes)
> fortune(106)
If the answer is parse() you should usually rethink the question.
-- Thomas Lumley
R-help (February 2005)
There is almost always a better alternative in R, the above code looks like a more complicated, hard to understand version of:
num_val <- as.numeric(as.character(num))
if the values in num are numbers, or a factor with numeric labels then this works (there is a slightly more efficient version in the FAQ). The eval-parse method would work if num contained something like "c(1,4,6)" or "1:10", but in those cases it would probably be better to figure out where num is coming from that it would contain something like the above and figure out a better workflow.
Using eval and parse can be dangerous (if num contains some attack code, then this could cause major problems) and very hard to debug (google for "debug action at a distance").
That said, there is code out there where the programmers used parse and eval as a quick and dirty (but not best) way to do something, then that code was copied and modified and people took the quick and dirty approach to somehow be a reasonable approach, so things like this are out there. But you best approach (for good code and learning to be a better R programmer) is to find better options and never use eval(parse(.
My own contribution to fortunes on this topic:
> fortune(181)
Personally I have never regretted trying not to underestimate my own future
stupidity.
-- Greg Snow (explaining why eval(parse(...)) is often suboptimal, answering a
question triggered by the infamous fortune(106))
R-help (January 2007)

Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1
In FASTR, t1 returns this value:
user system elapsed
0.870 0.012 0.901
While in the original normal R, I get this result:
user system elapsed
0.112 0.000 0.113
As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)
====================================
EDIT:
Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
forloopfunction <- function(xTrain, yTrain) {
for (x in 1:100) {
smooth.spline(xTrain, yTrain, spar=0.6)
}
}
t1 <- system.time(fit1 <-forloopfunction(x,y))
t1
Now, the normal R returns this for t1:
user system elapsed
19.665 0.008 19.667
while FastR returns this:
user system elapsed
76.570 0.210 77.918
So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)
TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).
As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.
Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.
It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.
To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:
$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
Name || Total Time || Self Time || Location
-----------------------------------------------------------------------------------------------------------
order || 2190ms 81.4% || 2190ms 81.4% || order.r~1-42:0-1567
which || 70ms 2.6% || 70ms 2.6% || which.r~1-6:0-194
ifelse || 140ms 5.2% || 70ms 2.6% || ifelse.r~1-34:0-1109
...
--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)
One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.
So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.
Other points raised:
but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code
YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.
FastR used between 100% and 300% of CPU usage
This is due to JIT compilations going on on background threads. Again, no free lunch.
To summarize:
FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.
(*) the first profiling tier may be also compiled, which gives different tradeoffs

How to pair events from different recordings?

I have what I think to be an optimization problem that I have already solved. I think, however, that there might be a more robust solution which I don't quite know how to implement. Maybe someone here has some ideas.
Problem Description
I need to pair events from two separate recordings. An event has a name, a time t at which it occurred and an amplitude a with the constraint that the next event occurs at t_next = t + a. The two recordings should produce events at the same time and amplitude, but there are recording errors to deal with like drift.
The task is to group the same events into pairs of 2.
Here is an example:
Event Name t a t a
39770 155648.16 41.96 154726.4 41.75
39780 155690.12 42.01 154768.15 41.78
39790 155732.13 41.24 154809.94 41.13
39800 154851.06 5.74
39810 155773.37 1.55
39815 155774.03 1.55
39820 155774.92 3.75
39830 155778.67 0.65
39840 155779.32 22.35 154856.81 22.24
39850 155801.67 1.36 154879.04 1.27
39855 155802.4 1.36 154879.66 1.27
39860 155803.04 12.79 154880.31 12.74
As you can see, the t and a values are not exactly the same within a pair (same row), so I have to work with tolerances. Sometimes, you have events in recording 1 what you don't have in recording 2 and vice verse. So how can I pair them?
Idea For Solution
I was thinking of a "rubberband" approach somehow. Basically, you group the events by amplitude within the allowed tolerance such that you maximize the number of paired up events. But I'm not sure how to deal with missing events in one recording vs the other. Somehow, I have to ignore them and I'm not sure how to best do that without this problem exploding due to exorbitant combinations to try.
Has anyone come across this kind of problem and knows a robust solution for it?

Why is Unix/Terminal faster than R?

I'm new to Unix, however, I have recently realized that very simple Unix commands can do very simple things to large data set very very quickly. My question is why are these Unix commands so fast relative to R?
Let's begin by assuming that the data is big, but not larger than the amount of RAM on your computer.
Computationally, I understand that Unix commands are likely faster than their R counterparts. However, I can't imagine that this would explain the entire time difference. After all basic R functions, like Unix commands, are written in low-level languages like C/C++.
I therefore suspect that the speed gains have to do with I/O. While I only have a basic understanding of how computers work, I do understand that to manipulate data it most first be read from disk (assuming the data is local). This is slow. However, regardless of whether you use R functions or Unix commands to manipulate data both most obtain the data from disk.
Therefore I suspect that how data is read from disk, if that even makes sense, is what is driving the time difference. Is that intuition correct?
Thanks!
UPDATE: Sorry for being vague. This was done on purpose, I was hoping to discuss this idea in general, rather than focus on a specific example.
Regardless, I'll generate an example of counting the number of rows
First I'll generate a big data set.
row = 1e7
col = 50
df<-matrix(rpois(row*col,1),row,col)
write.csv(df,"df.csv")
Doing it with Unix
time wc -l df.csv
real 0m12.261s
user 0m1.668s
sys 0m2.589s
Doing it with R
library(data.table)
system.time({ nrow(fread("df.csv")) })
...
user system elapsed
26.77 1.67 47.07
Notice that elapsed/real > user + system. This suggests that the CPU is waiting on the disk.
I suspected the slow speed of R has to do with reading the data in. It appears that I'm right:
system.time(fread("df.csv"))
user system elapsed
34.69 2.81 47.41
My question is how is the I/O different for Unix and R. Why?
I'm not sure what operations you're talking about, but in general, more complex processing systems like R use more complex internal data structures to represent the data being manipulated, and constructing these data structures can be a big bottleneck, significantly slower than the simple lines, words, and characters that Unix commands like grep tend to operate on.
Another factor (depending on how your scripts are set up) is whether you're processing the data one thing at a time, in "streaming mode", or reading everything into memory. Unix commands tend to be written to operate in pipelines, and to read a small piece of data (usually one line), process it, maybe write out a result, and move on to the next line. If, on the other hand, you read the entire data set into memory before processing it, then even if you do have enough RAM, allocating and organizing all the necessary memory can be very expensive.
[updated in response to your additional information]
Aha. So you were asking R to read the whole file into memory at once. That accounts for much of the difference. Let's talk about a few more things.
I/O. We can think about three ways of reading characters from a file, especially if the style of processing we're doing affects the way that's most convenient to do the reading.
Unbuffered small, random reads. We ask the operating system for 1 or a few characters at a time, and process them as we read them.
Unbuffered large, block-sized reads. We ask the operating for big chunks of memory -- usually of a size like 1k or 8k -- and chew on each chunk in memory before asking for the next chunk.
Buffered reads. Our programming language gives us a way of asking for as many characters as we want out of an intermediate buffer, and code that's built into the language ("library" code) automatically takes care of keeping that buffer full by reading large, block-sized chunks from the operating system.
Now, the important thing to know is that the operating system would much rather read big, block-sized chunks. So #1 can be drastically slower than 2 and 3. (I've seen factors of 10 or 100.) But no well-written programs use #1, so we can pretty much forget about it. As long as you're using 2 or 3, the I/O speed will be roughly the same. (In extreme cases, if you know what you're doing, you can get a little efficiency increase by using 2 instead of 3, if you can.)
Now let's talk about the way each program processes the data. wc has basically 5 steps:
Read characters one at a time. (I can assure you it uses method 3.)
For each character read, add one to the character count.
If the character read was a newline, add one to the line count.
If the character read was or wasn't a word-separator character, update the word count.
At the very end, print out the counts of lines, words, and/or characters, as requested.
So as you can see it's all I/O and very simple, character-based processing. (The only step that's at all complicated is 4. As an exercise, I once wrote a version of wc that contrived not to do all of steps 2, 3, and 4 inside the read loop if the user didn't ask for all the counts. My version did indeed run significantly faster if you invoked wc -c or wc -l. But obviously the code was significantly more complicated.)
In the case of R, on the other hand, things are quite a bit more complicated. First, you told it to read a CSV file. So as it reads, it has to find the newlines separating lines and the commas separating columns. That's roughly equivalent to the processing that wc has to do. But then, for each number that it finds, it has to convert it into an internal number that it can work with efficiently. For example, if somewhere in the CSV file occurs the sequence
...,12345,...
R is going to have to read those digits (as individual characters) and then do the equivalent of the math problem
1 * 10000 + 2 * 1000 + 3 * 100 + 4 * 10 + 5 * 1
to get the value 12345.
But there's more. You asked R to build a table. A table is a specific, highly regular data structure which orders all the data into rigid rows and columns for efficient lookup. To see how much work that can be, let's use a slightly far-fetched hypothetical real-world example.
Suppose you're a survey company and it's your job to ask people walking by on the street certain questions. But suppose that the questions are complicated enough that you need all the people seated in a classroom at once. (Suppose further that the people don't mind this inconvenience.)
But first you have to build that classroom. You're not sure how many people are going to walk by, so you build an ordinary classroom, with room for 5 rows of 6 desks for 30 people, and you haul in the desks, and the people start filing in, and after 30 people file in you notice there's a 31st, so what do you do? You could ask him to stand in the back, but you're kind of fixated on the rigid-rows-and-columns idea, so you ask the 31st person to wait, and you quickly call the builders and ask them to build a second 30-person classroom right next to the first, and now you can accept the 31st person and in fact 29 more for a total of 60, but then you notice a 61st person.
So you ask him to wait, and you call the builders back again, and you have them build two more classrooms, so now you've got a nice 2x2 grid of 30-person classrooms, but the people keep coming and soon enough the 121st person shows up and there's not enough room and you still haven't even started asking your survey questions yet.
So you call some fancier builders that know how to do steelwork and you have them build a big 5-story building next door with 50-person classrooms, 5 on each floor, for a total of 50 x 5 x 5 = 1,250 desks, and you have the first 120 people (who've been waiting patiently) file out of the old rooms into the new building, and now there's room for the 121st person and quite a few more behind him, and you hire some wreckers to demolish the old classrooms and recycle some of the materials, and the people keep coming and pretty soon there's 1,250 people in your new building waiting to be surveyed and the 1,251st has just showed up.
So you build a giant new skyscraper with 1,000 desks on each floor and 100 floors, and you demolish the old 5-story building, but the people keep coming, and how big did you say your big data set was? 1e7 x 50? So I don't think the 100-story building is going to be big enough, either. (And when you're all done with all this, the only "survey question" you're going to ask is "How many rows are there?")
Contrived as it may seem, this is actually not too bad an analogy for what R is having to do internally to build the table to store that data set in.
Meanwhile, Bob's discount survey company, who can only tell you how many people he surveyed and how many were men and women and in which age brackets, is down there on the streetcorner, and the people are filing by, and Bob is jotting down tally marks on his clipboards, and the people, once surveyed, are walking away and going about their business, and Bob isn't wasting time and money building any classrooms at all.
I don't know anything about R, but see if there's a way to construct an empty 1e7 x 50 matrix up front, and read the CSV file into it. You might find that significantly quicker. R will still have to do some building, but at least it won't have any false starts.

Digging into R profiling information

I am trying to optimize a bit of code, and am puzzled about information from summaryRprof(). In particular, it looks like a number of calls are made to external C programs, but I'm not able to pin down which C program, from which R function. I am planning to resolve this through a bunch of slicing and dicing of the code, but wondered if I am overlooking some better way to interpret the profiling data.
The highest-consuming function is .Call, which is apparently a generic description for calls to C code; the next leading functions appear to be assignment operations:
$by.self
self.time self.pct total.time total.pct
".Call" 2281.0 54.40 2312.0 55.14
"[.data.frame" 145.0 3.46 218.5 5.21
"initialize" 123.5 2.95 217.5 5.19
"$<-.data.frame" 121.5 2.90 121.5 2.90
"as.vector" 110.5 2.64 416.0 9.92
I decided to focus on the .Call to see how this arises. I looked through the profiling file to find those entries with .Call in the call stack, and the following are the top entries in the call stack (by count of # of appearances):
13640 "eval"
11252 "["
7044 "standardGeneric"
4691 "<Anonymous>"
4658 "tryCatch"
4654 "tryCatchList"
4652 "tryCatchOne"
4648 "doTryCatch"
This list is as clear as mud: I have <Anonymous> and standardGeneric in there.
I believe this is due to calls to functions in the Matrix package, but that's because I'm looking at the code and that package appears to be the only possible source of C code. However, a lot of different functions from Matrix are called in this package, and it seems very difficult to determine which function is consuming this time.
So, my question is pretty basic: is there some way of deciphering and attributing these calls (e.g. .Call, <Anonymous>, etc.) in some other way? The plot of the call graph for this code is rather tricky to render, given the # of functions involved.
The fallback tactics I see are to either (1) comment out bits of code (and hack around to make the code work with this) to see where the time consumption occurs, or to (2) wrap certain operations inside of other functions and see when those functions appear on the call stack. The latter is inelegant, but it seems like it's the best way to add a tag to the call stack. The former is unpleasant because it takes quite some time to run the code, and iteratively uncommenting code and rerunning is unpleasant.
May I suggest you use the profr package. This is another bit of Hadley magic. It's a wrapper around Rprof and gives a visulation of the call stack and timings.
I find profr very easy to use and interpret. For example, here is a profile of a bit of ddply example code and the resulting profr plot:
library(profr)
p <- profr(
ddply(baseball, .(year), "nrow"),
0.01
)
plot(p)
You can immediately see the following:
How ddply calls ldply, llply and loop_apply.
Inside loop_apply there is a .Call function.
You can confirm this by reading the source code for loop_apply:
> plyr:::loop_apply
function (n, f, env = parent.frame())
{
.Call("loop_apply", as.integer(n), f, env)
}
<environment: namespace:plyr>
Edit. There is something very odd about the ggplot.profr method. I have proposed the following fix to Hadley. (You may wish to try this on your example.)
ggplot.profr <- function (data, ..., minlabel = 0.1, angle = 0){
if (!require("ggplot2", quiet = TRUE))
stop("Please install ggplot2 to use this plotting method")
data$range <- diff(range(data$time))
ggplot(as.data.frame(data), aes(y=level)) +
geom_rect(
#aes(xmin=(level), xmax=factor(level)+1, ymin=start, ymax=end),
aes(ymin=level-0.5, ymax=level+0.5, xmin=start, xmax=end),
#position = "identity", stat = "identity", width = 1,
fill = "grey95",
colour = "black", size = 0.5) +
geom_text(aes(label = f, x = start + range/60),
data = subset(data, time > max(time) * minlabel), size = 4, angle = angle, vjust=0.5, hjust = 0) +
scale_x_continuous("time") +
scale_y_continuous("level")
}
It seems that the short answer is "No" and the long answer is "Yes, but you're not going to enjoy this." Even answering this question is going to take some time (so stick around, I may be updating it).
There are several basic things to get one's head around when working with profiling in R:
First, there are many different ways to think about profiling. It is quite typical to think in terms of a call stack. At any given instant, this is the sequence of function calls that are active, essentially nested within each other (subroutines, if you will). This is quite useful for understanding the state of evaluations, where functions will return, and lots of other things that are important for seeing things as the computer / interpreter / OS may see them. Rprof does call stack profiling.
Second, a different perspective is that I've got a bunch of code and a particular call is taking a long time: which line in my code caused that call to be made? This is line profiling. R doesn't have line profiling, as far as I can tell. This is in contrast with Python and Matlab, which both have line profilers.
Third, the map from from lines to calls is surjective, but it is not bijective: given a particular call stack we cannot guarantee that we can map it back to the code. In fact, call stack analyses often summarize the calls completely out of context of the whole stack (i.e. cumulative times are reported no matter where that call was on all of the different stacks in which it occurred).
Fourth, even though we have these constraints, we can put on our statistical hats and analyze the call stack data carefully and see what we can make of it. The call stack information is data and we like data, don't we? :)
Just a quick intro to a call stack. Let's just assume that our call stack looked like this:
"C" "B" "A"
This means that function A called B which then calls C (the order is reversed), and the call stack is 3 levels deep. In my code, the call stack gets to as many as 41 levels deep. Since the stacks can be so deep and are presented in reverse order, this is more interpretable by software than by a human. Naturally, we begin cleaning and transforming this data. :)
Now, our data really comes along looking like:
".Call" "subCsp_cols" "[" "standardGeneric" "[" "eval" "eval" "callGeneric"
"[" "standardGeneric" "[" "myFunc2" "myFunc1" "eval" "eval" "doTryCatch"
"tryCatchOne" "tryCatchList" "tryCatch" "FUN" "lapply" "mclapply"
"<Anonymous>" "%dopar%"
Miserable, isn't it? It even has duplicates of things like eval, some guy called <Anonymous> - probably some darn hacker. (Anonymous is legion, by the way. :-))
The first step in transforming this into something useful was to split each line of Rprof() output and reverse the entries (via strsplit and rev). The first 12 entries (last 12 if you look at the raw call stack, rather than the post-rev version) were the same for every line (of which there were about 12000, the sampling interval was 0.5 seconds - so about 100 minutes of profiling), and these can be discarded.
Remember, we're still interested in knowing which line(s) led to .Call, which took so much time. Before we get to that question, we put on our statistical caps: the profiling reports, e.g. from summaryRprof, profr, ggplot, etc., only reflect the cumulative time spent for a given call or for calls beneath a given call. What does this cumulative information not tell us? Bingo: whether that call was made many times, or a few, and whether the time spent was constant over all invocations of that call or whether there are some outliers. A particular function might be executed 100 times or 100K times, but all of the cost may come from a single invocation (it shouldn't, but we don't know until we look at the data).
This only begins to describe the fun. The A->B->C example doesn't reflect the way things may really appear, such as A->B->C->D->B->E. Now, "B" may be counted a couple of times. What's more, suppose that a lot of time is spent in the C level, but we never sample at precisely that level, only seeing its child calls in the stack. We may see a sizable time for "total.time", but not for "self.time". If there are lots of different child calls under C, we may lose sight of what to optimize - should we take out C altogether or tweak the children, B, D, and E?
Just to account for the time spent, I took the sequences and ran them through digest, storing counts for the digested values, via hash. I also split up the sequences, storing {(A),(A,B), (A,B,C), etc.}. This doesn't seem so interesting, but removing singletons from the counts helps a lot in cleaning up the data. We can also store the time spent in each call by using rle(). This is useful for analyzing the distribution of time spent for a given call.
Still we're nowhere closer to finding the actual time spent per line of code. We'll never get lines of code from the call stack. A simpler way to do this is to store a list of times throughout the code, which stores the output of proc.time(), for a given invocation. Taking the difference of these times reveals which lines or sections of code are taking a long time. (Hint: that's what we're really looking for, not the actual calls.)
However, we have this call stack and we might as well do something useful. Going up the stack is somewhat interesting, but if we rewind the profile information to a little earlier, we can find which calls tend to precede the longer running calls. This allows us to look for landmarks in the call stack - positions where we can tie a call to a particular line of code. This makes it a bit easier to map more calls back to code, if all we have is the call stack, rather than instrumented code. (As I keep mentioning: out of context, there isn't a 1:1 mapping, but at a fine enough granularity, especially in repeatedly hit calls that are distinctive, you may be able to find landmarks in the calls that map to code.)
Altogether, I was able to find which calls were taking a lot of time, whether that was based on 1 long interval or many small ones, what the distribution of time spent was like, and, with some effort, I was able to map the most important & time consuming calls back to the code and discover which parts of the code could benefit the most from rewriting or from a change in algorithms.
Statistical analyses of the call stack is loads of fun, but investigating a particular call based on cumulative time consumption is not a very good way to go. The cumulative time consumed by a call is informative on a relative basis, but it doesn't enlighten us as whether one or many calls consumed this time, nor the depth of the call in the stack, nor the section of code responsible for the invocations. The first two things can be addressed via a bit more R code, while the latter is best pursued through instrumented code.
As R doesn't yet have line profilers like Python and Matlab, the simplest way to handle this is to just instrument one's code.
A line in a profile file might look like
"strsplit" ".parseTabix" ".readVcf" "readVcf" "standardGeneric" "readVcf" "system.time"
which says, reading right to left, that the outermost function was system.time, which invoked readVcf, which was an S4 generic that dispatched to a readVcf method, invoking a function .readVcf, which invoked .parseTabix, which finally called strsplit.
Here we read in the profile file, sort the lines, tally them up (using rle -- run length encoding), then select the six most common paths in the profile file
r = rle(sort(readLines("readVcf.Rprof"))
o = order(r$lengths, decreasing=TRUE)
r$values[head(o)]
This
r$lengths[head(o)]
tells us how many times each of those call stacks were sampled.
There are some common patterns that can help interpret this. Here's an S4 generic being dispatched to its method
"readVcf" "standardGeneric" "readVcf"
an lapply iterating over its function
"FUN" "lapply"
and a tryCatch surrounding a .Call
".Call" "doTryCatch" "tryCatchOne" "tryCatchList" "tryCatch"
Usually one tries to profile relatively small chunks of code, rather than a whole script, with the small chunk identified by, e.g., stepping through the code interactively or making some educated guesses about what parts are likely to be slow. The fact that .Call is the most commonly sampled function is not encouraging -- it suggests most of the time is already being spent in C. Likely your best bet will involve coming up with a better overall algorithm, rather than say a brute force approach.

Resources