Why could Shark be so slow? - shark

I'm trying to profile Ruby interpreter. I run shark -i ./ruby bm_sudoku.rb or something like that, the script finishes in less than a second, and then Shark goes to "CHUDData - Analyzing samples... 99.3%.." point and stays there frozen for 10 minutes or so. It finishes eventually, it's just so ridiculously slow it's pretty much unusable.
Version I have here is OSX 10.5, shark 4.6.1 (227).
Any ideas what that might be?

There's some sort of strange problem in 10.5 with Shark being glacially slow loading symbols which can take several minutes. I don't see the problem in 10.6, and that seems to mirror the same behaviour seen by Chromium developers.

Try Instruments -- the CPU sampler is similar to shark and it is a lot faster. No pause at the end.

Related

Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1
In FASTR, t1 returns this value:
user system elapsed
0.870 0.012 0.901
While in the original normal R, I get this result:
user system elapsed
0.112 0.000 0.113
As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)
====================================
EDIT:
Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
forloopfunction <- function(xTrain, yTrain) {
for (x in 1:100) {
smooth.spline(xTrain, yTrain, spar=0.6)
}
}
t1 <- system.time(fit1 <-forloopfunction(x,y))
t1
Now, the normal R returns this for t1:
user system elapsed
19.665 0.008 19.667
while FastR returns this:
user system elapsed
76.570 0.210 77.918
So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)
TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).
As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.
Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.
It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.
To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:
$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
Name || Total Time || Self Time || Location
-----------------------------------------------------------------------------------------------------------
order || 2190ms 81.4% || 2190ms 81.4% || order.r~1-42:0-1567
which || 70ms 2.6% || 70ms 2.6% || which.r~1-6:0-194
ifelse || 140ms 5.2% || 70ms 2.6% || ifelse.r~1-34:0-1109
...
--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)
One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.
So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.
Other points raised:
but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code
YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.
FastR used between 100% and 300% of CPU usage
This is due to JIT compilations going on on background threads. Again, no free lunch.
To summarize:
FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.
(*) the first profiling tier may be also compiled, which gives different tradeoffs

Julia: ctrl+c does not interrupt

I'm using REPL inside VScode and trying to fix a code that gets stuck inside a certain package. I want to figure out which process is taking time by looking at the stack trace but cannot interrupt because REPL does not respond to ctrl+c. I pressed ctrl+x by accident and that showed ^X on the screen.
I am using JuMP and GLPK so it could be stuck there. However, I am not seeing any outputs.
I would also appreciate any tips on figuring out which process is causing it to be stuck.
Interrupts are not implemented in GLPK.jl. I've opened an issue: https://github.com/jump-dev/GLPK.jl/issues/171 (but it's unlikely to get fixed quickly).
If you're interested in contributing to JuMP, it's a good issue to get started with. You could look at the Gurobi.jl code for how we handle interrupts there as inspiration.
I started out using GLPK.jl and I also found that it would "hang" on large problems. However, I recommend trying the Cbc.jl solver. It has a time limit parameter which will interrupt the solver in a set number of seconds. I found it to produce quality results. (Or you could use Cbc for Dev/QA testing to determine what might be causing the hanging and switch to GLPK for your production runs.)
You can set the time limit using the seconds parameter as follows.
For newer package versions:
model = Model(optimizer_with_attributes(Cbc.Optimizer
,"seconds" => 60
,"threads" => 4
,"loglevel" => 0
,"ratioGap" => 0.0001))
Or like this for older package versions:
model = Model(with_optimizer(Cbc.Optimizer
,seconds=60
,threads=4
,loglevel=0
,ratioGap=0.0001))

Data corruption when uploading a texture (Direct3D11)

I have encountered the following issue in my Direct3D11-based application:
Sometimes (and on some machines) I get a texture which is seemingly corrupted, a couple of lines are just black. Something like this (showing one texture):
Sometimes it is as bad as this, sometimes it is only one or two lines. I made sure that it is not a rendering issue; all observations indicate that the texture-data is not correct.
So far this has been observed only under the following conditions:
we are creating an immutable texture (CPUAccessFlags=None, Usage=Default, BindFlags=ShaderResource)
using a laptop with "Intel HD Graphics 520" and "AMD Radeon R7 M360" with Windows 7 (with "AMD Enduro"-technology)
only with texture-formats "R16G16B16A16_UNorm", "R8G8B8A8_UNorm", "R32_Float", not with "R8_UNorm" and "R16_UNorm" (for the rest it is unknown to me whether they work or not)
and it seems to occur only intermittently (seems to be more likely if the GPU is quite busy)
I tried putting together a small sample which reproduces the issue: https://github.com/ptahmose/ImmutableTextureUploadTest
In this code, I am up-loading an immutable texture, then copy it to a staging texture, and then compare the data to what was up-loaded.
Now, with this repro-project I get the following:
If this repro-application is running alone on the machine, all is fine.
If I run another application which uses D3D11 on the machine, we get errors with this test-application.
I.e. this test-application reports errors if we run two instances of it (or, if I run the actual D3D11-based application).
Did somebody run across something similar? Can this issue be reproduced? Am I missing something? ...and of course: what are my options to solve/work around this issue?
Thanks for reading!

How does RStudio determine the console width, and why does it seem to be getting it consistently wrong?

I just discovered wid <- options()$width in RStudio, and it seems to be the source (or rather, much closer to the source) of much irritation in my everyday console usage. I should say up front that I'm currently on R 3.2.2, RStudio 0.99.491, on Linux Mint 17.3 (built over Ubuntu 14.04.3 LTS)
As I understand it, wid should be measured in characters -- if wid is equal to 52, say, then one should be able to fit the alphabet on the screen twice (given the fixed-width default font), but this doesn't appear to be the case:
As you can see, despite having wid equal to 52, I am unable to fit the alphabet twice -- I come up 6 characters short. I also note that this means it is not simply due to the presence of the command prompt arrow and space (> ).
The problem seems somewhat proportional -- if I have wid up to 78, I can only fit 70 characters; up to 104, 93, so wid is about 88% off pretty consistently (side note: this also suggests my assumption wid is measured in characters is probably right).
The problem that this engenders is that oftentimes console output overflows beyond its intended line, making the output ugly and hard to digest; take, for example, the simple snipped setDT(lapply(1:30, function(x) 1:3))[] which produces for me:
It seems clear to me that the output was attempted on a screen width which was not available in practice -- that internally, a larger screen width than actually exists was used for printing.
This leaves me with three questions:
How is options()$width determined?
Why is it so consistently wrong?
What can we do to override this error?
Found a post about this on Rstudio support and it seems the issue has to do with high DPI Displays; there is a claimed bug fix in RStudio version 0.99.878 (released just today! as luck would have it), according to the release notes:
Bug Fixes
...
Correct computation of getOption(“width”) on high DPI displays
Hope this helps anyone else experiencing this! I'm tempted to post about this on /r/oddlysatisfying B-)
Would love to see the relevant commit on the RStudio GitHub page if anyone can track it down (I had no luck).

Give CPU more power to plot in Octave

I made this function in Octave which plots fractals. Now, it takes a long time to plot all the points I've calculated. I've made my function as efficient as possible, the only way I think I can make it plot faster is by having my CPU completely focus itself on the function or telling it somehow it should focus on my plot.
Is there a way I can do this or is this really the limit?
To determine how much CPU is being consumed for your plot, run your plot, and in a separate window (assuming your on Linux/Unix), run the top command. (for windows, launch the task master and switch to the 'Processes' tab, click on CPU header to sort by CPU).
(The rollover description for Octave on the tag on your question says that Octave is a scripting language. I would expect it's calling gnuplot to create the plots. Look for this as the highest CPU consumer).
You should see that your Octave/gnuplot cmd is near the top of the list, and for top there is a column labeled %CPU (or similar). This will show you how much CPU that process is consuming.
I would expect to see that process is consuming 95% or more CPU. If you see that is a significantly lower number, then you need to check the processes below that, are they consuming the remaining CPU (some sort of Virus scan (on a PC), or DB or Server?)? If a competing program is the problem, then you'll have to decide if you can wait til it/they are finished, OR that you can kill them and restart later. (For lunix, use kill -15 pid or kill -11 pid. Only use kill -9 pid as a last resort. Search here for articles on correct order for trying to kill -$n)
If there are no competing processes AND it octave/gnuplot is using less than 95%, then you'll have to find alternate tools to see what is holding up the process. (This is unlikely, it's possible some part of your overall plotting process is either Disk I/O or Network I/O bound).
So, it depends on the timescale you're currently experiencing versus the time you "want" to experience.
Does your system have multiple CPUs? Then you'll need to study the octave/gnuplot documentation to see if it supports a switch to indicate "use $n available CPUs for processing". (Or find a plotting program that does support using $n multiple CPUs).
Realistically, if your process now takes 10 mins, and you can, by eliminating competing processes, go from 60% to 90%, that is a %50 increase in CPU, but will only reduce it to 5 mins (not certain, maybe less, math is not my strong point ;-)). Being able to divide the task over 5-10-?? CPUs will be the most certain path to faster turn-around times.
So, to go further with this, you'll need to edit your question with some data points. How long is your plot taking? How big is the file it's processing. Is there something especially math intensive for the plotting you're doing? Could a pre-processed data file speed up the calcs? Also, if the results of top don't show gnuplot running at 99% CPU, then edit your posting to show the top output that will help us understand your problem. (Paste in your top output, select it with your mouse, and then use the formatting tool {} at the top of the input box to keep the formatting and avoid having the output wrap in your posting).
IHTH.
P.S. Note the # of followers for each of the tags you've assigned to your question by rolling over. You might get more useful "eyes" on your question by including a tag for the OS you're using, and a tag related to performance measurement/testing (Go to the tags tab and type in various terms to see how many followers you're getting. One bit of S.O. etiquette is to only specify 1 programming language (if appropriate) and that may apply to OS's too.)

Resources