Julia speed of execution - julia

I'm benchmarking Julia execution speed. I executed #time [i^2 for i in 1:1000] on Julia prompt, which resulted in something of the order of 20 ms. This seems strange, since my computer is modern with an i7 processor (I am using Linux Ubuntu).
Another strange thing is that when I execute the same command on a range of 1:10 the execution time is 15 ms.
There must be something trivial that I am missing here?

Several things, see performance tips:
Don't benchmark in global scope.
Don't measure the first execution of something like this.
Use BenchmarkTools.
Julia is a JIT-compiled language, so the first time you measure things, you're measuring compilation time. This is a small fixed overhead, so for anything that takes a substantial time, it's negligible, but for short-running code like this, it's almost all of the time. Non-constant global variables force the compiler to assume almost nothing about types, which tends to poison all of your performance. This is fine in some circumstances, but most of the time, you a) should write code so that the inputs are explicit parameters to functions, rather than implicit parameters coming from some globals, and b) shouldn't write code that uses mutable global state.

Related

Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1
In FASTR, t1 returns this value:
user system elapsed
0.870 0.012 0.901
While in the original normal R, I get this result:
user system elapsed
0.112 0.000 0.113
As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)
====================================
EDIT:
Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
forloopfunction <- function(xTrain, yTrain) {
for (x in 1:100) {
smooth.spline(xTrain, yTrain, spar=0.6)
}
}
t1 <- system.time(fit1 <-forloopfunction(x,y))
t1
Now, the normal R returns this for t1:
user system elapsed
19.665 0.008 19.667
while FastR returns this:
user system elapsed
76.570 0.210 77.918
So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)
TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).
As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.
Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.
It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.
To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:
$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
Name || Total Time || Self Time || Location
-----------------------------------------------------------------------------------------------------------
order || 2190ms 81.4% || 2190ms 81.4% || order.r~1-42:0-1567
which || 70ms 2.6% || 70ms 2.6% || which.r~1-6:0-194
ifelse || 140ms 5.2% || 70ms 2.6% || ifelse.r~1-34:0-1109
...
--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)
One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.
So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.
Other points raised:
but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code
YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.
FastR used between 100% and 300% of CPU usage
This is due to JIT compilations going on on background threads. Again, no free lunch.
To summarize:
FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.
(*) the first profiling tier may be also compiled, which gives different tradeoffs

Line-by-line profiling of Julia code possible?

I have some Julia functions that are several hundreds of lines long that I would like to profile so that I can then work on optimizing the code.
I am aware of the BenchmarkTools package which allows the overall execution time and memory consumption of a function to be measured using #btime or #benchmark. But those functions tell me nothing about where inside the functions the bottlenecks are. So my first step would have to be using some tool to identify which parts of the code are slow.
In Matlab for instance there is a very nice built-in profiler which runs a script/function and then reports the time spent on every line of the code. Similarly in Python there is a module called line_profiler which can produce a line-by-line report showing how much time was spent on every single line of a function.
What I’m looking for is simply a line-by-line report showing the total time spent on each line of code and how many times a particular piece of code was called.
Is there such a functionality in Julia? Either built-in or via some third-party package.
There is a Profiling chapter in Julia docs with all the necessary info.
Also, you can use ProfileView.jl or similar packages for visual exploration of the profiled code.
And, not exactly profiling, but very useful in practice package is TimerOutputs.jl
UPD: Since Julia is a compiling language it makes no sense to measure timing of individual lines, since the actual code that is executed can be very different from what is written in Julia.
For example following julia code
function f()
x = 0
for i in 0:100_000_000
x += i
end
x
end
is lowered to
julia> #code_llvm f()
; # REPL[8]:1 within `f'
define i64 #julia_f_594() {
top:
; # REPL[8]:7 within `f'
ret i64 5000000050000000
}
I.e. there is no loop at all. This is why instead of execution time proxy metric of how often a line appears in the set of all backtraces is used. Of course, it is not the same as the execution time, but it gives a good approximation of where the bottleneck is because lines with long execution time appear in backtraces more often.
OwnTime.jl. Doesn't do call counts though, but it should be easy to add.

How come Go calculates fibonacci recursion so fast?

This is not a correct version of it, I am just playing around Go but I was shocked how fast Go calculated the 42nd(43 actually) number in Fibonacci sequence.
Can someone please explain how come it calculates it this fast? I tried comparing it to python(I know its slow compared to other languages) but python took > 1 minute and I had to break the recursion.
package main
import "fmt"
func fib(a uint) uint {
if a <= 1 {
return 1
}
return fib(a-1) + fib(a-2)
}
func main() {
fmt.Println(fib(42))
}
[ `go run Hello.go` | done: 2.316821835s ]
433494437
Its compiler isn't as smart or mature as C's (at least not yet), but Go is still closer to C in its time performance than Python (space performance is a separate thing, and not what you asked about). Just being a compiled language instead of an interpreted language gives it a major leg up in time performance over Python (and it is still faster than PyPy in general, but not as much faster).
Why compiled languages generally offer greater time performance than interpreted languages has been thoroughly covered elsewhere. You can research this question on stackoverflow and elsewhere on the internet. For example, here's the TL;DR in one stackoverflow answer to that question:
Native programs run using instructions written for the processor they run on.
And here's the TL;DR in another answer:
Interpreted languages are slower because their method, object and global variable space model is dynamic
You can also find plenty of benchmark case studies and results comparing implementations in different languages if you look for them.
Performance improvements to the Go compiler and Go toolchain are also frequently made, which you can read about in the release notes (and elsewhere) such as this excerpt about version 1.8:
The new back end, based on static single assignment form (SSA), generates more compact, more efficient code and provides a better platform for optimizations such as bounds check elimination. The new back end reduces the CPU time required by our benchmark programs by 20-30% on 32-bit ARM systems.

The best way to write code in Julia working on GPU's via ArrayFire

In Julia, I saw principally that to acelerate and optimizing codes when I work on a matrix, es better e.g.
-work by columns instead of by rows, this is for the way which Julia store the matrix.
-On loops could use #inbounds and #simd macros
-any function, macros or methods you could recommend it's welcome :D
But it seems that the above examples do not work when I use the ArrayFire package with a matrix stored on the GPU, similar codes in the CPU and GPU do not seem to favor the GPU that runs much slower in some cases, I think it shouldn't be like that, I think the problem is in the way of writing the code. Any help will be welcome
GPU computing should be done on optimized GPU kernels as much as possible. Indexing a GPU array is a small kernel that copies one value back to the CPU. This is really really bad for performance, so you should almost never index a GPUArray unless you have to (this is true for any implementation! It's just a hardware problem!)
Thus, instead of writing looping code for GPUs, you should write broadcasting ("vectorized") code. With the v0.6 broadcast changes, broadcasted operations are nearly as efficient as loops anyways (unless you hit this bug), so there's no reason to avoid them in generic code. However, there are cases where broadcasting is faster than looping, and GPUs is one big case.
Let me explain a little bit why. When you do the code:
#. A = B*C + D*E
it lowers to
A .= B.*C .+ D.*E
which then lowers to:
broadcast!((b,c,d,e)->b*c + d*e,A,B,C,D,E)
Notice that in there you have a fused anonymous function for the entire broadcast. For GPUArrays, this is then overwritten so that way a single GPU kernel is automatically created that performs this fused operation element-wise. Thus only one GPU kernel is required to do this whole operation! Notice that this is even more efficient than the R/Python/MATLAB way of doing GPU computing since those vectorized forms have temporaries and would require 4 kernels here, but this has no temporary arrays and is a single kernel, which is pretty much exactly how you'd write it if you were writing the kernel yourself. So if you exploit broadcast, then your GPU calculations will be fast.

OpenCL computation times much longer than CPU alternative

I'm taking my first steps in OpenCL (and CUDA) for my internship. All nice and well, I now have working OpenCL code, but the computation times are way too high, I think. My guess is that I'm doing too much I/O, but I don't know where that could be.
The code is for the main: http://pastebin.com/i4A6kPfn, and for the kernel: http://pastebin.com/Wefrqifh I'm starting to measure time after segmentPunten(segmentArray, begin, eind); has returned, and I end measuring time after the last clEnqueueReadBuffer.
Computation time on a Nvidia GT440 is 38.6 seconds, on a GT555M 35.5, on a Athlon II X4 5.6 seconds, and on a Intel P8600 6 seconds.
Can someone explain this to me? Why are the computation times are so high, and what solutions are there for this?
What is it supposed to do: (short version) to calculate how much noiseload there is made by an airplane that is passing by.
long version: there are several Observer Points (OP) wich are the points in wich sound is measured from an airplane thas is passing by. The flightpath is being segmented in 10.000 segments, this is done at the function segmentPunten. The double for loop in the main gives OPs a coordinate. There are two kernels. The first one calculates the distance from a single OP to a single segment. This is then saved in the array "afstanden". The second kernel calculates the sound load in an OP, from all the segments.
Just eyeballing your kernel, I see this:
kernel void SEL(global const float *afstanden, global double *totaalSEL,
const int aantalSegmenten)
{
// ...
for(i = 0; i < aantalSegmenten; i++) {
double distance = afstanden[threadID * aantalSegmenten + i];
// ...
}
// ...
}
It looks like aantalSegmenten is being set to 1000. You have a loop in each
kernel that accesses global memory 1000 times. Without crawling though the code,
I'm guessing that many of these accesses overlap when considering your
computation as a whole. It this the case? Will two work items access the same
global memory? If this is the case, you will see a potentially huge win on the
GPU from rewriting your algorithm to partition the work such that you can read
from a specific global memory only once, saving it in local memory. After that,
each work item in the work group that needs that location can read it quickly.
As an aside, the CL specification allows you to omit the leading __ from CL
keywords like global and kernel. I don't think many newcomers to CL realize
that.
Before optimizing further, you should first get an understanding of what is taking all that time. Is it the kernel compiles, data transfer, or actual kernel execution?
As mentioned above, you can get rid of the kernel compiles by caching the results. I believe some OpenCL implementations (the Apple one at least) already do this automatically. With other, you may need to do the caching manually. Here's instructions for the caching.
If the performance bottle neck is the kernel itself, you can probably get a major speed-up by organizing the 'afstanden' array lookups differently. Currently when a block of threads performs a read from the memory, the addresses are spread out through the memory, which is a real killer for GPU performance. You'd ideally want to index array with something like afstanden[ndx*NUM_THREADS + threadID], which would make accesses from a work group to load a contiguous block of memory. This is much faster than the current, essentially random, memory lookup.
First of all you are measuring not the computation time but the whole kernel read-in/compile/execute mumbo-jumbo. To do a fair comparison measure the computation time from the first "non-static" part of your program. (For example from between the first clSetKernelArgs to the last clEnqueueReadBuffer.)
If the execution time is still too high, then you can use some kind of profiler (such as VisualProfiler from NVidia), and read the OpenCL Best Practices guid which is included in the CUDA Toolkit documentation.
To the raw kernel execution time: Consider (and measure) that do you really need the double precision for your calculation, because the double precision calculations are artificially slowed down on the consumer grade NVidia cards.

Resources