How does CPython know to optimise (m**e)%n? - cpython

If I run, assuming e is a large value like 65537:
x**e
It takes a very long time, however assuming a reasonable n:
(x**e)%n
Runs much faster, so Python is clearly optimising modular exponentiation. My question is how does CPython identify that it can make this optimisation, is it the bytecode or something else?

Related

Why is FASTR (ie GraalVM version of R) 10x *slower* compared to normal R despite Oracle's claim of 40x *faster*?

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1
In FASTR, t1 returns this value:
user system elapsed
0.870 0.012 0.901
While in the original normal R, I get this result:
user system elapsed
0.112 0.000 0.113
As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)
====================================
EDIT:
Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
forloopfunction <- function(xTrain, yTrain) {
for (x in 1:100) {
smooth.spline(xTrain, yTrain, spar=0.6)
}
}
t1 <- system.time(fit1 <-forloopfunction(x,y))
t1
Now, the normal R returns this for t1:
user system elapsed
19.665 0.008 19.667
while FastR returns this:
user system elapsed
76.570 0.210 77.918
So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)
TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).
As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.
Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.
It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.
To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:
$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
Name || Total Time || Self Time || Location
-----------------------------------------------------------------------------------------------------------
order || 2190ms 81.4% || 2190ms 81.4% || order.r~1-42:0-1567
which || 70ms 2.6% || 70ms 2.6% || which.r~1-6:0-194
ifelse || 140ms 5.2% || 70ms 2.6% || ifelse.r~1-34:0-1109
...
--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)
One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.
So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.
Other points raised:
but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code
YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.
FastR used between 100% and 300% of CPU usage
This is due to JIT compilations going on on background threads. Again, no free lunch.
To summarize:
FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.
(*) the first profiling tier may be also compiled, which gives different tradeoffs

How come Go calculates fibonacci recursion so fast?

This is not a correct version of it, I am just playing around Go but I was shocked how fast Go calculated the 42nd(43 actually) number in Fibonacci sequence.
Can someone please explain how come it calculates it this fast? I tried comparing it to python(I know its slow compared to other languages) but python took > 1 minute and I had to break the recursion.
package main
import "fmt"
func fib(a uint) uint {
if a <= 1 {
return 1
}
return fib(a-1) + fib(a-2)
}
func main() {
fmt.Println(fib(42))
}
[ `go run Hello.go` | done: 2.316821835s ]
433494437
Its compiler isn't as smart or mature as C's (at least not yet), but Go is still closer to C in its time performance than Python (space performance is a separate thing, and not what you asked about). Just being a compiled language instead of an interpreted language gives it a major leg up in time performance over Python (and it is still faster than PyPy in general, but not as much faster).
Why compiled languages generally offer greater time performance than interpreted languages has been thoroughly covered elsewhere. You can research this question on stackoverflow and elsewhere on the internet. For example, here's the TL;DR in one stackoverflow answer to that question:
Native programs run using instructions written for the processor they run on.
And here's the TL;DR in another answer:
Interpreted languages are slower because their method, object and global variable space model is dynamic
You can also find plenty of benchmark case studies and results comparing implementations in different languages if you look for them.
Performance improvements to the Go compiler and Go toolchain are also frequently made, which you can read about in the release notes (and elsewhere) such as this excerpt about version 1.8:
The new back end, based on static single assignment form (SSA), generates more compact, more efficient code and provides a better platform for optimizations such as bounds check elimination. The new back end reduces the CPU time required by our benchmark programs by 20-30% on 32-bit ARM systems.

The best way to write code in Julia working on GPU's via ArrayFire

In Julia, I saw principally that to acelerate and optimizing codes when I work on a matrix, es better e.g.
-work by columns instead of by rows, this is for the way which Julia store the matrix.
-On loops could use #inbounds and #simd macros
-any function, macros or methods you could recommend it's welcome :D
But it seems that the above examples do not work when I use the ArrayFire package with a matrix stored on the GPU, similar codes in the CPU and GPU do not seem to favor the GPU that runs much slower in some cases, I think it shouldn't be like that, I think the problem is in the way of writing the code. Any help will be welcome
GPU computing should be done on optimized GPU kernels as much as possible. Indexing a GPU array is a small kernel that copies one value back to the CPU. This is really really bad for performance, so you should almost never index a GPUArray unless you have to (this is true for any implementation! It's just a hardware problem!)
Thus, instead of writing looping code for GPUs, you should write broadcasting ("vectorized") code. With the v0.6 broadcast changes, broadcasted operations are nearly as efficient as loops anyways (unless you hit this bug), so there's no reason to avoid them in generic code. However, there are cases where broadcasting is faster than looping, and GPUs is one big case.
Let me explain a little bit why. When you do the code:
#. A = B*C + D*E
it lowers to
A .= B.*C .+ D.*E
which then lowers to:
broadcast!((b,c,d,e)->b*c + d*e,A,B,C,D,E)
Notice that in there you have a fused anonymous function for the entire broadcast. For GPUArrays, this is then overwritten so that way a single GPU kernel is automatically created that performs this fused operation element-wise. Thus only one GPU kernel is required to do this whole operation! Notice that this is even more efficient than the R/Python/MATLAB way of doing GPU computing since those vectorized forms have temporaries and would require 4 kernels here, but this has no temporary arrays and is a single kernel, which is pretty much exactly how you'd write it if you were writing the kernel yourself. So if you exploit broadcast, then your GPU calculations will be fast.

Julia speed of execution

I'm benchmarking Julia execution speed. I executed #time [i^2 for i in 1:1000] on Julia prompt, which resulted in something of the order of 20 ms. This seems strange, since my computer is modern with an i7 processor (I am using Linux Ubuntu).
Another strange thing is that when I execute the same command on a range of 1:10 the execution time is 15 ms.
There must be something trivial that I am missing here?
Several things, see performance tips:
Don't benchmark in global scope.
Don't measure the first execution of something like this.
Use BenchmarkTools.
Julia is a JIT-compiled language, so the first time you measure things, you're measuring compilation time. This is a small fixed overhead, so for anything that takes a substantial time, it's negligible, but for short-running code like this, it's almost all of the time. Non-constant global variables force the compiler to assume almost nothing about types, which tends to poison all of your performance. This is fine in some circumstances, but most of the time, you a) should write code so that the inputs are explicit parameters to functions, rather than implicit parameters coming from some globals, and b) shouldn't write code that uses mutable global state.

BLAS daxpy and dcopy when inc=1, are they any faster than just using a(:,t) = b(:,t)?

Well the title says it, I'm doing following kind of operations in Fortran:
a(:,t) = b(:,t)
c(:,t) = x(i,t)*d(:,t)
Is there any benefits of using daxpy and dcopy subroutines from BLAS in this case where inc=1?
No difference you would ever notice.
BLAS has to be compatible with Fortran 77, that I'm pretty sure didn't have those fancy features.
Those subroutines are in there to make array or matrix copying 1 line of code, because it's done quite a lot. Cycles tend to get hogged in other routines, like matrix inverse, so copying is not usually a performance issue.
If you're worried about performance, just code it up in a reasonable way. Then what I would do is interrupt it a few times. This will show you where the time's actually going. If it is spending much time in copying, it will tell you. If not, it will tell you.

Resources