How come Go calculates fibonacci recursion so fast?

How come Go calculates fibonacci recursion so fast? - recursion

This is not a correct version of it, I am just playing around Go but I was shocked how fast Go calculated the 42nd(43 actually) number in Fibonacci sequence.
Can someone please explain how come it calculates it this fast? I tried comparing it to python(I know its slow compared to other languages) but python took > 1 minute and I had to break the recursion.
package main
import "fmt"
func fib(a uint) uint {
if a <= 1 {
return 1
}
return fib(a-1) + fib(a-2)
}
func main() {
fmt.Println(fib(42))
}
[ `go run Hello.go` | done: 2.316821835s ]
433494437

Its compiler isn't as smart or mature as C's (at least not yet), but Go is still closer to C in its time performance than Python (space performance is a separate thing, and not what you asked about). Just being a compiled language instead of an interpreted language gives it a major leg up in time performance over Python (and it is still faster than PyPy in general, but not as much faster).
Why compiled languages generally offer greater time performance than interpreted languages has been thoroughly covered elsewhere. You can research this question on stackoverflow and elsewhere on the internet. For example, here's the TL;DR in one stackoverflow answer to that question:
Native programs run using instructions written for the processor they run on.
And here's the TL;DR in another answer:
Interpreted languages are slower because their method, object and global variable space model is dynamic
You can also find plenty of benchmark case studies and results comparing implementations in different languages if you look for them.
Performance improvements to the Go compiler and Go toolchain are also frequently made, which you can read about in the release notes (and elsewhere) such as this excerpt about version 1.8:
The new back end, based on static single assignment form (SSA), generates more compact, more efficient code and provides a better platform for optimizations such as bounds check elimination. The new back end reduces the CPU time required by our benchmark programs by 20-30% on 32-bit ARM systems.

Related

Why is FASTR (ie GraalVM version of R) 10x slower compared to normal R despite Oracle's claim of 40x faster?

Oracle claims that its graalvm implementaion of R (called "FastR") is up to 40x faster than normal R (https://www.graalvm.org/r/). However, I ran this super simple (but realistic) 4 line test program and not only was GraalVM/FastR not 40x faster, it was actually 10x SLOWER!
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
t1 <- system.time(fit1 <- smooth.spline(x,y,spar=0.6))
t1
In FASTR, t1 returns this value:
user system elapsed
0.870 0.012 0.901
While in the original normal R, I get this result:
user system elapsed
0.112 0.000 0.113
As you can see, FAST R is super slow even for this simple (ie 4 lines of code, no extra/special library imported etc). I tested this on a 16 core VM on Google Cloud. Thoughts? (FYI: I did a quick peek at the smooth.spline code, and it does call Fortran, but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code.)
====================================
EDIT:
Per the comments from Ben Bolker and user438383 below, I modified the code to include a for loop so that the code ran for much longer and I had time to monitor CPU usage. The modified code is below:
x <- 1:300000/300000
mu <- exp(-400*(x-0.6)^2)+
5*exp(-500*(x-0.75)^2)/3+2*exp(-500*(x-0.9)^2)
y <- mu+0.5*rnorm(300000)
forloopfunction <- function(xTrain, yTrain) {
for (x in 1:100) {
smooth.spline(xTrain, yTrain, spar=0.6)
}
}
t1 <- system.time(fit1 <-forloopfunction(x,y))
t1
Now, the normal R returns this for t1:
user system elapsed
19.665 0.008 19.667
while FastR returns this:
user system elapsed
76.570 0.210 77.918
So, now, FastR is only 4x slower, but that's still considerably slower. (I would be ok with 5% to even 10% difference, but that's 400% difference.) Moreoever, I checked the cpu usage. Normal R used only 1 core (at 100%) for the entirety of the 19 seconds. However, surprisingly, FastR used between 100% and 300% of CPU usage (ie between 1 full core and 3 full cores) during the ~78 seconds. So, I think it fairly reasonably to conclude that at least for this test (which happens to be a realistic test for my very simple scenario), FastR is at least 4x slower while consuming ~1x to 3x more CPU cores. Particularly given that I'm not importing any special libraries which the FASTR team may not have time to properly analyze (ie I'm using just vanilla R code that ships with R), I think that there's something not quite right with the FASTR implementation, at least when it comes to speed. (I haven't tested accuracy, but that's now moot I think.) Has anyone else experienced anything similar or does anyone know of any "magic" configuration that one needs to do to FASTR to get its claimed speeds (or at least similar, ie +- 5% speeds to normal R)? (Or maybe there's some known limitation to FASTR that I may be able to work around, ie don't use normal fortran binaries etc, but use these special ones etc.)

TL;DR: your example is indeed not the best use-case for FastR, because it spends most of its time in R builtins and Fortran code. There is no reason for it to be slower on FastR, though, and we will work on fixing that. FastR may be still useful for your application overall or just for some selected algorithms that run slowly on GNU-R, but would be a good fit for FastR (loopy, "scalar" code, see FastRCluster package).
As others have mentioned, when it comes to micro benchmarks one needs to repeat the benchmark multiple times to allow the system to warm-up. This is important in any case, but more so for systems that rely on dynamic compilation, like FastR.
Dynamic just-in-time compilation works by first interpreting the program while recording the profile of the execution, i.e., learning how the program executes, and only then compiling the program using this knowledge to optimize it better(*). In case of dynamic languages like R, this can be very beneficial, because we can observe types and other dynamic behavior that is hard if not impossible to statically determine without actually running the program.
It should be now clear why FastR needs few iterations to show the best performance it can achieve. It is true that the interpretation mode of FastR has not been optimized very much, so the first few iterations are actually slower than GNU-R. This is not inherent limitation of the technology that FastR is based on, but tradeoff of where we put our resources. Our priority in FastR has been peak performance, i.e., after a sufficient warm-up for micro benchmarks or for applications that run for long enough time.
To your concrete example. I could also reproduce the issue and I analyzed it by running the program with builtin CPU sampler:
$GRAALVM_HOME/bin/Rscript --cpusampler --cpusampler.Delay=20000 --engine.TraceCompilation example.R
...
-----------------------------------------------------------------------------------------------------------
Thread[main,5,main]
Name || Total Time || Self Time || Location
-----------------------------------------------------------------------------------------------------------
order || 2190ms 81.4% || 2190ms 81.4% || order.r~1-42:0-1567
which || 70ms 2.6% || 70ms 2.6% || which.r~1-6:0-194
ifelse || 140ms 5.2% || 70ms 2.6% || ifelse.r~1-34:0-1109
...
--cpusampler.Delay=20000 delays the start of sampling by 20 seconds
--engine.TraceCompilation prints basic info about the JIT compilation
when the program finishes, it prints the table from CPU sampler
(example.R runs the micro benchmark in a loop)
One observation is that the Fotran routine called from smooth.spline is not to blame here. It makes sense because FastR runs the very same native Fortran code as GNU-R. FastR does have to convert the data to native memory, but that is probably small cost compared to the computation itself. Also the transition between native and R code is in general more expensive on FastR, but here it does not play a role.
So the problem here seems to be a builtin function order. In GNU-R builtin functions are implemented in C, they basically do a big switch on the type of the input (integer/real/...) and then just execute highly optimized C code doing the work on plain C integer/double/... array. That is already the most effective thing one can do and FastR cannot beat that, but there is no reason for it to not be as fast. Indeed it turns out that there is a performance bug in FastR and the fix is on its way to master. Thank you for bringing our attention to it.
Other points raised:
but according to the Oracle marketing site, GraalVM/FastR is faster than even Fortran-R code
YMMV. That concrete benchmark presented at our website does spend considerable amount of time in R code, so the overhead of R<->native transition does not skew the result as much. The best results are when translating the Fortran code to R, so making the whole thing just a pure R program. This shows that FastR can run the same algorithm in R as fast as or quite close to Fortran and that is, performance wise, the main benefit of FastR. There is no free lunch. Warm-up time and the costs of R<->native transition is currently the price to pay.
FastR used between 100% and 300% of CPU usage
This is due to JIT compilations going on on background threads. Again, no free lunch.
To summarize:
FastR can run R code faster by using dynamic just-in-time compilation and optimizing chunks of R code (functions or possibly multiple functions inlined into one compilation unit) to the point that it can get close or even match equivalent native code, i.e., significantly faster than GNU-R. This matters on "scalar" R code, i.e., code with loops. For code that spends majority of time in builtin R functions, like, e.g., sum((x - mean(x))^2) for large x, this doesn't gain that much, because that code already spends much of the time in optimized native code even on GNU-R.
What FastR cannot do is to beat GNU-R on execution of a single R builtin function, which is likely to be already highly optimized C code in GNU-R. For individual builtins we may beat GNU-R, because we happen to choose slightly better algorithm or GNU-R has some performance bug somewhere, or it can be the other way around like in this case.
What FastR also cannot do is speeding up native code, like Fortran routines that some R code may call. FastR runs the very same native code. On top of that, the transition between native and R code is more costly in FastR, so programs doing this transition too often may end up being slower on FastR.
Note: what FastR can do and is a work-in-progress is to run LLVM bitcode instead of the native code. GraalVM supports execution of LLVM bitcode and can optimize it together with other languages, which removes the cost of the R<->native transition and even gives more power to the compiler to optimize across this boundary.
Note: you can use FastR via the cluster package interface to execute only parts of you application.
(*) the first profiling tier may be also compiled, which gives different tradeoffs

Operate only on a subset of buffer in OpenCL kernel

Newbie to OpenCL here. I'm trying to convert a numerical method I've written to OpenCL for acceleration. I'm using the PyOpenCL package as I've written this once in Python already and as far as I can tell there's no compelling reason to use the C version. I'm all ears if I'm wrong on this, though.
I've managed to translate over most of the functionality I need in to OpenCL kernels. My question is on how to (properly) tell OpenCL to ignore my boundary/ghost cells. The reason I need to do this is that my method (for example) for point i accesses cells at [i-2:i+2], so if i=1, I'll run off the end of the array. So - I add some extra points that serve to prevent this, and then just tell my algorithm to only run on points [2:nPts-2]. It's easy to see how to do this with a for loop, but I'm a little more unclear on the 'right' way to do this for a kernel.
Is it sufficient to do, for example (pseudocode)
__kernel void myMethod(...) {
gid = get_global_id(0);
if (gid < nGhostCells || gid > nPts-nGhostCells) {
retVal[gid] = 0;
}
// Otherwise perform my calculations
}
or is there another/more appropriate way to enforce this constraint?

It looks sufficient.
Branching is same for nPts-nGhostCells*2 number of points and it is predictable if nPts and nGhostCells are compile-time constants. Even if it is not predictable, sufficiently large nPts vs nGhostCells (1024 vs 3) should not be distinctively slower than zero-branching version, except the latency of "or" operation. Even that "or" latency must be hidden behind array access latency, thanks to thread level parallelism.
At those "break" points, mostly 16 or 32 threads would lose some performance and only for several clock cycles because of the lock-step running of SIMD-like architectures.
If you happen to code some chaotic branching, like data-driven code path, then you should split them into different kernels(for different regions) or sort them before the kernel so that average branching between neighboring threads are minimized.

Julia speed of execution

I'm benchmarking Julia execution speed. I executed #time [i^2 for i in 1:1000] on Julia prompt, which resulted in something of the order of 20 ms. This seems strange, since my computer is modern with an i7 processor (I am using Linux Ubuntu).
Another strange thing is that when I execute the same command on a range of 1:10 the execution time is 15 ms.
There must be something trivial that I am missing here?

Several things, see performance tips:
Don't benchmark in global scope.
Don't measure the first execution of something like this.
Use BenchmarkTools.
Julia is a JIT-compiled language, so the first time you measure things, you're measuring compilation time. This is a small fixed overhead, so for anything that takes a substantial time, it's negligible, but for short-running code like this, it's almost all of the time. Non-constant global variables force the compiler to assume almost nothing about types, which tends to poison all of your performance. This is fine in some circumstances, but most of the time, you a) should write code so that the inputs are explicit parameters to functions, rather than implicit parameters coming from some globals, and b) shouldn't write code that uses mutable global state.

What is the standard (or best supported) big number (arbitrary precision) library for Lua?

I'm working with large numbers that I can't have rounded off. Using Lua's standard math library, there seem to be no convenient way to preserve precision past some internal limit. I also see there are several libraries that can be loaded to work with big numbers:
http://oss.digirati.com.br/luabignum/
http://www.tc.umn.edu/~ringx004/mapm-main.html
http://lua-users.org/lists/lua-l/2002-02/msg00312.html (might be identical to #2)
http://www.gammon.com.au/scripts/doc.php?general=lua_bc (but I can't find any source)
Further, there are many libraries in C that could be called from Lua, if the bindings where established.
Have you had any experience with one or more of these libraries?

Using lbc instead of lmapm would be easier because lbc is self-contained.
local bc = require"bc"
s=bc.pow(2,1000):tostring()
z=0
for i=1,#s do
z=z+s:byte(i)-("0"):byte(1)
end
print(z)

I used Norman Ramsey's suggestion to solve Project Euler problem #16. I don't think it's a spoiler to say that the crux of the problem is calculating a 303 digit integer accurately.
Here are the steps I needed to install and use the library:
Lua needs to be built with dynamic loading enabled. I use Cygwin, but I changed PLAT in src/Makefile to be linux. The default, none, doesn't enable dynamic loading.
The MAMP needs to be built and installed somewhere that your C compiler can find it. I put libmapm.a in /usr/local/lib/. Next m_apm.h and m_apm_lc.h went to /usr/local/include/.
The makefile for lmamp needs to be altered to the correct location of the Lua and MAMP libraries. For me, that means uncommenting the second declaration of LUA, LUAINC, LUALIB, and LUABIN and editing the declaration of MAMP.
Finally, mapm.so needs to be placed somewhere that Lua will find it. I put it at /usr/local/lib/lua/5.1/.
Thank you all for the suggestions!

The lmapm library by Luiz Figueiredo, one of the authors of the Lua language.

I can't really answer, but I will add LGMP, a GMP binding. Not used.
Not my field of expertise, but I would expect the GNU multiple precision arithmetic library to be quite a standard here, no?

Though not arbitrary precision, Lua decNumber, a Lua 5.1 wrapper for IBM decNumber, implements the proposed General Decimal Arithmetic standard IEEE 754r. It has the Lua 5.1 arithmetic operators and more, full control over rounding modes, and working precision up to 69 decimal digits.

There are several libraries for the problem, each one with your advantages
and disadvantages, the best choice depends on your requeriments. I would say
lbc is a good first pick if it
fulfills your requirements or any other by Luiz Figueiredo. For the most efficient one I guess would be any using GMP bindings as GMP is a standard C library for dealing with large integers and is very well optimized.
Nevertheless in case you are looking for a pure Lua one, lua-bint
library could be an option for dealing with big integers,
I wouldn't say it's the best because there are more efficient
and complete ones such the ones mentioned above, but usually they requires compiling C code
or can be troublesome to setup. However when comparing pure Lua big integer libraries
and depending in your use case it could perhaps be an efficient choice. The library is documented,
code fully covered by tests and have many examples. But take this recommendation with grant of
salt because I am the library author.
To install you can use luarocks if you already have it in your computer or simply download the
bint.lua
file in your project, as it has no other dependencies other than requiring Lua 5.3+.
Here is a small example using it to solve the problem #16 from Project Euler
(mentioned in previous answers):
local bint = require 'bint'(1024)
local n = bint(1) << 1000
local digits = tostring(n)
local sum = 0
for i=1,#digits do
sum = sum + tonumber(digits:sub(i,i))
end
print(sum) -- should output 1366

Smart design of a math parser?

What is the smartest way to design a math parser? What I mean is a function that takes a math string (like: "2 + 3 / 2 + (2 * 5)") and returns the calculated value? I did write one in VB6 ages ago but it ended up being way to bloated and not very portable (or smart for that matter...). General ideas, psuedo code or real code is appreciated.

A pretty good approach would involve two steps. The first step involves converting the expression from infix to postfix (e.g. via Dijkstra's shunting yard) notation. Once that's done, it's pretty trivial to write a postfix evaluator.

I wrote a few blog posts about designing a math parser. There is a general introduction, basic knowledge about grammars, sample implementation written in Ruby and a test suite. Perhaps you will find these materials useful.

You have a couple of approaches. You could generate dynamic code and execute it in order to get the answer without needing to write much code. Just perform a search on runtime generated code in .NET and there are plenty of examples around.
Alternatively you could create an actual parser and generate a little parse tree that is then used to evaluate the expression. Again this is pretty simple for basic expressions. Check out codeplex as I believe they have a math parser on there. Or just look up BNF which will include examples. Any website introducing compiler concepts will include this as a basic example.
Codeplex Expression Evaluator

If you have an "always on" application, just post the math string to google and parse the result. Simple way but not sure if that's what you need - but smart in some way i guess.

I know this is old, but I came across this trying to develop a calculator as part of a larger app and ran across some issues using the accepted answer. The links were IMMENSELY helpful in understanding and solving this problem and should not be discounted. I was writing an Android app in Java and for each item in the expression "string," I actually stored a String in an ArrayList as the user types on the keypad. For the infix-to-postfix conversion, I iterated through each String in the ArrayList, then evaluated the newly arranged postfix ArrayList of Strings. This was fantastic for a small number of operands/operators, but longer calculations were consistently off, especially as the expressions started evaluating to non-integers. In the provided link for Infix to Postfix conversion, it suggests popping the Stack if the scanned item is an operator and the topStack item has a higher precedence. I found that this is almost correct. Popping the topStack item if it's precedence is higher OR EQUAL to the scanned operator finally made my calculations come out correct. Hopefully this will help anyone working on this problem, and thanks to Justin Poliey (and fas?) for providing some invaluable links.

The related question Equation (expression) parser with precedence? has some good information on how to get started with this as well.
-Adam

Assuming your input is an infix expression in string format, you could convert it to postfix and, using a pair of stacks: an operator stack and an operand stack, work the solution from there. You can find general algorithm information at the Wikipedia link.

ANTLR is a very nice LL(*) parser generator. I recommend it highly.

Developers always want to have a clean approach, and try to implement the parsing logic from ground up, usually ending up with the Dijkstra Shunting-Yard Algorithm. Result is neat looking code, but possibly ridden with bugs. I have developed such an API, JMEP, that does all that, but it took me years to have stable code.
Even with all that work, you can see even from that project page that I am seriously considering to switch over to using JavaCC or ANTLR, even after all that work already done.

11 years into the future from when this question was asked: If you don't want to re-invent the wheel, there are many exotic math parsers out there.
There is one that I wrote years ago which supports arithmetic operations, equation solving, differential calculus, integral calculus, basic statistics, function/formula definition, graphing, etc.
Its called ParserNG and its free.
Evaluating an expression is as simple as:
MathExpression expr = new MathExpression("(34+32)-44/(8+9(3+2))-22");
System.out.println("result: " + expr.solve());
result: 43.16981132075472
Or using variables and calculating simple expressions:
MathExpression expr = new MathExpression("r=3;P=2*pi*r;");
System.out.println("result: " + expr.getValue("P"));
Or using functions:
MathExpression expr = new MathExpression("f(x)=39*sin(x^2)+x^3*cos(x);f(3)");
System.out.println("result: " + expr.solve());
result: -10.65717648378352
Or to evaluate the derivative at a given point(Note it does symbolic differentiation(not numerical) behind the scenes, so the accuracy is not limited by the errors of numerical approximations):
MathExpression expr = new MathExpression("f(x)=x^3*ln(x); diff(f,3,1)");
System.out.println("result: " + expr.solve());
result: 38.66253179403897
Which differentiates x^3 * ln(x) once at x=3.
The number of times you can differentiate is 1 for now.
or for Numerical Integration:
MathExpression expr = new MathExpression("f(x)=2*x; intg(f,1,3)");
System.out.println("result: " + expr.solve());
result: 7.999999999998261... approx: 8
This parser is decently fast and has lots of other functionality.
Work has been concluded on porting it to Swift via bindings to Objective C and we have used it in graphing applications amongst other iterative use-cases.
DISCLAIMER: ParserNG is authored by me.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex