Speeding up AVR function pointers - arduino

I have a program for avr where I would like to use a pointer to a method. But why is using function pointer over normal call almost 4 times slower?? And how do I speed it up?
I have:
void simple_call(){ PORTB |= _BV(1); }
void (*simple)() = &simple_call;
Then if I compile with -O3 and call:
simple_call()
it takes 250ns to complete. If I instead call:
simple()
it takes 960ns to complete!!
How can I make it faster?

why is it slower??
You see a 710 ns increase in time. For a 16 MHz clock, that time is 11 ticks.
It is not really fair to say 4X because the time increase is a constant overhead for the function pointer. In your case, the function body is tiny, so the overhead is relatively large. But if you had a case where the function was large and took 1 ms to execute, the time increase would still be 710 ns and you would be asking why does the function pointer take 0.07% longer?
To see why one approach is faster than another, you need to get at the assembler code. Using build tools such as Eclipse allows you the get an assembler listing from the GCC compiler by adding command line options not available with the Arduino IDE. This is invaluable to figure out what is going on.
Here is a section of the assembler listing showing what you think is going on:
simple_call();
308: 0e 94 32 01 call 0x264 ; 0x264 <_Z11simple_callv>
simple();
30c: e0 91 0a 02 lds r30, 0x020A
310: f0 91 0b 02 lds r31, 0x020B
314: 19 95 eicall
These listings show the source code and assembler produced by the compiler. To make sense of that and figure out timing, you need the Atmel AVR instruction reference which contains descriptions of every instruction and the number of clock ticks they take. The simple_call() is maybe what you expect and takes 4 ticks. The simple() says:
LDS = load address byte - 2 ticks
LDS = load address byte - 2 ticks
EICALL = indirect call to address loaded - 4 ticks
Those both call the function simple_call():
void simple_call(){ PORTB |= _BV(1); }
264: df 93 push r29
266: cf 93 push r28
268: cd b7 in r28, 0x3d ; 61
26a: de b7 in r29, 0x3e ; 62
26c: a5 e2 ldi r26, 0x25 ; 37
26e: b0 e0 ldi r27, 0x00 ; 0
270: e5 e2 ldi r30, 0x25 ; 37
272: f0 e0 ldi r31, 0x00 ; 0
274: 80 81 ld r24, Z
276: 82 60 ori r24, 0x02 ; 2
278: 8c 93 st X, r24
27a: cf 91 pop r28
27c: df 91 pop r29
27e: 08 95 ret
So the function pointer should take just 4 more ticks and be small compared to all the instructions in the function method.
Above, I said should and what you think is going on. I lied a bit: the assembler above is for no optimizations.
You used the optimization -O3 which changes everything.
With the optimizations, the function body gets squashed to almost nothing:
void simple_call(){ PORTB |= _BV(1); }
264: 29 9a sbi 0x05, 1 ; 5
266: 08 95 ret
That is 2 + 4 ticks. The compiler gurus have coded the compiler to figure out a much better way to execute the one line of C++. But wait there is more. When you "call" your function the compiler says "why do that? it is just one assembler instruction". The compiler decides your call is pointless and puts the instructions inline:
void simple_call(){ PORTB |= _BV(1); }
2d6: 29 9a sbi 0x05, 1 ; 5
But with the optimizations, the function pointer call remains a call:
simple();
2d8: e0 91 0a 02 lds r30, 0x020A
2dc: f0 91 0b 02 lds r31, 0x020B
2e0: 19 95 eicall
So lets see if the math adds up. With the inline, the "call" is 3 ticks. The indirect call is 8 + 6 = 14. The difference is 11 ticks! (I can add!)
So that is **why*.
how do I speed it up?
You don't need to: It is only 4 clock ticks more to make a function pointer call. Except for the most trivial functions, it does not matter.
You can't: Even if you try to inline the functions, you still need a conditional branch. A bunch of load, compare, and conditional jumps will take more than the indirect call. In other words, the function pointer is a better method of branching than any conditional.

Related

Generate QR Code with CIQRCodeDescriptor initWithPayload: symbolVersion: maskPattern errorCorrectionLevel:

I am trying to generate a QR Code using CoreImage.
I want to be able to control the symbol version, the masking pattern, and the error correction level.
Using the simple "CIFilter filterWithName:" does not give you the ability to set the symbol version or the mask pattern.
The only way it seems possible is to use a CIQRCodeDescriptor - using "CIQRCodeDesciptor initWithPayload: symbolVersion: maskPattern: errorCorrectionLevel:"
Has anyone been able to use this method to successfully generate a QR Code image?
If so, can you please post a simple complete example?
To be able to use CIQRCodeDescriptor you need
codewords (mode + character count + data + terminator + padding)
correct symbol version (version for the character count; 1-40)
correct mask pattern (mask with minimum penalty; 0-7)
Follows example of "Think Different".
Notice the extra bits in codeword
Think Different: 54 68 69 6E 6B 20 44 69 66 66 65 72 65 6E 74
Codeword: 40 F5 46 86 96 E6 B2 04 46 96 66 66 57 26 56 E7 40 EC 11
The codeword construction is explained at nayuiki or at the bottom.
let codeword : [UInt8] = [0x40, 0xf5, 0x46, 0x86, 0x96, 0xe6, 0xb2, 0x04, 0x46, 0x96, 0x66, 0x66, 0x57, 0x26, 0x56, 0xe7, 0x40, 0xec, 0x11]
let data = Data(bytes: codeword, count: codeword.count)
if let descriptor = CIQRCodeDescriptor(payload: data, symbolVersion: 1, maskPattern: 4, errorCorrectionLevel: .levelL) {
if let image = imageFromBarcodeCodeDescriptor(descriptor)?.transformed(by: .init(scaleX: 10, y: 10)) {
let newImage = NSImage()
newImage.addRepresentation(NSCIImageRep(ciImage: image))
imageView1.image = newImage
}
}
func imageFromBarcodeCodeDescriptor(_ descriptor: CIBarcodeDescriptor) -> CIImage? {
let filter = CIFilter(name: "CIBarcodeGenerator", parameters: ["inputBarcodeDescriptor" : descriptor])
return filter?.outputImage
}
Concatenate segments, add padding, make codewords
Notes:
The segment mode is always a 4-bit field.
The character count’s field width depends on the mode and version.
The terminator is normally four “0” bits, but fewer if the data
codeword capacity is reached.
The bit padding is between zero to seven “0” bits, to fill all unused
bits in the last byte.
The byte padding consists of alternating (hexadecimal) EC and 11 until
the capacity is reached.
The entire sequence of data bits:
01000000111101010100011010000110100101101110011010110010000001000100011010010110011001100110011001010111001001100101011011100111010000001110110000010001
The entire sequence of data codeword bytes (by splitting the bit
string into groups of 8 bits), displayed in hexadecimal: 40 F5 46 86
96 E6 B2 04 46 96 66 66 57 26 56 E7 40 EC 11
It seems CIQRCodeGenerator doesn't support those parameters.
Maybe you can find what you are looking for in this library.
You need to use "CIBarcodeGenerator" CIFilter with the CIQRCodeDescriptor as input:
let data = ... // error corrected payload data
if let barcode = CIQRCodeDescriptor(payload: data,
symbolVersion: 1, // 1...40
maskPattern: 0, // 0..7
errorCorrectionLevel: .levelL) // Any of the available enum values
{
let filter = CIFilter(name: "CIBarcodeGenerator",
parameters: ["inputBarcodeDescriptor": barcode])
let image = filter?.outputImage
}
The caveat though is that you need to obtain somehow the errorCorrectedPayload data for the message you are trying to encode. One of the ways to do this would be to use "CIQRCodeGenerator" to encode the message, parse the resulting image with Vision to extract the barcode descriptor from it, and then get the errorCorrectedPayload data from that descriptor.
A simple working example is:
// Create the CIFilter (CIQRCodeGenerator)
CIFilter *ciFilter = [CIFilter filterWithName:#"CIQRCodeGenerator"];
[ciFilter setDefaults];
NSData *data = [#"123456" dataUsingEncoding:NSUTF8StringEncoding];// QR code value
[ciFilter setValue:data forKey:#"inputMessage"];
[ciFilter setValue:#"L" forKey:#"inputCorrectionLevel"];// L: low, M: Medium, Q: Quartile, H: High
// Create the image at the desired size
CGSize size = CGSizeMake(280, 280);// Desired QR code size
CGRect rect = CGRectIntegral(ciFilter.outputImage.extent);
CIImage *ciImage = [ciFilter.outputImage imageByApplyingTransform:CGAffineTransformMakeScale(size.width/CGRectGetWidth(rect), size.height/CGRectGetHeight(rect))];
// Create a UIImage (if needed)
UIImage *image = [UIImage imageWithCIImage:ciImage];
_imageView.image = image;

What is the difference between #code_native, #code_typed and #code_llvm in Julia?

While going through julia, I wanted to have a functionality similar to python's dis module.
Going through over the net, I found out that the Julia community have worked over this issue and given these (https://github.com/JuliaLang/julia/issues/218)
finfer -> code_typed
methods(function, types) -> code_lowered
disassemble(function, types, true) -> code_native
disassemble(function, types, false) -> code_llvm
I have tried these personally using the Julia REPL, but I quite seem to find it hard to understand.
In Python, I can disassemble a function like this.
>>> import dis
>>> dis.dis(lambda x: 2*x)
1 0 LOAD_CONST 1 (2)
3 LOAD_FAST 0 (x)
6 BINARY_MULTIPLY
7 RETURN_VALUE
>>>
Can anyone who has worked with these help me understand them more? Thanks.
The standard CPython implementation of Python parses source code and does some pre-processing and simplification of it – aka "lowering" – transforming it to a machine-friendly, easy-to-interpret format called "bytecode". This is what is displayed when you "disassemble" a Python function. This code is not executable by the hardware – it is "executable" by the CPython interpreter. CPython's bytecode format is fairly simple, partly because that's what interpreters tend to do well with – if the bytecode is too complex, it slows down the interpreter – and partly because the Python community tends to put a high premium on simplicity, sometimes at the cost of high performance.
Julia's implementation is not interpreted, it is just-in-time (JIT) compiled. This means that when you call a function, it is transformed to machine code which is executed directly by the native hardware. This process is quite a bit more complex than the parsing and lowering to bytecode that Python does, but in exchange for that complexity, Julia gets its hallmark speed. (The PyPy JIT for Python is also much more complex than CPython but also typically much faster – increased complexity is a fairly typical cost for speed.) The four levels of "disassembly" for Julia code give you access to the representation of a Julia method implementation for particular argument types at different stages of the transformation from source code to machine code. I'll use the following function which computes the next Fibonacci number after its argument as an example:
function nextfib(n)
a, b = one(n), one(n)
while b < n
a, b = b, a + b
end
return b
end
julia> nextfib(5)
5
julia> nextfib(6)
8
julia> nextfib(123)
144
Lowered code. The #code_lowered macro displays code in a format that is the closest to Python byte code, but rather than being intended for execution by an interpreter, it's intended for further transformation by a compiler. This format is largely internal and not intended for human consumption. The code is transformed into "single static assignment" form in which "each variable is assigned exactly once, and every variable is defined before it is used". Loops and conditionals are transformed into gotos and labels using a single unless/goto construct (this is not exposed in user-level Julia). Here's our example code in lowered form (in Julia 0.6.0-pre.beta.134, which is just what I happen to have available):
julia> #code_lowered nextfib(123)
CodeInfo(:(begin
nothing
SSAValue(0) = (Main.one)(n)
SSAValue(1) = (Main.one)(n)
a = SSAValue(0)
b = SSAValue(1) # line 3:
7:
unless b < n goto 16 # line 4:
SSAValue(2) = b
SSAValue(3) = a + b
a = SSAValue(2)
b = SSAValue(3)
14:
goto 7
16: # line 6:
return b
end))
You can see the SSAValue nodes and unless/goto constructs and label numbers. This is not that hard to read, but again, it's also not really meant to be easy for human consumption. Lowered code doesn't depend on the types of the arguments, except in as far as they determine which method body to call – as long as the same method is called, the same lowered code applies.
Typed code. The #code_typed macro presents a method implementation for a particular set of argument types after type inference and inlining. This incarnation of the code is similar to the lowered form, but with expressions annotated with type information and some generic function calls replaced with their implementations. For example, here is the type code for our example function:
julia> #code_typed nextfib(123)
CodeInfo(:(begin
a = 1
b = 1 # line 3:
4:
unless (Base.slt_int)(b, n)::Bool goto 13 # line 4:
SSAValue(2) = b
SSAValue(3) = (Base.add_int)(a, b)::Int64
a = SSAValue(2)
b = SSAValue(3)
11:
goto 4
13: # line 6:
return b
end))=>Int64
Calls to one(n) have been replaced with the literal Int64 value 1 (on my system the default integer type is Int64). The expression b < n has been replaced with its implementation in terms of the slt_int intrinsic ("signed integer less than") and the result of this has been annotated with return type Bool. The expression a + b has been also replaced with its implementation in terms of the add_int intrinsic and its result type annotated as Int64. And the return type of the entire function body has been annotated as Int64.
Unlike lowered code, which depends only on argument types to determine which method body is called, the details of typed code depend on argument types:
julia> #code_typed nextfib(Int128(123))
CodeInfo(:(begin
SSAValue(0) = (Base.sext_int)(Int128, 1)::Int128
SSAValue(1) = (Base.sext_int)(Int128, 1)::Int128
a = SSAValue(0)
b = SSAValue(1) # line 3:
6:
unless (Base.slt_int)(b, n)::Bool goto 15 # line 4:
SSAValue(2) = b
SSAValue(3) = (Base.add_int)(a, b)::Int128
a = SSAValue(2)
b = SSAValue(3)
13:
goto 6
15: # line 6:
return b
end))=>Int128
This is the typed version of the nextfib function for an Int128 argument. The literal 1 must be sign extended to Int128 and the result types of operations are of type Int128 instead of Int64. The typed code can be quite different if the implementation of a type is considerably different. For example nextfib for BigInts is significantly more involved than for simple "bits types" like Int64 and Int128:
julia> #code_typed nextfib(big(123))
CodeInfo(:(begin
$(Expr(:inbounds, false))
# meta: location number.jl one 164
# meta: location number.jl one 163
# meta: location gmp.jl convert 111
z#_5 = $(Expr(:invoke, MethodInstance for BigInt(), :(Base.GMP.BigInt))) # line 112:
$(Expr(:foreigncall, (:__gmpz_set_si, :libgmp), Void, svec(Ptr{BigInt}, Int64), :(&z#_5), :(z#_5), 1, 0))
# meta: pop location
# meta: pop location
# meta: pop location
$(Expr(:inbounds, :pop))
$(Expr(:inbounds, false))
# meta: location number.jl one 164
# meta: location number.jl one 163
# meta: location gmp.jl convert 111
z#_6 = $(Expr(:invoke, MethodInstance for BigInt(), :(Base.GMP.BigInt))) # line 112:
$(Expr(:foreigncall, (:__gmpz_set_si, :libgmp), Void, svec(Ptr{BigInt}, Int64), :(&z#_6), :(z#_6), 1, 0))
# meta: pop location
# meta: pop location
# meta: pop location
$(Expr(:inbounds, :pop))
a = z#_5
b = z#_6 # line 3:
26:
$(Expr(:inbounds, false))
# meta: location gmp.jl < 516
SSAValue(10) = $(Expr(:foreigncall, (:__gmpz_cmp, :libgmp), Int32, svec(Ptr{BigInt}, Ptr{BigInt}), :(&b), :(b), :(&n), :(n)))
# meta: pop location
$(Expr(:inbounds, :pop))
unless (Base.slt_int)((Base.sext_int)(Int64, SSAValue(10))::Int64, 0)::Bool goto 46 # line 4:
SSAValue(2) = b
$(Expr(:inbounds, false))
# meta: location gmp.jl + 258
z#_7 = $(Expr(:invoke, MethodInstance for BigInt(), :(Base.GMP.BigInt))) # line 259:
$(Expr(:foreigncall, ("__gmpz_add", :libgmp), Void, svec(Ptr{BigInt}, Ptr{BigInt}, Ptr{BigInt}), :(&z#_7), :(z#_7), :(&a), :(a), :(&b), :(b)))
# meta: pop location
$(Expr(:inbounds, :pop))
a = SSAValue(2)
b = z#_7
44:
goto 26
46: # line 6:
return b
end))=>BigInt
This reflects the fact that operations on BigInts are pretty complicated and involve memory allocation and calls to the external GMP library (libgmp).
LLVM IR. Julia uses the LLVM compiler framework to generate machine code. LLVM defines an assembly-like language which it uses as a shared intermediate representation (IR) between different compiler optimization passes and other tools in the framework. There are three isomorphic forms of LLVM IR:
A binary representation that is compact and machine readable.
A textual representation that is verbose and somewhat human readable.
An in-memory representation that is generated and consumed by LLVM libraries.
Julia uses LLVM's C++ API to construct LLVM IR in memory (form 3) and then call some LLVM optimization passes on that form. When you do #code_llvm you see the LLVM IR after generation and some high-level optimizations. Here's LLVM code for our ongoing example:
julia> #code_llvm nextfib(123)
define i64 #julia_nextfib_60009(i64) #0 !dbg !5 {
top:
br label %L4
L4: ; preds = %L4, %top
%storemerge1 = phi i64 [ 1, %top ], [ %storemerge, %L4 ]
%storemerge = phi i64 [ 1, %top ], [ %2, %L4 ]
%1 = icmp slt i64 %storemerge, %0
%2 = add i64 %storemerge, %storemerge1
br i1 %1, label %L4, label %L13
L13: ; preds = %L4
ret i64 %storemerge
}
This is the textual form of the in-memory LLVM IR for the nextfib(123) method implementation. LLVM is not easy to read – it's not intended to be written or read by people most of the time – but it is thoroughly specified and documented. Once you get the hang of it, it's not hard to understand. This code jumps to the label L4 and initializes the "registers" %storemerge1 and %storemerge with the i64 (LLVM's name for Int64) value 1 (their values are derived differently when jumped to from different locations – that's what the phi instruction does). It then does a icmp slt comparing %storemerge with register %0 – which holds the argument untouched for the entire method execution – and saves the comparison result into the register %1. It does an add i64 on %storemerge and %storemerge1 and saves the result into register %2. If %1 is true, it branches back to L4 and otherwise it branches to L13. When the code loops back to L4 the register %storemerge1 gets the previous values of %storemerge and %storemerge gets the previous value of %2.
Native code. Since Julia executes native code, the last form a method implementation takes is what the machine actually executes. This is just binary code in memory, which is rather hard to read, so long ago people invented various forms of "assembly language" which represent instructions and registers with names and have some amount of simple syntax to help express what instructions do. In general, assembly language remains close to one-to-one correspondence with machine code, in particular, one can always "disassemble" machine code into assembly code. Here's our example:
julia> #code_native nextfib(123)
.section __TEXT,__text,regular,pure_instructions
Filename: REPL[1]
pushq %rbp
movq %rsp, %rbp
movl $1, %ecx
movl $1, %edx
nop
L16:
movq %rdx, %rax
Source line: 4
movq %rcx, %rdx
addq %rax, %rdx
movq %rax, %rcx
Source line: 3
cmpq %rdi, %rax
jl L16
Source line: 6
popq %rbp
retq
nopw %cs:(%rax,%rax)
This is on an Intel Core i7, which is in the x86_64 CPU family. It only uses standard integer instructions, so it doesn't matter beyond that what the architecture is, but you can get different results for some code depending on the specific architecture of your machine, since JIT code can be different on different systems. The pushq and movq instructions at the beginning are a standard function preamble, saving registers to the stack; similarly, popq restores the registers and retq returns from the function; nopw is a 2-byte instruction that does nothing, included just to pad the length of the function. So the meat of the code is just this:
movl $1, %ecx
movl $1, %edx
nop
L16:
movq %rdx, %rax
Source line: 4
movq %rcx, %rdx
addq %rax, %rdx
movq %rax, %rcx
Source line: 3
cmpq %rdi, %rax
jl L16
The movl instructions at the top initialize registers with 1 values. The movq instructions move values between registers and the addq instruction adds registers. The cmpq instruction compares two registers and jl either jumps back to L16 or continues to return from the function. This handful of integer machine instructions in a tight loop is exactly what executes when your Julia function call runs, presented in slightly more pleasant human-readable form. It's easy to see why it runs fast.
If you're interested in JIT compilation in general as compared to interpreted implementations, Eli Bendersky has a great pair of blog posts where he goes from a simple interpreter implementation of a language to a (simple) optimizing JIT for the same language:
http://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-1-an-interpreter/
http://eli.thegreenplace.net/2017/adventures-in-jit-compilation-part-2-an-x64-jit.html

Implement recursion in ASM without procedures

I'm trying to implement functions and recursion in an ASM-like simplified language that has no procedures. Only simple jumpz, jump, push, pop, add, mul type commands.
Here are the commands:
(all variables and literals are integers)
set (sets the value of an already existing variable or declares and initializes a new variable) e.g. (set x 3)
push (pushes a value onto the stack. can be a variable or an integer) e.g. (push 3) or (push x)
pop (pops the stack into a variable) e.g. (pop x)
add (adds the second argument to the first argument) e.g. (add x 1) or (add x y)
mul (same as add but for multiplication)
jump (jumps to a specific line of code) e.g. (jump 3) would jump to line 3 or (jump x) would jump to the line # equal to the value of x
jumpz (jumps to a line number if the second argument is equal to zero) e.g. (jumpz 3 x) or (jumpz z x)
The variable 'IP' is the program counter and is equal to the line number of the current line of code being executed.
In this language, functions are blocks of code at the bottom of the program that are terminated by popping a value off the stack and jumping to that value. (using the stack as a call stack) Then the functions can be called anywhere else in the program by simply pushing the instruction pointer onto the stack and then jumping to the start of the function.
This works fine for non-recursive functions.
How could this be modified to handle recursion?
I've read that implementing recursion with a stack is a matter of pushing parameters and local variables onto the stack (and in this lower level case, also the instruction pointer I think)
I wouldn't be able to do something like x = f(n). To do this I'd have some variable y (that is also used in the body of f), set y equal to n, call f which assigns its "return value" to y and then jumps control back to where f was called from, where we then set x equal to y.
(a function that squares a number whose definition starts at line 36)
1 - set y 3
2 - set returnLine IP
3 - add returnLine 2
4 - push returnLine
5 - jump 36
6 - set x y
...
36 - mul y 2
37 - pop returnLine
38 - jump returnLine
This doesn't seem to lend itself to recursion. Arguments and intermediate values would need to go on the stack and I think multiple instances on the stack of the same address would result from recursive calls which is fine.
Next code raises the number "base" to the power "exponent" recursively in "John Smith Assembly":
1 - set base 2 ;RAISE 2 TO ...
2 - set exponent 4 ;... EXPONENT 4 (2^4=16).
3 - set result 1 ;MUST BE 1 IN ORDER TO MULTIPLY.
4 - set returnLine IP ;IP = 4.
5 - add returnLine 4 ;RETURNLINE = 4+4.
6 - push returnLine ;PUSH 8.
7 - jump 36 ;CALL FUNCTION.
.
.
.
;POWER FUNCTION.
36 - jumpz 43 exponent ;FINISH IF EXPONENT IS ZERO.
37 - mul result base ;RESULT = ( RESULT * BASE ).
38 - add exponent -1 ;RECURSIVE CONTROL VARIABLE.
39 - set returnLine IP ;IP = 39.
40 - add returnLine 4 ;RETURN LINE = 39+4.
41 - push returnLine ;PUSH 43.
42 - jump 36 ;RECURSIVE CALL.
43 - pop returnLine
44 - jump returnLine
;POWER END.
In order to test it, let's run it manually :
LINE | BASE EXPONENT RESULT RETURNLINE STACK
------|---------------------------------------
1 | 2
2 | 4
3 | 1
4 | 4
5 | 8
6 | 8
7 |
36 |
37 | 2
38 | 3
39 | 39
40 | 43
41 | 43(1)
42 |
36 |
37 | 4
38 | 2
39 | 39
40 | 43
41 | 43(2)
42 |
36 |
37 | 8
38 | 1
39 | 39
40 | 43
41 | 43(3)
42 |
36 |
37 | 16
38 | 0
39 | 39
40 | 43
41 | 43(4)
42 |
36 |
43 | 43(4)
44 |
43 | 43(3)
44 |
43 | 43(2)
44 |
43 | 43(1)
44 |
43 | 8
44 |
8 |
Edit : parameter for function now on stack (didn't run it manually) :
1 - set base 2 ;RAISE 2 TO ...
2 - set exponent 4 ;... EXPONENT 4 (2^4=16).
3 - set result 1 ;MUST BE 1 IN ORDER TO MULTIPLY.
4 - set returnLine IP ;IP = 4.
5 - add returnLine 7 ;RETURNLINE = 4+7.
6 - push returnLine ;PUSH 11.
7 - push base ;FIRST PARAMETER.
8 - push result ;SECOND PARAMETER.
9 - push exponent ;THIRD PARAMETER.
10 - jump 36 ;FUNCTION CALL.
...
;POWER FUNCTION.
36 - pop exponent ;THIRD PARAMETER.
37 - pop result ;SECOND PARAMETER.
38 - pop base ;FIRST PARAMETER.
39 - jumpz 49 exponent ;FINISH IF EXPONENT IS ZERO.
40 - mul result base ;RESULT = ( RESULT * BASE ).
41 - add exponent -1 ;RECURSIVE CONTROL VARIABLE.
42 - set returnLine IP ;IP = 42.
43 - add returnLine 7 ;RETURN LINE = 42+7.
44 - push returnLine ;PUSH 49.
45 - push base
46 - push result
47 - push exponent
48 - jump 36 ;RECURSIVE CALL.
49 - pop returnLine
50 - jump returnLine
;POWER END.
Your asm does provide enough facilities to implement the usual procedure call / return sequence. You can push a return address and jump as a call, and pop a return address (into a scratch location) and do an indirect jump to it as a ret. We can just make call and ret macros. (Except that generating the correct return address is tricky in a macro; we might need a label (push ret_addr), or something like set tmp, IP / add tmp, 4 / push tmp / jump target_function). In short, it's possible and we should wrap it up in some syntactic sugar so we don't get bogged down with that while looking at recursion.
With the right syntactic sugar, you can implement Fibonacci(n) in assembly that will actually assemble for both x86 and your toy machine.
You're thinking in terms of functions that modify static (global) variables. Recursion requires local variables so each nested call to the function has its own copy of local variables. Instead of having registers, your machine has (apparently unlimited) named static variables (like x and y). If you want to program it like MIPS or x86, and copy an existing calling convention, just use some named variables like eax, ebx, ..., or r0 .. r31 the way a register architecture uses registers.
Then you implement recursion the same way you do in a normal calling convention, where either the caller or callee use push / pop to save/restore a register on the stack so it can be reused. Function return values go in a register. Function args should go in registers. An ugly alternative would be to push them after the return address (creating a caller-cleans-the-args-from-the-stack calling convention), because you don't have a stack-relative addressing mode to access them the way x86 does (above the return address on the stack). Or you could pass return addresses in a link register, like most RISC call instructions (usually called bl or similar, for branch-and-link), instead of pushing it like x86's call. (So non-leaf callees have to push the incoming lr onto the stack themselves before making another call)
A (silly and slow) naively-implemented recursive Fibonacci might do something like:
int Fib(int n) {
if(n<=1) return n; // Fib(0) = 0; Fib(1) = 1
return Fib(n-1) + Fib(n-2);
}
## valid implementation in your toy language *and* x86 (AMD64 System V calling convention)
### Convenience macros for the toy asm implementation
# pretend that the call implementation has some way to make each return_address label unique so you can use it multiple times.
# i.e. just pretend that pushing a return address and jumping is a solved problem, however you want to solve it.
%define call(target) push return_address; jump target; return_address:
%define ret pop rettmp; jump rettmp # dedicate a whole variable just for ret, because we can
# As the first thing in your program, set eax, 0 / set ebx, 0 / ...
global Fib
Fib:
# input: n in edi.
# output: return value in eax
# if (n<=1) return n; // the asm implementation of this part isn't interesting or relevant. We know it's possible with some adds and jumps, so just pseudocode / handwave it:
... set eax, edi and ret if edi <= 1 ... # (not shown because not interesting)
add edi, -1
push edi # save n-1 for use after the recursive call
call Fib # eax = Fib(n-1)
pop edi # restore edi to *our* n-1
push eax # save the Fib(n-1) result across the call
add edi, -1
call Fib # eax = Fib(n-2)
pop edi # use edi as a scratch register to hold Fib(n-1) that we saved earlier
add eax, edi # eax = return value = Fib(n-1) + Fib(n-2)
ret
During a recursive call to Fib(n-1) (with n-1 in edi as the first argument), the n-1 arg is also saved on the stack, to be restored later. So each function's stack frame contains the state that needs to survive the recursive call, and a return address. This is exactly what recursion is all about on a machine with a stack.
Jose's example doesn't demonstrate this as well, IMO, because no state needs to survive the call for pow. So it just ends up pushing a return address and args, then popping the args, building up just some return addresses. Then at the end, follows the chain of return addresses. It could be extended to save local state across each nested call, doesn't actually illustrate it.
My implementation is a bit different from how gcc compiles the same C function for x86-64 (using the same calling convention of first arg in edi, ret value in eax). gcc6.1 with -O1 keeps it simple and actually does two recursive calls, as you can see on the Godbolt compiler explorer. (-O2 and especially -O3 do some aggressive transformations). gcc saves/restores rbx across the whole function, and keeps n in ebx so it's available after the Fib(n-1) call. (and keeps Fib(n-1) in ebx to survive the second call). The System V calling convention specifies rbx as a call-preserved register, but rbi as call-clobbered (and used for arg-passing).
Obviously you can implement Fib(n) much faster non-recursively, with O(n) time complexity and O(1) space complexity, instead of O(Fib(n)) time and space (stack usage) complexity. It makes a terrible example, but it is trivial.
Margaret's pastebin modified slightly to run in my interpreter for this language: (infinite loop problem, probably due to a transcription error on my part)
set n 3
push n
set initialCallAddress IP
add initialCallAddress 4
push initialCallAddress
jump fact
set finalValue 0
pop finalValue
print finalValue
jump 100
:fact
set rip 0
pop rip
pop n
push rip
set temp n
add n -1
jumpz end n
push n
set link IP
add link 4
push link
jump fact
pop n
mul temp n
:end
pop rip
push temp
jump rip
Successful transcription of Peter's Fibonacci calculator:
String[] x = new String[] {
//n is our input, which term of the sequence we want to calculate
"set n 5",
//temp variable for use throughout the program
"set temp 0",
//call fib
"set temp IP",
"add temp 4",
"push temp",
"jump fib",
//program is finished, prints return value and jumps to end
"print returnValue",
"jump end",
//the fib function, which gets called recursively
":fib",
//if this is the base case, then we assert that f(0) = f(1) = 1 and return from the call
"jumple base n 1",
"jump notBase",
":base",
"set returnValue n",
"pop temp",
"jump temp",
":notBase",
//we want to calculate f(n-1) and f(n-2)
//this is where we calculate f(n-1)
"add n -1",
"push n",
"set temp IP",
"add temp 4",
"push temp",
"jump fib",
//return from the call that calculated f(n-1)
"pop n",
"push returnValue",
//now we calculate f(n-2)
"add n -1",
"set temp IP",
"add temp 4",
"push temp",
"jump fib",
//return from call that calculated f(n-2)
"pop n",
"add returnValue n",
//this is where the fib function ultimately ends and returns to caller
"pop temp",
"jump temp",
//end label
":end"
};

Details of Syscall.RawSyscall() & Syscall.Syscall() in Go?

I'm reading source code in package syscall now, and met some problems:
Since I'm totally a noob of syscall and assembly, so don't hesitate to share anything you know about it :)
First about func RawSyscall(trap, a1, a2, a3 uintptr) (r1, r2 uintptr, err Errno) : what does its parameter trap, a1, a2, a3 & return value r1 r2 means? I've searched documents and site but seems lack of description about this.
Second, since I'm using darwin/amd64 I searched source code and find it here:
http://golang.org/src/pkg/syscall/asm_darwin_amd64.s?h=RawSyscall
Seems it's written by assemble(which I can't understand), can you explain what happened in line 61-80, and what's the meaning of ok1: part under line 76?
I also found some code in http://golang.org/src/pkg/syscall/zsyscall_darwin_amd64.go , what does zsyscall mean in its filename?
What's the difference between syscall & rawsyscall?
How and when to use them if I want to write my own syscall function(Yes, os package gave many choices but there are still some situation it doesn't cover)?
So many noob questions, thanks for your patience to read and answer :)
I'll share my reduced assembly knowledge with you:
61 TEXT ·RawSyscall(SB),7,$0
62 MOVQ 16(SP), DI
63 MOVQ 24(SP), SI
64 MOVQ 32(SP), DX
65 MOVQ $0, R10
66 MOVQ $0, R8
67 MOVQ $0, R9
68 MOVQ 8(SP), AX // syscall entry
69 ADDQ $0x2000000, AX
70 SYSCALL
71 JCC ok1
72 MOVQ $-1, 40(SP) // r1
73 MOVQ $0, 48(SP) // r2
74 MOVQ AX, 56(SP) // errno
75 RET
76 ok1:
77 MOVQ AX, 40(SP) // r1
78 MOVQ DX, 48(SP) // r2
79 MOVQ $0, 56(SP) // errno
80 RET
81
Line 61 is the routine entry point
Line 76 is a label called ok1
Line 71 is a conditional jump to label ok1.
The short names you see on every line on the left side are called mnemonics and stand for assembly instructions:
MOVQ means Move Quadword (64 bits of data).
ADDQ is Add Quadword.
SYSCALL is kinda obvious
JCC is Jump if Condition (condition flag set by previous instruction)
RET is return
On the right side of the mnemonics you'll find each instruction's arguments which are basically constants and registers.
SP is the Stack Pointer
AX is the Accumulator
BX is the Base register
each register can hold a certain amount of data. On 64 bit CPU architectures I believe it's in fact 64 bits per register.
The only difference between Syscall and RawSyscall is on line 14, 28 and 34 where Syscall will call runtime·entersyscall(SB) and runtime·exitsyscall(SB) whereas RawSyscall will not. I assume this means that Syscall notifies the runtime that it's switched to a blocking syscall operations and can yield CPU-time to another goroutine/thread whereas RawSyscall will just block.

Why would declaring variables as volatile speed up code execution?

Any ideas? I'm using the GCC cross-compiler for a PPC750. Doing a simple multiply operation of two floating-point numbers in a loop and timing it. I declared the variables to be volatile to make sure nothing important was optimized out, and the code sped up!
I've inspected the assembly instructions for both cases and, sure enough, the compiler generated many more instructions to do the same basic job in the non-volatile case. Execution time for 10,000,000 iterations dropped from 800ms to 300ms!
assembly for volatile case:
0x10eeec stwu r1,-32(r1)
0x10eef0 lis r9,0x1d # 29
0x10eef4 lis r11,0x4080 # 16512
0x10eef8 lfs fr0,-18944(r9)
0x10eefc li r0,0x0 # 0
0x10ef00 lis r9,0x98 # 152
0x10ef04 stfs fr0,8(r1)
0x10ef08 mtspr CTR,r9
0x10ef0c stw r11,12(r1)
0x10ef10 stw r0,16(r1)
0x10ef14 ori r9,r9,0x9680
0x10ef18 mtspr CTR,r9
0x10ef1c lfs fr0,8(r1)
0x10ef20 lfs fr13,12(r1)
0x10ef24 fmuls fr0,fr0,fr13
0x10ef28 stfs fr0,16(r1)
0x10ef2c bc 0x10,0, 0x10ef1c # 0x0010ef1c
0x10ef30 addi r1,r1,0x20 # 32
asssembly for non-volatile case:
0x10ef04 stwu r1,-48(r1)
0x10ef08 stw r31,44(r1)
0x10ef0c or r31,r1,r1
0x10ef10 lis r9,0x1d # 29
0x10ef14 lfs fr0,-18832(r9)
0x10ef18 stfs fr0,12(r31)
0x10ef1c lis r0,0x4080 # 16512
0x10ef20 stw r0,16(r31)
0x10ef24 li r0,0x0 # 0
0x10ef28 stw r0,20(r31)
0x10ef2c li r0,0x0 # 0
0x10ef30 stw r0,8(r31)
0x10ef34 lwz r0,8(r31)
0x10ef38 lis r9,0x98 # 152
0x10ef3c ori r9,r9,0x967f
0x10ef40 cmpl crf0,0,r0,r9
0x10ef44 bc 0x4,1, 0x10ef4c # 0x0010ef4c
0x10ef48 b 0x10ef6c # 0x0010ef6c
0x10ef4c lfs fr0,12(r31)
0x10ef50 lfs fr13,16(r31)
0x10ef54 fmuls fr0,fr0,fr13
0x10ef58 stfs fr0,20(r31)
0x10ef5c lwz r9,8(r31)
0x10ef60 addi r0,r9,0x1 # 1
0x10ef64 stw r0,8(r31)
0x10ef68 b 0x10ef34 # 0x0010ef34
0x10ef6c lwz r11,0(r1)
0x10ef70 lwz r31,-4(r11)
0x10ef74 or r1,r11,r11
0x10ef78 blr
If I read this correctly, it's loading the values from memory during every iteration in both cases, but it seems to have generated a lot more instructions to do so in the non-volatile case.
Here's the source:
void floatTest()
{
unsigned long i;
volatile double d1 = 500.234, d2 = 4.000001, d3=0;
for(i=0; i<10000000; i++)
d3 = d1*d2;
}
Are you sure you didn't also change optimization settings?
The original looks un-optimized - here's the looping part:
0x10ef34 lwz r0,8(r31) //Put 'i' in r0.
0x10ef38 lis r9,0x98 # 152 //Put MSB of 10000000 in r9
0x10ef3c ori r9,r9,0x967f //Put LSB of 10000000 in r9
0x10ef40 cmpl crf0,0,r0,r9 //compare r0 to r9
0x10ef44 bc 0x4,1, 0x10ef4c //branch to loop if r0<r9
0x10ef48 b 0x10ef6c //else branch to end
0x10ef4c lfs fr0,12(r31) //load d1
0x10ef50 lfs fr13,16(r31) //load d2
0x10ef54 fmuls fr0,fr0,fr13 //multiply
0x10ef58 stfs fr0,20(r31) //save d3
0x10ef5c lwz r9,8(r31) //load i into r9
0x10ef60 addi r0,r9,0x1 //add 1
0x10ef64 stw r0,8(r31) //save i
0x10ef68 b 0x10ef34 //go back to top, must reload r9
The volatile version looks quite optimized - It rearanges instructions, and uses the special purpose counter register instead of storing i on the stack:
0x10ef00 lis r9,0x98 # 152 //MSB of 10M
//.. 4 initialization instructions here ..
0x10ef14 ori r9,r9,0x9680 //LSB of 10,000000
0x10ef18 mtspr CTR,r9 // store r9 in Special Purpose CTR register
0x10ef1c lfs fr0,8(r1) // load d1
0x10ef20 lfs fr13,12(r1) // load d2
0x10ef24 fmuls fr0,fr0,fr13 // multiply
0x10ef28 stfs fr0,16(r1) // store result
0x10ef2c bc 0x10,0, 0x10ef1c // decrement counter and branch if not 0.
The CTR optimization reduces the loop to 5 instructions, instead of the 14 in the original code. I don't see any reason 'volatile' by itself would enable that optimization.

Resources